Hacker News new | past | comments | ask | show | jobs | submit | pzo's comments login

surprisingly cursor charges only 0.75x for request for sonnet 4.0 (comparing to 1x for sonnet 3.7)

It does say "temporarily offered at a discount" when you hover over the model in the dropdown.

no it's not - it's assumed by maybe taking the most likely unit (year). But if the conversation is in hospital with your kid having emergency I guess doctor would appreciate to know if they will have to do surgery on 3 months child or 3 years kid.

If the doctor has trouble figuring out the difference between 3 months and years, there are bigger problems than specificity.

There are places specificity is necessary, and there are places the implicit assumptions people make are specific, and only need additional specification if the implication is violated. That's how language works - shortcuts everywhere, even with really important things, because people figure it out. There are also lots of examples of this biting people in the ass - it doesn't always work, even if most of the time, it does.


A 3 month old kid looks very different to a 3 year old kid.

and thats the exact point - you assume doctor see the kid instead of you calling doctor or doctor is getting briefed by emergency stuff.

Exactly! It makes sense in context.

As long as you construct a strawmen strict enough that can be no ambiguity, and refuses to acknowledge any context where it's not enough, yeah, it always make sense in context.

I used AI to summarize this whole article and give me takeaways - it already saved me like 0.5h of reading something that in the end I would disagree with since the article is IMHO to harsh on AI.

I found AI extremely useful and easy sell for me to spend $20/m even if not used professionally for coding and I'm the person who avoid any type of subscription as a plague.

Even in educational setting that this article mostly focus about it can be super useful. Not everyone has access to mentors and scholars. I saved a lot of time helping family with typical tech questions and troubleshooting by teaching them how to use it and trying to solve their tech problem themselves.


Humane was very impressive product from hardware perspective and design but poor execution and software (partially because they don't own smartphone os like android/iOS).

If similar hardware was:

- released by apple or google and deeply integrated with android/iOS

- embedded inside apple watch / pixel watch

- embedded inside slim airpods case that could be wear as pendant

- apple had siri as good as gemini and very good local STT to reduce latency

- MCP clients got more adopted by integrated in smartphones AI assistants

then it could be a hit. They lost because they shipped to early, without having big pockets for long game, without owning android OS / iOS and charged big price + subscription for this gadgets.

I think google currently is the best positioned to pull seamless experience with pixel devices (smartphone, watch, buds, gemini)


The interface on iPhone still can be improved - latest ones have dedicated action buttons and camera buttons. Once you can plug it to better assistant and do it without phone unlocking then it becomes more seamless.

Make the same with apple watch to make hand gesture covering your ear like listening to something and then you don't even have to pickup phone from the pocket.

I think there is a lot of way how iphone, apple watch, airpods (case as pendant) could deliver the best UX but it doesn't matter as long as siri sux.


correct, youtube video shows it was uploaded 5 years ago - doubt it is relevant today even if someone don't want to use VLM for those tasks. Google dropped ball on tensorflow lite and their rebrand to LiteRT is still WIP.

Probably safer bet is to use onnxruntime or executorch if someone needs to deploy on edge devices today. At least for onnx, community is huge on hugging face and plenty of modern SOTA models already in transformers and transformers.js.


Google created android - the most popular OS. Sure maybe samsung or nokia would be used instead but definitely the helped expend ad business with android. Same like Meta/ByteDance expanded Ad business with Intagram/Tiktok. Even if ad spending grew 1.6% per year it's not sure if it grew as much if android didn't exist. Also need to take into account probably reduced cost of advertising - this product just got cheaper. That the ad market grew 50% in 25 years doesn't mean we have only 50% ads served same like 50% grow (in $) in smartphone market doesn't have to mean you have only 50% more smartphones if they got cheaper.

Technically, Andy Rubin and Chris White weren't at Google when creating Android. In usual big tech fashion, Google did a good acquisition but didn't actually "create" Android, they bought it.

I think it's simplification to compare progress only on LLM level.

We had big progress in AI in last 2 years but have to take into account more than text token generation. We have image generation that is not only super realistic but you just text what you want to modify without learning complicated tools like ComfyUI.

We have text to speech and audio to audio that is not only very realistic and fluent with many languages but also can express emotions in speech.

We have video generation that is really more realistic every month and taking less computation.

There is big progress in 3d models generation. Speech to text is still getting improved and fast enough to run on phones reducing latency. Next frontier is how AI is applied for robotics. No to mention areas not sexy to end users but in application in healthcare.


All of your examples are just other flavors of token generation.

I mentioned that OP focused only of not much improvements in text token generation (since gpt 4.0) but those models got multimodal and not every AI e.g. generative AI are based on tokens but on diffusers.

I have a similar feeling. While LLMs have given me a new way to do search/questions, it is the byproducts that feel like the actual game changers. For me, it is vision models and pretty impressive STT and TTS. I am blind, so I have my own reasons why Vision and Speech have so many real world applications for me. Sure, LLMs are still the backbone of the applications emerging, but the real progress in terms of use cases is in the fringes.

Heck, I wrote myself my own personal radio moderator in a few hundred lines of shell, later rewritten in Python. As a simple MPD client. Watch out for a queued track which has albumart, and pass the track metadata + picture to the LLM. Send the result through a pretty natural sounding TTS, and queue the resulting sound file before the next track. Suddenly, I had a radio moderator that would narrate album art for me. It gave me a glimpse into a world that wouldn't have been possible before. And while the LLM is basically writing the script, the real magic comes from multimodal and great sounding TTS.

Much potential for really cool looking/sounding PoCs. However, what makes me worry is that there is not much progress on (to me) obvious shortcomings. For instance, OpenAI TTS really can't speak any numbers correctly. Digits maybe, but once you hand it something like "2025" the chance is high it will have pronounciation problems. In the first months, this felt like bad but temporarily acceptable. A year later, it feels like hilariously sad that nothing has been done to address such a simple yet important issue. You know that something bad is going on when you start to consider expanding numbers to written-out form before passing the message to the TTS. My girlfriend keeps joking that since LLMs, we now have computers that totally can not compute correctly. And she has a point. Sure, hand the LLM a tool to do calculations, and the situation improves somewhat. But it seems to be underlying, as shown by the problems of TTS.

Vision models have so many applications for me... However, some of them turn out to be actually unusable in practice. That becomes clear when you use a vision model to read the values off a blood pressure sensor. Take three photos, and you get three slightly different values. Not obviously made up stuff, but numbers that could be. 145/90, 147/93, 142/97. Well, the range might be clear, but actually, you can never be sure. Great for scene and art descriptions, since hallucinations almost fall through the cracks. But I would never use it to read any kind of data, neither OCR'd text nor, gasp, numbers! You can never know if you have been lied to. But still, some of the byproducts of LLMs feel like a real revolution. The moment you realize why whisper is named like that. When you test it on your laptop, and realize that it just transcribed the YouTube video you were rather silently running in the background. Some of this stuff feels like a big jump.


I’m kind of disappointed that the generative AI hype has overshadowed how many non-generative tasks are basically “solved”, especially in vision.

Human level object recognition can easily be trained up for custom use cases. Image segmentation is amazing. I can take a photo of a document and it’s accurately OCRd. 10-15 years ago that would be unfathomable.

I think current LLMs would give AI a much better reputation if they focused on non generative applications. Sentiment analysis, translation, named entity extraction, etc. these were all problems that data folks have been wrestling with that could very well be seen as “solved” and a big win for AI that businesses would b able onconfidently integrate into their workflows, but instead they went with the generative route and we have to deal with hallucinations and slop


Ahh, I wanted to list translation as another "byproduct". That totally feels like solved now.

However, while OCR done by vision models feels neat, I personally dont feel like it changed anything for me. I have been using KNFB Reader and later Seeing AI, and both have sufficiently solved the "OCR a document you just photographed" use case for me. They even aid the picture taking process by letting me know that a particular edge of the document is not visible.

Besides, I still don't understand the actual potential for hallucinations when doing OCR through vision models fully. I have a feeling there are a number of corner cases which will lead to hallucinations. The tendency to fill in things that might fit but aren't there is rather concerning. Talking about spelling errors and numerical data.


There is also Trae - vscode fork from bytedance


but if you want to use google SDK (python-genai, js-genai) rather than openai SDK (If found google api more feature rich when using different modality like audio/images/video) you cannot use openrouter. Also not sure if you are developing app and needs higher rate limits - what's typical rate limit via openrouter?


also for some reason I tested simple prompt (few words, no system prompt) with attached 1 images and openrouter charged me like ~1700 tokens when on the other hand using directly via python-genai its like ~400 tokens. Also keep in mind they charge small markup fee when you top you their account.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: