The results are there, it's just hidden away in the appendix. The result is that those models they don't actually suffer drops on 4/5 of their modified benchmarks. The one benchmark that does see actual drops that aren't explained by margin of error is the benchmark that adds "seemingly relevant but ultimately irrelevant information to problems"
Those results are absent from the conclusion because the conclusion falls apart otherwise.
He is testing several models, some of which cannot reliably output legal moves. That's different from saying all models including the one he thinks understands can't generate a legal move in 10 tries.
3.5-turbo-instruct's illegal move rate is about 5 or less in 8205
>but it seemed the easiest solution (Occam's razor)
In my opinion, it only seems like the easiest solution on the surface taking basically nothing into account. By the time you start looking at everything in context, it just seems bizarre.
Because people are not saying "let's replace Casio Calculators with interfaces to GPT!"
By and large, the processes people are scrambling to place LLMs in are ones that typical machines struggle or fail and humans excel or do decently (and that LLMs are making some headway in).
There's no point comparing LLM performance to some hypothetical perfect understanding machine that doesn't exist. It's nonsensical actually. You compare it to the performance of the beings it's meant to replace or augment - humans.
Replacing non-deterministic black boxes with potentially better performing non-deterministic black boxes is not some crazy idea.
>If I can choose between having a script that queries the db and generates a report and “Dave in marketing” who “has done it for years”
If you could that would be nice wouldn't it? And if you couldn't?
If people were saying, "let's replace Casio Calculators with interfaces to GPT" then that would be crazy and I would wholly agree with you but by and large, the processes people are scrambling to place LLMs in are ones that typical machines struggle or fail and humans excel or do decently (and that LLMs are making some headway in).
You're making the wrong distinction here. It's not Dave vs your nifty script. It's Dave or nothing at all.
There's no point comparing LLM performance to some hypothetical perfect understanding machine that doesn't exist.
You compare to the things its meant to replace - humans. How well can the LLM do this compared to Dave ?
100%, and a lot of them are truly terrible use cases for LLMs.
For example, using a LLM to transform structured data into JSON, and doing it with two LLMs in parallel to try to catch the inevitable failures, instead of just writing code that outputs JSON.
If your task was being solved well by a deterministic script/algorithm, you are not going to save money porting to LLMs even if you use Open Source models.
'could' is doing a whole lot of work in that sentence, I'm being charitable. Reality is LLMs are being crammed in places where it isn't very sensible under thin justifications, just like the last few big ideas were (c.f. blockchain)
>Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close.
For Normal models, the state of Open Source OCR is pretty terrible. Unfortunately, the closed options from Microsoft, Google etc are much better. Did you try those ?
I tried open source and closed source OCR models, all were pretty bad. Google vision was probably the best of the "OCR" models, but it liked adding spaces between characters and had other issues I've forgotten. It was bad enough that I wondered if I was using it wrong. By the time I was trying to pass the text to an LLM with the image so it could do "touchups" and fix the mistakes, I gave up and decided to try LLMs for the whole task.
I don't remember the exact models, I more or less just went through the OpenRouter vision model list and tried them all. Gemini Flash performed the best, somehow better than Gemini Pro. GPT-4o/mini was terrible and expensive enough that it would have had to be near perfect to consider it. Pixtral did terribly. That's all I remember, but I tried more than just those. I think Llama 3.2 is the only one I haven't properly tried, but I don't have high hopes for it.
I think even if OCR models were perfect, they couldn't have done some of the things I was using LLMs for. Like extracting structured information at the same time as the plain text - extracting any dates listed in the text into a standard ISO format was nice, as well as grabbing peoples names. Being able to say "Only look at the hand-written text, ignore printed text" and have it work was incredible.
The OCR in OneNote is incredible IME. But, I've not tested in a wide range of fonts -- only that I have abysmal handwriting and it will find words that are almost unrecognisable.
3rd party libraries don't spring from nowhere. And no language starts having one in abundance. People have to be motivated enough to write all those libraries in the first place and a lot of them are written just to use python syntax over C Code.
I'm not saying it's the only reason to choose python now but it's definitely among the biggest reasons.
Why not just call the C code you've already written in C ? Because they would rather use python (or python like) syntax.
I don't think we actually disagree here. Even your point about the better C-API doesn't indicate that syntax wasn't a deciding factor, just that one of several options had better compatibility.
Tokenization is not strictly speaking necessary (you can train on bytes). What it is is really really efficient. Scaling is a challenge as is, bytes would just blow that up.
Those results are absent from the conclusion because the conclusion falls apart otherwise.
reply