More

coder543 · 2026-02-11T17:44:01 1770831841

If someone releases a benchmark/dataset, I'm sure that significantly increases the chances of one of these AI labs training on the task.

coder543 · 2026-02-11T15:44:45 1770824685

If you want OCR with the big LLM providers, you should probably be passing one page per request. Having the model focus on OCR for only a single page at a time seemed to help a lot in my anecdotal testing a few months ago. You can even pass all the pages in parallel in separate requests, and get the better quality response much faster too.

But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.

staticman2 · 2026-02-11T17:26:22 1770830782

Gemini Pro 3 seems to be built for handling multiple page PDFs.

I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way).

coder543 · 2026-02-11T17:31:02 1770831062

It's not that they can't do multiple pages... but did you compare against doing one page at a time?

How many pages did you try in a single request? 5? 50? 500?

I fully believe that 5 pages of input works just fine, but this does not scale up to larger documents, and the goal of OCR is usually to know what is actually written on the page... not what "should" have been written on the page. I think a larger number of pages makes it more likely for the LLM to hallucinate as it tries to "correct" errors that it sees, which is not the task. If that is a desirable task, I think it would be better to post-process the document with an LLM after it is converted to text, rather than asking the LLM to both read a large number of images and correct things at the same time, which is asking a lot.

Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response.

One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly.

staticman2 · 2026-02-11T19:13:03 1770837183

I've been doing small PDFs- usually 5 or 6 pages in length.

I never tested Gemini 3 PDF OCR compared to individual images but I can say it processes a small 6 page PDF better than the retired Gemini 1.5 or 2 did individual images.

I agree that OCR and analysis should be two separate steps.

HPsquared · 2026-02-11T16:24:09 1770827049

You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.

kergonath · 2026-02-11T17:03:02 1770829382

This is not always easy. The models I tried were too helpful and rewrote too much instead of fixing simple typos. When I tried I ended up with huge prompts and I still found sentences where the LLM was too enthusiastic. I ended up applying regexes with common typos and accepted some residual errors. It might be better now, though. But since then I’ve moved to all-in-one solutions like Mathpix and Mistral-OCR which are quite good for my purpose.

coder543 · 2026-02-11T15:34:06 1770824046

There are a bunch of new OCR models.

I’ve also heard very good things about these two in particular:

- LightOnOCR-2-1B: https://huggingface.co/lightonai/LightOnOCR-2-1B

- PaddleOCR-VL-1.5: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5

The OCR leaderboards I’ve seen leave a lot to be desired.

With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.

I also feel like most/all of these models don’t handle charts, other than to maybe include a link to a cropped image. It would be nice for the OCR model to also convert charts into markdown tables, but this is obviously challenging.

philipkglass · 2026-02-11T18:24:15 1770834255

I have been trying to catch up with recent OCR developments too. My documents have enough special requirements that public benchmarks didn't tell me enough to decide. Instead I'm building a small document OCR project with visualization tools for comparing bounding boxes, extracted text, region classification, etc. GLM-OCR is my favorite so far [1]. Apple's VisionKit is very good at text recognition, and fast, but it doesn't do high level layout detection and it only works on Apple hardware. It's another useful source of data for cross-validation if you can run it.

This project has been pretty easy to build with agentic coding. It's a Frankenstein monster of glue code and handling my particular domain requirements, so it's not suitable for public release. I'd encourage some rapid prototyping after you've spent an afternoon catching up on what's new. I did a lot of document OCR and post-processing with commercial tools and custom code 15 years ago. The advent of small local VLMs has made it practical to achieve higher accuracy and more domain customization than I would have previously believed.

[1] If you're building an advanced document processing workflow, be sure to read the post-processing code in the GLM code repo. They're doing some non-trivial logic to fuse layout areas and transform text for smooth reading. You probably want to store the raw model results and customize your own post-processing for uncommon languages or uncommon domain vocabulary. Layout is also easier to validate if you bypass their post-processing; it can make some combined areas "disappear" from the layout data.

dweekly · 2026-02-11T18:50:35 1770835835

I'm going to be the obnoxious person who asks you to please create this leaderboard because you care and have a modicum of knowledge in this space.

StableAlkyne · 2026-02-11T16:29:48 1770827388

How do these compare to something like Tesseract?

I remember that one clearing the scoreboard for many years, and usually it's the one I grab for OCR needs due to its reputation.

kergonath · 2026-02-11T16:57:18 1770829038

Tesseract does not understand layout. It’s fine for character recognition, but if I still have to pipe the output to a LLM to make sense of the layout and fix common transcription errors, I might as well use a single model. It’s also easier for a visual LLM to extract figures and tables in one pass.

chaps · 2026-02-11T17:05:01 1770829501

For my workflows, layout extraction has been so inconsistent that I've stopped attempting to use it. It's simpler to just throw everything into postgis and run intersection checks on size-normalized pages.

kergonath · 2026-02-11T17:51:01 1770832261

Interesting. What kind of layout do you have?

My documents have one or two-column layouts, often inconsistently across pages or even within a page (which tripped older layout detection methods). Most models seem to understand that well enough so they are good enough for my use case.

chaps · 2026-02-11T18:06:15 1770833175

Documents that come from FOIA. So, some scanned, some not. Lots of forms and lots of hand writing to add info that the form format doesn't recognize. Lots of repeated documents, but lots of one-off documents that have high signal.

pogue · 2026-02-12T01:36:19 1770860179

I'd be very curious what works well with FOIA historical documents that have been scanned by hand with redactions by markers & etc.

chaps · 2026-02-12T15:06:51 1770908811

I like to use textual anchors for things like, "line starts with" or "line ends with" or "file ends with" and combining that with levenshtein distance with some normalization stuff (combining adjacent strings in various patterns to account for OCR wonkiness). Turns into building lists of anchors that can be built off of. Of all the things I've tried, including things like image hashing and such, it's been the most effective generalized "tool".

But also, I hold the strong philosophy that it's important to actually read the documents that are being scanned. In that way, OCR tends to be more of a procedural step than anything.

Really, it ultimately depends on your goals.

fudged71 · 2026-02-11T19:17:58 1770837478

I don't know how, but PyMuPDF4LLM is based on Tessaract and has GNN-based layout detection

chaps · 2026-02-11T17:01:58 1770829318

Tesseract v4 when it was released was exceptionally good and blew everything out of the water. Have used it to OCR millions of pages. Tbh, I miss the simplicity of tesseract.

The new models are similarly better compared to tesseract v4. But what I'll say is that don't expect new models to be a panacea for your OCR problems. The edge case problems that you might be trying to solve (like, identifying anchor points, or identifying shared field names across documents) are still pretty much all problematic still. So you should still expect things like random spaces or unexpected characters to jam up your jams.

Also some newer models tend to hallucinate incredibly aggressively. If you've ever seen an LLM get stuck in an infinite, think of that.

notnullorvoid · 2026-02-12T00:07:56 1770854876

I used Tesseract v3 back in the day in combination with some custom layout parsing code. It ended up working quite well. When looking at many of the models coming out today the lack of accuracy scares me.

sgc · 2026-02-12T02:18:12 1770862692

The best leader board I have used is ocrarena.ai. I agree it is not detailed enough. I wish people could rate what part of the ocr went well or bad (layout, text recognition, etc). However, my more specific results using custom prompts and my own images on their playground page are relatively closely aligned with the rankings as others have voted.

What more are you looking for?

mixedmath · 2026-02-11T18:23:56 1770834236

Are there leaderboards that you follow or trust?

Also, do you have preferred OCR models in your experience? I've had some success with dots.OCR, but I'm only beginning to need to work with OCR.

coder543 · 2026-02-11T18:27:41 1770834461

> Are there leaderboards that you follow or trust?

Not for OCR.

Regardless of how much some people complain about them, I really do appreciate the effort Artificial Analysis puts into consistently running standardized benchmarks for LLMs, rather than just aggregating unverified claims from the AI labs.

I don't think LMArena is that amazing at this point in time, but at least they provide error bars on the ELO and give models the same rank number when they're overlapping.

> Also, do you have preferred OCR models in your experience?

It's a subject I'm interested in, but I don't have enough experience to really put out strong opinions on specific models.

pogue · 2026-02-12T01:39:29 1770860369

Here's one from last year. Helpful, but I guess they gave up on updating it.

https://getomni.ai/blog/ocr-benchmark

noahjohannessen · 2026-02-11T17:28:57 1770830937

is https://www.ocrarena.ai/ not accurate?

fzysingularity · 2026-02-11T17:56:37 1770832597

ELO scores for OCR don't really make much sense - it's trying to reduce accuracy to a single voting score without any real quality-control on the reviewer/judge.

I think a more accurate reflection of the current state of comparisons would be a real-world benchmark with messy/complex docs across industries, languages.

coder543 · 2026-02-11T17:37:14 1770831434

It is missing both models that I mentioned, so yes, I would say one reason it is not accurate is because it is so incomplete.

It also doesn't provide error bars on the ELO, so models that only have tens of battles are being listed alongside models that have thousands of battles with no indication of how confident those ELOs are, which I find rather unhelpful.

A lot of these models are also sensitive to how they are used, and offer multiple ways to be used. It's not clear how they are being invoked.

That leaderboard is definitely one of the ones that leaves a lot to be desired.

coder543 · 2026-02-04T16:16:09 1770221769

The diarization is on Voxtral Mini Transcribe V2, not Voxtral Mini 4B.

sbrother · 2026-02-04T16:54:40 1770224080

Do you have experience with that model for diarization? Does it feel accurate, and what's its realtime factor on a typical GPU? Diarization has been the biggest thorn in my side for a long time..

ashenke · 2026-02-04T19:03:41 1770231821

You can test it yourself for free on https://console.mistral.ai/build/audio/speech-to-text I tried it on an english-speaking podcast episode, and apart from identying one host as two different speakers (but only once for a few sentences at the start), the rest was flawless from what I could see

sbrother · 2026-02-04T22:16:03 1770243363

Amazing. Thank you.

coder543 · 2026-02-04T17:22:27 1770225747

> Do you have experience with that model

No, I just heard about it this morning.

observationist · 2026-02-04T16:30:57 1770222657

Ahh, yeah, and it's explicitly not working for realtime streams. Good catch!

coder543 · 2026-02-03T18:52:42 1770144762

MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.

Not as good as running the entire thing on the GPU, of course.

coder543 · 2026-01-28T00:51:21 1769561481

I wouldn't say "everyone" uses Air. I had never even heard of it, despite frequently developing in Go for close to a decade at this point. Modd is quite nice: https://github.com/cortesi/modd

But, why couldn't you just use WSL? It is one of the main selling points of using Windows as a developer these days.

coder543 · 2026-01-26T17:19:48 1769447988

An LLM that can't understand the environment properly can't properly reason about which command to give in response to a user's request. Even if the LLM is a very inefficient way to pilot the thing, being able to pilot means the LLM has the reasoning abilities required to also translate a user's request into commands that make sense for the more efficient, lower-level piloting subsystem.

coder543 · 2026-01-26T17:08:33 1769447313

Waymo is certainly interested in using LLMs/VLMs for this purpose.

https://waymo.com/research/emma/

https://waymo.com/blog/2024/10/introducing-emma

https://waymo.com/blog/2025/12/demonstrably-safe-ai-for-auto...

coder543 · 2026-01-15T23:30:15 1768519815

Another one is Soprano-1.1.

It seems like it is being trained by one person, and it is surprisingly natural for such a small model.

I remember when TTS always meant the most robotic, barely comprehensible voices.

https://www.reddit.com/r/LocalLLaMA/comments/1qcusnt/soprano...

https://huggingface.co/ekwek/Soprano-1.1-80M

coder543 · 2026-01-15T16:08:51 1768493331

Not that I've heard. I searched and I see nothing. Where has Intel said they are winding down chip fabrication?

bilekas · 2026-01-15T16:21:14 1768494074

I'm thinking back to over the summer, they were reducing their work force and changing the previous CEO's direction.

https://www.manufacturingdive.com/news/intel-layoffs-25-perc...

etempleton · 2026-01-15T16:28:44 1768494524

In some areas they may be shifting resources. But a lot has happened since last summer. They have received some cash infusions and 18a is in full production with yields, apparently, at acceptable levels. Rumors are Apple has already signed on.

bilekas · 2026-01-15T16:35:01 1768494901

Ah okay, that's good news actually, would love to see Intel growing more. I know I'm rooting for their GPU range too.

BeetleB · 2026-01-15T16:32:43 1768494763

New CEO said he'll continue with Foundry provided he gets significant customers to justify the cost. In a recent comment/press release, Intel said they are continuing production on 14A. Ergo, they have external customers (or Trump is bullying him into it, but I suspect it's mostly the former).