Hacker News new | past | comments | ask | show | jobs | submit login
Why LLMs still have problems with OCR (runpulse.com)
217 points by ritvikpandey21 4 days ago | hide | past | favorite | 145 comments





I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.

ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").

Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.


This is great until it hallucinates rows in a report with company assets that don't exist - why wouldn't a mining company own some excavation equipment? - and pollutes all future searches with fake data right from the start.

I laugh every time I hear someone tell me how great VLMs are for serious work by themselves. They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.


It does seem that companies are able to get reliability in narrow problem domains via prompts, evals and fine tuning.

> It does seem

And therein lies all the problem. The verification required for serious work is likely orders of magnitude more than anybody is willing to spend on.

For example, professional OCR companies have large teams of reviewers who double or triple review everything, and that is after the software itself flags recognition with varying degrees of certainty. I don't think companies are thinking of LLMs as tools that require that level of dedication and resources, in virtually all larger scale use cases.


This seems to be exactly the business model of myriad recent YC startups. It seemingly did work for casetext as an example.

In some cases this is true, but then why choose an expensive world model over a small net or random forest you trained specifically for the task at hand?

we completely agree - mechanistic interpretability might help keep these language models in check, but it’s going to be very difficult to run this on closed source frontier models. im excited to see where that field progresses

You can literally ask it to mark characters that are not clear instead of inferring them.

Unit tests work remarkably well

> They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

So are human beings. Meaning we've been working around this issue since forever, we're not suddenly caught up in a new thing here.


While humans can and do make mistakes, it seems to me like there is a larger problem here that LLMs make mistakes for different reasons than humans and that those reasons make them much worse than humans at certain types of problems (e.g. OCR). Worse, this weakness might be fundamental to LLM design rather than something that can be fixed by just LLM-ing harder.

I think a lot of this gets lost in the discussion because people insist on using terminology that anthropomorphizes LLMs to make their mistakes sound human. So LLMs are "hallucinating" rather than having faulty output because their lossy, probabilistic model fundamentally doesn't actually "understand" what's being asked of it the way a human would.


This is what a lot of people miss. We have thousands of years of understanding the kinds of mistakes that humans make; we only have months to years of experience with the mistakes that LLM's and other AI's make.

This means that most of our verification and testing processes won't inherently catch AI errors because they're designed to catch human errors. Things like "check to see if the two sides of these transactions sum to 0" are fine for human typos, but they won't catch a fake (yet accurately entered) transaction.

It's similar to a language barrier. You don't realize how much you rely on context clues until you spend 3 days of emails trying to communicate a complex topic to someone in their second language.


> we only have months to years of experience with the mistakes that LLM's and other AI's make.

The mistakes are also very much model dependent. That you have build a system which improves the accuracy of one models output give you no confidence that it will work on even the next generation of the same model.


Human beings also have an ability to doubt their own abilities and understanding. In the case of transcription, if someone has doubts about what they are transcribing they'll try and seek clarity instead of just making something up. They also have the capacity to know if something is too important to screw up and adjust their behaviour appropriately.

I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached. [0]

The correct (or at least humanly-expected) process would be to identify the presence of mangled word, determine what its missing suffixes could have been, and if some candidate is a clear contextual winner (e.g. "fried chicken" not "dried chicken") use that.

However I wouldn't be surprised if the LLM is doing something like "The OCR data is X. Repeat to me what the OCR data is." That same process could also corrupt things, because it's a license to rewrite anything to look more like its training data.

[0] If that's not true, then it means I must have a supernatural ability to see into the future and correctly determine the result of a coin toss in advance. Sure, the power only works 50% of the time, but you should still worship me for being a major leap in human development. :p


> I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached.

Something I may have believed until I got married. Now I know that "fnu cwken" obviously means "fresh broccoli, because what else could it mean, did I say something about buying chicken, obviously this is not chicken since I asked you to go to produce store and they DON'T SELL CHICKEN THERE".

Seriously though, I'm mostly on the side of "huge success" here, but LLMs sometimes really get overzealous with fixing what ain't broke.


I often think that LLM issues like this could be solved by a final pass of "is the information in this image the same as this text" (ie generally a verification pass).

It might be that you would want to use a different model {non-generative} for that last pass -- which is like the 'array of experts' type approach. Or comparing to your human analogy, like reading back the list to your partner before you leave for the shops.


Isn’t that just following through on “from how it was reached”? Without any of that additional information, if the LLM gave the same result, we should consider it the product of hallucination

On your epistemology, if you correctly guess the outcome of a random event then the statement, even if contingent on an event that did not yet occur, is still true. The same goes for every incorrect guess.

If you claim that you guess correctly 50% of the time then you are, from a Bayesian perspective, starting with a reasonable prior.

You then conflate the usefulness of some guessing skill with logic and statistics.

How this relates to an LLM is that the priors are baked into the LLM so statistics is all that is required to make an educated guess about the contents of a poorly written grocery list. The truthfulness of this guess is contingent on events outside of the scope of the LLM.

How often, applying a scalar value to the statistical outcome of an event, is very important. If your claim is that LLMs are wrong 5O% of the time then you need to update your priors based on some actual experience.


To consider: do we overestimate what we know about how we humans reach an answer? (Humans are very capable of intuitively reading scrambled text, for example, as long as the beginning and ending of each word remains correct.)

The correct way to handle it is to ask the user if it's not clear, like a real assistant would

I once did something similar with a recipe from a cookbook where the recipe started at the bottom of one page and continued onto the next page. It correctly identified the first few ingredients present in the photo of the first page but then proceeded to hallucinate another half-dozen or so ingredients in order to generate a complete recipe.

yup, this is a pretty common occurrence in using LLMs for data extraction. For personal use (trying to load a receipt) it’s great that the LLM filled in info. For production systems which need high quality, near 100% extraction accuracy, inferring results is a failure. Think medical record parsing, financial data, etc These hallucinations occur quite frequently, and we haven’t found a way to minimize this through prompt eng.

It's not possible with current gen models.

To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.

See an old post I made on what you need to get above sota OCR that works today: https://news.ycombinator.com/item?id=42952605#42955414


Maybe ask it to return the bounding box of every glyph.

This universally fails, on anything from frontier models to Gemini 2.0 Flash in its custom fine-tuned bounding box extraction mode.

Failed as a machine, succeeded as an intelligence. Intelligences don't make good machines though.

My experiences have been the same… that is to say nothing like what is reported here. This is more pitch than info.

Odd timing, too given flash 2.0 release and its performance on this problem.


I recently took a picture of the ingredients list on my multivitamin and fed it into ChatGPT o1-pro (at least that was the config.) and it made up ingredients and messed up the quantities.

Sounds like a xerox.

I'm somewhat surprised neither this article nor the previous one mention anything about the Florence-2 model series. I had thought that Florence-2 was not just surprisingly capable for this kind of work, but also easily fine-tunable for a particular kind of document, when you expect to process a lot of instances of that document and want to further optimize accuracy. It's extremely small (0.23B and 0.77B parameters), so it's easy to run, easy to fine-tune, and probably unlikely to overthink things.

https://arxiv.org/abs/2311.06242

https://huggingface.co/blog/finetune-florence2

https://blog.roboflow.com/florence-2-ocr/

https://www.assemblyai.com/blog/florence-2-how-it-works-how-...

I don't personally deal with any OCR tasks, so maybe I misread the room, but it sounded promising, and I have seen some continuing interest in it online elsewhere.

In addition to the architectural issues mentioned in OP's article that are faced by most SOTA LLMs, I also expect that current SOTA LLMs like Gemini 2.0 Flash aren't being trained with very many document OCR examples... for now, it seems like the kind of thing that could benefit from fine-tuning on that objective, which would help emphasize to the model that it doesn't need to try to solve any equations or be helpful in any smart way.


In case any scientist actually working on adaptive OCR is reading this, I was given a post-WWII newspaper archive (PDF scans, 1945-2006, German language) that I would like to OCR with the highest quality, compute demands are not an issue, I've got an army of A100s available.

I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.


I tried https://github.com/PaddlePaddle/PaddleOCR for my own use case (scanline images of parcel labels) and it beat Tesseract by an order of magnitude.

(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)


When I was doing OCR for some screenshots last year I managed to get it done with tesseract, but just barely. When looking for alternatives later on I found something called Surya on github which people claim does a lot better and looks quite promising. I've had it bookmarked for testing forever but I haven't gotten around to actually doing it. Maybe worth a try I guess?

Surya is on par with cloud vision offerings.

would love to give this a shot with pulse! feel free to reach out to me at ritvik [at] trypulse [dot] ai, and i’d be very curious to give these a run! in general, i’m happy to give some general advice on algos/models to fine-tune for this task

Are you targeting business or consumers?

I cannot find the pricing page.


our current customers are both enterprises and individuals.

pricing page is here https://www.runpulse.com/pricing-studio-pulse


Not currently looking for this, but can I just say thank you for being open and direct with your prices. So useful to just be able to look.

How are you on 18-19th Century cursive, English language. Do you have a guarantee for number of errors.


Not OP, but you might be looking for https://www.transkribus.org/

thanks! re: 18-19th century cursive, while we handle historical handwriting, we can't guarantee specific error rates. each document's accuracy varies based on condition, writing style, and preservation. happy to run test samples to check.

feel free to send over sample docs: sid [at] trypulse [dot] ai


No API pricing available?

Pls contact archive.org about adopting this digital archive once it exists (they also have a bad habit of accepting physical donations, if you are nearby)

I’m very far from an expert, but had good luck with EasyOCR when fiddling with such things.

If it's a large enough corpus I imagine it's worth fine tuning to the specific fonts/language used?

I would love to get access to that archive!

If Pulse (which is a competing product, the premise of which is threatened by both closed and open models) wants to dispute the post earlier this week, it should provide samples which fail in Claude and Gemini. The image [1] in the post is low-resolution and fuzzy. Claude's user manual specifically says: "Images uploaded on Claude.ai can be up to 30MB, and up to 8000x8000 pixels. We recommend avoiding small or low resolution images where possible."

> We have hundreds of examples like this queued up, so let us know if you want some more!

Link to it then, let people verify.

I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.

[1]: https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...


99%+ is terrible in the OCR world. 99.8%+ on first pass, and 99.99%+ (1/10k characters error) at the end of the process - which includes human reviewers in the loop - is ok, but the goal is higher fidelity than that. If we are throwing billions at the problem, I would expect at least another 9 on that.

  99.8%+ on first pass
Even with the best OCR, and high resolution scans, you might not get this due to:

- the quality of the original paper documents, and

- the language

I have non-English documents for which I'd love to have 99% accuracy!


Language is often solvable by better dictionaries. I have been forced to make my own dictionaries in the past, that led to similar error rates as more mainstream languages like English. If you are talking about another alphabet like Cyrillic or Arabic etc, that is another problem.

Yup, I'm talking about another alphabet.

Ha hi Jeswin! I was itching to reply to this post too, I wonder why…

Dave! Our sample sizes were large enough, and tables complex enough to opine on this.

I suppose Gemini or Claude could fail with scans or handwritten pages. But that's a smaller (and different) set of use cases than just OCR. Most PDFs (in healthcare, financial services, insurance) are digital.


Using that image and the following prompt on Gemini 2.0 Flash "please do ocr of the attached file and output ascii following the layout of the original as faithfully as possible" outputs something that isn't bad but not perfect:

  PI tno Name             Time            3.5 km   18 C   (cont.)
  MEN B (39)                                                  3(34)         4(52)         5(53)         6(54)         7(55)         
  8(40)         9(57)
                                                               12(60)        13(61)        14(62)        15(63)        16(47)        
  17(48)       18(100)
                                                  1(51)         2(33)
                                                  10(58)        11(59)
The first column is offset vertically which mixes up information and is wrong.

I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:

  PI      tno Name                       Time
  
  MEN   B (39)                                                         3.5 km       18 C          (cont.)
                                                   1 (51)                  2 (33)                 3 (34)                  4 (52)                  5   (53)                  6 (54)                  7 (55)                  8 (40)                 9 (57)
                                                  10 (58)                 11 (59)                12 (60)                 13 (61)                 14 (62)                 15 (63)                16 (47)                 17 (48)                18 (100)
                                                  Finish
  
  13     425  Peter  Hodkinson          11:40       0:48   +0: 06 (21)      1:29  +0: 13 (28)      1:58   +0: 13 (24)      2:44   +0: 18 (23)      3:38   +0: 20 (19)     4:28    +0: 22 (18)     5:05   +0: 23 (17)      5:36   +0: 26 (17)      6:19   +0: 29 (19)
              Great  Britain                        0:48   +0: 06 (21)      0:41  +0: 09 (30)      0:29   +0: 01 (4)       0:46   +0: 07 (22)      0:54   +0: 02 (5)      0:50    +0: 03 (7)      0:37   +0: 02 (10)      0:31   +0: 03 (11)      0:43   +0: 05 (20)
                                                    6:47   +0: 28 (17)     7:02   +0: 29 (17)      8:21   +0: 38 (16)      8:41   +0: 39 (16)      9:00   +0: 41 (16)     9:13    +0: 42 (16)     9:43   +0: 42 (16)     10:36   +0: 43 (14)     11:32   +0: 41 (13)
                                                    0:28   +0: 02 (8)      0:15   +0: 01 (4)       1:19   +0: 11 (16)      0:20   +0: 03 (15)      0:19   +0: 02 (4)      0:13   +0: 02 (11)      0:30   +0: 01 (2)       0:53   +0: 01 (3)       0:56    0:00  (1)
                                                   11:40   +0: 40 (13)
                                                    0:08   +0: 00 (8)

(edit: line wrap messes it all up... still I think my version is better ;-)

I usually say something like: ".. output it as hierarchical json". For better accuracy, we can run the output through another model.

Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.


As opposed to the discussion 2 days ago with 400+ comments:

Ingesting PDFs and why Gemini 2.0 changes everything

https://news.ycombinator.com/item?id=42952605


FTA:

> This week, there was a viral blog about Gemini 2.0 being used for complex PDF parsing, leading many to the same hypothesis we had nearly a year ago at this point. Data ingestion is a multistep pipeline, and maintaining confidence from these nondeterministic outputs over millions of pages is a problem.


Yes and per the poster's opening comment:

https://news.ycombinator.com/item?id=42966958#42966959


It seemed you were implying the article was naive to the earlier post, whereas the OP poses itself as a rebuttal. Perhaps a fault of my inference.

That's what I thought too, but apparently the title is pure, absolute, rage-inducing clickbait.

The actual conclusion is that they make classes of errors that traditional OCR programs either don't make, or make in different ways.


I assume you mean the title of the current thread? I've attempted to make it less baity now.

Indeed, the new title is far better. Thanks!

(CEO of Aryn here: https://aryn.ai)

Nice post and response to the previous one.

It’s important to remember that the use cases for VLMs and document parsers are often different. VLMs definitely take a different approach than layout detection and OCR. They’re not mutually exclusive. VLMs are adaptable with prompting, eg please pull out the entries related to CapEx and summarize the contributions. Layout parsers and OCR are often used for indexing and document automation. Each will have their own place in an enterprise stack.


This seems like a problem that will quickly fall to the new reinforcement learning methods introduced by DeepSeek. Just build a system to synthetically render a few million pages of insanely complex, hard-to-parse documents with different layouts along with a JSON description of what the correct OCR should be, mix in some human annotated datasets, then do RL against a verifier that insists on 100% accuracy.

I still don't get the reinforcement part here. Wouldn't that be normal training against the data set? Like how would you modify the normal MNIST training to be reinforcement learning

not an expert - yes, what would usually just be called training, with LLMs here is called RL. You do end up writing a sort of a reward function, so I guess it is RL.

You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.

ChatGPT is also still hilariously bad at drawing diagrams - universally producing a silly cartoon with misspelled words. The rate of improvement over the past two years is effectively zero.

Why would you use ChatGPT to draw diagrams? It’s a generative language model. Just because you can doesn’t mean it’s the best tool for the job.

Why not include some suggested tools in your comment? "you are an idiot for using ChatGPT" (paraphrased) isn't very helpful.

That's DALL3 which is not an LLM

Good point. I probably knew that at one time but now leverage it via chatgpt so forgot. Does anyone know if there is an AI wall with text to image?

I’m quite late here, but if you want a diagram you can ask the LLM to output Mermaid syntax and then paste that into Excalidraw or something else that can render based on Mermaid.

I found this part questionable.

> Fixed patch sizes may split individual characters

> Position embeddings lose fine-grained spatial relationships, losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.

The author suggests that the standard ViT architecture is poorly suited for OCR because patches do not respect character boundaries and that the positional embeddings only embed the locations of patches, which are 16x16 pixels.

My mental model is that a token is a memory slot where computation results can be stored or retrieved from. There is no reason why we should want the layout of these memory slots must mimic the layout of the document, except at the very first layer, because then we don't have to think too hard about how to encode the document.


I'm making a simple service that outputs layout-following ASCII from images, PDFs of images or text PDFs. I too think the risk of hallucination is in many cases too great.

I fed my system the first image in the post [0] and got the text below in return.

I will be looking for beta testers next week... Email if interested!

  VEH  YR  MAKE  MODEL        IDENTIFICATION   TYPE SYM  ST  TER USE  CLASS ALARM
     2 02  HOND  CIVIC EX     1HGEM22952L086006 PP    18  IL  37  L  887120
     LOSS PAYEE THAT APPLIES:     2
     3.02  HYUN  SONATA / GL  KMHWF25S72A671544 PP   16   IL  37  P  887120

  H    NO. COVERAGE DESCRIPTION   LIABILITY LIMIT (S) DEDUCTIBLE          PREMIUM
      2   Preferred Extra Auto
           Bodily Injury          $ 250,000 / $ 500,000                    $ 92.00
           Property Damage        $ 100,000                                $ 43.00
           Medical Payments       $ 5,000                                  $ 13.00
           Uninsured Motorist     $ 250,000 / $ 500,000                    $ 62.00
           Undinsured Motor.-BI   $ 250,000 / $ 500,000                      INCL
           Collision                                       $ 500          $ 141.00
           Other than Collision                            $ 250           $ 92.00
                                                     TOTAL FOR UNIT   2   $ 443.00
     3-  Preferred Extra Auto
           Bodily Injury          $ 250,000 / $ 500,000                    $ 92.00
           Property Damage        $ 100,000                                $ 43.00
           Medical Payments       $ 5,000                                  $ 13.00
           Uninsured Motorist     $ 250,000 / $ 500,000                    $ 62.00
           Undinsured Motor. BI   $ 250,000 / $ 500,000                      INCL
           Collision                                       $ 500          $ 136.00
           Other than Collision                            $ 250           $ 90.00
                                                     TOTAL FOR UNIT   3   $ 436.00
  
  DRIVER  INFORMATION
  DR VEH  SEX MAR   BIRTH  G / S PRIN     DVR LIC NO.    NAME                  PTS
[0] https://i.imgur.com/sLWQoFG.jpeg

Wasn’t seeing what OCR stands for, I believe it’s Optical Character Recognition.

>Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.

Except for a very special kind of bug:

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

>Xerox scanners/photocopiers randomly alter numbers in scanned documents


Shouldn't it be easy to generate a lot of OCR data? Generate HTML, randomize, generate image, apply noise and let it train on it.

Yes, but if you aren't careful you will end up with a model carefully tuned for be ways that you add noise not all types of noise from the real world. But stuff like this can be very useful for some base training especially if you add many real-world examples afterwards.

Question to anyone with experience in this domain: I have CSAM spam problems on a forum I host, with bots putting link shortener URLs embedded in images rather than the post body. Traditional OCR software deals poorly with them due to font modifications and intentional text edge modifications, and I'm obviously not gonna use a SaaS/closed source model to upload a bunch of may-be-may-not-be-CSAM pictures, so looking for a way to do this locally, with cheapish inference if possible (I don't mind spending a minute of compute to get the result out for one image, but need to do it on the CPU).

Is there any small model that would do this effectively, with pure text extraction (without going for any kind of formatting or whatnot)?


Yup, Moondream is great for this use case! You can use locally with the quickstart: https://docs.moondream.ai/

It is a 2b vision model that runs anywhere and can object detect, point, query, and more.


Have you looked at https://moondream.ai/?

I noticed llama 3.2 8b has big problems reading white on black text. Black on white goes way better. But I think it makes sense. They don't look at text like a dedicated OCR algorithm. I see the article elaborates on the very well.

thanks for the feedback!

This is a response to: https://news.ycombinator.com/item?id=42952605

A fun threat to read for the current hype cycle.

You can tell who is working in the field by the fact they don't use VLMs for OCR and who isn't because they think it's a solved problem.

A question to the authors.

Do you have resources to train any VLMs from scratch? They aren't quite the bests the sota LLMs are and I think they can be made a lot more useful with:

1). Better training data.

2). Larger vision parts of the model.

In short: 2d attention is not something that anyone's doing at scale - that I know of - and is a no brainer for understanding images.


appreciate the feedback. completely agree, tuning and/or training a VLM will definitely produce better ocr extractions. however, it’s notoriously hard to accumulate a really good ground truth labeled dataset of pdf/excel/pptx. there are some resources online especially for tables, with IBM’s labeled table dataset for example. however, we’d guess the same hallucination issues will persist on complex layouts

You can generate the data synthetically.

We never had the budget to do it but I do have some notes somewhere on a 2d context free grammar to generate arbitrarily nested rows/columns and a css styling that got applied to the xhtml output of the grammar. It dynamically generated as much high quality synthetic data as you wanted - but the IBM and similar data sets were plenty big enough for what we could do even on specialist models.

It depends on what you're doing really. I thought that we'd done pretty well, then someone on HN reached out with a table that spanned 50 pages and I just gave up.

Feel free to drop an email if you'd like a quick chat. I find the state of table models particularly abysmal for how important they are.


I find that LLMs can read text off product label photos I can't even read myself.

If you don't know what the text says, do you have access to some other form of ground truth? Because otherwise you don't know if they're reading illegible labels correctly!

I can know what the text says cause I have the actual product available :) but you are right if the llm can't read it will fill in the gap with hallucinations probably

yes they usually can! we delved into the mathematics behind this a bit in the blog, but tldr the LLMs are making educated guesses based on the embedding similarities - which can be detrimental for ocr systems.

I've had limited but good experience (with both English and French text) with Tesseract, then getting ChatGPT to fix problems with clever prompting (e.g., pretend you are an expert OCR corrector, blah blah, blah).

for most (text-dense) documents without much layout differences, these small prompt eng tricks work pretty well! scaling this to complex layouts and 1000+ page docs, we found the models don’t stick to their instructions. perhaps there’s some work to be done with 1M+ context length models so they don’t lose layout memory.

Do any models use some sort of context pruning to keep the [most] relevant parts of the context?

What single documents are you processing that are 1000+ pages?


Is processing one page at a time not feasible? I'm always chunking things as small as possible for LLMs

I'd just like to say this is a fantastic "marketing" blog post. Great explanation of an interesting problem, that this company theoretically helps solve. Very well done!

One note - there was a callout at the end to "stay tuned" for a follow-up post about the actual solution. I may have missed it, but I don't see any way to actually sign up to the blog or newsletter or anything. That's a shame - I'd love to follow this topic and product (and potentially have a few real-world use cases for it).


thanks for the kind words! we do have a mailing list for current users, I can add you to that

Thank you! Saw you contacted me, that's great. I'm planning to study your product more during the week to see if it's a fit for something we're building. :)

I was just trying a bunch of models for OCR. I only have 4 GB of VRAM in my personal machine.

My goal was to run an OCR model locally and extract text from scanned PDFs.

Many models could not even be run. Among those that did run, thanks to Ollama, provided very poor experience. Like llava-llama3, phi3.5 vision, etc.

What worked really well, but still not up to the mark- Surya [0].

It works perfectly on screenshots from true text PDFs, but not from scanned PDFs. Also has much better performance for English than Indian languages.

[0]: https://github.com/VikParuchuri/surya


yup, the models you tried out require a lot of work to be able to run efficiently. additionally for actual decent ocr it’ll require very high quality document datasets (of PDF/excel/pptx). scanned documents are especially hard and cause a lot of issues for LLMs, which start making up info a lot of the time.

It's good and useful to see empirical analyses like this. I use open & custom VLMs a lot. The point of VLMs is that OCR is not needed anymore: it's intrinsic to the model. For instance at work we've developed a family vision-based RAG, and it's performance is twice that of a text-based one. The point I'd like to make here is that OCR is an intermediate step that is not explicitly needed anymore, un many cases. My hunch is that pure OCR will go away.

I tried the square example from the paper mentioned with o1-pro and it had no problem counting 4 nested squares…

And the 5 square variation as well.

So perhaps it is just a question of how much compute you are willing to throw at it


yea seems like o1-pro was able to solve a few of those variations in the paper we referenced. take a look at the rest of the examples and let me know! the paper’s (somewhat) old at this point, but if you try our more complex square variations with overlapping edges and varying line thicknesses, the same issues arise. although I generally agree a mind boggling amount of compute will increase accuracy for sure.

Check out mathpix.com we have a hybrid approach towards OCR that features accurate layout understanding (with accurate bounding boxes) plus accurate OCR outputs.

Disclaimer: I'm the founder and CEO.


LLMs do not struggle at all with raw text: they never lose decimal places or drop digits when transcribing a table from raw text. So the problem is not the internal representation. I do this all the time and all major LLMs work eminently well at it.

The problem comes from the vision part. Either (a) the ViT architecture needs a rework, or (b) the vision models need more training on tasks of the "copy this" nature versus the "do this" nature.


on raw text, LLM’s usually do not struggle. however, when you start processing low-fidelity images (receipt scans with stains, documents with marks all over it, bent corners/areas, rotated docs) these transcription issues become extremely noticeable. to your point about table extraction, i disagree — we’ve had many examples on complex nested tables where the model hallucinated digits, especially from documents with weird aspect ratios.

fully agree on the last point, the vit architecture will need some working on for this — microsoft’s been doing some excellent research on this lately


If you have raw text you don’t need OCR.

There's lots of hidden gotchas to this. Uploading a screenshot and asking an LLM to transcribe one page is generally ok. Give it a table that spans pages, or a 60 page doc, and you're in dire straits.

I cofounded DocuPanda to handle this issue specifically. Call me biased, but I do believe it's the best solution out there.


Is there an OCR arena out there, similar to lmarena? Would be very useful but couldn't find one yet.

>>When an LLM processes a document image, it first embeds it into a high-dimensional vector space through the attention mechanism…

This is a confusing way to describe attention and gets a bit off topic, the attention mechanism is not really what’s causing any of the issues in the article.


You don’t really feed images to LLMs, rather to a vision model within the multi modal llm

yup, important clarification! the language portion of the model also works with the extraction however, and is prone to the hallucinations

I've worked in data extraction from documents for a decade and have developed algorithms in the space. I've developed a product using LLMs for this purpose too.

This article is essentially correct.


thanks, glad to hear it.

We use Claude 3.5 sonnet to OCR and structure tabular data from PDFs and it’s virtually flawless… orders of magnitude better than Textract (or pretty much any other LLM).

I just tried the rectangle test on 4o and it answered correctly.

LLMs seem to be really good at audio speech to text though. One would naively think these are similar problems, but apparently not?

Ripcord demo'd their stack to me yesterday and the use of LLMs works great for OCR, so it is indeed possible.

I use Chatgpt to convert tables in fng and pdfs to pandas data frames and it works very well

Resume parsing is a problem for decades,

and even today it can never be done right

because SOME resumes are just so f** up.


Is this just a training issue? They just need to train a model specifically for OCR?

we don’t think so - we’ve fine tuned most of the SOTA language models available today on table datasets, documents with complex layouts, and while they do perform better, seems like they’re still prone to the same hallucinations. these frontier models have pretty much already been trained on most of the internet at this point, and tons of publically available documents.

How does your solution compare to AWS textract?

They probably do this already. But the problem is more fundamental: there are simply no process guarantees or guardrails inside a generative model to constrain the failure modes.

To be fair, they would say that due to the fact they are selling a competing thing.

To me, the question is why we keep using PDFs that never get printed?

s/LLMs/VLMs/g

A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.

> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.

This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.

Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.

> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.

Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.

> Fixed patch sizes may split individual characters

This doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.

> Position embeddings lose fine-grained spatial relationships

This isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.

> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.

You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.

OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.

Oh and there's Florence, which is a VLM trained on bounding boxes.

> Favor common words over exact transcription

Nothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.

> "Correct" perceived errors in the source document

Which OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.

> Merge or reorder information based on learned patterns

LLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.

> Produce different outputs for the same input due to sampling

You can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.

And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.

If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.

It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, production LLMs only really put probability fields around tokens that are legitimately valid for the task at hand (bounded by the LLM's intelligence, of course).

> Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.Consider the sequence "rn" versus "m". To a human reader scanning quickly, or an LLM processing image patches, these can appear nearly identical. The model, trained on vast amounts of natural language, will tend toward the statistically more common "m" when uncertain.

Again, LLMs don't just regurgitate the most "common" stuff. They are context specific. Besides, it's the vision module that would be making the differentiation here between rn and m. A vision module that is likely neither better nor worse than the vision modules traditional OCR systems are using. (Of course, the LLM may process the vision module's output and notice that perhaps it mis-transcribed "rn" vs "m" and "correct" it. But correct it based on _context_ not on some simplistic statistical model as suggested.)

> There’s a great paper from July 2024 (millennia ago in the world of AI) titled “Vision language models are blind” that emphasizes shockingly poor performance on visual tasks a 5 year old could do

Absolutely. I work in this field, and these vision models are not at the same level as their language counterparts. Due in large part to a lack of good data, good training processes, and good benchmarks. The Cambrian-1 paper is quite insightful here, as it studies the vision benchmarks themselves (https://arxiv.org/abs/2406.16860). The TLDR is that most of the vision benchmarks are actually just text benchmarks, and performance barely degrades when the model is blinded. I've found the same to be true of almost all publicly available training datasets for vision models, which is likely why these models don't learn good, robust visual understandings.

That doesn't really speak to the fundamental capabilities of the vision models. It speaks to the lack of training them well. So, if a model is explicitly trained to do OCR using lots of high quality ground truth data (which is easy to get and generate), then their performance can, and does, excel.

---

Now, all of that said, I also don't agree with the prior post this post is in response to. I work with VLMs a lot as part of my research, and I can assure you that they are nowhere near human level on OCR. They can exceed human performance in very specific tasks at the moment, but that's about it.

Are they better than other OCR offerings? As of this moment, I would tend to trust someone who does OCR for a living, so if Pulse says VLMs aren't as good as their solution, I would probably trust that over someone else saying VLMs work for their specific application. And VLMs _absolutely_ come with a myriad of caveats. They aren't as reliable as a more mechanical OCR system. Expect something like GPT4o to completely glitch 1 in every 10,000 queries. And expect them to be "weird". GPT4o will tend to not fully follow instructions maybe 1 in 100 times, so you might get your document back in the wrong format, or have "Sure, I can help with that!" at the start of your document, etc. Gemini tends to have better instruction following, but I don't have a good assessment of its reliability yet.

If I, personally, had a small project that needed OCR, I'd use Tesseract if it's just PDFs or something like that with printed text. If it's something with weird fonts, fancy stuff, handwriting, math formulas, etc. I might give Gemini a try. If it's mission critical, pay an expert to do it, whether that's in-house or paying a service explicitly built for the purpose.

---

NOTE: One thing that got glossed over in the article is that VLMs are not trained on the "embeddings" of the vision model, per se. CLIP processes the images as N number of tokens across L number of layers. At the end, you have N embeddings. For traditional CLIP, the last (or first) embedding is used as the result. Modern CLIPs average the embeddings together. Tomato, tomato.

VLMs are not trained on that single embedding from CLIP. The "head" gets stripped off, and the VLMs get trained on all N processed tokens from CLIP. So they have access to much more information. The vision models also get finetuned during the training of the VLM, and, importantly, CLIP architectures use skip connections throughout. So there is a direct path for the LLM to access pretty much anything from the vision model that it needs, and optimize for any information it needs.

The size of the embedded information given to the LLM, then, is almost about the same as the number of pixels from the source image. For example it might be something like a 384x384x3 image (442,368 dimensions) getting baked down into something like a 150,000 dimensional vector. So it's really not a fundamentally lossy process at that point.


Which VLM models you found that are superior? A finetuned Trocr is very good based on my experience

Written by someone who knows what they are talking about.

i dont understand. what have llms to do with ocr?

Some like gpt-4o are multi-modal.

the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.

gpt-4o is multimodal. The o in it stands for omni.

https://news.ycombinator.com/item?id=40608269


thanks for the link. will have a look at it. if you tokenize tiles and then feed those serially to an llm. i really wouldn't know why someone thinks that's a good idea. you lose all local spacial context not to mention global context if the scan is produced at a slight angle. it's a really stupid idea. of course, provided enough computational power one might brute force a solution that works somewhat well.

February 6, 2024... okay grandpa

In the article, it references a paper from July 2024, weird...

fixed the year, good catch

very nice blogpost.

Really? I have been using 4o, and its flawless at OCR

give it a shot with a few of the examples in the blog! or better yet, find some financial statements from Goldman/morgan Stanley and run it through the model.

Check the output again, there will be small mistakes if your text is large enough.

I used it once, was given a screenshot that contained a SHA1 hash and needed it in text. Maybe this is a case where ChatGPT can do a small task quickly for me and save me squinting?

It still fails on this today (the "bdbdffdf" part). Not allowed to share a chat with a picture it seems, my prompt was to upload the file below and "Image to text please.". Just the free 4o model, maybe the paid stuff is better.

https://postimg.cc/m1jNPL0j


Amusingly it tried to write a Python script to OCR it first, decided there were errors and tried to correct it.... it did correct some stuff and nearly got it, but I was able to spot an error 3/4 through with my eyeballs after a couple minutes.

https://i.imgur.com/UuO3JxM.png


Document ingestion and the launch of Gemini 2.0 caused a lot of buzz this week. As a team building in this space, this is something we researched thoroughly. Here’s our take: ingestion is a multistep pipeline, and maintaining confidence from LLM nondeterministic outputs over millions of pages is a problem.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: