Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The new Mistral OCR release looks impressive - 94.89% overall accuracy and significantly better multilingual support than competitors. As someone who's built document processing systems at scale, I'm curious about the real-world implications.

Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.

Also interesting to see the pricing model ($1/1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.

I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.



Excited to test this our on our side as well. We recently built an OCR benchmarking framework specifically for VLMs[1][2], so we'll do a test run today.

From our last benchmark run, some of these numbers from Mistral seem a little bit optimistic. Side by side of a few models:

model | omni | mistral |

gemini | 86% | 89% |

azure | 85% | 89% |

gpt-4o | 75% | 89% |

google | 68% | 83% |

Currently adding the Mistral API and we'll get results out today!

[1] https://github.com/getomni-ai/benchmark

[2] https://huggingface.co/datasets/getomni-ai/ocr-benchmark


Update: Just ran our benchmark on the Mistral model and results are.. surprisingly bad?

Mistral OCR:

- 72.2% accuracy

- $1/1000 pages

- 5.42s / page

Which is pretty far cry from the 95% accuracy they were advertising from their private benchmark. The biggest thing I noticed is how it skips anything it classifies as an image/figure. So charts, infographics, some tables, etc. all get lifted out and returned as [image](image_002). Compared to the other VLMs that are able to interpret those images into a text representation.

https://github.com/getomni-ai/benchmark

https://huggingface.co/datasets/getomni-ai/ocr-benchmark

https://getomni.ai/ocr-benchmark


By optimistic, do you mean 'tweaked'? :)


At my client we want to provide an AI that can retrieve relevant information from documentation (home building business, documents detail how to install a solar panel or a shower, etc) and we've set up an entire system with benchmarks, agents, etc, yet the bottleneck is OCR!

We have millions and millions of pages of documents and an off by 1 % error means it compounds with the AI's own error, which compounds with documentation itself being incorrect at times, which leads it all to be not production ready (and indeed the project has never been released), not even close.

We simply cannot afford to give our customers incorrect informatiin

We have set up a backoffice app that when users have questions, it sends it to our workers along the response given by our AI application and the person can review it, and ideally correct the ocr output.

Honestly after an year of working it feels like AI right now can only be useful when supervised all the time (such as when coding). Otherwise I just find LLMs still too unreliable besides basic bogus tasks.


As someone who has had a home built, and nearly all my friends and acquaintances report the same thing, having a 1% error on information in this business would mean not a 10x but a 50x improvement over the current practice in the field.

If nobody is supervising building documents all the time during the process, every house would be a pile of rubbish. And even when you do stuff stills creeps in and has to be redone, often more than once.


I have done OCR on leases. It’s hard. You have to be accurate and they all have bespoke formatting.

It would almost be easier to switch everyone to a common format and spell out important entities (names, numbers) multiple times similar to how cheques do.

The utility of the system really depends on the makeup of that last 5%. If problematic documents are consistently predictable, it’s possible to do a second pass with humans. But if they’re random, then you have to do every doc with humans and it doesn’t save you any time.


I'd love to try it for my domain (regulation), but $1/1000 pages is significantly more expensive than my current local Docling based setup that already does a great job of processing PDF's for my needs.


I think for regulated fields / high impact fields $1/1000 is well-worth the price; if the accuracy is close to 100% this is way better than using people, who are still error-prone


It could be very well worth the price, but it still needs to justify the price increase over an already locally running solution that is nearly free in operation.

I will still check it out, but given the performance I already have for my specific use case with my current system, my upfront expectation is that it probably will not make it to production.

I'm sure there are other applications for wich this could be a true enabler.

I am also biased to using as little SaaS as possible. I prefer services on-prem and under my control where possible.

I do use GPT-4o for now as, again, for my use case, it significantly outperformed other local solutions I tried.


re: real world implications, LLMs and VLMs aren't magi, and anyone who goes in expecting 100% automation is in for a surprise (especially in domains like medical or legal).

IMO there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases.

e.g. you still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort.

But for RAG and other use cases where the error tolerance is higher, I do think these OCR models will get good enough to just solve that part of the problem.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)


> Has anyone tried this on specialized domains like medical or legal documents?

I’ll try it on a whole bunch of scientific papers ASAP. Quite excited about this.


$1 for 1000 pages seems high to me. Doing a google search

Rent and Reserve NVIDIA A100 GPU 80GB - Pricing Starts from $1.35/hour

I just don't know if in 1 hour and with a A100 I can process more than a 1000 pages. I'm guessing yes.


Is the model Open Source/Weight? Else the cost is for the model, not GPU.


Also interesting to see that parts of the training infrastructure to create frontier models is itself being monetized.


What do you mean by "free"? Using the OpenAI vision API, for example, for OCR is quite a bit more expensive than $1/1k pages.


> 94.89% overall accuracy

There are about 47 characters on average in a sentence. So does this mean it gets around 2 or 3 mistakes per sentence?


We’ll just stick LLM Gateway LLM in front of all the specialized LLMs. MicroLLMs Architecture.


I actually think you're onto something there. The "MicroLLMs Architecture" could mirror how microservices revolutionized web architecture.

Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.

The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.

The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.

Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.


This is already done with agents. Some agents only have tools and the one model, some agents will orchestrate with other LLMs to handle more advanced use cases. It's pretty obvious solution when you think about how to get good performance out of a model on a complex task when useful context length is limited: just run multiple models with their own context and give them a supervisor model—just like how humans organize themselves in real life.


I’m doing this personally for my own project - essentially building an agent graph that starts with the image output, orients and cleans, does a first pass with tesseract LSTM best models to create PDF/HOCR/Alto, then pass to other LLMs and models based on their strengths to further refine towards markdown and latex. My goal is less about RAG database population but about preserving in a non manually typeset form the structure and data and analysis, and there seems to be pretty limited tooling out there since the goal generally seems to be the obviously immediately commercial goal of producing RAG amenable forms that defer the “heavy” side of chart / graphic / tabular reproduction to a future time.


Take a look at MCP, Model Context Protocol.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: