Hacker News new | past | comments | ask | show | jobs | submit | constantinum's comments login

There was a discussion on this benchmark https://getomni.ai/ocr-benchmark couple of weeks ago here > https://news.ycombinator.com/item?id=43118514

Yeah this new benchmark is kind of inspired by these existing benchmarks, things that are missing wrt automation

I see a lot of comments on hallucination risk and the accumulation of non-traceable rotten data. If you are curious to try a better non-llm-based OCR, try LLMWhisperer.https://pg.llmwhisperer.unstract.com/


There is one with Langchain+pydantic+llmwhisperer https://unstract.com/blog/comparing-approaches-for-using-llm...


If you're looking for better accuracy and table layout preservation, give LLMWhisperer and Docling a try! Both keep tables tidy with a Markdown-like structure.


Tested it with the following documents:

* Loan application form: It picks up checkboxes and handwriting. But it missed a lot of form fields. Not sure why?

* Edsger W. Dijkstra’s handwritten notes(from Texas univ archive) - Parsing is good.*

* Badly(misaligned) scanned bill - Parsing is good. Observation: there is a name field, but it produced a synonymous name instead of the name in the bill — hallucination??

* Investment fund factsheet - It could parse the bar charts and tables, but it whimsically excluded many vital data points from the document.

* Investment fund factsheet, complex tables - Bad extraction, could not extract merged tables and again whimsical elimination of rows and columns.

Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.

There's also Docling[2], which is handy for converting tables from PDFs into markdown. While it uses Tesseract/EasyOCR under the hood, which can sometimes make the OCR results a bit less accurate

[1] - https://pg.llmwhisperer.unstract.com/ [2] - https://github.com/DS4SD/docling


FYI, you can choose which OCR engine Docling uses (from a handful of predefined choices) - it doesn’t have to be Tesseract.

https://ds4sd.github.io/docling/reference/pipeline_options/#...


The primary issue with LLMs is hallucination, which can lead to incorrect data and flawed business decisions.

For example, Llamaparse(https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse...) uses LLMs for PDF text extraction but faces hallucination problems. See this issue for more details: https://github.com/run-llama/llama_parse/issues/420.

For those interested, try LLMWhisperer(https://unstract.com/llmwhisperer/) for OCR. It avoids LLMs, eliminates hallucination issues, and preserves the input document layout for better context.

Examples of extracting complex layout:

https://imgur.com/a/YQMkLpA

https://imgur.com/a/NlZOrtX

https://imgur.com/a/htIm6cf


> try LLMWhisperer(https://unstract.com/llmwhisperer/) for OCR. It avoids LLMs

The website you linked says it uses LLMs?


The tool doesn't use any LLMs for processing/parsing the data. It parses and converts into raw text.

The final output(raw text) of the parsing is then fed to LLMs for data extraction. e.g. Extracting data from insurance, banking, and invoice documents.


Those images look exactly like what you get from every OCR tool out there if you use the XY information.


Give LLMWhisperer a try. Here is a playground for testing https://pg.llmwhisperer.unstract.com/


For instace Llamaparse(https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse...)uses LLMs for pdf text extraction, but the problem is hallucination. e.g > https://github.com/run-llama/llama_parse/issues/420

There is also LLMWhisperer that preserves the layout(tables, checkboxes, forms)and hence the context. https://pg.llmwhisperer.unstract.com/


Is this open source? Is it slow Python? That's where I'm stuck.


This is not open-source. It has high accuracy and it is faster too. All you need is to point your documents to the API.


Non-fiction as audiobooks


> The "best" models just made stuff up to meet the requirements. They lied in three ways:

> The main difficulty of the is project lies in correctly identifying page zones; wouldn't it be possible to properly find the zones during the OCR phase itself instead of rebuilding them afterwards?

Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.

[1] https://unstract.com/llmwhisperer/

Examples of extracting complex layout:

https://imgur.com/a/YQMkLpA

https://imgur.com/a/NlZOrtX

https://imgur.com/a/htIm6cf


Looks interesting, but the cost is prohibitive for a hobby project. Also, it doesn't really solve my problem.

Google Vision already returns the coordinates of each word (and even of each letter), so it's easy to know where the word was on the page, and even, if necessary, to rebuild the page with the words correctly placed -- that's fundamentally what I do with the mouseover on the interactive demo: https://divers.medusis.net/boislisle/pub (at the paragraph level).

But my problem isn't to know where the words are (Google Vision provides that); it's to know what belongs to what, what is footnotes, what is main text, etc. This is what the post discusses. Just having the text following the same layout as in the original wouldn't help, because I'm not trying to reproduce the layout or the typesetting, I want to rebuild the content semantically, so as to do different "flows".

That said, it got me thinking... there may be an opportunity to do a cheaper version of LLMwhisperer? ;-)


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: