Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Zerox – Document OCR with GPT-mini (github.com/getomni-ai)
246 points by themanmaran 4 months ago | hide | past | favorite | 98 comments
This started out as a weekend hack with gpt-4-mini, using the very basic strategy of "just ask the ai to ocr the document".

But this turned out to be better performing than our current implementation of Unstructured/Textract. At pretty much the same cost.

I've tested almost every variant of document OCR over the past year, especially trying things like table / chart extraction. I've found the rules based extraction has always been lacking. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. Using a vision model just make sense!

In general, I'd categorize this solution as slow, expensive, and non deterministic. But 6 months ago it was impossible. And 6 months from now it'll be fast, cheap, and probably more reliable!




It should be noted for some reason OpenAI prices GPT-4o-mini image requests at the same price as GPT-4o. I have a similar library but we found OpenAI has subtle OCR inconsistencies with tables (numbers will be inaccurate). Gemini Flash, for all its faults, seems to do really well as a replacement while being significantly cheaper.

Here’s our pricing comparison:

*Gemini Pro* - $0.66 per 1k image inputs (batch) - $1.88 per text output (batch API, 1k tokens) - 395 pages per dollar

*Gemini Flash* - $0.066 per 1k images (batch) - $0.53 per text output (batch API, 1k tokens) - 1693 pages per dollar

*GPT-4o* - $1.91 per 1k images (batch) - $3.75 per text output (batch API, 1k tokens) - 177 pages per dollar

*GPT-4o-mini* - $1.91 per 1k images (batch) - $0.30 per text output (batch API, 1k tokens) - 452 pages per dollar

[1] https://community.openai.com/t/super-high-token-usage-with-g...

[2] https://github.com/Filimoa/open-parse


Interesting. It didn't seem like gpt-4o-mini was priced the same as gpt-4o during our testing. We're relying on OpenAI usage page of course, which doesn't give as much request by request pricing. But we didn't see any huge usage spike after testing all weekend.

For our testing we ran a 1000 page document set, all treated as images. We got to about 25M input / 0.4M output tokens for 1000 pages. Which would be a pretty noticeable difference based on the listed token prices.

gpt-4o-mini => (24M/1M * $0.15) + (0.4M/1M * 0.60) = $4.10

gpt-4o => (24M/1M * $5.00) + (0.4M/1M * 15.00) = $126.00


The pricing is strange because the same images will use up 30X more tokens with mini. They even show this in the pricing calculator.

[1] https://openai.com/api/pricing/


Indeed it does. But also the price for output tokens of the OCR is cheaper. So in total it's still much cheaper with gpt-4o-mini.


That price compares favourably with AWS Textract. Has anyone compared their performance? Because a recent post about OCR had Textract at or near the top in terms of quality.


Can you locate that post? In my own experience, Google Document AI has superior quality but I'm looking for something a bit more objective and scientific.


I’m using AWS textract for scanning grocery receipts and i find it does it very well and fast. Can you say which performance metric you have in mind?


I'm surprised by the name choice, there's a large company with an almost identical name that has products that do this. May be worth changing it sooner rather than later.

https://duckduckgo.com/?q=xerox+ocr+software&t=fpas&ia=web


> there's a large company with an almost identical name

Are you suggesting that this wasn't intentional? The name is clearly a play on "zero shot" + "xerox"


I think they're suggesting that Xerox will likely sue them so might as well get ahead of that and change the name now.


Even if they don't sue, do you really want to deal with people getting confused and thinking you mean one of the many pre-existing OCR tools that Xerox produces? A search for "Zerox OCR" will lead to Xerox products, for example. Not worth the headache.

https://duckduckgo.com/?q=Zerox+OCR


Yup definitely a play on the name. Also the idea of photocopying a page, since we do pdf => image => markdown.

We're not planning to name a company after it or anything, just the OS tool. And if xerox sues I'm sure we could rename the repo lol.


I was involved in a somewhat similar trademark issue once.

I actually had a leg to stand on (my use was not infringing at all when I started using it), and I came out of it somewhat cash-positive, but I absolutely never want to go through anything like that ever again.

> Yup definitely a play on the name. Also the idea of photocopying a page,

But you? My God, man.

With these words you have already doomed yourself.

Best wishes.


> With these words you have already doomed yourself.

At least they didn't say "xeroxing a page".


It still seems reasonable someone may be confused, especially since the one letter of the company name that was changed has identical pronunciation (x --> z). It is like offering "Phacebook" or "Netfliks" competitors, but even less obviously different.


Surprisingly, http://phacebook.com/ is for sale.


From personal experience, I'd wager that anyone buying that domain will receive a letter from a Facebook lawyer pretty quickly.


If they sue, this comment will be used to make their case.

I guess I just don’t understand - how are you proceeding as if this is an acceptable starting point?

With all respect, I don’t think you’re taking this seriously, and it reflects poorly on the team building the tool. It looks like this is also a way to raise awareness for Omni AI? If so, I’ve gotta be honest - this makes me want to steer clear.

Bottom line, it’s a bad idea/decision. And when bad ideas are this prominent, it makes me question the rest of the decisions underlying the product and whether I want to be trusting those decision makers in the many other ways trust is required to choose a vendor.

Not trying to throw shade; just sharing how this hits me as someone who has built products and has been the person making decisions about which products to bring in. Start taking this seriously for your own sake.


I would happily contribute to the legal defense fund.


If imitation is the sincerest form of flattery, I'd have gone with "Xorex" myself.


We'll see what the new name is when the C&D is delivered.


Let me xerox that C&D letter first...


I'm sure that was on purpose.

Edit: Reading the comments below, yes, it was.

Very disrespectful behavior.


the commercial service is called OmniAI. zerox is just the name of a component (github repo, library) in a possible software stack.

am I only one finding these sort of takes silly in a cumulative globalized world with instant communications? There are so many things to be named, everything named is instantly available around the world, so many jurisdictions to cover - not all providing the same levels of protections to "trademarks".

Are we really suggesting this issue is worth defending and spending resources on?

what is the ground for confusion here? that a developer stumbles on here and thinks zerox is developed/maintained by xerox? this developer gets confused but won't simply check who is the owner of the repository? What if there's a variable called zerox?

I mean, I get it: the whole point of IP at this point is really just to create revenue streams for the legal/admin industry so we should all be scared and spend unproductive time naming a software dependency


> Are we really suggesting this issue is worth defending and spending resources on?

Absolutely.

Sure, sometimes non-competing products have the same name. Or products sold exclusively in one country use the same name as a competitor in a different country. There's also companies that don't trademark or protect their names. Often no one even notices the common name.

That's not whats happening here. Xerox is famously litigious about their trademark; often used as a case study. The product competes with Xerox OCR products in the same countries.

It's a strange thing to be cavalier about and to openly document intent to use a sound-alike name. Besides, do you really want people searching for "Zerox OCR" to land on a Xerox page? There's no shortage of other names.


> so we should all be scared and spend unproductive time naming a software dependency

All 5 minutes it would take to name it something else?


Maybe call it ZeroPDF?


ZerOCR maybe!


gpterox


I used this approach extensively over the past couple of months with GPT-4 and GPT-4o while building https://hotseatai.com. Two things that helped me:

1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.

2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.


One option we've been testing is the 'maintainFormat` mode. This tries to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. Especially useful if you've got tables that span pages. The flow is pretty much:

- Request #1 => page_1_image

- Request #2 => page_1_markdown + page_2_image

- Request #3 => page_2_markdown + page_3_image


>frequency of character triples

What are character triples? Are they trigrams?


I think so. I'd normalize the text first: lowercase it and remove all non-alphanumeric characters. E.g for the phrase "What now?" I'd create these trigrams: wha, hat, atn, tno, now.


> I extracted the embedded text from the PDF

What did you use to extract the embedded text during this step? Other than some other OCR tech


PyMuPDF, a PDF library for Python.


A different approach from vanilla OCR/parsing seems to be this mixed ColPali integrating a purposed small vision models and a ColBERT type indexing for retrieval. So - if search is the intended use case - it can skip the whole OCR step entirely.

[1] https://huggingface.co/blog/manu/colpali


Azure document AI accuracy I would categorize as high not "mid". Including hand writing. However for the $1.5/1000 pages, it doesn't include layout detection.

The $10/1000 pages model includes layout detection (headers, etc.) as well as key-value pairs and checkbox detection.

I have continued to do proofs of concept with Gemini and GPT, and in general any new multimodal model that comes out but have it is not on par with the checkbox detection of azure.

In fact the results from Gemini/GPT4 aren't even good enough to use as a teacher for distillation of a "small" multimodal model specializing in layout/checkbox.

I would like to also shout out surya OCR which is up and coming. It's source available and free for under a certain funding or revenue milestone - I think $5m. It doesn't have word level detection yet but it's one of the more promising non-hyper scaler/ heavy commercial OCR tools I'm aware of.


Surya OCR is great in my test use cases! Hoping to try it out in production soon.


Prompts in the background:

  const systemPrompt = `
    Convert the following PDF page to markdown. 
    Return only the markdown with no explanation text. 
    Do not exclude any content from the page.
  `;
For each subsequent page: messages.push({ role: "system", content: `Markdown must maintain consistent formatting with the following page: \n\n """${priorPage}"""`, });

Could be handy for general-purpose frontend tools.


so this is just a wrapper around gpt-4o mini?


Very interesting project, thank you for sharing.

Are you supporting the Batch API from OpenAI? This would lower costs by 50%. Many OCR tasks are not time-sensitive, so this might be a very good tradeoff.


That's definitely the plan. Using batch requests would definitely move this closer to $2/1000 pages mark. Which is effectively the AWS pricing.


Xerox tried it a while ago. It didn't end well https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...


> This is not an OCR problem (as we switched off OCR on purpose)


It also says

> This is not an OCR problem, but of course, I can't have a look into the software itself, maybe OCR is still fiddling with the data even though we switched it off.

But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.


> It also says...

It was a problem with employing the JBIG2 compression codec, which cuts and pastes things from different parts of the page to save space.

> But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.

Anyone trying to solve for the contents of a page uses context clues. Even humans reading.

You can OCR raw characters (performance is poor); use letter frequency information; use a dictionary; use word frequencies; or use even more context to know what content is more likely. More context is going to result in many fewer errors (of course, it may result in a bigger proportion of the remaining errors seeming to have significant meaning changes).

A small LLM is just a good way to encode this kind of "how likely are these given alternatives" knowledge.


Traditional OCR neural networks like tesseract crucially they have strong measures of their accuracy levels, including when they employ dictionaries or the like to help with accuracy. LLMs, on the other hand, give you zero guarantees, and have some pretty insane edge cases.

With a traditional OCR architecture maybe you'll get a symbol or two wrong, but an LLM can give you entirely new words or numbers not in the document, or even omit sections of the document. I'd never use an LLM for OCR like this.


If you use LLM stupidly, sure. You can get from the LLM pseudo-probabilities of next symbol and use e.g Bayes rule to combine the information of how well it matches the page. You can also report the total uncertainty at the end.

Done properly, this should strictly improve the results.


It's all fun and games until you need to prove something in court or to the tax office. I don't think that throwing an LLM into this mix helps.


Generally when OCRing documents you should keep the original scans so you can refer back to them in case of any questions or disputes.


It depends what your use-case is. At a low enough cost this would work for a project I'm doing where I really just need to be able to mostly search large documents. 100% accuracy and a lost or hallucinated paragraph here and there wouldn't be a deal-killer, especially if the original page image is available to the user too.

And additionally, this also might work if you are feeding the output into a bunch of humans to proof.


That was also what first came to my mind, I guess Zerox might be a reference to this


I did this for images using Tesseract for OCR + Ollama for AI.

Check it out, https://cluttr.ai

Runs entirely in browser, using OPFS + WASM.


If you want to do document OCR/PDF text extraction with decent accuracy without using an LLM, do give LLMWhisperer[1] a try.

Try with any PDF document in the playground - https://pg.llmwhisperer.unstract.com/

[1] - https://unstract.com/llmwhisperer/


You can do some really cool things now with these models, like ask them to extract not just the text but figures/graphs as nodes/edges and it works very well. Back when GPT-4 with vision came out I tried this with a simple prompt + dumping in a pydantic schema of what I wanted and it was spot on, pretty much this (before json mode was a supported):

    You are an expert in PDFs. You are helping a user extract text from a PDF.

    Extract the text from the image as a structured json output.

    Extract the data using the following schema:

    {Page.model_json_schema()}

    Example:
    {{
      "title": "Title",
      "page_number": 1,
      "sections": [
        ...
      ],
      "figures": [
        ...
      ]
    }}

https://binal.pub/2023/12/structured-ocr-with-gpt-vision/


My intuition is that the best solution here would be a division of labor: have the big multimodal model identify tables, paragraphs, etc, and output a mapping between segments of the document and texture output. Then a much simpler model that doesn’t try to hold entire conversations can process those segments into their contents.

This will perform worse in cases where whatever understanding the large model has of the contents is needed to recognize indistinct symbols. But it will avoid cases where that very same understanding causes contents to be understood incorrectly due to the model’s assumptions of what the contents should be.

At least in my limited experiments with Claude, it’s easy for models to lose track of where they’re looking on the page and to omit things entirely. But if segmentation of the page is explicit, one can enforce that all contents end up in exactly one segment.


I am using AWS Textract + LLM (OpenAI/Claude) to read grocery receipts for <https://www.5outapp.com>

So far, I have collected over 500 receipts from around 10 countries with 30 different supermarkets in 5 different languages.

What has worked for me so far is having control over OCR and processing (for formatting/structuring) separately. I don't have the figures to provide a cost structure, but I'm looking for other solutions to improve both speed and accuracy. Also, I need to figure out a way to put a metric around accuracy. I will definitely give this a shot. Thanks a lot.


Cool design. FYI the "Try now" card looks like it didn't render right, just seeing a blank box around the button.


You meant in the web version? it is supposed to look like a blank box in the rectangle grocery bill shape, but i suppose the design can be a bit better there. Thanks for the feedback.


The current design with that box feels broken


Ok, thanks for the feedback. Will think of something else


Fwiw have on good sourcing that OpenAI supplies Tesseract output to the LLM, so you're in a great place, best of all worlds


At inference time or during training?


Inference


In my own experiments I have had major failures where much of the text is fabricated by the LLM to the point where I just find it hard to trust even with great prompt engineering. What I have been very impressed with is it’s ability to take medium quality ocr from acrobat with poor formatting, lots of errors and punctuation problems and render 100% accurate and properly formatted output by simply asking it to correct the ocr output. This approach using traditional cheap ocr for grounding might be a really robust and cheap option.


Congrats! Cool project! I’d been curious about whether GPT would be good for this task. Looks like this answers it!

Why did you choose markdown? Did you try other output formats and see if you get better results?

Also, I wonder how HMTL performs. It would be a way to handle tables with groupings/merged cells


I think that I'll add an optional configuration for HTML vs Markdown. Which at the end of the day will just prompt the model differently.

I've not seen a meaningful difference between either, except when it comes to tables. It seems like HTML tends to outperform markdown tables, especially when you have a lot of complexity (i.e. tables within tables, lots of subheaders).


Xerox might want to have a word with you about that name.


It seems that there's a need for a benchmark to compare all solutions available in the market based on the quality and price

The majority of comments are related to prices and qualities

Also, is there any movements about product detection? These days I'm looking for solutions that can recognize goods in high accuracy and show [brand][product_name][variant]


The problem I've not found one OCR solution to handle well is complex column based layouts in magazines. Perhaps one problem is that there are often images spanning anything from one to all columns, and so the text might flow in sometimes funny ways. But in this day and age, this must be possible to handle for the best AI-based tools?


ohh, that could finally be a great way to get my ttrpg books readable for kindle. I'll give it a try, thanks for that.


> And 6 months from now it'll be fast, cheap, and probably more reliable!

I like the optimism.

I've needed to include human review when using previous generation OCR software; when I needed the results to be accurate. It's painstaking, but the OCR offered a speedup over fully-manual transcription. Have you given any thought to human-in-the-loop processes?


I've been surprised so far by llms capability, so I hope it continues.

On the human in loop side, it's really use case specific. For a lot of my company's work, it's focused on getting trends from large sets of documents.

Ex: "categorize building permits by municipality". If the OCR was wrong on a few documents, it's still going to capture the general trend. If the use case was "pull bank account info from wire forms" I would want a lot more double checking. But that said, humans also have a tendency to transpose numbers incorrectly.


Our human in the loop process with traditional OCR uses confidence scores from regions of interest and the page coordinates to speed-up the review process. I wish the LLM could provide that, but both seem far off on the horizon.


Hmm, sounds like different goals. I don't work on that project any longer but it was a very small set of documents and they needed to be transcribed perfectly. Every typo in the original needed to be preserved.

That said, there's huge value in lossy transcription elsewhere, as long as you can account for the errors they introduce.


Have you tried using the GraphRAG approach of just rerunning the same prompts multiple times and then giving the results along with a prompt to the model telling it to extract the true text and fix any mistakes? With mini this seems like a very workable solution. You could even incorporate one or more attempts from whatever OCR you were using previously.

I think that is one of the key findings from GraphRAG paper: the gpt can replace the human in the loop.


Does it also produce a confidence number?


The only thing close are the "logprobs": https://cookbook.openai.com/examples/using_logprobs

However, commenters around here noted that these have likely not been fine-tuned to correlate with accuracy - for plaintext LLM uses. Would be interested in hearing finding for MLLM use-cases!


No, there is no vision LLM that produces confidence numbers to my knowledge.


The AI says it's 100% confident that it's hallucinations are correct.


I don't think openAI's api for gpt4o-mini has any such mechanism.


Check gpt-4o, gpt-4o-mini uses around 20 times more tokens for the same image: https://youtu.be/ZWxBHTgIRa0?si=yjPB1FArs2DS_Rc9&t=655


I'd be more curious to see the performance over local models like LLaVa etc.


I think i'm missing something.. why would i pay to ocr the images when i can do it locally for free? Tesseract runs pretty well on just cpu, wouldn't even need something crazy powerful.


Tesseract works great for pure label-the-characters OCR, which is sufficient for books and other sources with straightforward layouts, but doesn't handle weird layouts (tables, columns, tables with columns in each cell, etc.) People will do absolutely depraved stuff with Word and PDF documents and you often need semantic understanding to decipher it.

That said, sometimes no amount of understanding will improve the OCR output because a structure in a document cannot be converted to a one-dimensional string (short of using HTML/CSS or something). Maybe we'll get image -> HTML models eventually.


And OpenAI uses Tesseract in the background, as it sometimes answers that Hungarian language is not installed for Tesseract for me


I would be extremely surprised if that's the case. There are "open-source" multimodal LLMs can extract text from images as a proof that the idea works.

Probably the model is hallucinating and adding "Hungarian language is not installed for Tesseract" to the response.


Great example of how LLMs are eliminating/simplifying giant swathes of complex tech.

I would love to use this in a project if it could also caption embedded images to produce something for RAG...


Yay! Now we can use more RAM, Network, Energy, etc to do the same thing! I just love hot phones!


Oops guess I'm not sippin' the koolaid huh?


Have you compared the results to special purpose OCR free models that do image to text with layout? My intuition is mini should be just as good if not better.


Very nice, seem to work pretty well!

Just

    maintainFormat: true
did not seem to have any effect in my testing.


Llama 3.1 now has images support right? Could this be adapted there as well, maybe with groq for speed?


Meta trained a vision encoder (page 54 of the Llama 3.1 paper) but has not released it as far as I can tell.


Yup! I want to evaluate a couple different model options over time. Which should be pretty simple!

The main thing we're doing is converting documents to a series of images, and then aggregating the response. So we should be model agnostic pretty soon.


I would really love something like this that could be run locally.


Man, this is just an awesome hack! Keep it up!


Or not a man, sorry for putting your identity into a bucket.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: