Llama-OCR: Document to Markdown

notsylver · 2024-11-16T06:34:41 1731738881

I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.

This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(

og_kalu · 2024-11-16T06:50:02 1731739802

>Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close.

For Normal models, the state of Open Source OCR is pretty terrible. Unfortunately, the closed options from Microsoft, Google etc are much better. Did you try those ?

Interesting about Flash, what LLMs did you test ?

notsylver · 2024-11-16T07:23:46 1731741826

I tried open source and closed source OCR models, all were pretty bad. Google vision was probably the best of the "OCR" models, but it liked adding spaces between characters and had other issues I've forgotten. It was bad enough that I wondered if I was using it wrong. By the time I was trying to pass the text to an LLM with the image so it could do "touchups" and fix the mistakes, I gave up and decided to try LLMs for the whole task.

I don't remember the exact models, I more or less just went through the OpenRouter vision model list and tried them all. Gemini Flash performed the best, somehow better than Gemini Pro. GPT-4o/mini was terrible and expensive enough that it would have had to be near perfect to consider it. Pixtral did terribly. That's all I remember, but I tried more than just those. I think Llama 3.2 is the only one I haven't properly tried, but I don't have high hopes for it.

I think even if OCR models were perfect, they couldn't have done some of the things I was using LLMs for. Like extracting structured information at the same time as the plain text - extracting any dates listed in the text into a standard ISO format was nice, as well as grabbing peoples names. Being able to say "Only look at the hand-written text, ignore printed text" and have it work was incredible.

philips · 2024-11-16T07:37:18 1731742638

Have you tried downscaling the images? I started getting better results with lower resolution images. I was using scans made with mobile phone cameras for this.

convert -density 76 input.pdf output-%d.png

https://github.com/philips/paper-bidsheets

notsylver · 2024-11-16T08:16:37 1731744997

That's interesting. I downscaled the images to something like 800px but that was mostly to try improve upload times. I wonder if downscaling further and with a better algorithm would help.. I remember using CLIP and found different scaling algorithms helped text readability. Maybe the text is just being butchered when its rescaled.

Though I also tried with the high detail setting which I think would deal with most issues that come from that and it didn't seem to help much

8n4vidtmkvmk · 2024-11-16T07:15:10 1731741310

That's a bummer. I'm trying to do the exact same thing right now, digitize family photos. Some of mine have German on the back. The last OCR to hit headlines was terrible, was hoping this would be better. ChatGPT 4o has been good though, when I paste individual images into the chat. I haven't tried with the API yet, not sure how much that would cost me to process 6500 photos, many of which are blank but I don't have an easy way to filter them either.

notsylver · 2024-11-16T08:29:30 1731745770

I found 4o to be one of the worst, but I was using the API. I didn't test it but sometimes it feels like images uploaded through ChatGPT work better than ones through the API. I was using Gemini Flash in the end, it seemed better than 4o and the images are so cheap that I have a hard time believing google is making any money even by bandwidth costs

I also tried preprocessing images before sending them through. I tried cropping it to just the text to see if it helped. Then I tried filtering on top to try brighten the text, somehow that all made it worse. The most success I had was just holding the image in my hand and taking a photo of it, the busy background seemed to help but I have absolutely no idea why.

The main problem was that it would work well for a few dozen images, you'd start to trust it, and then it'd hallucinate or not understand a crossed out word with a correction or wouldn't see text that had faded. I've pretty much given up on the idea. My new plan is to repurpose the website I made for verifying the results into one where you enter the text manually, as well as date/location/favourite status.

bosie · 2024-11-16T07:55:07 1731743707

Use a local rubbish model to extract text. If it doesn’t find any on the back, don’t send it to chatgtp?

Terrascan comes to mind

bboygravity · 2024-11-16T09:18:35 1731748715

Have you tried Claude?

It's not good at returning the locations of text (yet), but it's insane at OCR as far as I have tested.

nutlope · 2024-11-16T07:16:12 1731741372

Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.

Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!

nh2 · 2024-11-16T09:00:21 1731747621

I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.

Is this amount of larger transformation expected/desirable?

(It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)

Curiositry · 2024-11-16T08:20:23 1731745223

Option to use a local LLM?

Eisenstein · 2024-11-16T09:47:40 1731750460

I made this script, which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.

* https://github.com/jabberjabberjabber/LLMOCR

philips · 2024-11-16T07:34:39 1731742479

I have recently used llama3.2-vision to handle some paper bidsheets for a charity auction and it is fairly accurate with some terrible handwriting. I hope to use it for my event next year.

I do find it rather annoying not being able to get it to consistently output a CSV though. ChatGPT and Gemini seem better at doing that but I haven’t tried to automate it.

The scale of my problem is about 100 pages of bidsheets and so some manual cleaning is ok. It is certainly better than burning volunteers time.

https://github.com/philips/paper-bidsheets

mg · 2024-11-16T07:59:03 1731743943

I gave it a sentence, which I created by placing 500 circles via a genetic algorithm to form a sentence. And then drew with an actual physical circle:

https://www.instagram.com/marekgibney/p/BiFNyYBhvGr/

Interestingly, it sees the circles just fine, but not the sentence. It replied with this:

    The image contains no text or other elements
    that can be represented in Markdown. It is a
    visual composition of circles and does not
    convey any information that can be translated
    into Markdown format.

Vetch · 2024-11-16T09:13:10 1731748390

Based on the fact that squinting works, I applied a Gaussian blur to the image. Here's the response I got:

Markdown:

The provided image is a blurred text that reads "STOP THINKING IN CIRCLES." There are no other visible elements such as headers, footers, subtexts, images, or tables.

Markdown Content:

STOP THINKING IN CIRCLES

As the response is not deterministic, I also tried several times with the unprocessed image but it never worked. However, all the low-pass filter effects I applied worked with a high success rate.

https://imgur.com/q7Zd7fa

DandyDev · 2024-11-16T08:07:56 1731744476

I can't read this either.

Edit: at a distance it's easier to read

thih9 · 2024-11-16T08:36:36 1731746196

If you squint it’s easier too. I wonder if lowering the resolution of the image would make the text visible to ocr.

echoangle · 2024-11-16T08:06:58 1731744418

I can’t read anything but the „stop“ either without seeing the solution first

wasyl · 2024-11-16T08:11:51 1731744711

Why is it interesting? The image does not look like anything, and you need to skew it (by looking at an angle) to see any letters (barely).

gexla · 2024-11-16T05:57:28 1731736648

Should this be a "Show HN" post? Seems to just be the front-end and has no association we may make with the name Llama? Maybe together.ai gave them cloud space?

nash · 2024-11-16T07:49:12 1731743352

Holy Hallucinations batman!

Even the example images hallucinates random text

KeplerBoy · 2024-11-16T08:04:19 1731744259

Same for me. The receipt headline only says "Trader Joe's" and yet the model insists on adding some information and transcribes "Trader Joe's Receipt". This is like Xeroxgate, but infinitely worse.

Someday this will do great damage in ways we will completely neglect and overlook.

AmazingTurtle · 2024-11-16T08:46:52 1731746812

One can combine apache tika OCR and feed it together with the image into LLM to fix typos.

LeoPanthera · 2024-11-16T05:55:44 1731736544

I wonder what the watts-per-character is of this tool.

threatripper · 2024-11-16T06:08:54 1731737334

Joules per character

danielEM · 2024-11-16T06:34:15 1731738855

I think it is perfectly fine to describe it in Watts per character as you can easily determine how many characters per second you can process.

Eisenstein · 2024-11-16T06:03:21 1731737001

All it does is send the image to Llama 3.2 Vision and ask for it to read the text.

Note that this is just as open to hallucination as any other LLM output, because what it is doing is not reading the pixels looking for text characters, but describing the picture, which uses the images it trained on and their captions to determine what the text is. It may completely make up words, especially if it can't read them.

M4v3R · 2024-11-16T06:12:48 1731737568

This is also true for any other OCR system, we just never called these errors “hallucinations” in this context.

noduerme · 2024-11-16T07:55:30 1731743730

No, it's not even close to OCR systems, which are based on analyzing points in a grid for each character stroke and comparing them with known characters. Just for one thing, OCR systems are deterministic. Deterministic. Look it up.

visarga · 2024-11-16T08:12:35 1731744755

OCR system use vision models and as such they can make mistakes. They don't sample but they produce a distribution of probability over words like LLMs.

alex_suzuki · 2024-11-16T09:26:05 1731749165

One of my worries for the coming years is that people will forget what deterministic actually means. It terrifies me!

geysersam · 2024-11-16T07:08:32 1731740912

I gave this tool a picture of a restaurant menu and it made up several additional entries that didn't exist in the picture... What other OCR system would do that?

8n4vidtmkvmk · 2024-11-16T07:17:48 1731741468

OCR tools sometimes make errors, but they don't make things up. There's a difference.

llm_trw · 2024-11-16T06:20:38 1731738038

It really isn't since those systems are character based.

d1sxeyes · 2024-11-16T06:18:22 1731737902

Seemed pretty good with handwriting. Didn’t make any mistakes with numbers in the sample I tried.

noduerme · 2024-11-16T07:46:59 1731743219

Um, I just quickly uploaded an unstructured RTF file to this and apparently broke it... unless it's just realllly slow.

If this is just for converting hand-written documents, maybe put that in the header of the website. Right now it just says "Document to Markdown", which could be interpreted lots of different ways.

sumedh · 2024-11-16T06:26:45 1731738405

Site is dead now :(

nutlope · 2024-11-16T07:20:09 1731741609

Should be up, please try again!

bbor · 2024-11-16T05:11:24 1731733884

Looks awesome! Been doing a lot of OCR recently, and love the addition to the space. The reigning champion in the PDF -> Markdown space (AFAIK) is Facebook's Nougat[1], and I'm excited to hook this up to DSPy and see which works better for philosophy books. This repo links the Zerox[2] project by some startup, which also looks awesome, and certainly more smoothly advertised than Nougat. Would love corrections/advice from any actual experts passing by this comment section :)

That said, I have a few questions if OP/anyone knows the answers:

1. What is Together.ai, and is this model OSS? Their website sells them as a hosting service, and the "Custom Models" page[3] seems to be about custom finetuning, not, like, training new proprietary models in-house. They might have a HuggingFace profile but it's hard to tell if it's them https://huggingface.co/TogetherAI

2. The GitHub says "hosted demo", but the hosting part is just the tiny (clean!) WebGUI, yes? It's implied that this functionality is and will always be available only through API calls?

P.S. The header links are broken on my desktop browser -- no onClick triggered

[1] https://facebookresearch.github.io/nougat/

[2] https://github.com/getomni-ai/zerox

[3] https://www.together.ai/products#custom-models

gexla · 2024-11-16T05:58:36 1731736716

My guess is together.ai is at least partially sponsoring the demo.

jurnalanas · 2024-11-16T06:05:17 1731737117

the project author is Devrel from Together.ai. This is a fantastic way to advertise a dev tool, though.

magicalhippo · 2024-11-16T05:37:29 1731735449

Yeah was hoping for something I could self-host, both for privacy and cost.

rajansheth · 2024-11-16T06:24:59 1731738299

together.ai serves 100+ open-source models including multi-modal Llama 3.2 with an OpenAI compatible API

HaiderAftab1 · 2024-11-16T05:36:17 1731735377

Great tool for quickly converting plain text to Markdown, saving time and ensuring consistent formatting for documents

nutlope · 2024-11-16T07:20:22 1731741622

Thank you!

anothername12 · 2024-11-16T07:10:22 1731741022

We tried this and it was an absolute shit show for us.