For my use cases, this already beats all "traditional approaches" for at least a few month now. That's just inferring from when I first stumbled across it. No clue for how long it's been a thing.
I did some OCR tests on some 1960s era documents (all in English). Mix of typed and handwritten. I had as results:
Google Vision: 95.62% HW - 99.4% Typed
Amazon Texttract: 95.63% HW - 99.3% Typed
Azure: 95.9% HW - 98.1% Typed
Then if curious, TrOCR was the best FOSS solution at 79.5% HW and 97.4% Typed. (However it took roughly 200x longer than Tesseract which was 43% HW and 97.0% Typed)
When did you do this test? I don't have any numbers handy, but a couple years ago I compared google's OCR vs AWS's on "text in the wild" pictures. AWS' wasn't bad, but it was definitely outperformed by the google one. The open-source solutions I tried (tesseract and some academic deep-learning code) were far behind.
This was a couple months ago now, so not that long ago. For OCR I have found that it highly depends on the type of image you are looking at. In my case these were all scanned documents of good but not great scan quality, all in English. I expect if you were talking about random photos with text in them, you'd see the FOSS solutions do much worse, and much more variance in the Google vs Amazon vs Azure. I would be curious about the academic deep learning one you tried.
The main one was https://github.com/JaidedAI/EasyOCR, mostly because, as promised, it was pretty easy to use, and uses pytorch (which I preferred in case I wanted to tweak it). It has been updated since, but at the time it was using CRNN, which is a solid model, especially for the time - it wasn't (academic) SOTA but not far behind that. I'm sure I could've coaxed better performance than I got out of it with some retraining and hyperparameter tuning.
Interesting. I tried easyOCR, I found on handwriting it was about 35%, on typed it was 95.7%, so not bad at all with typed, but for handwriting pretty bad. I focused on Tesseract and TrOCR since it wasn't working out that well, still could easily have just been my particular use case.
I also tested paddleocr and keras ocr to round them all out.
At some point I really need to finish my project enough to write up some blog articles and post a bunch of code repos for others to use.
I did not check. I also never checked if they share my mails on google search with you -- but I trust their ambition to not be sued into the ground for doing something immensely stupid.
Leaking sensitive data of enterprise customers as training material for public recaptchas falls in that category.
What would you recommend for classifying documents? Most of the companies I've evaluated market their product as using fancy AI/ML, but instead they have hundreds of people, usually in India, manually classifying the documents.
I strongly believe everything just has to go through OpenAI or Anthropic, for now. These models are significantly better than any NLP models I try swapping in.
But this isn’t much help if you must classify images.
For documents which are mostly pretty clean you are probably right. The ceiling for AI/ML is definitely higher though, and very useful right now if you know specifically what type of document you expect to look at, but expect it to be messy.
I was just playing with tesseract last week (I'd used it years ago) and wasn't too happy. I had a pretty simple pdf that was in what you could think of as an old typewritten font, but easily legible, and I got all kinds of word fragments and nonsense characters in the output. I know that high quality ocr systems include a language model to coerce the read text into the most probable words. Is tesseract just supposed to be the first stage of such a system?
I'll note that when I put the tesseract output into chatgpt and prompted it saying it was ocr'd text and asking to clean it up, it worked very well.
I was just processing a document with tesseract & ocrmypdf, and two things:
My first time processing it, I used `ocrmypdf --redo-ocr` because it looked like there was some existing OCR. After processing, the OCR was crap because ocrmypdf didn't realize it was OCR but thought it was real text in the document that should be kept. This was fixable using `ocrmypdf --force-ocr`.
Before realizing this, I discovered that Tesseract 4 & 5 use a neural network-based recognition. I then came across this step-by-step guide on fine-tuning Tesseract for a specific document set: https://www.statworx.com/en/content-hub/blog/fine-tuning-tes...
I didn't end up following the fine-tuning process because at this point `ocrmypdf --force-ocr` worked excellently, but I thought the draw_box_file_data.py script from their example was particularly useful: https://gist.github.com/flaviut/d901be509425098645e4ae527a9e...
FWIW, I'm using Google's ML Kit which runs completely on-device and doesn't send the documents to Google. It works better than tesseract for my use case.
Tessearct is generally the overall best for typed documents, though it struggles with handwriting. TrOCR is better than Tesseract, especially with handwriting, but requires a GPU to have any speed. Tesseract from my tests was roughly 200X faster than TrOCR (not an exaggeration)
When I was evaluating options a few months ago I found https://github.com/PaddlePaddle/PaddleOCR to be a very strong contender for my use case (reading product labels), but you'll definitely want to put together some representative docs/images and test a bunch of solutions to see what works for you.
If you want to OCR specific text you can use Textsniper on Mac and draw a box on whatever part of the screen you want to capture. I'm guessing under the hood it's just using Apple's OCR tech, which does work very well (at least if you're on Apple Silicon, it's not quite so fast on my 2015 Intel Macbook Pro)
Developments in this space are coming really fast, and reading words are squarely within the capabilities of neural engines. 5 years is a very long time in AI years.
As a developer who has been building IDP solutions I can assert that although this model is a lot larger (more weights) than a Graph Neural Network on OCR tokens, industry standard before transformers, it outperforms given enough data. Depending on how heterogenous the data is usually 200 documents can reach human levels of accuracy on documents, scoring by levenshtein ratio.
Smaller graph models could get away with using less data. The problem that the "traditional" approach had is the the quality of the OCR was the bottleneck for overall model performance. It amazes me how this problem shifted from a node classification problem to a image to text problem.
Training on CPU was possible with GCN but not with Donut.
If you want to train the Donut, check out this notebook on Kaggle. It trains Donut to read plots for a competition. The notebook contains full pipeline for finetuning. https://www.kaggle.com/code/nbroad/donut-train-benetech
I'm sure everyone is kinda tired of this answer, but gpt4. At least I have the share thing now so those who want to avoid it don't have to see a big pasted output.
People are justifiably excited about these language tools, but we're getting tired of this answer because it's not a good answer: "Use GPT-4! Some of the answers aren't that great, but it's at least a starting point." That's like if you asked how to sort a list and remove duplicates, and the answer was to import it into a spreadsheet program and follow these steps instead of just "sort | uniq". It's suggesting a general-purpose tool to do a specific job maybe kinda acceptably instead of suggesting the right tool for the job.
It reminds me of the microwave cookery books that came out after consumer microwaves became available: there are things a microwave is good at, but those books used it for everything, just like we're using GPT-4 today. We'll calm down eventually.
>That's like if you asked how to sort a list and remove duplicates, and the answer was to import it into a spreadsheet program and follow these steps instead of just "sort | uniq".
That's exactly how many Windows office users do it, they paste a list into Excel and use it to remove duplicates. There are alternatives even on Windows, but It's must easier for them to use a single general-purpose graphical tool (and let's not get started with the abominations VLOOKUP is used for).
I used to look down on that but then I realized that using a graphical program for list manipulation is kinda cool and that this program is rather capable, and could create combinations that are rather difficult to do with the more specialized tools. I still use these specialized tools (I'm used to them, and I can do some stuff they can't easily do in Excel).
Yeah, I used to look down on that kind of things too, and a decade later, I find myself doing those very things.
Yes, I know sort | uniq. I even have a couple Linux shells open on my Windows work system. But I can't for the life of me remember the magic flags, so I'll either paste the list to Emacs and M-x sort-lines + M-x delete-duplicate-lines, or paste it to Excel and do it there, or do something even more cheesy - whatever is least likely to break my flow.
There are tools more or less optimized for any specific job, but the best tool for the job is the one you have handy, and are experienced in using.
I too increasingly often find myself using GPT-4 for random, ad-hoc tasks. They may or may not be better tools out there. I may even have some installed. But none of them beat being able to just describe what you want, paste some data, and get the results out few seconds later.
> but we're getting tired of this answer because it's not a good answer: "Use GPT-4! Some of the answers aren't that great, but it's at least a starting point."
Some of the answers are straight up usable, others if you prefer you can go from there because this a creative language task.
And there isn't really a specific tool for this, is there? It's nothing like your comparison to a very well specified problem. "Identify what this thing does, and come up with a title that also contains a word, and the word is related to the topic" is not the same as sort|uniq Vs a spreadsheet.
(1) The problem is not clearly defined. Does the word need to be thematically related to the topic? (As far as I can tell, "Donut" isn't thematically related to document understanding.) Maybe you could say it's a nice, optional bonus if it's related.
(2) The best solution would be good at two things: (A) satisfying constraints and (B) creativity. ChatGPT is unlikely to be good at A, and a non-AI algorithm that just finds valid words can't do B.
Regarding #1, if people don't all have the same idea of the problem, they're not going to agree on the solution.
Regarding #2, maybe a combined solution would be best. Generate all allowable words, then feed them to ChatGPT and have it say which ones are thematically good.
Tbh I think gpt4 shines at this. People's requirements will be different, in weird ways. Duckdb things are all duck related. Rust is related to crabs. Your project may all be sweets related. It might be serious or fun. Maybe you want a name easy to draw.
These are hard to encode.
Instead I just asked "Make them more fun, and related to literary characters", then Muppets and awkward ones based on Harry potter which it described as "certainly a unique request". It's faster than getting a word list related to that. And they are frankly great - better than I'd come up with given much longer.
This problem is great for llms. It's language, works well with a back and forth discussing good and bad options, has no well defined output requirements but is easy to explain to a person, has a human in the loop and has almost zero cost if it's wrong.
> QUIXOTE: Quality Unstructured Information Extraction and Organization Through End-to-end transformer
There’s a model for music transcription (audio to midi) called MT3 which takes an end-to-end transformer approach and claims SOTA on some datasets. However, from my own research and comparing with other models it seems that MT3 is very prone to overfitting and the real world results are not as impressive. A similar story seems to be playing out in the comments here
I want to build an application that scans restaurant and café menus (PDFs, photos, webpages) to identify which items are vegetarian or vegan. Would this work for that? If not, I would love to hear peoples ideas and suggestions.
With vegan you can’t estimate it 100% from menu alone - because the sauce and other minor ingredients can be animal based.
If you want to do it, using “plant based” is probably better than “vegan”, and it’s always good to make sure your users are aware that the mark can be wrong and they should double-check with the waiter.
As for your question - I didn’t play with Donut, but ocr+gpt or multimodal gpt4 once released should handle this smoothly.
You could combine ingredient search, looking for symbols the actually designate vegan as some places do, along with long/lat data to determine what the restaurant actually is and then check it with a database you maintain.
So I could scan a menu, and then ask the owner or server about certain dishes, and then crowdsource an updated
It would be great if there was a standard API for all restaurants that included all menu items, prices, ingredients, preparation and sourcing information. I could be maintained like a wiki I suppose, and restaurants could be incentivized by including their restaurants.
Yeah crowd-sourcing updates is the way I'd like to go. I'm hoping people will submit photos of menus, because lots of bakeries, cafés etc don't have much of an online presence or keep their menus up-to-date.
Trying to solve the problem of scanning through menus for multiple restaurants to find something a vegan or vegetarian can eat, and instead just showing all the individual menu options in the area as a list.
You should look at LayoutLM models for a NER task. Then your pipeline should look like :
- Identity the menu sub structure (title, item list ...)
- Classify each item with 2 labels.
The training process is not hard, but the data gathering / cleaning / labelling can be a little long.
I bet in a year or so you don't need a specialized app for it, but you just ask your phone whatever you want to know about anything around you, including menus.
Google maps is pretty close to that, I can already find lots of places or restaurants based on what I want to eat. And now, thanks to crows sourcing, you can filter by a range of options.
The only problem is that the data is closely walled by Google and you can only access it through their api.
What I want to create is a tiny search engine that collects all menus (somehow collected from images) and let users find and filter what they like, and even get recommendations nearby!
I will have to investigate this, I am dreaming of a system that can take a pdf scan of a book as input and produce one or more properly formated (headings, italic, bold, underline, etc) markdown files.
In my tests, LLMs have proved very good at cleaning a raw OCR but they need formating information to get me all the way.
It's not ready to take a book, but I'm building an app that takes scans of book chapters/journal articles (which I often receive from my college library) and turns them into well formatted PDFs (with OCR, consistent margins, rotation...) https://fixpdfs.com
> You agree that this license includes the right for us to make your Content available to other users of Service, who may also use your Content subject to these Terms.
I'm curious as to why this appears in your ToS, as this is quite a deterrent for many.
I don't think it supports a markdown / formated text export but it looks looks fantastic as far as pdf cleanup goes (i currently rely on adobe scan for that when i am working from a paper copy), I will try it soon.
No, it doesn't support markdown and it doesn't do analysis of headers/page numbers. It's mainly aimed at making academic PDFs better for reading and annotating (especially on iPad-like devices). Hoping to start charging for it at some point, but I'm still trialing it...
Some users have expected it to "unwarp" bad scans, it also doesn't do that unfortunately. But that's a much harder problem to solve...
I found perspective transformation to be good enough for 90% of cases. Going further would be a lot of effort better spent elsewhere.
Does it deal fine with illustrations and pdf compression?
This is really cool if it delivers. I tried building an app to scan till receipts. The image to text APIs out there really don't perform as well as you'd think. AWS Text Extract performed far better than GCP and Azure equivalents and traditional OCR solutions, but it still made some really annoying errors that I had to fix with heuristics.
Yup, it was solid but not as good as AWS out of the box. IIRC preprocessing the image did help, but I didn't have enough time to spend on fleshing that out for an MVP. (I gave up on the project when I realised that recently introduced protection of personal information laws in my country would have made this project too risky to continue work on. The intention was to automatically extract spending habits from receipts to improve personal finance management.)
I've started using Microsoft's TROCR (another transformer OCR model) to read the cursive in my pocket journal (I have a habit of writing programs there first while I'm out and then typing them in manually, I just focus better that way.)
It's surprisingly accurate although you have to write your own program to segment the image into lines. I think with some fine tuning I could have the machine read my notebook with minimal corrections.
I think the traditional approach to scanning and classifying without AI/ML is the way to go, for the next 5 years at very least.