OCR4all

vintermann · 2025-02-14T06:51:46 1739515906

The big complicated segmentation pipeline is a legacy from the time you had to do that, a few years ago. It's error prone, and even at it's best it robs the model of valuable context. You need that context if you want to take the step to handwriting. If you go to a group of human experts to help you decipher historical handwriting, the first thing they will tell you is that they need the whole document for context, not just the line or word you're interested in.

We need to do end to end text recognition. Not "character recognition", it's not the characters we care about. Evaluating models with CER is also a bad idea. It frustrates me so much that text recognition is remaking all the mistakes of machine translation from 15+ years ago.

liotier · 2025-02-14T08:51:57 1739523117

> We need to do end to end text recognition. Not "character recognition", it's not the characters we care about.

Arbitrary nonsensical text require character recognition. Sure, even a license plate bears some semantics bounding expectations of what text it contains, but text that has no coherence might remain an application domain for character rather than text recognition.

einpoklum · 2025-02-14T10:22:22 1739528542

> Arbitrary nonsensical text require character recognition.

Are you sure? I mean, if it's printed text in a non-connected script, where characters repeat themselves (nearly) identically, then ok, but if you're looking at handwriting - couldn't one argue that it's _words_ that get recognized? And that's ignoring the question of textual context, i.e. recognizing based on what you know the rest of the sentence to be.

WhatThisGuySaid · 2025-02-14T10:45:17 1739529917

Handwriting with words is not arbitrary nonsensical text

liotier · 2025-02-14T13:14:46 1739538886

Yes - my point was about identifier strings such as UUID

coredog64 · 2025-02-14T16:03:42 1739549022

Not really. I have an HTR use case where the data is highly specialized codes. All the OCR software I use is tripped up by trying to find the content into the category of English words.

LLMs can help, but I’ve also had issues where the repetitive nature of the content can reliably result in terrible hallucinations.

modeless · 2025-02-14T08:14:18 1739520858

VLMs seem to render traditional OCR systems obsolete. I'm hearing lately that Gemini does a really good job on tasks involving OCR. https://news.ycombinator.com/item?id=42952605

Of course there are new models coming out every month. It's feeling like the 90s when you could just wait a year and your computer got twice as fast. Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.

bayindirh · 2025-02-14T08:49:59 1739522999

The problem with doing OCR with LLMs is hallucination. It creates character replacements like Xerox's old flawed compression algorithm. At least this my experience with Gemini 2.0 Flash. It was a screenshot of a webpage, too.

Graybeards like Tessaract has moved to neural network based pipelines, and they're re-inventing and improving themselves.

I was planning to train Tessaract with my own hand writing, but if OCR4All can handle that, I'll be happy.

mdp2021 · 2025-02-14T09:09:30 1739524170

Paradoxically, LLMs should be the tool to fix traditional OCR by recognizing that "Charles ||I" should be "Charles III", "carrot ina box" should be "carrot in a box", the century of the event in context cannot be that construed through looking at the gliphs etc.

cvz · 2025-02-14T22:43:21 1739573001

As someone who's learning how to do OCR in order to re-OCR a bunch of poorly digitized documents, this will not work with modern OCR. Modern OCR is too good.

If you're able to improve the preprocessing and recognition enough, then there's a point at which any post-processing step you do will introduce more errors than it fixes. LLM's are particularly bad as a post-processing step because the errors they introduce are _designed to be plausible_ even when they don't match the original text. This means they can't be caught just by reading the OCR results.

I've only learned this recently, but it's something OCR experts have known for over a decade, including the maintainers of Tesseract. [1]

OCR is already at the point where adding an LLM at the end is counterproductive. The state of the art now is to use an LSTM (also a type of neural network) which directly recognizes the text from the image. This performs shockingly well if trained properly. When it does fail, it fails in ways not easily corrected by LLM's. I've OCR'ed entire pages using Tesseract's new LSTM engine where the only errors were in numbers and abbreviations which an LLM obviously can't fix.

[1] https://tesseract-ocr.github.io/docs/Limits_on_the_Applicati...

mdp2021 · 2025-02-15T09:13:50 1739610830

> As someone ... Modern OCR is too good

I also have even recent extensive experience: I get an important amount of avoidable errors.

> at which any post-processing step you do will introduce more errors than it fixes ... the errors they [(LLMs)] introduce are _designed to be plausible_

You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`. And even then, given that I can notice errors that an algorithm with some language proficiency could certainly correct, I have reasons to suppose that a proper calibration of an LLM based system can achieve action over a large number of True Positives with a negligible amount of False Positives.

> LSTM

I am using LSTM-based engines, and on those outputs I have stated «I get an important amount of avoidable errors». The one thing that could go in your direction is that I am not using the latest version of `tesseract` (though still in the 4.x), and I have recently noticed (already through `tesseract --print-parameters | grep lstm`) that the LSTM engine evolved within 4.x, from early to later.

> numbers and abbreviations which an LLM obviously can't fix

? It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" and for abbreviations, the LLM is exactly the thing that should guess them ot of the context. The LLM is that thing that knows that "the one defeated by Cromwell should really be Charles II-staintoberemoved instead of an apparent Charles III".

cvz · 2025-02-16T02:21:05 1739672465

> You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`.

Fair, and I'm aware that that makes a huge difference in how worthwhile an LLM is. I'm glad you're not doing the annoyingly common "just throw AI at it" without thinking through the consequences.

I'm doing two things to flag words for human review: checking the confidence score of the classifier, and checking words against a dictionary. I didn't even consider using an LLM for that since the existing process catches just about everything that's possible to catch.

> I am using LSTM-based engines . . .

I'm using Tesseract 5.5. It could actually be that much better, or I could just be lucky. I've got some pretty well-done scans to work with.

> It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" . . .

I honestly don't know what you mean. Are you saying that an LLM would know that a reference to "page 311" should actually be a reference to "page 317" based on context?

I think the example you've given makes a lot of sense if you're just using an LLM as one way to flag things for review.

mdp2021 · 2025-02-16T19:30:27 1739734227

> I honestly don't know what you mean. Are you saying that an LLM would know that a reference to "page 311" should actually be a reference to "page 317" based on context

Not that of your example (the page number): that would be pretty hard to check (with current general agents. In the future, not impossible - you would need some agent finally capable to follow procedures strictly). That the extra punctuation or accent is a glitch and that the sentence has a mistake are more within the realm of a Language Model.

What I am saying is that a good Specialized Language Model (maybe a good less efficient LLM) could fix a text like:

"AB〈G〉 [Should be 'ABC'!] News was founded in 194〈S〉 [Should be '1945'!] after action from the 〈P〉CC [Should be 'FCC'!], 〈_〉 [Noise!], deman〈ci〉ing [Should be 'demanding'!] pluralist progress 〈8〉 [Should be '3'!] 〈v〉ears [should be 'years'] ear〈!〉ier [Should be 'earlier'!]..."

since it should "understand" the sentence and be already informed of the facts.

cvz · 2025-02-17T05:53:22 1739771602

This is moot anyway if the LLM is only used as part of a review process. But the most valuable documents to digitize are, almost by definition, those that don't have widely-known information that an LLM is statistically likely to guess. There's no way to get around that.

mdp2021 · 2025-02-17T08:58:14 1739782694

> This is moot anyway if the LLM is only used as part of a review process

Not really, because if plan to perform a `diff` in order to ensure that there are no False Positives in the corrections proposed by your "assistant", you will want one that finds as many True Positives as possible (otherwise, the exercise will be futile, inefficient, if a large number of OCR errors remain). So, it would be good to have some tool that could (in theory, at this stage) be able to find the subtle ones (not dictionary related, not local etc.).

> But the most valuable documents to digitize are, almost by definition, those that don't have widely-known information that an LLM is statistically likely to guess

It depends on the text. You may be e.g. interested in texts the value of which is in arguments, the development of thought (they will speak about known things in a novel way) - where OCR error has a much reduced scope (and yet you want them clean of local noise). And, if a report comes out from Gallup or similar, with original figures coming from recent research, we can hope that it will be distributed foremostly electronically. Potential helpers today could do more things that only a few years ago (e.g. hunspell).

mdp2021 · 2025-02-17T10:57:42 1739789862

I tried it! (How very distracted of us not to have tried immediately.) And it works, mostly... I used a public LLM.

The sentence:

> ABG News was founded in 194S after action from the PCC , _ , demanciing pluralist progress 8 vears ear!ier

is corrected as follows, initially:

> ABG News was founded in 1945 after action from the FCC, demanding pluralist progress 8 years earlier

and as you can see it already corrects a number of trivial and non-trivial OCR errors, including recognizing (explicitly in the full response) that it should be the "FCC" to "demand pluralist progress" (not to mention, all of the wrong characters breaking unambiguous words and the extra punctuation).

After a second request, to review its output to "see if the work is complete", it sees that 'ABG', which it was reluctant to correct because of ambiguities ("Australian Broadcasting Corporation", "Autonomous Bougainville Government" etc.), should actually be 'ABC' because of several hints from the rest of the sentence.

As you can see - proof of concept - it works. The one error it (the one I tried) cannot see is that of '8' instead of '3': it knows (it says in a full response) that the FFC acted in 1942, and ABC was created in 1945, but it does not compute the difference internally nor catch the hint that '8' and '3' are graphically close. Maybe one LLM with "explicit reasoning" could do even better and catch also that.

cvz · 2025-02-17T19:09:08 1739819348

I'm saying it's moot because, if you're just flagging things for review, there's already a more direct and reliable way to do that. The OCR classifier itself outputs a confidence score. The naieve way of just checking that confidence score will work. The OCR classifier has less overall information than an LLM, but the information it has is much more relevant to the task it's doing.

When I have some time in front of a computer, I'll try a side-by side comparison with some actual images.

vintermann · 2025-02-16T07:46:23 1739691983

> OCR is already at the point where adding an LLM at the end is counterproductive

That's mass OCR on printed documents. On handwritten documents, LLMs help. There are tons of documents that even top human experts can't read without context and domain language. Printed documents are intended to be readable character by character. Often the only thing a handwriting author intends is to remind himself of what he was thinking when he wrote it.

Also, what is the downstream tasks? What do you need character level accuracy for? In my experience, it's often for indexing and search. I believe LLMs have a higher ceiling there, and can in principle (if not in practice, yet) find data and answer questions about a text better than straightforward indexing or search can. I can't count the number of times I've e.g. missed a child in genealogy because I didn't think of searching the (fully and usually correctly) indexed data for some spelling or naming variant.

cvz · 2025-02-17T05:32:16 1739770336

I am working with printed documents. Maybe LLMs currently make a difference with handwriting recognition. I wasn't directly responding to that. It's outside the little bit that I know, and I didn't even think of it as "OCR".

I'm not saying that I need high accuracy (though I do), I'm saying that the current accuracy (and clarifying that this is specifically for printed text) is already very high. Part of the reason it's so high is because the old complicated character-by-character classifiers have already been replaced with neural networks that process entire lines at a time. It's already moving in the direction you're saying we need.

bayindirh · 2025-02-14T09:14:56 1739524496

Maybe you can passthrough the completed text from a simple, fast grammar model to improve text. You don't need a 40B/200GB A100 demanding language model to fix these mistakes. It's absurdly wasteful in every sense.

I'm sure there can be models which can be accelerated on last gen CPUs AI accelerators and fix these kinds of mistakes faster than real time, and I'm sure Microsoft Word is already doing it for some languages, for quite some time.

Heck, even Apple has on-device models which can autocomplete words now, and even though its context window or completion size are not that big, it allows me to jump ahead with a simple tap to tab.

mistercow · 2025-02-14T14:45:12 1739544312

I wonder if this is a case where you want an encoder-decoder model. It seems very much like a translation task, only one where training data is embarrassingly easy to synthesize by just grabbing sentences from a corpus and occasionally swapping, inserting, and deleting characters.

In terms of attention masking, it seems like you want the input to be unmasked, since the input is fixed for a given “translation”, and then for the output tokens to use causally masked self attention plus cross attention with the input.

I wonder if you could get away with a much smaller network this way because you’re not pointlessly masking input attention for a performance benefit that doesn’t matter.

bayindirh · 2025-02-14T14:53:23 1739544803

When I was reading your comment, I remembered an assignment in Fuzzy Logic course:

"Number of different letters" is a great heuristic for a word guesser. In that method you just tell the number of letters and then do some educated guesses to start from a semi-converged point (I think word frequencies is an easy way), and brute force your way from there, and the whole process finds the words in mere milliseconds.

You can improve this method to a 2-3 word window since we don't care about the grammar, but misread words, and brute-force it from there.

You may even need no network to fix these kinds of misrecognitions with this. Add some SSE/AVX magic for faster processing and you have a potential winner in your hands.

mdp2021 · 2025-02-14T09:35:36 1739525736

> Maybe you can passthrough the completed text from a simple, fast grammar model to improve text

Yes - but not really a "grammar model": a statistical model about text, with "transformer's attention" - the core of LLMs - should be it: something that identifies if the fed text has statistical anomalies (which the glitches are).

Unfortunately, small chatbot LLMs do not follow instructions ("check the following text"), they just invent stories, and I am not aware of a specialized model that can be fed text for anomalies. Some spoke about a BERT variant - which still does not have great accuracy, I understood.

It is a relatively small problem that probably does not have a specialized solution yet. Already a simple input-output box that worked like: "Evaluate statistical probability of each token" - then we would check the spikes of anomaly. (For clarity: this is not plain spellchecking, as we want to identify anomalies in context.)

--

Edit: a check I have just done with an engine I had not yet used for the purpose shows a number of solutions... But none a good specific tool, I am afraid.

pbhjpbhj · 2025-02-14T12:02:24 1739534544

One might need document/context fine-tuning. It's not beyond possibility that a prayer of text is about Charles ll1 (el-el-one) someone's pet language model or something. Sometimes you want correction of "obvious" mistakes (like with predictive text) other times you really did write keming [with an M].

mdp2021 · 2025-02-14T14:20:09 1739542809

And that is why I wrote that you need an LLM, i.e. (next post) a «statistical model about text, with "transformer's attention"», as «[what we want] is not plain spellchecking, as we want to identify anomalies in context».

To properly correct text you need a system that checks large blocks of text with some understanding, not just disconnected words.

Edit: minutes ago a member (in a faraway post) wrote «sorry if this is a bit cheeky»... He meant "cheesy". You need some level of understanding to see those mistakes. Marking words that are outside the dictionary is not sufficient.

dajonker · 2025-02-15T14:28:02 1739629682

Gemini does not seem to do OCR with LLM. They seem to use their existing OCR technology of which they feed the output into the LLM. If you set the temperature to 0 and ask for the exact text as found in the document, you get really good results. I once got weird results where I got literally the JSON output of the OCR result with bounding boxes and everything.

bayindirh · 2025-02-15T15:19:30 1739632770

Interesting, thanks for the information. However, unfortunately, no way I'll be sending my personal notebooks to a service which I don't know what's going to do with them in the long term. However, I might use it for the publicly available information.

Thanks again.

wittjeff · 2025-02-14T14:53:51 1739544831

I've been looking for an algorithm for running OCR or STT results through multiple language models, compare the results and detect hallucinations as well as correct errors by combining the results in a kind of group consensus way. I figured someone must have done something similar already. If anyone has any leads or more thoughts on algorithm implementation, I'd appreciate it.

aidenn0 · 2025-02-14T17:26:54 1739554014

Tesseract wildly outperforms any VLM I've tried (as of November 2024) for clean scans of machine-printed text. True, this is the best case for Tesseract, but by "wildly outperforms" I mean: given a page that Tesseract had a few errors on, the VLM misread the text everywhere that Tesseract did, plus more.

On top of that, the linked article suggests that Gemini 2.0 can't give meaningful bounding boxes for the text it OCRs, which further limits the places in which it can be used.

I strongly suspect that traditional OCR systems will become obsolete, but we aren't there yet.

gopher_space · 2025-02-14T22:39:36 1739572776

I just wrapped up a test project[0] based on a comment from that post! My takeaway was that there are a lot of steps in the process you can farm out to cheaper, faster ML models.

For example, the slowest part of my pipeline is picture description since I need a LLM for that (and my project needs to run on low-end equipment). Locally I can spin up a tiny LLM and get one-word descriptions in a minute, but anything larger takes like 30. I might be able to only send sections I don't have the hardware to process.

It was a good into to ML models incorporating vision, and video is "just" another image pipeline, so it's been easy to look at e.g. facial recognition groupings like any document section.

[0] https://github.com/jnday/ocr_lol

vintermann · 2025-02-16T07:32:52 1739691172

Yes, I agree general purpose is the way to go, but I'm still waiting. Gemini is the best at last time I tried, but for all the ways I've tried to prompt it, it can not transcribe (or correctly understand the content of) e.g. the probate documents I try to decipher for my genealogy research.

cnity · 2025-02-14T16:25:13 1739550313

For self hosting check out Qwen-VL: https://github.com/QwenLM/Qwen-VL

tcascais · 2025-02-14T08:43:02 1739522582

I just used Gemini as an OCR a couple of hours ago because all the OCR apps I tried on android failed at the task lol Wild seeing this commment right after waking up

dhon_ · 2025-02-14T22:01:34 1739570494

I've seen Gemini Flash 2 mention "in the OCR text" when responding to VQA tasks which makes me question of they have a traditional OCR process mixed in the pipeline.

exikyut · 2025-02-14T12:12:58 1739535178

> Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.

Maybe this is what the age of desktop AGI looks like.

chgs · 2025-02-14T09:00:55 1739523655

Wouldn’t an AI make assumptions and fix mistakes?

For example instead of

> The speiling standards were awful

It would produce

> The spelling standards were awful

yndoendo · 2025-02-15T15:24:42 1739633082

Issue with that is that some writings are not word based. People use acronyms, temporal, personalized, industrial jargon, and global ones. Beginning of the year, there where some HN posts about moving from dictionary word to character encoding for LLMs, because of the very varying nature in writing.

Even I used symbols for different means in a shorthand form when constructing an idea.

I see it the same way laws are. Their word definitions are anchored in time from the common dictionaries of the era. Grammar, spelling, and means all change through time. LLMs would require time scoped information to properly parse content from 1400 vs 1900. LLM would be for trying to take meaning out of the content versus retaining the works.

Character based OCR ignores the rules, spelling, and meaning of words and provides what most likely there. This retains any spelling and grammar error that are true positives or false positives, based on the rules of their day.

registeredcorn · 2025-02-14T18:31:57 1739557917

Could you dumb this down a bit (a lot) for dimmer readers, like myself? The way I am understanding the problem you are getting at is something like:

> The way person_1 in 1850 wrote a lowercase letter "l" will look consistently like a lowercase letter "l" throughout a document.

> The way person_2 in 1550 wrote a lowercase letter "l" may look more like an uppercase "K" in some parts, and more of a lowercase "l" in others, and the number "0" in other areas, depending on the context of the sentence within that document.

I don't get why you would need to see the entire document in order to gauge some of the details of those things. Does it have something to do with how language has changed over the centuries, or is it something more obvious that we can relate to fairly easily today? From my naive position, I feel like if I see a bunch of letters in modern English (assuming they are legible) I know what they are and what they mean, even if I just see them as individual characters. My assumption is that you are saying that there is something deeper in terms of linguistic context / linguistic evolution that I'm not aware of. What is that..."X factor"?

I will say, if nothing else, I can understand certain physical considerations. For example:

A person who is right-handed, and is writing on the right edge of a page may start to slant, because of the physical issue of the paper being high, and the hand losing its grip. By comparison, someone who is left-handed might have very smudged letters because their hand is naturally going to press against fresh ink, or alternatively, have very "light" because they are hovering their hand over the paper while the ink dries.

In those sorts of physical considerations, I can understand why it would matter to be able to see the entire page, because the manner in which they write could change depending on where they were in the page...but wouldn't the individual characters still look approximately the same? That's the bit I'm not understanding.

vintermann · 2025-02-16T13:15:34 1739711734

The lower case "e" in gothic cursive often looks like a lower case "r". If you see one of these: ſ maybe you think "ah, I know that one, that's an S!" and yes, it is, but some scribes when writing a capital H makes something that looks a LOT like it. You need context to disambiguate. Think of it as a cryptogram: if you see a certain squiggle in a context where it's clearly an "r", you can assume that the other squiggles that look like that are "r"s too. Familiarity with a scribe's hand is often necessary to disambiguate squiggles, especially in words such as proper names, where linguistic context doesn't help you a lot. And it's often the proper names which are the most interesting part of a document.

But yes, writers can change style too. Mercifully, just like we sometimes use all caps for surnames, so some writers would use antika-style handwriting (i.e. what we use today) for proper names in a document which is otherwise all gothic-style handwriting. But this certainly doesn't happen consistently enough that you can rely on it, and some writers have so messy handwriting that even then, you need context to know what they're doing.

cyanydeez · 2025-02-14T22:06:12 1739570772

The problem is payong experts to properly train a model is expensive, doubly when you want larger context.

Ots almost like we need a shared commons to benefit society but were surrounded by hoarders whp think they cam just strip mine society automatically bootstrap intelligence.

Surpise: Garbage CEOs in, garbage intelligence out.

abrichr · 2025-02-14T03:29:07 1739503747

From https://www.ocr4all.org/guide/user-guide/introduction :

> OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software.

Looks like it's built on https://github.com/Calamari-OCR/calamari

seu · 2025-02-14T10:03:00 1739527380

Looks like a great project, and I don't want to nitpick, but...

https://www.ocr4all.org/about/ocr4all > Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.

https://www.ocr4all.org/guide/setup-guide/quickstart > Quickstart > Open a terminal of your choice and enter the following command if you're running Linux (followed by a 6 line docker command).

How is that addressing the needs of non-technical users?

7bit · 2025-02-14T10:34:29 1739529269

Any end-user application that uses docker is not an end-user application. It does not matter if the end-user knows how to use docker or not. End-user applications should be delivered as SaaS/WebUi or a local binary (GUI or CLI). Period.

pbhjpbhj · 2025-02-14T12:06:45 1739534805

Application installation isn't a user level task. The application being ready for a user to use, and being easy to install are separate. You get your IT literate helper to install for you, then, if the program is easy for users to use you're golden.

lupusreal · 2025-02-14T14:35:39 1739543739

That's a very corporate mentality. Outside of an organizational context, installing applications certainly is a normal user level task. And for those users that have somebody help them, that somebody is usually just a younger person who's comfortable clicking 'Next' to get through an installer but certainly has no devops experience.

pbhjpbhj · 2025-02-14T22:58:53 1739573933

Maybe corporate, but that's how it works in my family...

lupusreal · 2025-02-14T11:55:52 1739534152

"Silicate chemistry is second nature to us geochemists, so it's easy to forget that the average person probably only knows the formulas for olivine and one or two feldspars."

teddyh · 2025-02-14T15:36:47 1739547407

<https://www.xkcd.com/2501/>

einpoklum · 2025-02-14T10:25:20 1739528720

s/non-technical users/technical users who are into docker and don't mind filling their computers with large files for no good reason/

lionkor · 2025-02-14T12:07:52 1739534872

Those are called "tech enthusiasts"

fny · 2025-02-14T12:05:10 1739534710

A little secret: Apple’s Vision Framework has an absurdly fast text recognition library with accuracy that beats Tesseract. It consumes almost any image format you can think of including PDFs.

I wrote a simple CLI tool and more featured Python wrapper for it: https://github.com/fny/swiftocr

Moto7451 · 2025-02-14T14:05:41 1739541941

This has been one of my favorite features Apple added. When I’m in a call and someone shares a page I need the link to, rather than interrupt the speaker and ask them to share the link it’s often faster to screengrab the url and let Apple OCR the address and take me to the page/post it in chat.

jjice · 2025-02-14T14:55:33 1739544933

After getting an iPhone and exploring some of their API documentation after being really impressed with system provided features, I'm blown away by the stuff that's available. My app experience on iOS vs Android is night and day. The vision features alone have been insane, but their text recognition is just fantastic. Any image and even my god awful handwriting gets picked up without issue.

That said, I do love me a free and open source option for this kind of thing. I can't use it much since I'm not using Apple products for my desktop computing. Good on Apple though - they're providing some serious software value.

JeremyNT · 2025-02-15T15:34:31 1739633671

I can't comment on what Apple is doing here, but Google has an equivalent called "lens" which works really well and I use it in the way you suggest here.

ted_dunning · 2025-02-14T21:27:17 1739568437

Google photo app does really good OCR as well, actually.

eigenvalue · 2025-02-14T14:41:51 1739544111

I basically wrapped this in a simple iOS app that can take a PDF, turn it into images, and applies the native OCR to the images. It works shockingly well:

https://apps.apple.com/us/app/super-pdf-ocr/id6479674248

I probably should have just made it a free app so it would have gotten very popular, but oh well.

syntaxing · 2025-02-14T18:21:11 1739557271

How does it work with tables and diagrams? I have scanned pages with mixed media, like some are diagrams, I want to be able to extract the text but tell me where the diagrams are in the image with coordinates.

acheong08 · 2025-02-15T05:34:07 1739597647

I wonder if it's possible to reverse engineer that, rip it out, and put it on Linux. Would love to have that feature without having to use Apple hardware

maCDzP · 2025-02-14T18:49:21 1739558961

This seems to run locally?

criddell · 2025-02-14T12:53:35 1739537615

Just about everything beats Tesseract.

mometsi · 2025-02-14T04:14:28 1739506468

> How is this different from tesseract and friends?

The workflow is for digitizing historical printed documents. Think conserving old announcements in blackletter typesetting, not extracting info from typewritten business documents.

amelius · 2025-02-14T10:16:54 1739528214

I didn't have good results in tesseract, so I hope this is really different ;)

I was surprised that even scraped screen text did not work 100% flawlessly in tesseract. Maybe it was not made for that, but still, I had a lot of problems with high resolution photos also. I did not try scanned documents, though.

Moto7451 · 2025-02-14T14:13:46 1739542426

I have never had to handle handwriting professionally but I have had great success with Tesseract in the past. I’m sure it’s no longer the best free/cheap option but with a little bit of image pre-processing to ensure the text pops from the background and isn’t unnecessarily large (I.e. that 1200dpi scan is overkill) you can have a pretty nice pipeline with good results.

In the mid 2010s I put Tesseract, OCRad (which is decidedly not state of the art), and aspell into a pretty effective text processing pipeline to transform resumes into structured documents. The commercial solutions we looked at (at the time) were a little slower and about as good. If the spellcheck came back with too low of a success rate I ran the document through OCRad which, while simplistic, sometimes did a better job.

I expect the results today with more modern projects to be much better so I probably wouldn’t go that path again. However as all of it runs nicely on slow hardware, it likely still has a place on low power/hobby grade IoT boards and other niches.

spigottoday · 2025-02-14T14:47:49 1739544469

I have a typewriter written manuscript that is interspersed with hand written editing. Tesseract worked fine until the hand written part, then garbage. Is there a local solution that anyone can recommend? I have a 16gb lenovo laptop and access to a workstation with a with an RTX 4070 ti 16gb card. Thanks.

bonefolder · 2025-02-14T08:23:45 1739521425

Tangentially related, but does someone know a resource for high-quality scans of documents in blackletter / fraktur typesetting? I'm trying to convert documents to look fraktury in latex and would like any and all documents I can lay my hands on.

jjuliano · 2025-02-14T10:41:25 1739529685

If you are interested, I also made an AI assisted OCR API - https://github.com/kdeps/examples

It combines Tesseract (for images) and Poppler-utils (PDF). A local open-source LLMs will extract document segments intelligently.

It can also be extended to use one or multiple Vision LLM models easily.

And finally, it outputs the entire AI agent API into a Dockerized container.

Krasnol · 2025-02-14T06:20:51 1739514051

> Designed with usability in mind

Create complex OCR workflows through the UI without the need of interacting with code or command line interfaces.

[...] https://www.ocr4all.org/guide/setup-guide/windows

------------------

I'm sorry. I suppose this is great but, an .exe-File is designed for usability. A docker container may be nice for techy people, but it is not "4all" this way and I do understand that the usability starts after you've gone through all the command line interface parts, but those are just extra steps compared to other OCR programs which work out of the box.

eigenvalue · 2025-02-14T14:46:48 1739544408

I think the current sweet-spot for speed/efficiency/accuracy is to use Tesseract in combination with an LLM to fix any errors and to improve formatting, as in my open source project which has been shared before as a Show HN:

https://github.com/Dicklesworthstone/llm_aided_ocr

This process also makes it extremely easy to tweak/customize simply by editing the English language prompt texts to prioritize aspects specific to your set of input documents.

TheNovaBomb · 2025-02-14T18:19:52 1739557192

What kind of accuracy have you reached with this pipeline of Tesseract+LLM? I imagine that there would be a hard limit as to what level the LLM could improve the OCR extract text from Tesseract, since its far from perfect itself.

Haven't seen many people mention it, but have just been using the PaddleOCR library on it's own and has been very good for me. Often achieving better quality/accuracy than some of the best V-LLM's, and generally much better quality than other open-source OCR models I've tried like Tesseract for example.

That being said, my use case is definitely focused primarily on digital text, so if you're working with handwritten text, take this with a grain of salt.

https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_e...

https://huggingface.co/spaces/echo840/ocrbench-leaderboard

sgc · 2025-02-14T15:11:38 1739545898

Have you used your project on classical languages like Latin / Ancient Greek / Hebrew etc? Will the LLM fall flat in those cases, or be able to help?

eigenvalue · 2025-02-14T16:32:25 1739550745

I haven’t, but I bet it would work pretty well, particularly if you tweaked the prompts to explain that it’s dealing with Ancient Greek or whatever and give a couple examples of how to handle things.

jdthedisciple · 2025-02-14T08:35:42 1739522142

What is this? A new SOTA OCR engine (which would be very interesting to me) or just a tool that uses other known engines (which would be much less interesting to me).

A movement? A socio-political statement?

If only landing pages could be clearer about wtf it actually is ...

mdp2021 · 2025-02-14T08:52:36 1739523156

[Top menu] > About > What is OCR4all?

https://www.ocr4all.org/about/ocr4all

jdthedisciple · 2025-02-14T11:50:00 1739533800

Even there after reading a cple of paragraphs, all I know is that it "combines" and it "addresses".

Could be a software? Library? Module? OSS? Api? Downloadable tool? Script? Wrapper? Paid service?...

Could also be some dude behind a counter named OCR4all?

ur-whale · 2025-02-14T09:08:14 1739524094

Not the landing page

amai · 2025-02-14T12:27:31 1739536051

„OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material.“

It seems to be based on OCR-D, which itself is based on

- https://github.com/tesseract-ocr/tesseract

- https://kraken.re/main/index.html

- https://github.com/ocropus-archive/DUP-ocropy

- https://github.com/Calamari-OCR/calamari

See

- https://ocr-d.de/en/models

It seems to be an open-source alternative to https://www.transkribus.org/ ( which uses amongst others https://atr.pages.teklia.com/pylaia/pylaia/ )

Another alternative is https://escriptorium.inria.fr/ ( which uses kraken)

jaffa2 · 2025-02-14T03:53:13 1739505193

Ocr is well and good, i thought it was mostly solved with tesseract what does this bring? But, what I’m looking for is a reasonable library or usable implementation of MRC compression for the resulting pdfs. Nothing i have tried comes anywhere near the commercial offerings available, which cost $$$$ . It seems to be a tricky problem to solve, that is detecting and separating the layers of the image to compress separately and then binding them Back togethr into a compatible pdf.

joecool1029 · 2025-02-14T07:13:06 1739517186

Cheap network locked iphone SE2's on ebay seem to be a cost effective way with good accuracy: https://findthatmeme.com/blog/2023/01/08/image-stacks-and-ip...

jjice · 2025-02-14T15:02:14 1739545334

Very interesting article. I'd be interested to know if a M-series Mac Mini (this article was early 2023, so there should've been M1 and M2) would have also filled this role just fine.

> My preliminary speed tests were fairly slow on my MacBook. However, once I deployed the app to an actual iPhone the speed of OCR was extremely promising (possibly due to the Vision framework using the GPU).

I don't know a lot about the specifics of where (hardware-wise) this gets run, but I'd assume any semi-modern Mac would also have an accelerated compute for this kind of thing. Running it on a Mac Mini would ease my worries about battery and heat issues. I would've guessed that they'd scale better as well, but I have no idea if that's actually the case. Also, you'd be able to run the server as a service for automatic restarts and such.

All that said, a rack of iPhones is pretty fun.

sandreas · 2025-02-14T06:57:09 1739516229

You could try OCRmyPDF (https://github.com/ocrmypdf/OCRmyPDF)

jaffa2 · 2025-02-14T09:32:27 1739525547

Thanks for the message. Im talking about MRC ( https://en.m.wikipedia.org/wiki/Mixed_raster_content ) not just invisible text layer over image.

You can achieve colour PDFs smaller than group4 binary compression of the same images. And 10x smaller than a Jpeg compressed PDF

I handle many scanned documents so my source data is typically a 300ppi image of a book/document/newspaper etc

kergonath · 2025-02-14T06:14:41 1739513681

> Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?

Tesseract is nice, but not good enough that there is no opportunity for another, better solution.

aidenn0 · 2025-02-14T18:01:39 1739556099

> Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?

This is specifically for historic documents that tesseract will handle poorly. It also provides a good interface for retraining models on a specific document set, which will help for documents that are different from the training set.

aidenn0 · 2025-02-14T17:59:17 1739555957

The internet archive generates MRC pdfs and have open-sourced their tooling: https://github.com/internetarchive/archive-pdf-tools

jaffa2 · 2025-02-21T00:21:56 1740097316

Thank you

fny · 2025-02-14T12:13:56 1739535236

Run Tesseract on a screenshot and you'll be underwhelmed.

danpla · 2025-02-17T10:29:42 1739788182

With proper image pre-processing, Tesseract can recognize even tiny text (5-7 px high).

miles · 2025-02-14T03:49:08 1739504948

As this project is geared toward "early modern prints", any recommendations for the best OCR/LLM solution for poor-quality typed manuscripts?

ikbdsk · 2025-02-14T07:34:05 1739518445

I was playing with PaddleOCR a while ago, it seemed to work quite well. It seems to be geared to Chinese, but it also works with other languages in my experience.

GutenYe · 2025-02-14T09:43:44 1739526224

I created a wrapper of PaddleOCR： https://github.com/gutenye/ocr

manmal · 2025-02-14T07:54:25 1739519665

If cloud options are ok, Gemini has been getting a good rep lately, as it’s good at OCR and cheaper than the rest.

ngai_aku · 2025-02-14T04:24:17 1739507057

If we’re doing recommendations I’d like to know what everyone is using for handwriting

juliendorra · 2025-02-14T07:26:50 1739518010

Researchers use Transkribus

sandreas · 2025-02-14T06:55:41 1739516141

You could try MiniCPM 2.6.

krick · 2025-02-14T05:20:02 1739510402

Wow. Setup took 12 GB of my disk. First impression: nice UI, but no idea what to do with it or how to create a project. Tells me "session expired" no matter what I try to do. Definitely not batteries-included kind of stuff, will need to explore later.

pogue · 2025-02-14T04:58:15 1739509095

I've been looking for a project that would have an easy free/extremely cheap way to do OCR/image recognition for generating ALT text automatically for social media. Some sort of embedded implementation that looks at an image and is either able to transcribe the text, or (preferably) transcribe the text AND do some brief image recognition.

I generally do this manually with Claude and it's able to do it lightning fast, but a small dev making a third party Bluesky/Mastodon/etc client doesn't have the resources to pay for an AI API.

vladxyz · 2025-02-14T05:41:02 1739511662

Maybe relevant, mozilla is working on building AI alt text generation into firefox:

https://blog.mozilla.org/en/mozilla/ai/help-us-improve-our-a...

gostsamo · 2025-02-14T06:10:29 1739513429

Such an approach moves the cost of accessibility to each user individually. It is not bad as a fallback mechanism, but I hope that those who publish won't decide that AI absolves them of the need to post accessible content. After all, if they generate the alt text on their side, they can do it only once and it would be accessible to everyone while saving multiple executions of the same recognition task on the other end. Additionally, they have more control how the image would be interpreted and I hope that this really would matter.

cdrini · 2025-02-14T03:25:44 1739503544

What differentiates this from other tools? Eg tesseract, EasyOcr?

jasonhemann · 2025-02-14T03:31:12 1739503872

I was going to say, isn't tesseract /already/ OCR for everyone?

rkagerer · 2025-02-14T04:30:10 1739507410

I played with Tesseract a long time ago. Has the accuracy and success rate improved a lot in the last eg. decade?

MassPikeMike · 2025-02-14T04:58:30 1739509110

I maintain a searchable archive of historical documents for a nonprofit, OCR'd with Tesseract over several years. Tesseract 4 was a big improvement over previous versions, but since then its accuracy has not improved at the same rate as other free solutions.

These days, just uploading a PDF of scanned documents (typeset ones, not handwriting) to Google Drive and opening with Google Docs results in a text document generated with impressive quality OCR.

But this is not scriptable, and doesn't provide access position information, which is needed so we can highlight search results as color overlays on the original PDF. Tesseract's hOCR mode was great for that.

For the next version, we're planning to use one of the command-line wrappers to Apple's Vision framework, which is included free in MacOS. A nice one that provides position information is at https://github.com/bytefer/macos-vision-ocr

thenthenthen · 2025-02-14T06:13:50 1739513630

I asked this question yesterday but did not enough votes. I need to OCR and then translate thousands of pages from historical documents and was wondering if you knew a scriptable app/technique or technology that includes ‘layout recovery’, aka overlaying translated text over the original, like the Safari browser etc. does (not sure the apple vision framework wrapper does this?).

MassPikeMike · 2025-02-14T16:56:49 1739552209

Apple Vision and its wrappers provide bounding boxes for each line of text. That's slightly less convenient than Tesseract which can give you a bounding box for each word, but more than compensated by Apple Vision's better accuracy. I am planning to fudge the word boxes by assuming fixed-width letters and dividing up the overall width so that each word's width is proportional to its share of the total letters on the line.

Once you have those bounding boxes, it's pretty simple to use a library like [1] (Python) or [2] (JavaScript) to add overlay text in the right place. For example, see how [3] does it.

[1] https://pymupdf.readthedocs.io/en/latest/recipes-text.html#h... [2] https://github.com/foliojs/pdfkit [3] https://github.com/eloops/hocr2pdf

thenthenthen · 2025-02-15T10:23:39 1739615019

Thank you, I will take a look at hocr2pdf to see how they overlay text!

wahnfrieden · 2025-02-14T08:32:06 1739521926

FYI the Apple one is best inside the Live Text API which is Swift-only and so some old Python and CLI tools which wrap the older Obj-C APIs may have worse quality (though Live Text doesn't really provide bounding boxes - so what I do is combine its output with bounding box APIs like the old iOS/macOS ones)

llm_trw · 2025-02-14T05:06:05 1739509565

Tesseract has had a near 100% success rate since the first time I used it in 2008 _when you read the manual_.

Black letters on white background, xheight of between 10 to 30 px, tiff format, mono column layout, etc., etc., etc..

People get terrible results because they treat it like a phone app and drop a barely legible colored jpg of a bent page and wonder why it's garbage.

tjoff · 2025-02-14T06:02:01 1739512921

Reading the manual to actually tune it or reading the manual to know the limitations?

All I've ever tried it for is pixel perfect super high contrast text but still the results doesn't exactly impress.

rcxdude · 2025-02-14T13:03:07 1739538187

Sure, if you're generating the input to be machine readable, then it's not very surprising that it's machine readable withough much effort. But then you could also use a QR code. Most people who want to OCR stuff are doing it because they don't have control over the input.

poulpy123 · 2025-02-14T10:18:48 1739528328

what you are saying is that tesseract works perfectly if you don't need to use it with real world stuff

9rx · 2025-02-14T13:34:05 1739540045

My read is that he is saying that Tesseract is intended for OCR, not an entire image pipeline, so there is an expectation that you will preprocess those real world images into a certain form rather than throwing an image straight off the sensor at it.

criddell · 2025-02-14T13:01:41 1739538101

They wonder why it's garbage because modern apps have set people's expectations much higher.

A barely legible colored jpg of a bent page shouldn't be a problem anymore.

llm_trw · 2025-02-14T13:14:04 1739538844

It works better than those apps when you know what you are doing.

Not everything needs to be made for a chimp.

krick · 2025-02-14T05:10:01 1739509801

Yes' anecdotally, it's a bit better now. Still nowhere near actually usable OCR software though, unless your use-case is scanning clear hi-res screenshots in conventional fonts and popular langues, without tables or complicated formatting.

jackbravo · 2025-02-14T05:18:25 1739510305

or https://ds4sd.github.io/docling/ from IBM.

pininja · 2025-02-14T04:14:38 1739506478

This tool says it includes a workflow GUI and refinement tools, like creating work-specific text recognition models - maybe the others do too? tesseract isn’t packaged with a GUI, but is wrapped by many.

This project seems focused on making tools more accessible and helping the user be more efficient and organized

einpoklum · 2025-02-14T10:19:34 1739528374

They lost me when they suggested I install docker.

Now, I wouldn't mind if they suggested that as an _option_ for people whose system might exhibit compatibility problems, but - come on! How lazy can you get? You can't be bothered to cater to anything other than your own development environment, which you want us to reproduce? Then maybe call yourself "OCR4me", not "OCR4all".

danso · 2025-02-14T18:24:36 1739557476

> How lazy can you get?"

Genuine question: What would the ideal docker-free solution look like in your opinion? That is, something that is accessible to the average university student, researcher, and faculty member? What installation tool/package manager would this hypothetical common user use? How many hoops would they have to jump through? The hub page lists the various dependencies [0], which on their own are pretty complicated packages.

[0] https://hub.docker.com/r/uniwuezpd/ocr4all

alexnewman · 2025-02-14T06:27:05 1739514425

How does this compare quality wise to Gemini flash

registeredcorn · 2025-02-14T17:48:14 1739555294

I don't wish to speak out of turn, but it doesn't look like this project has been active for about 1 year. I checked GitHub and the last update was in Feb 2024. Their last post to X was 25 OCT 2023. :(

vagab0nd · 2025-02-14T17:02:18 1739552538

It's cool but is there any doubt that this will be very obsolete very soon? This is like how image recognition worked pre-CNN.

(It looks like the project started in 2022. So maybe it wasn't obvious at the time)

khaki54 · 2025-02-14T12:58:34 1739537914

This looks promising, not sure how it stacks up to Transkribus which seems to be the leader in the space since it has support for handwritten and trainable ML for your dataset.

klaussilveira · 2025-02-14T16:50:36 1739551836

Training a model exclusively on written material from specific time periods would be a fascinating way to explore history, specially at schools.

axegon_ · 2025-02-14T08:07:06 1739520426

I've been using tesseract for a few years on a personal project, I'd be interested to know how they compare in terms of system resources, given that I am running it on a dell optiplex micro with 8 gigs of ram and 6-th gen i5 - tesseract is barely noticeable so it's just my curiosity at this point, I don't have any reasons to even consider switching over. I do however have a large dataset of several hundred gbs of scanned pdfs which would be worth digitalizing when I find some time to spare.