The big complicated segmentation pipeline is a legacy from the time you had to do that, a few years ago. It's error prone, and even at it's best it robs the model of valuable context. You need that context if you want to take the step to handwriting. If you go to a group of human experts to help you decipher historical handwriting, the first thing they will tell you is that they need the whole document for context, not just the line or word you're interested in.
We need to do end to end text recognition. Not "character recognition", it's not the characters we care about. Evaluating models with CER is also a bad idea. It frustrates me so much that text recognition is remaking all the mistakes of machine translation from 15+ years ago.
> We need to do end to end text recognition. Not "character recognition", it's not the characters we care about.
Arbitrary nonsensical text require character recognition. Sure, even a license plate bears some semantics bounding expectations of what text it contains, but text that has no coherence might remain an application domain for character rather than text recognition.
> Arbitrary nonsensical text require character recognition.
Are you sure? I mean, if it's printed text in a non-connected script, where characters repeat themselves (nearly) identically, then ok, but if you're looking at handwriting - couldn't one argue that it's _words_ that get recognized? And that's ignoring the question of textual context, i.e. recognizing based on what you know the rest of the sentence to be.
Not really. I have an HTR use case where the data is highly specialized codes. All the OCR software I use is tripped up by trying to find the content into the category of English words.
LLMs can help, but I’ve also had issues where the repetitive nature of the content can reliably result in terrible hallucinations.
Of course there are new models coming out every month. It's feeling like the 90s when you could just wait a year and your computer got twice as fast. Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.
The problem with doing OCR with LLMs is hallucination. It creates character replacements like Xerox's old flawed compression algorithm. At least this my experience with Gemini 2.0 Flash. It was a screenshot of a webpage, too.
Graybeards like Tessaract has moved to neural network based pipelines, and they're re-inventing and improving themselves.
I was planning to train Tessaract with my own hand writing, but if OCR4All can handle that, I'll be happy.
Paradoxically, LLMs should be the tool to fix traditional OCR by recognizing that "Charles ||I" should be "Charles III", "carrot ina box" should be "carrot in a box", the century of the event in context cannot be that construed through looking at the gliphs etc.
As someone who's learning how to do OCR in order to re-OCR a bunch of poorly digitized documents, this will not work with modern OCR. Modern OCR is too good.
If you're able to improve the preprocessing and recognition enough, then there's a point at which any post-processing step you do will introduce more errors than it fixes. LLM's are particularly bad as a post-processing step because the errors they introduce are _designed to be plausible_ even when they don't match the original text. This means they can't be caught just by reading the OCR results.
I've only learned this recently, but it's something OCR experts have known for over a decade, including the maintainers of Tesseract. [1]
OCR is already at the point where adding an LLM at the end is counterproductive. The state of the art now is to use an LSTM (also a type of neural network) which directly recognizes the text from the image. This performs shockingly well if trained properly. When it does fail, it fails in ways not easily corrected by LLM's. I've OCR'ed entire pages using Tesseract's new LSTM engine where the only errors were in numbers and abbreviations which an LLM obviously can't fix.
I also have even recent extensive experience: I get an important amount of avoidable errors.
> at which any post-processing step you do will introduce more errors than it fixes ... the errors they [(LLMs)] introduce are _designed to be plausible_
You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`. And even then, given that I can notice errors that an algorithm with some language proficiency could certainly correct, I have reasons to suppose that a proper calibration of an LLM based system can achieve action over a large number of True Positives with a negligible amount of False Positives.
> LSTM
I am using LSTM-based engines, and on those outputs I have stated «I get an important amount of avoidable errors». The one thing that could go in your direction is that I am not using the latest version of `tesseract` (though still in the 4.x), and I have recently noticed (already through `tesseract --print-parameters | grep lstm`) that the LSTM engine evolved within 4.x, from early to later.
> numbers and abbreviations which an LLM obviously can't fix
? It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" and for abbreviations, the LLM is exactly the thing that should guess them ot of the context. The LLM is that thing that knows that "the one defeated by Cromwell should really be Charles II-staintoberemoved instead of an apparent Charles III".
> You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`.
Fair, and I'm aware that that makes a huge difference in how worthwhile an LLM is. I'm glad you're not doing the annoyingly common "just throw AI at it" without thinking through the consequences.
I'm doing two things to flag words for human review: checking the confidence score of the classifier, and checking words against a dictionary. I didn't even consider using an LLM for that since the existing process catches just about everything that's possible to catch.
> I am using LSTM-based engines . . .
I'm using Tesseract 5.5. It could actually be that much better, or I could just be lucky. I've got some pretty well-done scans to work with.
> It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" . . .
I honestly don't know what you mean. Are you saying that an LLM would know that a reference to "page 311" should actually be a reference to "page 317" based on context?
I think the example you've given makes a lot of sense if you're just using an LLM as one way to flag things for review.
> I honestly don't know what you mean. Are you saying that an LLM would know that a reference to "page 311" should actually be a reference to "page 317" based on context
Not that of your example (the page number): that would be pretty hard to check (with current general agents. In the future, not impossible - you would need some agent finally capable to follow procedures strictly). That the extra punctuation or accent is a glitch and that the sentence has a mistake are more within the realm of a Language Model.
What I am saying is that a good Specialized Language Model (maybe a good less efficient LLM) could fix a text like:
"AB〈G〉 [Should be 'ABC'!] News was founded in 194〈S〉 [Should be '1945'!] after action from the 〈P〉CC [Should be 'FCC'!], 〈_〉 [Noise!], deman〈ci〉ing [Should be 'demanding'!] pluralist progress 〈8〉 [Should be '3'!] 〈v〉ears [should be 'years'] ear〈!〉ier [Should be 'earlier'!]..."
since it should "understand" the sentence and be already informed of the facts.
This is moot anyway if the LLM is only used as part of a review process. But the most valuable documents to digitize are, almost by definition, those that don't have widely-known information that an LLM is statistically likely to guess. There's no way to get around that.
> This is moot anyway if the LLM is only used as part of a review process
Not really, because if plan to perform a `diff` in order to ensure that there are no False Positives in the corrections proposed by your "assistant", you will want one that finds as many True Positives as possible (otherwise, the exercise will be futile, inefficient, if a large number of OCR errors remain). So, it would be good to have some tool that could (in theory, at this stage) be able to find the subtle ones (not dictionary related, not local etc.).
> But the most valuable documents to digitize are, almost by definition, those that don't have widely-known information that an LLM is statistically likely to guess
It depends on the text. You may be e.g. interested in texts the value of which is in arguments, the development of thought (they will speak about known things in a novel way) - where OCR error has a much reduced scope (and yet you want them clean of local noise). And, if a report comes out from Gallup or similar, with original figures coming from recent research, we can hope that it will be distributed foremostly electronically. Potential helpers today could do more things that only a few years ago (e.g. hunspell).
I tried it! (How very distracted of us not to have tried immediately.) And it works, mostly... I used a public LLM.
The sentence:
> ABG News was founded in 194S after action from the PCC , _ , demanciing pluralist progress 8 vears ear!ier
is corrected as follows, initially:
> ABG News was founded in 1945 after action from the FCC, demanding pluralist progress 8 years earlier
and as you can see it already corrects a number of trivial and non-trivial OCR errors, including recognizing (explicitly in the full response) that it should be the "FCC" to "demand pluralist progress" (not to mention, all of the wrong characters breaking unambiguous words and the extra punctuation).
After a second request, to review its output to "see if the work is complete", it sees that 'ABG', which it was reluctant to correct because of ambiguities ("Australian Broadcasting Corporation", "Autonomous Bougainville Government" etc.), should actually be 'ABC' because of several hints from the rest of the sentence.
As you can see - proof of concept - it works. The one error it (the one I tried) cannot see is that of '8' instead of '3': it knows (it says in a full response) that the FFC acted in 1942, and ABC was created in 1945, but it does not compute the difference internally nor catch the hint that '8' and '3' are graphically close. Maybe one LLM with "explicit reasoning" could do even better and catch also that.
I'm saying it's moot because, if you're just flagging things for review, there's already a more direct and reliable way to do that. The OCR classifier itself outputs a confidence score. The naieve way of just checking that confidence score will work. The OCR classifier has less overall information than an LLM, but the information it has is much more relevant to the task it's doing.
When I have some time in front of a computer, I'll try a side-by side comparison with some actual images.
> OCR is already at the point where adding an LLM at the end is counterproductive
That's mass OCR on printed documents. On handwritten documents, LLMs help. There are tons of documents that even top human experts can't read without context and domain language. Printed documents are intended to be readable character by character. Often the only thing a handwriting author intends is to remind himself of what he was thinking when he wrote it.
Also, what is the downstream tasks? What do you need character level accuracy for? In my experience, it's often for indexing and search. I believe LLMs have a higher ceiling there, and can in principle (if not in practice, yet) find data and answer questions about a text better than straightforward indexing or search can. I can't count the number of times I've e.g. missed a child in genealogy because I didn't think of searching the (fully and usually correctly) indexed data for some spelling or naming variant.
I am working with printed documents. Maybe LLMs currently make a difference with handwriting recognition. I wasn't directly responding to that. It's outside the little bit that I know, and I didn't even think of it as "OCR".
I'm not saying that I need high accuracy (though I do), I'm saying that the current accuracy (and clarifying that this is specifically for printed text) is already very high. Part of the reason it's so high is because the old complicated character-by-character classifiers have already been replaced with neural networks that process entire lines at a time. It's already moving in the direction you're saying we need.
Maybe you can passthrough the completed text from a simple, fast grammar model to improve text. You don't need a 40B/200GB A100 demanding language model to fix these mistakes. It's absurdly wasteful in every sense.
I'm sure there can be models which can be accelerated on last gen CPUs AI accelerators and fix these kinds of mistakes faster than real time, and I'm sure Microsoft Word is already doing it for some languages, for quite some time.
Heck, even Apple has on-device models which can autocomplete words now, and even though its context window or completion size are not that big, it allows me to jump ahead with a simple tap to tab.
I wonder if this is a case where you want an encoder-decoder model. It seems very much like a translation task, only one where training data is embarrassingly easy to synthesize by just grabbing sentences from a corpus and occasionally swapping, inserting, and deleting characters.
In terms of attention masking, it seems like you want the input to be unmasked, since the input is fixed for a given “translation”, and then for the output tokens to use causally masked self attention plus cross attention with the input.
I wonder if you could get away with a much smaller network this way because you’re not pointlessly masking input attention for a performance benefit that doesn’t matter.
When I was reading your comment, I remembered an assignment in Fuzzy Logic course:
"Number of different letters" is a great heuristic for a word guesser. In that method you just tell the number of letters and then do some educated guesses to start from a semi-converged point (I think word frequencies is an easy way), and brute force your way from there, and the whole process finds the words in mere milliseconds.
You can improve this method to a 2-3 word window since we don't care about the grammar, but misread words, and brute-force it from there.
You may even need no network to fix these kinds of misrecognitions with this. Add some SSE/AVX magic for faster processing and you have a potential winner in your hands.
> Maybe you can passthrough the completed text from a simple, fast grammar model to improve text
Yes - but not really a "grammar model": a statistical model about text, with "transformer's attention" - the core of LLMs - should be it: something that identifies if the fed text has statistical anomalies (which the glitches are).
Unfortunately, small chatbot LLMs do not follow instructions ("check the following text"), they just invent stories, and I am not aware of a specialized model that can be fed text for anomalies. Some spoke about a BERT variant - which still does not have great accuracy, I understood.
It is a relatively small problem that probably does not have a specialized solution yet. Already a simple input-output box that worked like: "Evaluate statistical probability of each token" - then we would check the spikes of anomaly. (For clarity: this is not plain spellchecking, as we want to identify anomalies in context.)
--
Edit: a check I have just done with an engine I had not yet used for the purpose shows a number of solutions... But none a good specific tool, I am afraid.
One might need document/context fine-tuning. It's not beyond possibility that a prayer of text is about Charles ll1 (el-el-one) someone's pet language model or something. Sometimes you want correction of "obvious" mistakes (like with predictive text) other times you really did write keming [with an M].
And that is why I wrote that you need an LLM, i.e. (next post) a «statistical model about text, with "transformer's attention"», as «[what we want] is not plain spellchecking, as we want to identify anomalies in context».
To properly correct text you need a system that checks large blocks of text with some understanding, not just disconnected words.
Edit: minutes ago a member (in a faraway post) wrote «sorry if this is a bit cheeky»... He meant "cheesy". You need some level of understanding to see those mistakes. Marking words that are outside the dictionary is not sufficient.
Gemini does not seem to do OCR with LLM. They seem to use their existing OCR technology of which they feed the output into the LLM. If you set the temperature to 0 and ask for the exact text as found in the document, you get really good results. I once got weird results where I got literally the JSON output of the OCR result with bounding boxes and everything.
Interesting, thanks for the information. However, unfortunately, no way I'll be sending my personal notebooks to a service which I don't know what's going to do with them in the long term. However, I might use it for the publicly available information.
I've been looking for an algorithm for running OCR or STT results through multiple language models, compare the results and detect hallucinations as well as correct errors by combining the results in a kind of group consensus way. I figured someone must have done something similar already. If anyone has any leads or more thoughts on algorithm implementation, I'd appreciate it.
Tesseract wildly outperforms any VLM I've tried (as of November 2024) for clean scans of machine-printed text. True, this is the best case for Tesseract, but by "wildly outperforms" I mean: given a page that Tesseract had a few errors on, the VLM misread the text everywhere that Tesseract did, plus more.
On top of that, the linked article suggests that Gemini 2.0 can't give meaningful bounding boxes for the text it OCRs, which further limits the places in which it can be used.
I strongly suspect that traditional OCR systems will become obsolete, but we aren't there yet.
I just wrapped up a test project[0] based on a comment from that post! My takeaway was that there are a lot of steps in the process you can farm out to cheaper, faster ML models.
For example, the slowest part of my pipeline is picture description since I need a LLM for that (and my project needs to run on low-end equipment). Locally I can spin up a tiny LLM and get one-word descriptions in a minute, but anything larger takes like 30. I might be able to only send sections I don't have the hardware to process.
It was a good into to ML models incorporating vision, and video is "just" another image pipeline, so it's been easy to look at e.g. facial recognition groupings like any document section.
Yes, I agree general purpose is the way to go, but I'm still waiting. Gemini is the best at last time I tried, but for all the ways I've tried to prompt it, it can not transcribe (or correctly understand the content of) e.g. the probate documents I try to decipher for my genealogy research.
I just used Gemini as an OCR a couple of hours ago because all the OCR apps I tried on android failed at the task lol
Wild seeing this commment right after waking up
I've seen Gemini Flash 2 mention "in the OCR text" when responding to VQA tasks which makes me question of they have a traditional OCR process mixed in the pipeline.
Issue with that is that some writings are not word based. People use acronyms, temporal, personalized, industrial jargon, and global ones. Beginning of the year, there where some HN posts about moving from dictionary word to character encoding for LLMs, because of the very varying nature in writing.
Even I used symbols for different means in a shorthand form when constructing an idea.
I see it the same way laws are. Their word definitions are anchored in time from the common dictionaries of the era. Grammar, spelling, and means all change through time. LLMs would require time scoped information to properly parse content from 1400 vs 1900. LLM would be for trying to take meaning out of the content versus retaining the works.
Character based OCR ignores the rules, spelling, and meaning of words and provides what most likely there. This retains any spelling and grammar error that are true positives or false positives, based on the rules of their day.
Could you dumb this down a bit (a lot) for dimmer readers, like myself? The way I am understanding the problem you are getting at is something like:
> The way person_1 in 1850 wrote a lowercase letter "l" will look consistently like a lowercase letter "l" throughout a document.
> The way person_2 in 1550 wrote a lowercase letter "l" may look more like an uppercase "K" in some parts, and more of a lowercase "l" in others, and the number "0" in other areas, depending on the context of the sentence within that document.
I don't get why you would need to see the entire document in order to gauge some of the details of those things. Does it have something to do with how language has changed over the centuries, or is it something more obvious that we can relate to fairly easily today? From my naive position, I feel like if I see a bunch of letters in modern English (assuming they are legible) I know what they are and what they mean, even if I just see them as individual characters. My assumption is that you are saying that there is something deeper in terms of linguistic context / linguistic evolution that I'm not aware of. What is that..."X factor"?
I will say, if nothing else, I can understand certain physical considerations. For example:
A person who is right-handed, and is writing on the right edge of a page may start to slant, because of the physical issue of the paper being high, and the hand losing its grip. By comparison, someone who is left-handed might have very smudged letters because their hand is naturally going to press against fresh ink, or alternatively, have very "light" because they are hovering their hand over the paper while the ink dries.
In those sorts of physical considerations, I can understand why it would matter to be able to see the entire page, because the manner in which they write could change depending on where they were in the page...but wouldn't the individual characters still look approximately the same? That's the bit I'm not understanding.
The lower case "e" in gothic cursive often looks like a lower case "r". If you see one of these: ſ maybe you think "ah, I know that one, that's an S!" and yes, it is, but some scribes when writing a capital H makes something that looks a LOT like it. You need context to disambiguate. Think of it as a cryptogram: if you see a certain squiggle in a context where it's clearly an "r", you can assume that the other squiggles that look like that are "r"s too. Familiarity with a scribe's hand is often necessary to disambiguate squiggles, especially in words such as proper names, where linguistic context doesn't help you a lot. And it's often the proper names which are the most interesting part of a document.
But yes, writers can change style too. Mercifully, just like we sometimes use all caps for surnames, so some writers would use antika-style handwriting (i.e. what we use today) for proper names in a document which is otherwise all gothic-style handwriting. But this certainly doesn't happen consistently enough that you can rely on it, and some writers have so messy handwriting that even then, you need context to know what they're doing.
The problem is payong experts to properly train a model is expensive, doubly when you want larger context.
Ots almost like we need a shared commons to benefit society but were surrounded by hoarders whp think they cam just strip mine society automatically bootstrap intelligence.
Surpise: Garbage CEOs in, garbage intelligence out.
> OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software.
Looks like a great project, and I don't want to nitpick, but...
https://www.ocr4all.org/about/ocr4all
> Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.
Any end-user application that uses docker is not an end-user application. It does not matter if the end-user knows how to use docker or not. End-user applications should be delivered as SaaS/WebUi or a local binary (GUI or CLI). Period.
Application installation isn't a user level task. The application being ready for a user to use, and being easy to install are separate. You get your IT literate helper to install for you, then, if the program is easy for users to use you're golden.
That's a very corporate mentality. Outside of an organizational context, installing applications certainly is a normal user level task. And for those users that have somebody help them, that somebody is usually just a younger person who's comfortable clicking 'Next' to get through an installer but certainly has no devops experience.
"Silicate chemistry is second nature to us geochemists, so it's easy to forget that the average person probably only knows the formulas for olivine and one or two feldspars."
A little secret: Apple’s Vision Framework has an absurdly fast text recognition library with accuracy that beats Tesseract. It consumes almost any image format you can think of including PDFs.
This has been one of my favorite features Apple added. When I’m in a call and someone shares a page I need the link to, rather than interrupt the speaker and ask them to share the link it’s often faster to screengrab the url and let Apple OCR the address and take me to the page/post it in chat.
After getting an iPhone and exploring some of their API documentation after being really impressed with system provided features, I'm blown away by the stuff that's available. My app experience on iOS vs Android is night and day. The vision features alone have been insane, but their text recognition is just fantastic. Any image and even my god awful handwriting gets picked up without issue.
That said, I do love me a free and open source option for this kind of thing. I can't use it much since I'm not using Apple products for my desktop computing. Good on Apple though - they're providing some serious software value.
I can't comment on what Apple is doing here, but Google has an equivalent called "lens" which works really well and I use it in the way you suggest here.
I basically wrapped this in a simple iOS app that can take a PDF, turn it into images, and applies the native OCR to the images. It works shockingly well:
How does it work with tables and diagrams? I have scanned pages with mixed media, like some are diagrams, I want to be able to extract the text but tell me where the diagrams are in the image with coordinates.
I wonder if it's possible to reverse engineer that, rip it out, and put it on Linux. Would love to have that feature without having to use Apple hardware
> How is this different from tesseract and friends?
The workflow is for digitizing historical printed documents. Think conserving old announcements in blackletter typesetting, not extracting info from typewritten business documents.
I didn't have good results in tesseract, so I hope this is really different ;)
I was surprised that even scraped screen text did not work 100% flawlessly in tesseract. Maybe it was not made for that, but still, I had a lot of problems with high resolution photos also. I did not try scanned documents, though.
I have never had to handle handwriting professionally but I have had great success with Tesseract in the past. I’m sure it’s no longer the best free/cheap option but with a little bit of image pre-processing to ensure the text pops from the background and isn’t unnecessarily large (I.e. that 1200dpi scan is overkill) you can have a pretty nice pipeline with good results.
In the mid 2010s I put Tesseract, OCRad (which is decidedly not state of the art), and aspell into a pretty effective text processing pipeline to transform resumes into structured documents. The commercial solutions we looked at (at the time) were a little slower and about as good. If the spellcheck came back with too low of a success rate I ran the document through OCRad which, while simplistic, sometimes did a better job.
I expect the results today with more modern projects to be much better so I probably wouldn’t go that path again. However as all of it runs nicely on slow hardware, it likely still has a place on low power/hobby grade IoT boards and other niches.
I have a typewriter written manuscript that is interspersed with hand written editing. Tesseract worked fine until the hand written part, then garbage. Is there a local solution that anyone can recommend? I have a 16gb lenovo laptop and access to a workstation with a with an RTX 4070 ti 16gb card. Thanks.
Tangentially related, but does someone know a resource for high-quality scans of documents in blackletter / fraktur typesetting? I'm trying to convert documents to look fraktury in latex and would like any and all documents I can lay my hands on.
I'm sorry. I suppose this is great but, an .exe-File is designed for usability. A docker container may be nice for techy people, but it is not "4all" this way and I do understand that the usability starts after you've gone through all the command line interface parts, but those are just extra steps compared to other OCR programs which work out of the box.
I think the current sweet-spot for speed/efficiency/accuracy is to use Tesseract in combination with an LLM to fix any errors and to improve formatting, as in my open source project which has been shared before as a Show HN:
This process also makes it extremely easy to tweak/customize simply by editing the English language prompt texts to prioritize aspects specific to your set of input documents.
What kind of accuracy have you reached with this pipeline of Tesseract+LLM? I imagine that there would be a hard limit as to what level the LLM could improve the OCR extract text from Tesseract, since its far from perfect itself.
Haven't seen many people mention it, but have just been using the PaddleOCR library on it's own and has been very good for me. Often achieving better quality/accuracy than some of the best V-LLM's, and generally much better quality than other open-source OCR models I've tried like Tesseract for example.
That being said, my use case is definitely focused primarily on digital text, so if you're working with handwritten text, take this with a grain of salt.
I haven’t, but I bet it would work pretty well, particularly if you tweaked the prompts to explain that it’s dealing with Ancient Greek or whatever and give a couple examples of how to handle things.
What is this? A new SOTA OCR engine (which would be very interesting to me) or just a tool that uses other known engines (which would be much less interesting to me).
A movement? A socio-political statement?
If only landing pages could be clearer about wtf it actually is ...
„OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material.“
It seems to be based on OCR-D, which itself is based on
Ocr is well and good, i thought it was mostly solved with tesseract what does this bring? But, what I’m looking for is a reasonable library or usable implementation of MRC compression for the resulting pdfs. Nothing i have tried comes anywhere near the commercial offerings available, which cost $$$$ . It seems to be a tricky problem to solve, that is detecting and separating the layers of the image to compress separately and then binding them
Back togethr into a compatible pdf.
Very interesting article. I'd be interested to know if a M-series Mac Mini (this article was early 2023, so there should've been M1 and M2) would have also filled this role just fine.
> My preliminary speed tests were fairly slow on my MacBook. However, once I deployed the app to an actual iPhone the speed of OCR was extremely promising (possibly due to the Vision framework using the GPU).
I don't know a lot about the specifics of where (hardware-wise) this gets run, but I'd assume any semi-modern Mac would also have an accelerated compute for this kind of thing. Running it on a Mac Mini would ease my worries about battery and heat issues. I would've guessed that they'd scale better as well, but I have no idea if that's actually the case. Also, you'd be able to run the server as a service for automatic restarts and such.
> Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?
This is specifically for historic documents that tesseract will handle poorly. It also provides a good interface for retraining models on a specific document set, which will help for documents that are different from the training set.
I was playing with PaddleOCR a while ago, it seemed to work quite well. It seems to be geared to Chinese, but it also works with other languages in my experience.
Wow. Setup took 12 GB of my disk. First impression: nice UI, but no idea what to do with it or how to create a project. Tells me "session expired" no matter what I try to do. Definitely not batteries-included kind of stuff, will need to explore later.
I've been looking for a project that would have an easy free/extremely cheap way to do OCR/image recognition for generating ALT text automatically for social media. Some sort of embedded implementation that looks at an image and is either able to transcribe the text, or (preferably) transcribe the text AND do some brief image recognition.
I generally do this manually with Claude and it's able to do it lightning fast, but a small dev making a third party Bluesky/Mastodon/etc client doesn't have the resources to pay for an AI API.
Such an approach moves the cost of accessibility to each user individually. It is not bad as a fallback mechanism, but I hope that those who publish won't decide that AI absolves them of the need to post accessible content. After all, if they generate the alt text on their side, they can do it only once and it would be accessible to everyone while saving multiple executions of the same recognition task on the other end. Additionally, they have more control how the image would be interpreted and I hope that this really would matter.
I maintain a searchable archive of historical documents for a nonprofit, OCR'd with Tesseract over several years. Tesseract 4 was a big improvement over previous versions, but since then its accuracy has not improved at the same rate as other free solutions.
These days, just uploading a PDF of scanned documents (typeset ones, not handwriting) to Google Drive and opening with Google Docs results in a text document generated with impressive quality OCR.
But this is not scriptable, and doesn't provide access position information, which is needed so we can highlight search results as color overlays on the original PDF. Tesseract's hOCR mode was great for that.
For the next version, we're planning to use one of the command-line wrappers to Apple's Vision framework, which is included free in MacOS. A nice one that provides position information is at https://github.com/bytefer/macos-vision-ocr
I asked this question yesterday but did not enough votes. I need to OCR and then translate thousands of pages from historical documents and was wondering if you knew a scriptable app/technique or technology that includes ‘layout recovery’, aka overlaying translated text over the original, like the Safari browser etc. does (not sure the apple vision framework wrapper does this?).
Apple Vision and its wrappers provide bounding boxes for each line of text. That's slightly less convenient than Tesseract which can give you a bounding box for each word, but more than compensated by Apple Vision's better accuracy. I am planning to fudge the word boxes by assuming fixed-width letters and dividing up the overall width so that each word's width is proportional to its share of the total letters on the line.
Once you have those bounding boxes, it's pretty simple to use a library like [1] (Python) or [2] (JavaScript) to add overlay text in the right place. For example, see how [3] does it.
FYI the Apple one is best inside the Live Text API which is Swift-only and so some old Python and CLI tools which wrap the older Obj-C APIs may have worse quality (though Live Text doesn't really provide bounding boxes - so what I do is combine its output with bounding box APIs like the old iOS/macOS ones)
Sure, if you're generating the input to be machine readable, then it's not very surprising that it's machine readable withough much effort. But then you could also use a QR code. Most people who want to OCR stuff are doing it because they don't have control over the input.
My read is that he is saying that Tesseract is intended for OCR, not an entire image pipeline, so there is an expectation that you will preprocess those real world images into a certain form rather than throwing an image straight off the sensor at it.
Yes' anecdotally, it's a bit better now. Still nowhere near actually usable OCR software though, unless your use-case is scanning clear hi-res screenshots in conventional fonts and popular langues, without tables or complicated formatting.
This tool says it includes a workflow GUI and refinement tools, like creating work-specific text recognition models - maybe the others do too? tesseract isn’t packaged with a GUI, but is wrapped by many.
This project seems focused on making tools more accessible and helping the user be more efficient and organized
They lost me when they suggested I install docker.
Now, I wouldn't mind if they suggested that as an _option_ for people whose system might exhibit compatibility problems, but - come on! How lazy can you get? You can't be bothered to cater to anything other than your own development environment, which you want us to reproduce? Then maybe call yourself "OCR4me", not "OCR4all".
Genuine question: What would the ideal docker-free solution look like in your opinion? That is, something that is accessible to the average university student, researcher, and faculty member? What installation tool/package manager would this hypothetical common user use? How many hoops would they have to jump through? The hub page lists the various dependencies [0], which on their own are pretty complicated packages.
I don't wish to speak out of turn, but it doesn't look like this project has been active for about 1 year. I checked GitHub and the last update was in Feb 2024. Their last post to X was 25 OCT 2023. :(
This looks promising, not sure how it stacks up to Transkribus which seems to be the leader in the space since it has support for handwritten and trainable ML for your dataset.
I've been using tesseract for a few years on a personal project, I'd be interested to know how they compare in terms of system resources, given that I am running it on a dell optiplex micro with 8 gigs of ram and 6-th gen i5 - tesseract is barely noticeable so it's just my curiosity at this point, I don't have any reasons to even consider switching over. I do however have a large dataset of several hundred gbs of scanned pdfs which would be worth digitalizing when I find some time to spare.
We need to do end to end text recognition. Not "character recognition", it's not the characters we care about. Evaluating models with CER is also a bad idea. It frustrates me so much that text recognition is remaking all the mistakes of machine translation from 15+ years ago.
reply