To anyone screen capturing small fonts as a demonstration, or capturing digital text especially at a small resolution, I don't believe that that is the purpose of this OCR library. (As a specialized problem, that might be easier to solve depending on the typeface.)
Magic .
Read this to yourself. Read it silently
Don't move your lips. Don’t make a suund
Listen to yourself. Listen without hearing
What a wonderfully weird thing, huh?
NOW MAKE THIS PART LOUD!
SCREAM IT IN YOUR MIND!
DROWN EVERYTHING OUT.
Now, hear a whisper. A tiny whisper.
New, read this next line with your best crotchety—
old-man voice:
“Hello there, sonny. Does your town have apost 0
Awesome! Who was that? Whose voice was that?
It sure wasn’t yours!
How do you do that?
How?!
Must be magic.
Problems with this text: misspelled 'sound' as 'suund', didn't recognize the word 'anything', and mis-recognized 'a post office' as 'apost 0'.
Not bad. Especially since two of three mistakes are on the edge of the page.
Can someone try and see how it would perform if you simply upscale the image using normal bicubic interpolation? And if it performs much better, I feel like that should be a preprocessing option to scale up the image since it seems to do so poorly on small resolutions.
> Note: image should be be sufficiently high resolution. Often, the same image will get much better results if you upscale it before calling recognize.
Presumably, the result would be even better if you put a filter that did a path-tracing/auto-vectorization step first, upscaled by an arbitrary amount, and then rasterized the result. Analogue sources would have their traced paths distorted by such a process in a way that's lossier than just using the noisy analogue source—but for actually-digital-from-the-beginning images, it'd work perfectly.
This would be really cool! We would welcome a pull request and / or link to any downstream project working on this. In a similar vein, we've been considering adding the stroke width transform [0] as an optional preprocessing step.
The text detection is lacking in comparison to Google's Vision API. Here is a real-life comparison between Tesseract and Google's Vision API, based on a PDF a user of our website uploaded.
> I am also a top professional on Thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well
> “ I am also a top professional on Thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well ”
> \ am also a mp pmfesslonzl on Thummack wmcn Is a sue 1m peop‘e \ookmg (or professmna‘
semces We on glg salad P‘ezse see my rewews 1mm my cuems were as weH
Although Google's API is certainly better, Tesseract.js should work similarly if you increase the font size. Screenshots taken on 'retina' devices are around the smallest text it can handle well.
"I am also a top professional on thumbtack which is a site for people looking for professional
services like on gig salad. Please see my reviews from my clients there as well"
Although Googie's API is certaihiy better,
Tesseract.js should work simiiarly if you
increase the font size.
Screenshots taken
on 'retiha’ devices are around the smailest
text it can handie well.
Edit:
A screenshot of the same text at a higher
resolution:
httgs:[[imgurxomZaN/UGu
Tesseract.js
output: httgs://imguricom[a[hiIfM
This is a neat toy, but not impressive compared to the results from tesseract-ocr/tesseract [0]:
$ curl -s http://i.imgur.com/uuFhw90.png \
| tesseract stdin stdout
Although Google's API is certainly better,
Tesseract.js should work similarly if you
increase the font size.
Screenshots taken on 'retina' devices are
around the smallest text it can handle well.
Edit:
A screenshot of the same text at a higher
resolution: https:[ZimguncomlalWHGu
Tesseract.js output:
https:[[imgur.com[a[nilfM
Notice how Tesseract.js results suffer from being unable to differentiate between n's and h's, i's and l's.
That's interesting! Given that Tesseract.js wraps an Emscripten'd copy of Tesseract, I would have expected close to identical performance. This might have to do with the way we threshold images, with the age of the tesseract version we're using, or both. I'll look into it!
Edit: In addition to those differences, I think your font size is still a bit too small. On an unedited screenshot from a macbook (https://i.imgur.com/iv4ZdSt.png) I get
Although Google's API is certainly better, Tesseract.js should work similarly if you increase the font size. Screenshots taken on 'retina' devices are around the smallest text it can handle well.
Edit:
A screenshot of the same text at a higher resolution: https:[[imgur.com[a[W7IGu
Tesseract.js output: https:[[imgur.com[a[niIfM
“I am also a top professional on thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well"
What do you mean by "emscripten'd"? Is Tesseract.js using emscripten to effectively bundle the 150KLOC of C/C++ from tesseract-ocr and the upstream dependency on leptonica [0]? If so, that's amazing!
> This might have to do with the way we threshold images,
> with the age of the tesseract version we're using, or
> both. I'll look into it!
I'd be very interested to hear about what is required to make it "match" native functionality. Please do drop me a line if/when you get it figured out! (I'm @jtaylor on twitter [1])
I've used Tesseract for a production OCR project some years ago, and can confirm too that it just doesn't work at screen resolution. On the other hand, performance on high DPI photos was quite OK. Google Vision API wasn't around at that time so I can't compare.
I spent many afternoons trying to get tesseract to read Dwarf Fortress screenshots, such as http://i.imgur.com/32vVhnH.png - including much pre-processing, such as converting the text to black and white. Alas, I never even got close.
Edit: just tried Google's, and it had one mistake for that entire file. That's pretty impressive.
I spent a weekend tying Tesseract together with Tekkotsu (the amazing open framework for the Sony AIBO) in an attempt to teach my robot dog to read. The eventual goal being to hook up the output of OCR --> Text To Speech (TTS) and have him read to me.
Alas, the low resolution of the camera was an insurmountable problem. Poor Aibo needed 40-point fonts and I practically had to rub his nose in the book. Not exactly the user experience I was aiming for.
If you're reading screenshots of a non-changing font, you could quite easily get away with plain template-matching. Simply do a run through and label the data one time.
I did something like that a few years ago when making an Eve-Online UI scraper.
Google Cloud Vision API is very expensive though -- if you can sacrifice some amount of quality, it might make sense to go with the open source alternative. At $2.50/unit the cost is absurd, and even the free trial expires after 90 days.
This is the correct pricing. One caveat: Even the free tier requires you to enter a credit card.
For higher volumes, there is also the OCR.space api. It offers 25,000 free conversions per month. It is not as good as Google, but works fine on screenshots.
HOW is there not a better, almost 100% accurate OCR tool?
I routinely (daily) need to OCR PDF files. The PDF files are not scans. They are PDF files created from a Word file. The text is 100% clear, the lines are 100% straight, and the type is 100% uniform.
And, yet, Microsoft and Google OCR spits out gibberish that is full of critical errors.
From a problem solving perspective, this seems like an incredibly easy problem to solve in this exact use case. That is, PDFs generated from text files. Identify a uniform font size (prevent o-to-O and o-to-0 errors), identify a font-family (serif, sans-serif, narrow to particular fonts), and OCR the damn thing. And yet, the output is useless in my field.
The main advantage of ABBYY is that if you need to do OCR, it is, in my opinion, the best consumer-level package. And it does a pretty good job of doing OCR and conversion to Excel. Here's a Github repo that demonstrates some results:
But to reemphasize, the above repo demonstrates ABBYY maintaining table structure with PDFs that are scanned images, which is considerably harder than the situation you're in.
As much as your response tries to solve GP's particular problem (OCR for PDF-to-text being not the right tool), I 100% agree with the extreme annoyance expressed in it regarding the state of the free OCR.
In principle, text-pdf-to-text is just a matter of parsing PDF (and/or Word) formats and extracting text buried in metadata. (I know it's a lot of work but still).
Even if you forget about what GP said about the source being text PDFs, and when all the sources are png images, as long as those pngs were generated from text documents (Word, PDF, etc) without any scanning or camera involved, it is unacceptable that today's free OCR tools don't get the job done, when in 2016, machine-learning has produced systems that have surpassed human accuracy in much harder tasks like object detection and speech recognition.
I know it's not an unsolved problem. It's just a matter of some knowledgeable machine learning researcher taking a break from working on cutting edge for a few months and putting together a package that gets the image-to-text job done. Once such a base tool is available on github, the community will take over and add features, fix bugs, as needed. (I'm extremely busy with my own degree work ATM, otherwise I would probably do something like that).
EDIT 1: As for tesseract, I hate it with the passion of a thousand fiery suns. It's a kludge, a black-box of traditional-programming karate-chops and overly-complicated bloat that spits out text the way it likes and there is, largely, nothing you can do about it. Compared to machine-learning and modern computer-vision, tesseract belongs to the dark ages. If there is going to be a quality OCR tool, it's has to be written from scratch based on deep-learning from the ground up.
There's a brute-force solution to the "extract text from a 'digital-native' image" problem that you can write in an afternoon:
1. Use an existing OCR library to give you the positions of the words, plus a first-cut guess of their content.
2. Take the first word from the OCRed guess, and loop through a set of {font, size, leading} tuples, rendering out the same word at that {font, size, leading} and overlaying it on the image, and measuring error-distance.
3. If your best match isn't within some minimum error-distance, then assume that the OCR misrecognized the first word, and try again with the second, third, etc.
Once you've got a font-settings match:
4. render the rest of the words onto their respective detected bounding boxes;
5. notice which words have a higher error-distance than the rest;
6. for each word, generate candidate mutations of the word (e.g. everything at a Levenstein distance of 1 from the OCRed guess), pick the one that lowers the error-distance, and repeat until the distance for that word won't go down any lower.
Actually, come to think of it, it'd be a lot easier to detect and unify "identical" sub-regions of the image first (using e.g. https://en.wikipedia.org/wiki/JBIG2 on a lossless setting). Then you could, in parallel to the above, also try to do frequency-analysis to discover which of your image "tiles" would likely form a basic "alphabet" of character-glyphs—and then hill-climb toward aligning that "alphabet" by attempting to produce the most runs of character-glyphs that translate to known dictionary words in whatever language the OCR thinks the text is in.
The font-matching would still be necessary, though, for the rest of the image samples that don't fall into the easily-frequency-analyzed part. (And for languages that aren't alphabetic, like Chinese, where there are no super-common character-glyphs.)
Another partner and I came up with a similar solution. It hinged on detecting the typeface and using a bitmapped (or otherwise rendered) font package to OCR letter by letter.
The PDF files that we are dealing with do not have embedded text and are not searchable, but are "digital-native," to use the term that you suggested.
Does this not exist? If not, why does it not exist?!
Right. It's an image PDF generated from a text file, so there are no digital-to-analog-to-digital errors introduced. These files should be perfect OCR candidates, but everything that I've found is full of errors, missing portions of sentences, rearranged fragments, etc.
Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.
I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.
Debian has a command-line tool 'pdftotext' which extracts the text from a PDF. It is not OCR, it pulls the characters from the file itself. Its in the package called poppler-utils.
Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.
There are a couple of ways a PDF document could contain actual text that is nevertheless not selectable or searchable. One is that the originator could have protected the document; another (more common) cause is that the originator didn't embed the proper font maps when exporting the document. I see the latter a lot with documents produced from LaTeX originals.
As the parent mentioned, pdftotext can often extract text from such documents without the need for OCR. (Although sometimes if the document contains ligatures those don't get converted.)
Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.
Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.
Maybe not. The PDF probably has an embedded text (so it doesn't blur when zomming in) but it could be either cinverted into vector curves or protected from copying (see properties). The easiest way is to change the PDF export settings in Word/Ghostscript/Distiller.
For all those claiming issues with reading text from a screen shot of this page, note that this is more an issue with the original Tesseract library, not this library (which appears to wrap Tesseract compiled through Emscripten). I remember having a similar issue when I used the original Tesseract. The quick hack I found to fix it was to rescale any small text input images 3x first before feeding it to Tesseract. I'm sure there's more intelligent solutions to mitigate that problem.
Why the promise-like interface? If it returned a promise with a this-returning progress method monkey-patched onto it, then you could use it otherwise like a regular promise:
I guess I just still have bad memories of jQuery's old almost-like-real promises. I'd rather never have to think ever again about whether I'm dealing with a real promise or one that's going to surprise me and break at run-time because I tried to use it like a real one.
Excited about this... but the OCR quality seems to be very bad. Maybe it's not optimized for recognizing black text on a white background.
For example, I took a screenshot of this comment and ran it through the demo and got this:
Excited ehent this... but the OCR enenty Seems te be very bad. Maybe it's het Dptimized far
recngnizing black text an e white heckgmnhe.
EDI example, 1 tank e Screenshnt at this cement ehe teh it. thmneh the den» ehd get this:
It seems to recognize the bounding boxes just fine but mangles the words.
Did you try increasing the font size a bit? On a retina macbook (so effectively ~2x bigger font) I get:
Excited about this... but the OCR quality seems to be very bad. Maybe it's not optimized for recognizing black text on a white background.
For example, I took a screenshot of this comment and ran it through the demo and got this:
I've been using this library to read screenshots of Pokemon Go to automatically calculate Individual Values for each Pokemon[1] It's worked great on desktop, but on mobile safari where it matters most the library causes the browser to crash :(
Tesseract was one of the best publicly-available CAPTCHA solvers when I was playing around with that stuff a few years ago; I remember somewhere in the neighbourhood of 90%+ accuracy on ReCAPTCHA, no wonder they've changed those considerably since then to make it difficult even for humans.
Does this mean I can implement Tesseract on my home server without using php's shell_exec to perform magic on my files? I can just use Jscript instead? Cool!
My current HRCloud2 project could benefit greatly if I ever get around to it. Currently I make the php interpreter jump through hoops and move stuff all over the place to OCR images and docs. This could save a ton of time and shift the processing to the client instead of my server.
Well it's pure JS in that it's been running the C tesseract through emscripten. So in a way it's pure JS just as much as the original lib is pure assembly when compiled ;-)
As another commenter mentioned, Tesseract.js won't perform very well on 'natural' images (e.g. the very light text you tried).
It should work better if you feed it a screenshot of the black text at the top of the demo page though (Tesseract.js is a pure Javascript port etc...).
The title and description are very misleading: this is technically "pure JavaScript" but the JS is compiled from the original C++ library of the same name using emscripten. I think "pure JS" would imply that all of its sources are written in JS which is not the case here. It's mostly the C++ code doing the actual work, with a little JS wrapper on top.
Pretty cool. I screen captured the text in the bottom right corner of the page and it had some issues. Here's a screenshot of what happened: http://io.kc.io/hkeM
Not sure about this JS version, but just tesseract comes with trained database for selected languages and fonts and it should be able to work out of box.
According to my tests this is true, but for curiosity can anyone get equivalent or better speed with tesseract.js? This is nice but I don't need client side processing so is there any reason to pick up tesseract.js?
Is this at all affiliated with the already-existing tesseract OCR library? It doesn't seem to be from my cursory check so if not you need to rename your library, because you're ripping off their name.
"Tesseract.js is a pure Javascript port of the popular Tesseract OCR engine." first sentence on http://tesseract.projectnaptha.com/ linked from the github page
A much better example that works quite well is a picture of someone holding a book: http://i.imgur.com/3JWs64x.jpg
Problems with this text: misspelled 'sound' as 'suund', didn't recognize the word 'anything', and mis-recognized 'a post office' as 'apost 0'.Not bad. Especially since two of three mistakes are on the edge of the page.