Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Tesseract.js – Pure JavaScript OCR for 60 Languages (github.com/naptha)
727 points by bijection on Oct 12, 2016 | hide | past | favorite | 97 comments



To anyone screen capturing small fonts as a demonstration, or capturing digital text especially at a small resolution, I don't believe that that is the purpose of this OCR library. (As a specialized problem, that might be easier to solve depending on the typeface.)

A much better example that works quite well is a picture of someone holding a book: http://i.imgur.com/3JWs64x.jpg

    Magic .
    Read this to yourself. Read it silently
    Don't move your lips. Don’t make a suund
    Listen to yourself. Listen without hearing
    What a wonderfully weird thing, huh?
    NOW MAKE THIS PART LOUD!
    SCREAM IT IN YOUR MIND!
    DROWN EVERYTHING OUT.
    Now, hear a whisper. A tiny whisper.
    New, read this next line with your best crotchety—
    old-man voice:

    “Hello there, sonny. Does your town have apost 0
    Awesome! Who was that? Whose voice was that?
    It sure wasn’t yours!

    How do you do that?
    How?!
    Must be magic.
Problems with this text: misspelled 'sound' as 'suund', didn't recognize the word 'anything', and mis-recognized 'a post office' as 'apost 0'.

Not bad. Especially since two of three mistakes are on the edge of the page.


The old man voice was spoken in my mind as Deckard Cain.


I stayed a while and I listened


it was instantaneous Deckard Cain for me.


Isn't this an issue with the algorithm?

Can someone try and see how it would perform if you simply upscale the image using normal bicubic interpolation? And if it performs much better, I feel like that should be a preprocessing option to scale up the image since it seems to do so poorly on small resolutions.


Tesseract's Github Readme actually recommends upscaling for better results:

https://github.com/naptha/tesseract.js#tesseractrecognizeima...

> Note: image should be be sufficiently high resolution. Often, the same image will get much better results if you upscale it before calling recognize.


Presumably, the result would be even better if you put a filter that did a path-tracing/auto-vectorization step first, upscaled by an arbitrary amount, and then rasterized the result. Analogue sources would have their traced paths distorted by such a process in a way that's lossier than just using the noisy analogue source—but for actually-digital-from-the-beginning images, it'd work perfectly.


This would be really cool! We would welcome a pull request and / or link to any downstream project working on this. In a similar vein, we've been considering adding the stroke width transform [0] as an optional preprocessing step.

[0] https://www.microsoft.com/en-us/research/publication/stroke-...


Thanks, the text is awesome!


The text detection is lacking in comparison to Google's Vision API. Here is a real-life comparison between Tesseract and Google's Vision API, based on a PDF a user of our website uploaded.

Original text [http://i.imgur.com/CZGhKhn.png]:

> I am also a top professional on Thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well

Google detects [http://i.imgur.com/pSJym1x.png]:

> “ I am also a top professional on Thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well ”

Tesseract detects [http://i.imgur.com/wwbLU6g.png]:

> \ am also a mp pmfesslonzl on Thummack wmcn Is a sue 1m peop‘e \ookmg (or professmna‘ semces We on glg salad P‘ezse see my rewews 1mm my cuems were as weH


Although Google's API is certainly better, Tesseract.js should work similarly if you increase the font size. Screenshots taken on 'retina' devices are around the smallest text it can handle well.

Edit:

A screenshot of the same text at a higher resolution: https://imgur.com/a/W7IGu

Tesseract.js output: https://imgur.com/a/niIfM

"I am also a top professional on thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well"


Your comment (zoomed in Chrome on Win 10): http://i.imgur.com/uuFhw90.png

Tesseract.js analysis:

    Although Googie's API is certaihiy better,
    Tesseract.js should work simiiarly if you
    increase the font size.
    Screenshots taken
    on 'retiha’ devices are around the smailest
    text it can handie well.
    
    Edit:
    
    A screenshot of the same text at a higher
    resolution:
    httgs:[[imgurxomZaN/UGu
    
    Tesseract.js
    output: httgs://imguricom[a[hiIfM
This is a neat toy, but not impressive compared to the results from tesseract-ocr/tesseract [0]:

    $  curl -s http://i.imgur.com/uuFhw90.png \
        | tesseract stdin stdout

    Although Google's API is certainly better,
    Tesseract.js should work similarly if you
    increase the font size.
    Screenshots taken on 'retina' devices are
    around the smallest text it can handle well.
    
    Edit:
    A screenshot of the same text at a higher
    resolution: https:[ZimguncomlalWHGu
    Tesseract.js output:
    https:[[imgur.com[a[nilfM
Notice how Tesseract.js results suffer from being unable to differentiate between n's and h's, i's and l's.

[0] https://github.com/tesseract-ocr/tesseract


That's interesting! Given that Tesseract.js wraps an Emscripten'd copy of Tesseract, I would have expected close to identical performance. This might have to do with the way we threshold images, with the age of the tesseract version we're using, or both. I'll look into it!

Edit: In addition to those differences, I think your font size is still a bit too small. On an unedited screenshot from a macbook (https://i.imgur.com/iv4ZdSt.png) I get

  Although Google's API is certainly better, Tesseract.js should work similarly if you increase the font size. Screenshots taken on 'retina' devices are around the smallest text it can handle well.

  Edit:

  A screenshot of the same text at a higher resolution: https:[[imgur.com[a[W7IGu

  Tesseract.js output: https:[[imgur.com[a[niIfM

  “I am also a top professional on thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well"


What do you mean by "emscripten'd"? Is Tesseract.js using emscripten to effectively bundle the 150KLOC of C/C++ from tesseract-ocr and the upstream dependency on leptonica [0]? If so, that's amazing!

    > This might have to do with the way we threshold images,
    > with the age of the tesseract version we're using, or
    > both. I'll look into it!
I'd be very interested to hear about what is required to make it "match" native functionality. Please do drop me a line if/when you get it figured out! (I'm @jtaylor on twitter [1])

[0] http://www.leptonica.org/

[1] https://twitter.com/jtaylor


Here's the repo of their build process for the core tesseract emscripten build

https://github.com/naptha/tesseract-emscripten/blob/master/j... specifically the line for lepton


Hey just a reminder that this is a ShowHN which has its own rules. Honesty is okay but aim to be respectful.


I've used Tesseract for a production OCR project some years ago, and can confirm too that it just doesn't work at screen resolution. On the other hand, performance on high DPI photos was quite OK. Google Vision API wasn't around at that time so I can't compare.


I spent many afternoons trying to get tesseract to read Dwarf Fortress screenshots, such as http://i.imgur.com/32vVhnH.png - including much pre-processing, such as converting the text to black and white. Alas, I never even got close.

Edit: just tried Google's, and it had one mistake for that entire file. That's pretty impressive.


Since we're trading "I failed" stories...

I spent a weekend tying Tesseract together with Tekkotsu (the amazing open framework for the Sony AIBO) in an attempt to teach my robot dog to read. The eventual goal being to hook up the output of OCR --> Text To Speech (TTS) and have him read to me.

Alas, the low resolution of the camera was an insurmountable problem. Poor Aibo needed 40-point fonts and I practically had to rub his nose in the book. Not exactly the user experience I was aiming for.

Never got around to the TTS part.


I completely missed the word "robot" then and was pretty impressed that you wanted to teach your dog to read.


I'm trying to fathom how you thought he loaded Tessaract into a live dog.


There are two openings. One accepts anything, really. It's dangerous to try to use the other one, though.


If you're reading screenshots of a non-changing font, you could quite easily get away with plain template-matching. Simply do a run through and label the data one time.

I did something like that a few years ago when making an Eve-Online UI scraper.


Upscaling worked ok for me:

Upscaled image: https://imgur.com/a/4IQA7

Result on demo page: http://imgur.com/a/A0v5C

The hammerman Tikes flsosushsath: Greetings. My name is Tikes Leafsilk.

You: Rh. hello. I'm Stasbo Murderknower the Craterous Trance of Fins. Don't travel alone at night. or the bogeyman will get you.

You: Tell me about this hall.

Tikes: This is The flccidental Ualley. In 123, Stasho Steamdances ruled from The flccidental Ualley of The Council of Cobras in Ueilapes.

...


This seems like a font issue. Would training the model for this console font help?


Google Cloud Vision API is very expensive though -- if you can sacrifice some amount of quality, it might make sense to go with the open source alternative. At $2.50/unit the cost is absurd, and even the free trial expires after 90 days.


I just checked because I couldn't believe that price.

"Price per 1000 units. Unit volumes are based on monthly usage." It's $2.50 per 1000 units, so 0.25 cents per unit.

Edit: And, according to the pricing page [1] the first 1000 units are free.

[1]: https://cloud.google.com/vision/pricing


Note that's $0.0025 per unit, not 25 cents per unit. Your post is correct, but, ya know, Verizon math is a real thing.


My mistake -- thanks for pointing that out. I think I was in shock, and didn't take a closer look.


This is the correct pricing. One caveat: Even the free tier requires you to enter a credit card.

For higher volumes, there is also the OCR.space api. It offers 25,000 free conversions per month. It is not as good as Google, but works fine on screenshots.


We spent a decent amount of time evaluating Abbyy vs Tesseract vs Cloud Vision. Cloud Vision wins hands down and is very reasonably priced.


There is a Chrome Extension for Cloud Vision which works well and seems to be free.

Written by a Google employee, see top link at http://www.imjasonh.com/projects


How do you test sample images with google's vision api? Do you have to sign up for a 90 day trial or do they allow you to upload images for trial.


HOW is there not a better, almost 100% accurate OCR tool?

I routinely (daily) need to OCR PDF files. The PDF files are not scans. They are PDF files created from a Word file. The text is 100% clear, the lines are 100% straight, and the type is 100% uniform.

And, yet, Microsoft and Google OCR spits out gibberish that is full of critical errors.

From a problem solving perspective, this seems like an incredibly easy problem to solve in this exact use case. That is, PDFs generated from text files. Identify a uniform font size (prevent o-to-O and o-to-0 errors), identify a font-family (serif, sans-serif, narrow to particular fonts), and OCR the damn thing. And yet, the output is useless in my field.


You do not want OCR for this. You want either ABBYY FineReader (around $99 for a license), or, if you prefer open source, Tabula:

https://source.opennews.org/en-US/articles/introducing-tabul...

The main advantage of ABBYY is that if you need to do OCR, it is, in my opinion, the best consumer-level package. And it does a pretty good job of doing OCR and conversion to Excel. Here's a Github repo that demonstrates some results:

https://github.com/dannguyen/abbyy-finereader-ocr-senate

But to reemphasize, the above repo demonstrates ABBYY maintaining table structure with PDFs that are scanned images, which is considerably harder than the situation you're in.

I've started a repo that eventually will compare text-to-table tools, which is what you want: https://github.com/dannguyen/pdftotablestable


As much as your response tries to solve GP's particular problem (OCR for PDF-to-text being not the right tool), I 100% agree with the extreme annoyance expressed in it regarding the state of the free OCR.

In principle, text-pdf-to-text is just a matter of parsing PDF (and/or Word) formats and extracting text buried in metadata. (I know it's a lot of work but still).

Even if you forget about what GP said about the source being text PDFs, and when all the sources are png images, as long as those pngs were generated from text documents (Word, PDF, etc) without any scanning or camera involved, it is unacceptable that today's free OCR tools don't get the job done, when in 2016, machine-learning has produced systems that have surpassed human accuracy in much harder tasks like object detection and speech recognition.

I know it's not an unsolved problem. It's just a matter of some knowledgeable machine learning researcher taking a break from working on cutting edge for a few months and putting together a package that gets the image-to-text job done. Once such a base tool is available on github, the community will take over and add features, fix bugs, as needed. (I'm extremely busy with my own degree work ATM, otherwise I would probably do something like that).

EDIT 1: As for tesseract, I hate it with the passion of a thousand fiery suns. It's a kludge, a black-box of traditional-programming karate-chops and overly-complicated bloat that spits out text the way it likes and there is, largely, nothing you can do about it. Compared to machine-learning and modern computer-vision, tesseract belongs to the dark ages. If there is going to be a quality OCR tool, it's has to be written from scratch based on deep-learning from the ground up.


There's a brute-force solution to the "extract text from a 'digital-native' image" problem that you can write in an afternoon:

1. Use an existing OCR library to give you the positions of the words, plus a first-cut guess of their content.

2. Take the first word from the OCRed guess, and loop through a set of {font, size, leading} tuples, rendering out the same word at that {font, size, leading} and overlaying it on the image, and measuring error-distance.

3. If your best match isn't within some minimum error-distance, then assume that the OCR misrecognized the first word, and try again with the second, third, etc.

Once you've got a font-settings match:

4. render the rest of the words onto their respective detected bounding boxes;

5. notice which words have a higher error-distance than the rest;

6. for each word, generate candidate mutations of the word (e.g. everything at a Levenstein distance of 1 from the OCRed guess), pick the one that lowers the error-distance, and repeat until the distance for that word won't go down any lower.

7. Return the error-minimized set of words.

You could call this a form of https://en.wikipedia.org/wiki/Code-excited_linear_prediction, with fonts as the pre-trained models.

---

Actually, come to think of it, it'd be a lot easier to detect and unify "identical" sub-regions of the image first (using e.g. https://en.wikipedia.org/wiki/JBIG2 on a lossless setting). Then you could, in parallel to the above, also try to do frequency-analysis to discover which of your image "tiles" would likely form a basic "alphabet" of character-glyphs—and then hill-climb toward aligning that "alphabet" by attempting to produce the most runs of character-glyphs that translate to known dictionary words in whatever language the OCR thinks the text is in.

The font-matching would still be necessary, though, for the rest of the image samples that don't fall into the easily-frequency-analyzed part. (And for languages that aren't alphabetic, like Chinese, where there are no super-common character-glyphs.)


Another partner and I came up with a similar solution. It hinged on detecting the typeface and using a bitmapped (or otherwise rendered) font package to OCR letter by letter.

The PDF files that we are dealing with do not have embedded text and are not searchable, but are "digital-native," to use the term that you suggested.

Does this not exist? If not, why does it not exist?!


Why do you use OCR and not PDF to text conversion?


Probably because the pdf is just a big image file? If I understand correctly. Otherwise it should be just copy paste from pdf.


Right. It's an image PDF generated from a text file, so there are no digital-to-analog-to-digital errors introduced. These files should be perfect OCR candidates, but everything that I've found is full of errors, missing portions of sentences, rearranged fragments, etc.


> The PDF files are not scans. They are PDF files created from a Word file.

I am unsure as to why he can't just copy / paste.


Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.

I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.


Debian has a command-line tool 'pdftotext' which extracts the text from a PDF. It is not OCR, it pulls the characters from the file itself. Its in the package called poppler-utils.


Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.


There are a couple of ways a PDF document could contain actual text that is nevertheless not selectable or searchable. One is that the originator could have protected the document; another (more common) cause is that the originator didn't embed the proper font maps when exporting the document. I see the latter a lot with documents produced from LaTeX originals. As the parent mentioned, pdftotext can often extract text from such documents without the need for OCR. (Although sometimes if the document contains ligatures those don't get converted.)


@iplaw please clarify -- is the pdf have an image or text?


Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.


it dosnt make sense to use ocr for this. libraries such as aspose will do much better


Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.

So, I do have to use OCR, right?


Maybe not. The PDF probably has an embedded text (so it doesn't blur when zomming in) but it could be either cinverted into vector curves or protected from copying (see properties). The easiest way is to change the PDF export settings in Word/Ghostscript/Distiller.


For all those claiming issues with reading text from a screen shot of this page, note that this is more an issue with the original Tesseract library, not this library (which appears to wrap Tesseract compiled through Emscripten). I remember having a similar issue when I used the original Tesseract. The quick hack I found to fix it was to rescale any small text input images 3x first before feeding it to Tesseract. I'm sure there's more intelligent solutions to mitigate that problem.


Yeah, in the past, I've had to scale my own scans/photos to get good results. The tesseract github site mentions this:

    "Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images."
- https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuali...


Why the promise-like interface? If it returned a promise with a this-returning progress method monkey-patched onto it, then you could use it otherwise like a regular promise:

    Tesseract.recognize(myImage)
      .progress(function(message){console.log(message)})
      .then(function(result){console.log(result)})
      .catch(function(err){console.error(err)});
or

    Tesseract.recognize(myImage)
      .progress(function(message){console.log(message)})
      .then(
        function(result){console.log(result)},
        function(err){console.error(err)}
      );
I guess I just still have bad memories of jQuery's old almost-like-real promises. I'd rather never have to think ever again about whether I'm dealing with a real promise or one that's going to surprise me and break at run-time because I tried to use it like a real one.


If you want to use a real Promise, you can wrap the call to recognize in Promise.resolve:

  Promise.resolve(Tesseract.recognize(myImage)).then(result => console.log(result))


Excited about this... but the OCR quality seems to be very bad. Maybe it's not optimized for recognizing black text on a white background.

For example, I took a screenshot of this comment and ran it through the demo and got this:

Excited ehent this... but the OCR enenty Seems te be very bad. Maybe it's het Dptimized far recngnizing black text an e white heckgmnhe. EDI example, 1 tank e Screenshnt at this cement ehe teh it. thmneh the den» ehd get this:

It seems to recognize the bounding boxes just fine but mangles the words.


Did you try increasing the font size a bit? On a retina macbook (so effectively ~2x bigger font) I get:

Excited about this... but the OCR quality seems to be very bad. Maybe it's not optimized for recognizing black text on a white background. For example, I took a screenshot of this comment and ran it through the demo and got this:


I've been using this library to read screenshots of Pokemon Go to automatically calculate Individual Values for each Pokemon[1] It's worked great on desktop, but on mobile safari where it matters most the library causes the browser to crash :(

1: https://github.com/goatslacker/pokemon-go-iv-calculator/blob...


Consider doing it server-side


Tesseract was one of the best publicly-available CAPTCHA solvers when I was playing around with that stuff a few years ago; I remember somewhere in the neighbourhood of 90%+ accuracy on ReCAPTCHA, no wonder they've changed those considerably since then to make it difficult even for humans.


I've always wanted to use Tesseract on .NET projects but it was always clumsy (wrappers). This looks like it'll make things easier.

Thanks for putting this out.


I think a .NET wrapper would be more direct/elegant than using emscripten generated code (especially in a .NET project).


A few .NET wrappers exist...but it's always felt heavyweight for me (my use case is pretty rare). I am hoping this makes it trivially easy.

We'll see how it goes.


> Drop an English image on this page to OCR it!

This looks great, and I'd really love to but

> Uncaught ReferenceError: progress is not defined

EDIT: works now!


What browser / OS are you on?

Edit: this affected every browser because it was a typo. Fixed!


OSX El Capitan 10.11.6 (15G31) Chrome Version 53.0.2785.143 (64-bit)


Languages list link is broken - getting 404 for the following https://github.com/naptha/tesseract.js/blob/master/tesseract...



Does this mean I can implement Tesseract on my home server without using php's shell_exec to perform magic on my files? I can just use Jscript instead? Cool!

My current HRCloud2 project could benefit greatly if I ever get around to it. Currently I make the php interpreter jump through hoops and move stuff all over the place to OCR images and docs. This could save a ton of time and shift the processing to the client instead of my server.


Yep, this is completely client side :)

You can even host the external language files yourself as described in the readme: https://github.com/naptha/tesseract.js#local-installation


Impressive that this is pure JS, however trying an image cut from the page itself gave this result

> Dropan Enghsh Wage on (Ms page to OCR m

Should be

> Drop an English image on this page to OCR it!


> Impressive that this is pure JS

Well it's pure JS in that it's been running the C tesseract through emscripten. So in a way it's pure JS just as much as the original lib is pure assembly when compiled ;-)


As another commenter mentioned, Tesseract.js won't perform very well on 'natural' images (e.g. the very light text you tried).

It should work better if you feed it a screenshot of the black text at the top of the demo page though (Tesseract.js is a pure Javascript port etc...).


The title and description are very misleading: this is technically "pure JavaScript" but the JS is compiled from the original C++ library of the same name using emscripten. I think "pure JS" would imply that all of its sources are written in JS which is not the case here. It's mostly the C++ code doing the actual work, with a little JS wrapper on top.


Pretty cool. I screen captured the text in the bottom right corner of the page and it had some issues. Here's a screenshot of what happened: http://io.kc.io/hkeM


Awesome! The ability to OCR video in a browser opens up so many interesting possibilities.


For those who may be interested;

I threw together a quick proof-of-concept in Go for exposing tesseract via a web API:

https://github.com/jaytaylor/tesseract-web


Does this include taking a text and for example, when viewing it, 'wiping' the text in the logical native language order?

For languages that don't employ much whitespace, this would be nice.


Sorry guys, probably a stupid question (googled quickly, doesn't worked), but does this kind of stuff involve ML? Do I need to train it?


Not sure about this JS version, but just tesseract comes with trained database for selected languages and fonts and it should be able to work out of box.


Ah ok, this helped. At least, now I know more about tesseract. Thank you akerro


Does it block while it works and do the work in several setTimeouts or how do they get it to report progress without freezing everything?


Tesseract.js uses webworkers in the browser.


Is it true, that original implementation of tesseract exexuted from commandline is faster than javascript translated version?


According to my tests this is true, but for curiosity can anyone get equivalent or better speed with tesseract.js? This is nice but I don't need client side processing so is there any reason to pick up tesseract.js?


What License? Doesn't mention it.



Says MIT in package.json


Tesseract is not specific to JavaScript right? I do recall there being a version for Python


No, tesseract is a C++ library; this is a wrapper for an Emscripten port of that library.


More instructions, like how to train it, would be nice.


How does it compare with Ocrad.js?


How long did this take to build?


Awesome


Is this at all affiliated with the already-existing tesseract OCR library? It doesn't seem to be from my cursory check so if not you need to rename your library, because you're ripping off their name.

https://github.com/tesseract-ocr/tesseract


It's a wrapper around an Emscripten port of that library. See https://github.com/naptha/tesseract.js-core


"Tesseract.js is a pure Javascript port of the popular Tesseract OCR engine." first sentence on http://tesseract.projectnaptha.com/ linked from the github page




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: