
Show HN: Tesseract.js – Pure JavaScript OCR for 60 Languages - bijection
https://github.com/naptha/tesseract.js
======
xigency
To anyone screen capturing small fonts as a demonstration, or capturing
digital text especially at a small resolution, I don't believe that that is
the purpose of this OCR library. (As a specialized problem, that might be
easier to solve depending on the typeface.)

A much better example that works quite well is a picture of someone holding a
book: [http://i.imgur.com/3JWs64x.jpg](http://i.imgur.com/3JWs64x.jpg)

    
    
        Magic .
        Read this to yourself. Read it silently
        Don't move your lips. Don’t make a suund
        Listen to yourself. Listen without hearing
        What a wonderfully weird thing, huh?
        NOW MAKE THIS PART LOUD!
        SCREAM IT IN YOUR MIND!
        DROWN EVERYTHING OUT.
        Now, hear a whisper. A tiny whisper.
        New, read this next line with your best crotchety—
        old-man voice:
    
        “Hello there, sonny. Does your town have apost 0
        Awesome! Who was that? Whose voice was that?
        It sure wasn’t yours!
    
        How do you do that?
        How?!
        Must be magic.
    

Problems with this text: misspelled 'sound' as 'suund', didn't recognize the
word 'anything', and mis-recognized 'a post office' as 'apost 0'.

Not bad. Especially since two of three mistakes are on the edge of the page.

~~~
minism
The old man voice was spoken in my mind as Deckard Cain.

~~~
holografix
I stayed a while and I listened

------
pyronite
The text detection is lacking in comparison to Google's Vision API. Here is a
real-life comparison between Tesseract and Google's Vision API, based on a PDF
a user of our website uploaded.

Original text
[[http://i.imgur.com/CZGhKhn.png](http://i.imgur.com/CZGhKhn.png)]:

> I am also a top professional on Thumbtack which is a site for people looking
> for professional services like on gig salad. Please see my reviews from my
> clients there as well

Google detects
[[http://i.imgur.com/pSJym1x.png](http://i.imgur.com/pSJym1x.png)]:

> “ I am also a top professional on Thumbtack which is a site for people
> looking for professional services like on gig salad. Please see my reviews
> from my clients there as well ”

Tesseract detects
[[http://i.imgur.com/wwbLU6g.png](http://i.imgur.com/wwbLU6g.png)]:

> \ am also a mp pmfesslonzl on Thummack wmcn Is a sue 1m peop‘e \ookmg (or
> professmna‘ semces We on glg salad P‘ezse see my rewews 1mm my cuems were as
> weH

~~~
ajacksified
I spent many afternoons trying to get tesseract to read Dwarf Fortress
screenshots, such as
[http://i.imgur.com/32vVhnH.png](http://i.imgur.com/32vVhnH.png) \- including
much pre-processing, such as converting the text to black and white. Alas, I
never even got close.

Edit: just tried Google's, and it had one mistake for that entire file. That's
pretty impressive.

~~~
bijection
Upscaling worked ok for me:

Upscaled image: [https://imgur.com/a/4IQA7](https://imgur.com/a/4IQA7)

Result on demo page: [http://imgur.com/a/A0v5C](http://imgur.com/a/A0v5C)

The hammerman Tikes ﬂsosushsath: Greetings. My name is Tikes Leafsilk.

You: Rh. hello. I'm Stasbo Murderknower the Craterous Trance of Fins. Don't
travel alone at night. or the bogeyman will get you.

You: Tell me about this hall.

Tikes: This is The ﬂccidental Ualley. In 123, Stasho Steamdances ruled from
The ﬂccidental Ualley of The Council of Cobras in Ueilapes.

...

~~~
xigency
This seems like a font issue. Would training the model for this console font
help?

------
iplaw
HOW is there not a better, almost 100% accurate OCR tool?

I routinely (daily) need to OCR PDF files. The PDF files are not scans. They
are PDF files created from a Word file. The text is 100% clear, the lines are
100% straight, and the type is 100% uniform.

And, yet, Microsoft and Google OCR spits out gibberish that is full of
critical errors.

From a problem solving perspective, this seems like an incredibly easy problem
to solve in this exact use case. That is, PDFs generated from text files.
Identify a uniform font size (prevent o-to-O and o-to-0 errors), identify a
font-family (serif, sans-serif, narrow to particular fonts), and OCR the damn
thing. And yet, the output is useless in my field.

~~~
danso
You do not want OCR for this. You want either ABBYY FineReader (around $99 for
a license), or, if you prefer open source, Tabula:

[https://source.opennews.org/en-US/articles/introducing-
tabul...](https://source.opennews.org/en-US/articles/introducing-tabula/)

The main advantage of ABBYY is that if you need to do OCR, it is, in my
opinion, the best consumer-level package. And it does a pretty good job of
doing OCR _and_ conversion to Excel. Here's a Github repo that demonstrates
some results:

[https://github.com/dannguyen/abbyy-finereader-ocr-
senate](https://github.com/dannguyen/abbyy-finereader-ocr-senate)

But to reemphasize, the above repo demonstrates ABBYY maintaining table
structure _with PDFs that are scanned images_ , which is considerably harder
than the situation you're in.

I've started a repo that eventually will compare text-to-table tools, which is
what you want:
[https://github.com/dannguyen/pdftotablestable](https://github.com/dannguyen/pdftotablestable)

~~~
fizixer
As much as your response tries to solve GP's particular problem (OCR for PDF-
to-text being not the right tool), I 100% agree with the extreme annoyance
expressed in it regarding the state of the free OCR.

In principle, text-pdf-to-text is just a matter of parsing PDF (and/or Word)
formats and extracting text buried in metadata. (I know it's a lot of work but
still).

Even if you forget about what GP said about the source being text PDFs, and
when all the sources are png images, as long as those pngs were generated from
text documents (Word, PDF, etc) without any scanning or camera involved, it is
unacceptable that today's free OCR tools don't get the job done, when in 2016,
machine-learning has produced systems that have surpassed human accuracy in
much harder tasks like object detection and speech recognition.

I know it's not an unsolved problem. It's just a matter of some knowledgeable
machine learning researcher taking a break from working on cutting edge for a
few months and putting together a package that gets the image-to-text job
done. Once such a base tool is available on github, the community will take
over and add features, fix bugs, as needed. (I'm extremely busy with my own
degree work ATM, otherwise I would probably do something like that).

EDIT 1: As for tesseract, I hate it with the passion of a thousand fiery suns.
It's a kludge, a black-box of traditional-programming karate-chops and overly-
complicated bloat that spits out text the way it likes and there is, largely,
nothing you can do about it. Compared to machine-learning and modern computer-
vision, tesseract belongs to the dark ages. If there is going to be a quality
OCR tool, it's has to be written from scratch based on deep-learning from the
ground up.

~~~
derefr
There's a brute-force solution to the "extract text from a 'digital-native'
image" problem that you can write in an afternoon:

1\. Use an existing OCR library to give you the positions of the words, plus a
first-cut guess of their content.

2\. Take the first word from the OCRed guess, and loop through a set of {font,
size, leading} tuples, rendering out the same word at that {font, size,
leading} and overlaying it on the image, and measuring error-distance.

3\. If your best match isn't within some minimum error-distance, then assume
that the OCR misrecognized the first word, and try again with the second,
third, etc.

Once you've got a font-settings match:

4\. render the rest of the words onto their respective detected bounding
boxes;

5\. notice which words have a higher error-distance than the rest;

6\. for each word, generate candidate mutations of the word (e.g. everything
at a Levenstein distance of 1 from the OCRed guess), pick the one that lowers
the error-distance, and repeat until the distance for that word won't go down
any lower.

7\. Return the error-minimized set of words.

You could call this a form of [https://en.wikipedia.org/wiki/Code-
excited_linear_prediction](https://en.wikipedia.org/wiki/Code-
excited_linear_prediction), with fonts as the pre-trained models.

\---

Actually, come to think of it, it'd be a lot easier to detect and unify
"identical" sub-regions of the image first (using e.g.
[https://en.wikipedia.org/wiki/JBIG2](https://en.wikipedia.org/wiki/JBIG2) on
a lossless setting). Then you could, in parallel to the above, also try to do
frequency-analysis to discover which of your image "tiles" would likely form a
basic "alphabet" of character-glyphs—and then hill-climb toward aligning that
"alphabet" by attempting to produce the most runs of character-glyphs that
translate to known dictionary words in whatever language the OCR thinks the
text is in.

The font-matching would still be necessary, though, for the _rest_ of the
image samples that don't fall into the easily-frequency-analyzed part. (And
for languages that aren't alphabetic, like Chinese, where there are no super-
common character-glyphs.)

~~~
iplaw
Another partner and I came up with a similar solution. It hinged on detecting
the typeface and using a bitmapped (or otherwise rendered) font package to OCR
letter by letter.

The PDF files that we are dealing with do not have embedded text and are not
searchable, but are "digital-native," to use the term that you suggested.

Does this not exist? If not, why does it not exist?!

------
AgentME
Why the promise- _like_ interface? If it returned a promise with a this-
returning progress method monkey-patched onto it, then you could use it
otherwise like a regular promise:

    
    
        Tesseract.recognize(myImage)
          .progress(function(message){console.log(message)})
          .then(function(result){console.log(result)})
          .catch(function(err){console.error(err)});
    

or

    
    
        Tesseract.recognize(myImage)
          .progress(function(message){console.log(message)})
          .then(
            function(result){console.log(result)},
            function(err){console.error(err)}
          );
    

I guess I just still have bad memories of jQuery's old almost-like-real
promises. I'd rather never have to think ever again about whether I'm dealing
with a real promise or one that's going to surprise me and break at run-time
because I tried to use it like a real one.

~~~
bijection
If you want to use a real Promise, you can wrap the call to recognize in
Promise.resolve:

    
    
      Promise.resolve(Tesseract.recognize(myImage)).then(result => console.log(result))

------
jameslk
For all those claiming issues with reading text from a screen shot of this
page, note that this is more an issue with the original Tesseract library, not
this library (which appears to wrap Tesseract compiled through Emscripten). I
remember having a similar issue when I used the original Tesseract. The quick
hack I found to fix it was to rescale any small text input images 3x first
before feeding it to Tesseract. I'm sure there's more intelligent solutions to
mitigate that problem.

~~~
dunham
Yeah, in the past, I've had to scale my own scans/photos to get good results.
The tesseract github site mentions this:

    
    
        "Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images."
    

\- [https://github.com/tesseract-
ocr/tesseract/wiki/ImproveQuali...](https://github.com/tesseract-
ocr/tesseract/wiki/ImproveQuality)

------
greenpizza13
Excited about this... but the OCR quality seems to be very bad. Maybe it's not
optimized for recognizing black text on a white background.

For example, I took a screenshot of this comment and ran it through the demo
and got this:

Excited ehent this... but the OCR enenty Seems te be very bad. Maybe it's het
Dptimized far recngnizing black text an e white heckgmnhe. EDI example, 1 tank
e Screenshnt at this cement ehe teh it. thmneh the den» ehd get this:

It seems to recognize the bounding boxes just fine but mangles the words.

~~~
bijection
Did you try increasing the font size a bit? On a retina macbook (so
effectively ~2x bigger font) I get:

Excited about this... but the OCR quality seems to be very bad. Maybe it's not
optimized for recognizing black text on a white background. For example, I
took a screenshot of this comment and ran it through the demo and got this:

------
gentleteblor
I've always wanted to use Tesseract on .NET projects but it was always clumsy
(wrappers). This looks like it'll make things easier.

Thanks for putting this out.

~~~
smarx007
I think a .NET wrapper would be more direct/elegant than using emscripten
generated code (especially in a .NET project).

~~~
gentleteblor
A few .NET wrappers exist...but it's always felt heavyweight for me (my use
case is pretty rare). I am hoping this makes it trivially easy.

We'll see how it goes.

------
yankyou
> Drop an English image on this page to OCR it!

This looks great, and I'd really love to but

> Uncaught ReferenceError: progress is not defined

EDIT: works now!

~~~
bijection
What browser / OS are you on?

Edit: this affected every browser because it was a typo. Fixed!

~~~
yankyou
OSX El Capitan 10.11.6 (15G31) Chrome Version 53.0.2785.143 (64-bit)

------
goatslacker
I've been using this library to read screenshots of Pokemon Go to
automatically calculate Individual Values for each Pokemon[1] It's worked
great on desktop, but on mobile safari where it matters most the library
causes the browser to crash :(

1: [https://github.com/goatslacker/pokemon-go-iv-
calculator/blob...](https://github.com/goatslacker/pokemon-go-iv-
calculator/blob/master/web/components/PictureUpload.js)

~~~
methyl
Consider doing it server-side

------
mdani
Languages list link is broken - getting 404 for the following
[https://github.com/naptha/tesseract.js/blob/master/tesseract...](https://github.com/naptha/tesseract.js/blob/master/tesseract_lang_list.md)

~~~
bijection
Thanks! Fixed. The actual link is
[https://github.com/naptha/tesseract.js/blob/master/docs/tess...](https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_lang_list.md)

------
userbinator
Tesseract was one of the best publicly-available CAPTCHA solvers when I was
playing around with that stuff a few years ago; I remember somewhere in the
neighbourhood of 90%+ accuracy on ReCAPTCHA, no wonder they've changed those
considerably since then to make it difficult even for humans.

------
zelon88
Does this mean I can implement Tesseract on my home server without using php's
shell_exec to perform magic on my files? I can just use Jscript instead? Cool!

My current HRCloud2 project could benefit greatly if I ever get around to it.
Currently I make the php interpreter jump through hoops and move stuff all
over the place to OCR images and docs. This could save a ton of time and shift
the processing to the client instead of my server.

~~~
bijection
Yep, this is completely client side :)

You can even host the external language files yourself as described in the
readme: [https://github.com/naptha/tesseract.js#local-
installation](https://github.com/naptha/tesseract.js#local-installation)

------
KiwiCoder
Impressive that this is pure JS, however trying an image cut from the page
itself gave this result

> Dropan Enghsh Wage on (Ms page to OCR m

Should be

> Drop an English image on this page to OCR it!

~~~
bijection
As another commenter mentioned, Tesseract.js won't perform very well on
'natural' images (e.g. the very light text you tried).

It should work better if you feed it a screenshot of the black text at the top
of the demo page though (Tesseract.js is a pure Javascript port etc...).

------
daliwali
The title and description are very misleading: this is technically "pure
JavaScript" but the JS is compiled from the original C++ library of the same
name using emscripten. I think "pure JS" would imply that all of its sources
are written in JS which is not the case here. It's mostly the C++ code doing
the actual work, with a little JS wrapper on top.

------
slajax
Pretty cool. I screen captured the text in the bottom right corner of the page
and it had some issues. Here's a screenshot of what happened:
[http://io.kc.io/hkeM](http://io.kc.io/hkeM)

------
mgalka
Awesome! The ability to OCR video in a browser opens up so many interesting
possibilities.

------
jaytaylor
For those who may be interested;

I threw together a quick proof-of-concept in Go for exposing tesseract via a
web API:

[https://github.com/jaytaylor/tesseract-
web](https://github.com/jaytaylor/tesseract-web)

------
zhte415
Does this include taking a text and for example, when viewing it, 'wiping' the
text in the logical native language order?

For languages that don't employ much whitespace, this would be nice.

------
artf
Sorry guys, probably a stupid question (googled quickly, doesn't worked), but
does this kind of stuff involve ML? Do I need to train it?

~~~
akerro
Not sure about this JS version, but just tesseract comes with trained database
for selected languages and fonts and it should be able to work out of box.

~~~
artf
Ah ok, this helped. At least, now I know more about tesseract. Thank you
akerro

------
maaaats
Does it block while it works and do the work in several setTimeouts or how do
they get it to report progress without freezing everything?

~~~
bijection
Tesseract.js uses webworkers in the browser.

------
codemode
Is it true, that original implementation of tesseract exexuted from
commandline is faster than javascript translated version?

~~~
codemode
According to my tests this is true, but for curiosity can anyone get
equivalent or better speed with tesseract.js? This is nice but I don't need
client side processing so is there any reason to pick up tesseract.js?

------
ckluis
What License? Doesn't mention it.

~~~
talklittle
Apache-2.0

They've added
[https://github.com/naptha/tesseract.js/blob/c26cae7ee956c399...](https://github.com/naptha/tesseract.js/blob/c26cae7ee956c399eeb992de0135e3af29b4edb5/LICENSE.md)

------
mrcactu5
Tesseract is not specific to JavaScript right? I do recall there being a
version for Python

~~~
someonewithpc
No, tesseract is a C++ library; this is a wrapper for an Emscripten port of
that library.

------
z3t4
More instructions, like how to train it, would be nice.

------
niutech
How does it compare with Ocrad.js?

------
newtons_bodkin
How long did this take to build?

------
sanketbajoria
Awesome

------
employee8000
Is this at all affiliated with the already-existing tesseract OCR library? It
doesn't seem to be from my cursory check so if not you need to rename your
library, because you're ripping off their name.

[https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-
ocr/tesseract)

~~~
bijection
It's a wrapper around an Emscripten port of that library. See
[https://github.com/naptha/tesseract.js-
core](https://github.com/naptha/tesseract.js-core)

