
Tesseract.js: Pure JavaScript OCR for 100 Languages - petercooper
https://tesseract.projectnaptha.com/
======
crazygringo
In case it's not clear, Tesseract is developed by Google since 2006, having
been started at HP in 1985 and open-sourced by HP in 2005. [1]

As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

This (Tesseract.js) is a WASM port of the project by a separate group of
people.

I investigated using this port a couple years ago, but as you can see from the
demo, it's fairly slow to initialize and run, so I never found a practical use
for OCR client-side rather than server-side, but I still think it's
tremendously cool.

In case anyone's interested (shameless plug), because I do a lot of academic
research that involves tons of copying from webpages, PDF's and screenshots
and pasting into notes documents, I created a tool at
[https://pastemagic.com](https://pastemagic.com) that helps selectively remove
rich text formatting, remove line breaks and does OCR on screenshots and
camera photos. Setting up Tesseract on my server and creating a simple HTTP
endpoint for it took less than an hour, and for free I had OCR as powerful as
Google's. Pretty cool I thought.

[1] [https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-
ocr/tesseract)

~~~
NewDimension
Somewhat offtopic, do you know of a library that would allow me to select an
area of a PDF through a GUI and only read the text in those coordinates?

~~~
severine
You could simply pipe an area screenshot to tesseract, discard the input image
and get the tesseract output, am I wrong?

~~~
NewDimension
That sounds like a valid approach, any idea what tools I could use to get the
define the area and get the screenshot?

~~~
severine
You possibly have one installed. Mine comes with my desktop (Xfce), and gives
me a GUI and a CLI to take screenshots of the full desktop, any window, or a
particular area defined by crosshairs.

There's a very popular and minimalist CLI called scrot that I think would be
ideal... well scratch that, I made a search and our question has already been
asked and answered:

[https://askubuntu.com/questions/280475/how-can-
instantaneous...](https://askubuntu.com/questions/280475/how-can-
instantaneously-extract-text-from-a-screen-area-using-ocr-tools)

[https://stackoverflow.com/questions/21497447/ocr-on-a-
screen...](https://stackoverflow.com/questions/21497447/ocr-on-a-screenshot)

------
umvi
I recently did a project where I OCR'd a very rare book that I could only find
in the library of congress so I could read it on my kindle.

Tesseract was amazingly powerful and accurate, but it seemed to struggle if
the page was warped or tilted even a little. I had to preprocess the images
heavily to try to dewarp the natural spine curvature, and even then it could
only get about 99% accuracy (which sounds like a lot, but consider a book
where every 100th letter was wrong - I basically flagged the errors on my
kindle as I went along and manually corrected them later).

I guess the point of this comment is that, in my experience, Tesseract.js is
probably going to need an accompanying PageDewarp.js for it to be of use
scanning books. Not everyone has access to a right angle scanner or can slice
the spine and get perfectly straight high-res scans.

~~~
superpermutat0r
That's very interesting given that Tesseract uses Leptonica. I'm not sure if
they use it for dewarping but all of my little projects with Leptonica really
worked well. Dewarping, binarizing, extracting individual elements etc.

[https://github.com/DanBloomberg/leptonica](https://github.com/DanBloomberg/leptonica)

~~~
umvi
Maybe I wasn't using tesseract to its fullest potential, but I had a really
hard time getting it to do accurate OCR on warped paged - straight pages
worked perfect.

------
throwaway_ocr
I tested this by taking a screenshot of the introduction blurb. This is what
it came up with:

    
    
      Tesseract s is 2 pure Javascript port of the popular Tesseract OCR engine.
    
      This library supports more than 100 languages, automatic text orientation and script detection, a simple interface for reading paragraph, word, and character bounding boxes. Tesseract s can run either in 2 browser znd on & server with NodeJS.
    

Not bad, but far from useful.

------
alexcnwy
Ugh why does open source OCR still suck so bad?!

Why isn’t there an open source OCR engine even half as powerful as the Google
Cloud Platform API?!

------
petercooper
The reason I posted this is that they've just released v2.0. This isn't
highlighted on the homepage, but I assume it has some significance to the
project overall.

------
JeromeWu
Wow, I was wondering why there were so many new github stars yesterday, and
here I found the reason why. :)

Thanks for being interested in tesseract.js, it makes all the work worth. And
I have to thank @antimatter15 for creating this library, without him we cannot
go this far

I have read all the comments and here I would like to provide my two cents for
some questions:

1\. Is tesseract.js pure JavaScript?

Yes, it is 100% JavaScript and it leverages Webassembly port of original
tesseract-ocr. (means we compile the C sorce code to JavaScript Webassembly
code, powered by Emscripten)

2\. The accuracy of tesseract.js is poor.

In my experience, it is hard to get perfect results without applying
additional techniques to your source images. You may need to some
preprocessing and sometimes train a custom traineddata. It is not easy, but it
is the price of high accuracy.

3\. Cloud OCR service is much more accurate

Yes, that's true. But tesseract.js provides an in browser offline option to do
your OCR, it is useful for scenarios like PWA and high confidential image
content (which you don't want to send to server). Tesseract.js is not a silver
bullet, but it is handy sometimes.

Hope you enjoy this library and feel free to leave any comment to us!

------
bhanhfo
Related Chrome extension: [https://chrome.google.com/webstore/detail/project-
naptha/mol...](https://chrome.google.com/webstore/detail/project-
naptha/molncoemjfmpgdkbdlbjmhlcgniigdnf)

Another extension (not using Tesseract.JS):
[https://chrome.google.com/webstore/detail/copyfish-%F0%9F%90...](https://chrome.google.com/webstore/detail/copyfish-%F0%9F%90%9F-free-
ocr-soft/eenjdnjldapjajjofmldgmkjaienebbj?hl=en)

------
dang
A thread from 2016:
[https://news.ycombinator.com/item?id=12694004](https://news.ycombinator.com/item?id=12694004)

------
gwbas1c
I'm curious about the performance tradeoffs of Javascript versus native code?

A few weeks ago, I tried writing something that was long-running, CPU intense,
ect, in a webworker. It was so darn slow that I switched to a native language.
(I hope I didn't do something silly that made my code run more slowly than it
should.)

I see some mention about running in WASM. Does this do something like have
ordinary Tesseract compiled for WebAssembly and then fallback to Javascript?

~~~
fyp
A while back I ported several c++ libraries with emscripten and it's usually
just 2-10x slower than native. It gets maybe another order of magnitude worse
if you're a porting a library that heavily relies on vectorization which isn't
available on the web.

------
gwd
What's slightly weird is that the Chinese "example" text mis-reads a character
in the first line. The image shows:

冬 日 平 泉 路 晚 归

But the OCR reports:

冬 日 平 柳 路 晚 归

(Note the different character right in the middle)

~~~
bufferoverflow
I tried using Tesseract around a year ago to recognize digits only, very clean
images, not some weird font. I had thousands of images, and it failed in
around 3% of them. It was so weird, as it would recognize the same digits in
other images just fine.

I tried 4 or 5 different OCR programs, and none of them worked well enough for
my case.

I was actually surprised, I thought OCR was a solved problem with ridiculously
low error rates.

~~~
mkl
In my experience Tesseract can get confused near the edges of images, and
padding with a wide white border helped a lot. Strange that _none_ of the OCR
programs were good enough, though.

------
dbhattar
Even though it says it supports 100 languages, I cannot find the list of
supported languages. I am mostly trying to find out if it supports Indic
languages.

~~~
severine
[https://github.com/tesseract-ocr/tesseract/wiki/Data-
Files#d...](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-
files-for-version-400-november-29-2016)

Seems so, lists at least Hindi, Urdu, Bengali, Sanskrit, Urdu, Nepali,
Marathi, Sinhala and Punjabi!

------
madsohm
Nice project! Tried it with a screenshot from an eBook. Unfortunately the Os
became Qs and the Is became |s.

~~~
umvi
You never realize how similar characters are until you start an OCR project.

ec

ij

tf

Il|1

hb

OQ

etc.

Even the tiniest addition or subtraction of printed ink can transform one
character in one of the above rows to any other character in the row. Throw in
page tilt/warp/etc. and OCR can frequently confuse them unless you train it
specifically on your text. The pipeline I've found that works best is:

image -> upscale -> dewarp -> OCR -> spellcheck -> grammar check

~~~
magicalhippo
I too have been very disappointed in Tesseract for "simple" OCR (converting
subtitles).

For communication, Turbo Codes[1] for example have the decoder produce an
integer value for each bit, rather than just a bit. The value is a measure of
how likely the value is 0 or 1.

This is then used with previous bit values, which includes parity data, to
make a "hard" decision.

I wonder if something similar has been tried for OCR? I imagine the OCR front-
end could feed a number of probable hits, along with confidence, into a
spellchecker. Or something along those lines.

[1]:
[https://en.wikipedia.org/wiki/Turbo_code](https://en.wikipedia.org/wiki/Turbo_code)

------
rolling_roland
I investigated tesseract.js for turning images of spreadsheets into data. I
didn't mind the initial startup time or run time but unfortunately wasn't able
to get good enough accuracy for my case. It seemed to work really well with
plain english text though.

------
sankyo
Does it work only on books and magazines or would it work on a driver license
or ID card as well?

~~~
bhanhfo
Tesseract is optimized for images with white backgrounds. ID cards or movie
screenshots do not work well.

~~~
Certified
I have used tesseract ocr combined with imagemagick and ffmpeg to great
success for video text extraction.

~~~
beagle3
Can you list your script/pipeline? I haven't had much success (though, I only
ran ffmpeg's internal tesseract OCR[0], no imagemagick processing or any other
processing in between)

[0] [https://ffmpeg.org/ffmpeg-all.html#ocr](https://ffmpeg.org/ffmpeg-
all.html#ocr)

------
ngcc_hk
There is another post about tesseract but by python. Same question but I guess
it might be closer ... is it compatible with TensorFlow js? or is it ok to run
alongside? Or the model can be “simplified” to run on client side like mobile
net?

------
atum47
This is just amazing. I wonder how much work and how many brilliant people it
took to develop this. Congratulations to everyone involved. I'm sure a lot of
cool stuff will come from this tool.

------
smashah
Ooh exciting! The main reason why I've needed to leverage the GCP vision API
is due to orientation limitations on local OCR.

I'll test and migrate to this soon depending on accuracy. Great job so far.

------
kbumsik
> Tesseract.js wraps an emscripten port of the Tesseract OCR Engine.

So it is a wrapper library of a C++ project, which is cool. But saying it is a
"Pure" JavaScript is purely misleading.

------
EGreg
I remember reading that Tesseract is the old tech which does worse than
current ML based stuff. But since no one gives out their ML data that’s all
you get.

~~~
mkl
Your knowledge is out of date. Tesseract 4 added a new OCR engine based on
LSTM neural networks.

