
Image text recognition APIs showdown - mohi13
https://dataturks.com/blog/compare-image-text-recognition-apis.php
======
jedberg
I've said this many times: Google's AI services are superior in every way.
They have better results and are easier to use.

Amazon is the master of the "good enough". Their service works well enough
that you can check the box that it exists and then point it at all that data
you already have in AWS. And that's all that most everyone needs.

If you are using AI and your competitors aren't, it doesn't really matter all
that much how good the AI is -- you're gonna do better and be more efficient.

It's only after _everyone_ is using AI that it will start to matter how good
your particular implementation is. Right now we're at the stage that any
implantation is better than none.

~~~
binarymax
I disagree with this sentiment that any "AI" is better than none no matter how
poor it is. If I'm a customer and you try to extract text from an image for
me, but it's wrong over half the time, then it looks really bad. An incorrect
extraction is a bug. You're basically just adding bugs to your product. Better
a good product with less features than a bug ridden product with lots of
features.

~~~
garysieling
It's appropriate in search, if you use the data as features to improve recall

------
yegle
There's a picture which Google's Vision API returned Goo\u011fle

I was wondering why there's a unicode char in the middle:

    
    
        In [5]: c = '\u011f'
        
        In [6]: c
        Out[6]: 'ğ'
    

This does closely resemble the characters in the picture :-)

~~~
mohi13
seems these API might work better if they took expected language as input :)

~~~
boulos
They do! For example, Vision takes languageHints [1]. With the Speech API, I
took an hour long tour of Rouffignac (in French) and translated it into
English. I considered adding some SpeechContext [2] for things like mammoth
and so on, but actually it did a fine enough job as is (besides, I had to
listen to it later, as there was obviously no coverage in the cave).

Disclosure: I work on Google Cloud (but not these APIs).

[1]
[https://cloud.google.com/vision/docs/reference/rest/v1/image...](https://cloud.google.com/vision/docs/reference/rest/v1/images/annotate#ImageContext)

[2]
[https://cloud.google.com/speech/reference/rest/v1/Recognitio...](https://cloud.google.com/speech/reference/rest/v1/RecognitionConfig#SpeechContext)

------
ckuhl
I find it interesting that it looks like in at least one instance, Microsoft
CS recognized the text upside down.

    
    
        Payloads -> speohed

~~~
mohi13
how is it upside down?

~~~
kzrdude
That wrong result makes sense if you compare the word shapes with one of them
upside down. 180-deg rotated p is d and so on.

------
eastendguy
You can compare Google vs Microsoft vs OCR.space without signing up here:

[https://ocr.space/compare-ocr-software](https://ocr.space/compare-ocr-
software)

OCR.space is a "good enough" option for many projects. It has a very generous
free tier of 25,000 free conversions/month _per IP address_ (Google only
1000/month per account). In my tests it performed not as good as Google, but
good enough for many applications (much better than Tesseract).

------
Isamu
Don't know where the assessment "Msft, Google APIs accuracy still below par"
came from. In fact this is just comparing 3 APIs.

In fact the state of the art of OCR "in the wild" (using images off the
street, for example) is far from 100%. Google Cloud Vision does pretty good.

The ICDAR 2017 challenges (especially Robust Reading Challenges) should give
you an idea of where we are now:

[http://u-pat.org/ICDAR2017/program_competitions.php](http://u-pat.org/ICDAR2017/program_competitions.php)

~~~
gajju3588
How do i read this page, it seems to be listing some competitions.

~~~
Isamu
Click through to the competitions. My point was that here are current
competitions giving you an idea of the state of the art. These are not student
competitions or top coder things.

For example the Robust Reading on COCO text:

[http://rrc.cvc.uab.es/?ch=5](http://rrc.cvc.uab.es/?ch=5)

------
notlisted
The analysis is weak. "There were 10 images where all three APIs got it
wrong".

Guess what. Zoom in on the "PRINCE" image, and you'll see it says top-right: A
MIKE NEWELL FILM. So... both google and AWS did a nice job. It's not
reasonable to expect PRINCE as the outcome.

The point another person makes below about "payloads" and MSoft is valid
too... As is the g-accented (not recognized because UTF codes not processed).

Makes ya wonder.

~~~
BrandonY
A human would record that as clearly being PRINCE. For their described use
case, reading images of business and movie names, the presence of other, small
text in the picture seems quite fair.

------
danso
> _The only drawback with AWS rekognition APIs is that it only takes an image
> stored as an AWS S3 object as input while the other API work with any image
> stored on the web._

This isn't quite true -- the rekognition API will also accept base64 encoded
bytes (5MB max):
[http://boto3.readthedocs.io/en/latest/reference/services/rek...](http://boto3.readthedocs.io/en/latest/reference/services/rekognition.html#Rekognition.Client.detect_text)

------
hcarvalhoalves
It seems his examples include logos and such. Some of those services could be
tuned for OCRing books and documents, which should be the bulk of OCR use
cases in commercial applications. Since there's usually a trade-off between
flexibility vs. accuracy, I wouldn't be surprised to see inverted results w/ a
different dataset. Might be worth doing the test.

~~~
gajju3588
Any idea how to find a good labeled dataset for this usecase ?

~~~
dalys
[https://www.gutenberg.org](https://www.gutenberg.org) might have this. Some
books available are just OCRed. Some are proof read and corrected. Not sure if
scanned image sources are available?

~~~
gajju3588
Testing needs some labeled dataset. Is there some good site to find publicly
available datasets ?

~~~
hcarvalhoalves
There is this one to test against Project Gutenberg, but you have to request
by mail:

[1] [http://ciir.cs.umass.edu/downloads/ocr-
evaluation/](http://ciir.cs.umass.edu/downloads/ocr-evaluation/)

------
shoshin23
Image text recognition is a major problem we're trying to solve in our
startup. I would love to be pointed to some SOTA research in this space. Hard
to find anything by Googling about it.

As far as our experience goes, Cloud Vision API is a killer option compared to
both AWS and MSFT. It's pricier than AWS though and is slower. MSFT is
terrible in both price and speed.

~~~
Livven
If you're trying to handle text "in the wild" and not scanned documents, the
keyword is "scene text". Most papers are focused on either
detection/localization, i.e. finding the location of text, or recognition,
i.e. recognizing the actual content given a cropped text image.

Here are some current state-of-the-art papers + code where available about
detection:

Fused Text Segmentation Networks for Multi-oriented Scene Text Detection
[https://arxiv.org/abs/1709.03272](https://arxiv.org/abs/1709.03272)

EAST: An Efficient and Accurate Scene Text Detector
[https://arxiv.org/abs/1704.03155](https://arxiv.org/abs/1704.03155)
[https://github.com/argman/EAST](https://github.com/argman/EAST)

Detecting Oriented Text in Natural Images by Linking Segments
[https://arxiv.org/abs/1703.06520](https://arxiv.org/abs/1703.06520)
[https://github.com/dengdan/seglink](https://github.com/dengdan/seglink)

Arbitrary-Oriented Scene Text Detection via Rotation Proposals
[https://arxiv.org/abs/1703.01086](https://arxiv.org/abs/1703.01086)
[https://github.com/mjq11302010044/RRPN](https://github.com/mjq11302010044/RRPN)

And for recognition:

An End-to-End Trainable Neural Network for Image-based Sequence Recognition
and Its Application to Scene Text Recognition
[https://arxiv.org/abs/1507.05717](https://arxiv.org/abs/1507.05717)
[https://github.com/bgshih/crnn](https://github.com/bgshih/crnn)

Robust Scene Text Recognition with Automatic Rectification
[https://arxiv.org/abs/1603.03915](https://arxiv.org/abs/1603.03915)

~~~
zihotki
I'd also add on the subject the following whitepaper from MS -
[http://digital.cs.usu.edu/~vkulyukin/vkweb/teaching/cs7900/P...](http://digital.cs.usu.edu/~vkulyukin/vkweb/teaching/cs7900/Paper1.pdf)

~~~
Livven
Note that this paper is from 2010 and thus, while quite influential for its
time far from the current state-of-the-art. The stroke width transform method
that it introduced is simply not as good as current deep learning-based
methods.

If you want to get a (slightly out of date but what can you do, the field is
moving very fast) overview see this survey from 2016:

Scene Text Detection and Recognition: Recent Advances and Future Trends
[http://mclab.eic.hust.edu.cn/UpLoadFiles/Papers/FCS_TextSurv...](http://mclab.eic.hust.edu.cn/UpLoadFiles/Papers/FCS_TextSurvey_2015.pdf)

------
gajju3588
I really hope some company is looking at this data and thinking can we do
something disruptive here.

~~~
tensor
The big companies, particularly google, already uses state of the art
techniques. Additionally, OCR has long been a topic study in academia. I think
it's unlikely that we'll see any major "disruption" in this industry.
Improvements will likely continue to come from the usual suspects.

~~~
mohi13
the really sad being that with all the talks of AI taking over humanity a
seemingly a simple task of reading printed text is so far behind..seems ML/AI
has many years to _really_ make big leaps.

~~~
paganel
Automatic translation has also already reached the low-hanging statistical
fruit, meaning it has been stuck at 70-80% performance/accuracy for the last
couple of years (i.e. you can use it to get a general idea of what a text
written in a foreign language might mean, but you can't use it for day-to-day
language interactions without sounding unprofessional or even stupid).

------
danso
This article reminded me to check if any updates had been made to Google's OCR
since Cloud vision was in beta sometime last year or earlier this year [0]. It
looks like a new parameter/option for "Document Text Detection" \-- i.e.
something more akin to Tesseract, rather than just detecting words in images
(such as road signs):

[https://cloud.google.com/vision/docs/detecting-
fulltext](https://cloud.google.com/vision/docs/detecting-fulltext)

[0]
[https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d](https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d)

edit: I would attempt my own test right now but it's been awhile since I've
tried to use Google Cloud. Right now I'm getting constant "Server Error"
popups until Chrome decides to crash and die just when simply checking my
account and billing page. The Cloud Console's wonkiness is probably one of the
reasons why I stopped using GC in favor of AWS :/

~~~
boulos
Sorry to hear about whatever is going on between Chrome and the Console (can
you try an incognito window?).

You can always test out the Vision API via the landing page
([https://cloud.google.com/vision/](https://cloud.google.com/vision/)). The
full text results seem to be under the little document tab. I took a
screenshot of the text above it, and it seemed to work as expected (breaking
it into two paragraphs).

Disclosure: I work on Google Cloud (but not on ML APIs).

~~~
danso
Thanks, I'm not sure what the problem could be since I've had GC bill me every
month for the past couple of years for other APIs and services, perhaps I
never officially enabled Cloud Vision API after it was in beta. The crash at
account screen is puzzling but I'm assuming it must be on my end (conflicting
plugin perhaps) -- I'll have to investigate it over the weekend. I'm hoping to
use GC for a class so I'm going to try the signup and onboarding process from
scratch anyway.

------
gajju3588
Comparision of face recognition apis from MS/Amazon/Kairos.

Kairos is doing better here. ;-)

[https://dataturks.com/blog/face-verification-api-
comparison....](https://dataturks.com/blog/face-verification-api-
comparison.php)

------
ape4
A non-cloud API:
[https://docs.opencv.org/3.1.0/d4/d61/group__text.html](https://docs.opencv.org/3.1.0/d4/d61/group__text.html)

------
spunker540
I’m surprised by these results being so poor because many of these reading
tasks are so easy for human readers. And it seems so much more straightforward
than stuff like face recognition and self driving cars!

~~~
gajju3588
It would be cheaper too with something like mechanical turks, i guess.

~~~
madamelic
That's what I've been doing. I had ~1,800 images to read, machine could only
read about 900 so I tossed like $20 at it ($0.02 / image) and it is pretty
good accuracy.

MTurk is hit and miss when it comes to workers, some will just click buttons
to see if they can get paid, others completely knock it out of the park.

I have another project that I have put about $100 in so far and had decent
results (incredibly quickly too!)

Getting my work to "good enough", then tossing the rest on Mturk is much, much
cheaper in the long run.

------
BLanen
During a college course my group also did a comparison like this but
specifically for handwriting and we simulated low-quality scanned images.

Microsoft was by far the best at this. Google wasn't even close.

------
garysieling
I ran a bunch of stained glass images through several APIs - Google's did by
far the best, although there are a lot of issues (curved text, hand-written
text in odd alignments or on a curve)

~~~
tomcam
Wait, just how many examples of stained glass text are there? What’s the use
case? Capchas?

~~~
dragonwriter
> Wait, just how many examples of stained glass text are there?

Lots.

> What’s the use case?

Churches, for one.

~~~
tomcam
Thanks, I stepped right in to that one so you’re welcome ... but I meant the
image data.

------
juanmirocks
I would be curious to know if anyone in this community is indeed already using
one of these APIs for a product and what your real-life experience is. Care to
share your use cases?

------
correlation
It would be interesting to see how Tesseract holds up against this.

~~~
mohi13
whats that? Thor :P ?

~~~
correlation
Hehe - well I was referring to the major os OCR lib. supposedly have LSTM-
stuff in the next major rev [https://github.com/tesseract-
ocr/tesseract/blob/master/READM...](https://github.com/tesseract-
ocr/tesseract/blob/master/README.md)

~~~
jremmons
I have used tesseract and in my experience unless you train it for the
particular type of text you want to recognize (font, background color, etc.)
it will do quite poorly (including the recent lstm based versions). Would be
great to see how it stacks up against these APIs though.

~~~
tensor
The supplied models are trained on document-like images, so I wouldn't expect
it to do particularly well on things like street signs. My experience with the
new lstm based versions is that it's very much competitive with closed source
solutions for document-like OCR.

------
red0point
Is there any service that converts scanned PDF's using the Google Cloud Vision
API to searchable PDF's? If so, that would be highly useful for me.

~~~
yorwba
Would a hacky script be acceptable?

The one time I needed to turn a scanned PDF (600+ page book) into searchable
text, I used this Ruby script
[https://github.com/gkovacs/pdfocr/](https://github.com/gkovacs/pdfocr/) ,
which pulls out individual pages using pdftk, turns them into images to feed
into an OCR engine of your choice (Tesseract seems to be the gold standard)
and then puts them back together. It can blow up the file size tremendously,
but worked well enough for my use case. (I did write a very special purpose
PDF compressor to shrink the file back, but that was more for fun.)

------
gricardo99
Click on "Get dataset and code":

"We will email you the dataset and code."

What's wrong with a github link in the article?

~~~
gricardo99
mailinator to the rescue.

------
Omnipresent
Does google cloud vision use deep learning to build their image text
recognition?

------
xmly
AWS only has 20+% correctness?

