Hacker News new | past | comments | ask | show | jobs | submit login
Image text recognition APIs showdown (dataturks.com)
193 points by mohi13 on Dec 19, 2017 | hide | past | web | favorite | 77 comments

I've said this many times: Google's AI services are superior in every way. They have better results and are easier to use.

Amazon is the master of the "good enough". Their service works well enough that you can check the box that it exists and then point it at all that data you already have in AWS. And that's all that most everyone needs.

If you are using AI and your competitors aren't, it doesn't really matter all that much how good the AI is -- you're gonna do better and be more efficient.

It's only after everyone is using AI that it will start to matter how good your particular implementation is. Right now we're at the stage that any implantation is better than none.

I disagree with this sentiment that any "AI" is better than none no matter how poor it is. If I'm a customer and you try to extract text from an image for me, but it's wrong over half the time, then it looks really bad. An incorrect extraction is a bug. You're basically just adding bugs to your product. Better a good product with less features than a bug ridden product with lots of features.

> "any "AI" is better than none no matter how poor it is"

You don't have to disagree. The OP said:

> "it doesn't really matter all that much how good the AI is"

I think you're both correct and you bring up an interesting point. As long as your AI is "good enough" to replace what would've taken more resources to do otherwise, it's a win. I'm not sure if that's "half" as you stated, but I bet it depends on the task. If the task is saving a few seconds to query something, then I'd agree, half of the time wrong isn't a savings. But, if you have less than half a chance at saving thousands of dollars or hundreds of hours if it works correctly, then that may be chalked up as a win.

It's appropriate in search, if you use the data as features to improve recall

I dunno, there's clearly a bar for "good enough". From the article it's not obvious to me that AWS Rekognition meets that bar (yet). 21% precision with 54% recall isn't the kind of thing that achieves "if you're using AI and your competitors aren't, you're ahead".

I like your framing/sentiment though: it's not about small differences in "betterness", it's about the difference in kind in going from "no ML/AI" => "whoa, it works!".

Disclosure: I work on Google Cloud (but not in ML).

> Disclosure: I work on Google Cloud (but not in ML).

Since you work there, hopefully you can see this feedback and filter it up: I love your tools. They are the best. I have tried using your tools. They are hard to use, despite the fact that I have a pretty solid understanding of how to use them. I would like to use your tools more, but getting support is hard (partly because the docs aren't great, partly because there is no community, because see #1).

I don't know how to fix this, but it would be great if maybe Google spent some time focusing on building a community around your tools, like AWS did. At the beginning they had a lot of employees hanging out on the forums and on other forums, answer questions and building a community of users, and especially helping third parties who tried to build libraries for their tools (like boto for Python). It would be great if Google did that too.

Thanks for listening!

Google has contributed quite a bit to Boto. The GCS command line tool, gsutil, is built on top of boto. If you're talking about Boto 3, though, not so much.

Is there any data that points to Google's AI service being superior?

I want to know which AI platform is the strongest, I did find in my anecdotal tests Microsoft's Vision API performed better but this was in mid to early 2016.

Well the post we're discussing has some hard data. :)

But generally speaking when I talk to AI experts they all agree that for the services that Google actually has, they are better, with MS being #2.

well, not so sure about Google's AI services are superior in every way.

Actually, Google has fewer public APIs when compared to Msft/AWS for AI use cases, ex: Face recognition etc. Maybe they just don't release stuff unless it's much better than the competitor's.

I should have clarified, I meant the services that they had were better, but you're right, they don't have the breadth of services that the others do.

There's a picture which Google's Vision API returned Goo\u011fle

I was wondering why there's a unicode char in the middle:

    In [5]: c = '\u011f'
    In [6]: c
    Out[6]: 'ğ'
This does closely resemble the characters in the picture :-)

IMHO, only Google's API got that correct.

The test they did is similar to providing:

MICROSOET and expecting the API's to find MICROSOFT.

If I put in a “typo” (read: ambiguous letter), I want the meaning that was originally intended. Technically correct doesn’t but me anything; we can play that game all day but I just don’t have much use for a document full of errors. In a way: I want the AI to do what a human would do.

This is like speech recognition using context to fill in the gaps. Without this it would be unusable.

Then it will need to be informed what is the valid character set and language.

Yes. By its own training data, and context from the image. E.g. from this very thread: “Google” is going to be Google :)

Pretty funny though that google couldn’t recognize its own name as well as the other engine!

seems these API might work better if they took expected language as input :)

They do! For example, Vision takes languageHints [1]. With the Speech API, I took an hour long tour of Rouffignac (in French) and translated it into English. I considered adding some SpeechContext [2] for things like mammoth and so on, but actually it did a fine enough job as is (besides, I had to listen to it later, as there was obviously no coverage in the cave).

Disclosure: I work on Google Cloud (but not these APIs).

[1] https://cloud.google.com/vision/docs/reference/rest/v1/image...

[2] https://cloud.google.com/speech/reference/rest/v1/Recognitio...

That, or had some cannonicalization algorithm/lookup table on top of the model. Should be able to attempt to map them to roman characters if wanted.

I find it interesting that it looks like in at least one instance, Microsoft CS recognized the text upside down.

    Payloads -> speohed

Was the image upside down? Sometimes stuff coming out of iOS is stored in a different orientation than it appears on the screen. (There is some metadata in jpeg that says to rotate it before displaying - browsers typically ignore this for backwards compatibility.)

how is it upside down?

That wrong result makes sense if you compare the word shapes with one of them upside down. 180-deg rotated p is d and so on.

You can compare Google vs Microsoft vs OCR.space without signing up here:


OCR.space is a "good enough" option for many projects. It has a very generous free tier of 25,000 free conversions/month per IP address (Google only 1000/month per account). In my tests it performed not as good as Google, but good enough for many applications (much better than Tesseract).

Don't know where the assessment "Msft, Google APIs accuracy still below par" came from. In fact this is just comparing 3 APIs.

In fact the state of the art of OCR "in the wild" (using images off the street, for example) is far from 100%. Google Cloud Vision does pretty good.

The ICDAR 2017 challenges (especially Robust Reading Challenges) should give you an idea of where we are now:


Current state-of-the-art approaches in this field are significantly better than existing commercial solutions at recognizing text in the wild (i.e. scene text).

As an example, see the ICDAR 2015 results [1], where the Google Vision API is at 59.60% (Hmean) while the best ones are over 80%. Note that this test is about localization, i.e. finding the text location without recognizing the actual content, though on a more challenging dataset.

As for recognition, see the table on page 6 of this paper [2]. The "IIIT5K None" column should be pretty close to what was done in the OP, using the same dataset, with recognition accuracies of around 80% while the Google Vision API is at 322/500=64.4%. Note here that since this paper is only about recognition, there is no localization step before which would otherwise act as a filter and decrease the accuracy a bit by failing to localize some text that the recognition step would be able to recognize.

[1] http://rrc.cvc.uab.es/?ch=4&com=evaluation&task=1

[2] https://arxiv.org/pdf/1603.03915.pdf

How do i read this page, it seems to be listing some competitions.

Click through to the competitions. My point was that here are current competitions giving you an idea of the state of the art. These are not student competitions or top coder things.

For example the Robust Reading on COCO text:


The analysis is weak. "There were 10 images where all three APIs got it wrong".

Guess what. Zoom in on the "PRINCE" image, and you'll see it says top-right: A MIKE NEWELL FILM. So... both google and AWS did a nice job. It's not reasonable to expect PRINCE as the outcome.

The point another person makes below about "payloads" and MSoft is valid too... As is the g-accented (not recognized because UTF codes not processed).

Makes ya wonder.

A human would record that as clearly being PRINCE. For their described use case, reading images of business and movie names, the presence of other, small text in the picture seems quite fair.

> The only drawback with AWS rekognition APIs is that it only takes an image stored as an AWS S3 object as input while the other API work with any image stored on the web.

This isn't quite true -- the rekognition API will also accept base64 encoded bytes (5MB max): http://boto3.readthedocs.io/en/latest/reference/services/rek...

It seems his examples include logos and such. Some of those services could be tuned for OCRing books and documents, which should be the bulk of OCR use cases in commercial applications. Since there's usually a trade-off between flexibility vs. accuracy, I wouldn't be surprised to see inverted results w/ a different dataset. Might be worth doing the test.

Any idea how to find a good labeled dataset for this usecase ?

https://www.gutenberg.org might have this. Some books available are just OCRed. Some are proof read and corrected. Not sure if scanned image sources are available?

Testing needs some labeled dataset. Is there some good site to find publicly available datasets ?

There is this one to test against Project Gutenberg, but you have to request by mail:

[1] http://ciir.cs.umass.edu/downloads/ocr-evaluation/

Image text recognition is a major problem we're trying to solve in our startup. I would love to be pointed to some SOTA research in this space. Hard to find anything by Googling about it.

As far as our experience goes, Cloud Vision API is a killer option compared to both AWS and MSFT. It's pricier than AWS though and is slower. MSFT is terrible in both price and speed.

If you're trying to handle text "in the wild" and not scanned documents, the keyword is "scene text". Most papers are focused on either detection/localization, i.e. finding the location of text, or recognition, i.e. recognizing the actual content given a cropped text image.

Here are some current state-of-the-art papers + code where available about detection:

Fused Text Segmentation Networks for Multi-oriented Scene Text Detection https://arxiv.org/abs/1709.03272

EAST: An Efficient and Accurate Scene Text Detector https://arxiv.org/abs/1704.03155 https://github.com/argman/EAST

Detecting Oriented Text in Natural Images by Linking Segments https://arxiv.org/abs/1703.06520 https://github.com/dengdan/seglink

Arbitrary-Oriented Scene Text Detection via Rotation Proposals https://arxiv.org/abs/1703.01086 https://github.com/mjq11302010044/RRPN

And for recognition:

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition https://arxiv.org/abs/1507.05717 https://github.com/bgshih/crnn

Robust Scene Text Recognition with Automatic Rectification https://arxiv.org/abs/1603.03915

I'd also add on the subject the following whitepaper from MS - http://digital.cs.usu.edu/~vkulyukin/vkweb/teaching/cs7900/P...

Note that this paper is from 2010 and thus, while quite influential for its time far from the current state-of-the-art. The stroke width transform method that it introduced is simply not as good as current deep learning-based methods.

If you want to get a (slightly out of date but what can you do, the field is moving very fast) overview see this survey from 2016:

Scene Text Detection and Recognition: Recent Advances and Future Trends http://mclab.eic.hust.edu.cn/UpLoadFiles/Papers/FCS_TextSurv...

Check out the post on the custom dropbox implementation. Helping them (and nsa) to search through files :)

I really hope some company is looking at this data and thinking can we do something disruptive here.

The big companies, particularly google, already uses state of the art techniques. Additionally, OCR has long been a topic study in academia. I think it's unlikely that we'll see any major "disruption" in this industry. Improvements will likely continue to come from the usual suspects.

the really sad being that with all the talks of AI taking over humanity a seemingly a simple task of reading printed text is so far behind..seems ML/AI has many years to really make big leaps.

Automatic translation has also already reached the low-hanging statistical fruit, meaning it has been stuck at 70-80% performance/accuracy for the last couple of years (i.e. you can use it to get a general idea of what a text written in a foreign language might mean, but you can't use it for day-to-day language interactions without sounding unprofessional or even stupid).

You can beat all of the general OCR solutions by training a model for a specific use case.

It always seems like this before the big dawn. Rev-share is a decently big carrot. ;-)

This article reminded me to check if any updates had been made to Google's OCR since Cloud vision was in beta sometime last year or earlier this year [0]. It looks like a new parameter/option for "Document Text Detection" -- i.e. something more akin to Tesseract, rather than just detecting words in images (such as road signs):


[0] https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d

edit: I would attempt my own test right now but it's been awhile since I've tried to use Google Cloud. Right now I'm getting constant "Server Error" popups until Chrome decides to crash and die just when simply checking my account and billing page. The Cloud Console's wonkiness is probably one of the reasons why I stopped using GC in favor of AWS :/

Sorry to hear about whatever is going on between Chrome and the Console (can you try an incognito window?).

You can always test out the Vision API via the landing page (https://cloud.google.com/vision/). The full text results seem to be under the little document tab. I took a screenshot of the text above it, and it seemed to work as expected (breaking it into two paragraphs).

Disclosure: I work on Google Cloud (but not on ML APIs).

Thanks, I'm not sure what the problem could be since I've had GC bill me every month for the past couple of years for other APIs and services, perhaps I never officially enabled Cloud Vision API after it was in beta. The crash at account screen is puzzling but I'm assuming it must be on my end (conflicting plugin perhaps) -- I'll have to investigate it over the weekend. I'm hoping to use GC for a class so I'm going to try the signup and onboarding process from scratch anyway.

Comparision of face recognition apis from MS/Amazon/Kairos.

Kairos is doing better here. ;-)


I’m surprised by these results being so poor because many of these reading tasks are so easy for human readers. And it seems so much more straightforward than stuff like face recognition and self driving cars!

It would be cheaper too with something like mechanical turks, i guess.

That's what I've been doing. I had ~1,800 images to read, machine could only read about 900 so I tossed like $20 at it ($0.02 / image) and it is pretty good accuracy.

MTurk is hit and miss when it comes to workers, some will just click buttons to see if they can get paid, others completely knock it out of the park.

I have another project that I have put about $100 in so far and had decent results (incredibly quickly too!)

Getting my work to "good enough", then tossing the rest on Mturk is much, much cheaper in the long run.

There are captcha "breakers" that use that method.

exactly..Google is doing a decent job but others are really poor.

During a college course my group also did a comparison like this but specifically for handwriting and we simulated low-quality scanned images.

Microsoft was by far the best at this. Google wasn't even close.

I ran a bunch of stained glass images through several APIs - Google's did by far the best, although there are a lot of issues (curved text, hand-written text in odd alignments or on a curve)

Wait, just how many examples of stained glass text are there? What’s the use case? Capchas?

I ran about 800 from http://sacredwindowrescueproject.org/ through this.

Some of the companies that do this work have been around for many decades and have tons of photographs / scanned images, so I'm investigating ways to ingest images into a search engine to help locate old projects.

> Wait, just how many examples of stained glass text are there?


> What’s the use case?

Churches, for one.

Thanks, I stepped right in to that one so you’re welcome ... but I meant the image data.

I would be curious to know if anyone in this community is indeed already using one of these APIs for a product and what your real-life experience is. Care to share your use cases?

It would be interesting to see how Tesseract holds up against this.

Pingtype is my program for learning Chinese. I tried to use Tesseract to recognise some Chinese text, taken directly from a PDF that couldn't copy-paste for some reason. The results were awful.

I tried again with English text. I wanted a word list from a book that helps people learn English, so I took photos of the index. The format is word....page #, in two columns.

The results were just as bad.

I've given up on OCR, and decided I have to transcribe everything by hand. I only do it in my free time, and it's been taking months.

Is there any tool that can take a photo of a book where the pages curl towards the middle, and "flatten" it so that OCR will work better?

whats that? Thor :P ?

Hehe - well I was referring to the major os OCR lib. supposedly have LSTM-stuff in the next major rev https://github.com/tesseract-ocr/tesseract/blob/master/READM...

I have used tesseract and in my experience unless you train it for the particular type of text you want to recognize (font, background color, etc.) it will do quite poorly (including the recent lstm based versions). Would be great to see how it stacks up against these APIs though.

The supplied models are trained on document-like images, so I wouldn't expect it to do particularly well on things like street signs. My experience with the new lstm based versions is that it's very much competitive with closed source solutions for document-like OCR.

Is there any service that converts scanned PDF's using the Google Cloud Vision API to searchable PDF's? If so, that would be highly useful for me.

Would a hacky script be acceptable?

The one time I needed to turn a scanned PDF (600+ page book) into searchable text, I used this Ruby script https://github.com/gkovacs/pdfocr/ , which pulls out individual pages using pdftk, turns them into images to feed into an OCR engine of your choice (Tesseract seems to be the gold standard) and then puts them back together. It can blow up the file size tremendously, but worked well enough for my use case. (I did write a very special purpose PDF compressor to shrink the file back, but that was more for fun.)

Click on "Get dataset and code":

"We will email you the dataset and code."

What's wrong with a github link in the article?

mailinator to the rescue.

Does google cloud vision use deep learning to build their image text recognition?

AWS only has 20+% correctness?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact