Using Google Cloud Vision OCR to extract text from photos and scanned documents

ImJasonH · on March 24, 2016

While we're talking about the Google Cloud Vision API I'll take the opportunity to plug the Chrome extension I wrote that adds a right-click menu item to detect text, labels and faces in images in your browser:

https://chrome.google.com/webstore/detail/cloud-vision/nblmo...

Try it out, let me know what you think. File issues at github.com/GoogleCloudPlatform/cloud-vision/

fjallstrom · on March 24, 2016

this is such a simple and brilliant idea. thanks for this!

j_s · on March 25, 2016

Another Chrome extension doing OCR, including some server-side processing: http://projectnaptha.com/

exclusiv · on March 25, 2016

Not trying to be rude, but why would you need this?

jyunderwood · on March 24, 2016

At work I replaced a [Tesseract](https://github.com/tesseract-ocr) pipeline with some scripts around the Cloud Vision API. I've been pleased with the speed and accuracy so far considering the low cost and light setup.

Btw, here is a Ruby script that will take an API key and image URL and return the text:

https://gist.github.com/jyunderwood/46b601578d9522c0e9ab

zodiac · on March 24, 2016

Did you see a significant accuracy increase over using tesseract?

Isamu · on March 24, 2016

Personally I have seen a very significant increase in accuracy. In particular with "real life" scenes, tesseract has a hard time.

jyunderwood · on March 24, 2016

The accuracy is about the same. We process store circular images which are actually pretty easy to OCR. It helps that we have large images to start with and are converted to grayscale and then edge sharpened in imagemagick before being sent to the OCR process.

Mithaldu · on March 24, 2016

Submitter: If you're also the author, thank you for sharing your efforts. I needed exactly this kind of information to improve protection against cp spammers who had switched to posting images with the urls on one of my websites. I had however not been able to find out how to start using ocr apis, so this is a god send.

zurbi · on March 24, 2016

This was useful information. Testing this was on my todo list for weeks now:

I read about these limitations in the Cloud Vision OCR API docs, but could not believe that they would indeed not provide data at the word or region level. Anyone has any idea why?

I mean, they must have this data internally and it is key for useful OCR.

Currently I am using the free ocr api at https://ocr.space/OCRAPI for my projects. It also has a corresponding chrome extension called "Copyfish", https://github.com/A9T9/Copyfish

misiti3780 · on March 24, 2016

I was recently testing out google's OCR for some PDF docs - it thought it worked really well (and is pretty reasonable priced). i didnt care so much about the structure of the response/document.

alex_hirner · on March 24, 2016

@danso, if there are any delimiters in the output (tesseract case) and you are looking for automatic table extraction, check out http://github.com/ahirner/Tabularazr-os

It's been used with different kinds of financial docs such as municipal bonds. Implemented in pure python, it has a web interface, simple API and does nifty type inference (dates, interest rate, dollar ammounts...).

danso · on March 24, 2016

Very cool, thanks for sharing. I'm guessing it doesn't do OCR yet? FWIW, you may be interested in these similar projects, which are popular in the journalism community though they don't provide the same high-level interface or data-inference, just the PDF-to-delimited text processing:

- http://tabula.technology/ (Java)

- https://github.com/jsvine/pdfplumber (pure Python as well)

alex_hirner · on March 25, 2016

OCR is left out as a possible future extension, which is why I got interested in this comparison. Thanks, I didn't know about pdfplumber! The utilization of additional markup like vertical lines from pdfminer is very interesting. Razr uses poppler tools with text-only conversion but from which it automatically extracts column names and types.

Similar to plumber and opposed to Tabula, the goal was to extract tables from a swath of documents without user intervention. Additionally, no knowledge about the location tables in the document is required. A fully automated workflow would curl -X POST localhost/analyze/... and filter down the json to the type or types of tables needed (via context lines, data types, column headers).

langitbiru · on March 25, 2016

While we're talking about Google Cloud Vision API, I'll take the opportunity to present the simple web interface to detect labels, text, landmark, faces, logo, etc, using Vision API:

https://iseeimage.com

I hope it will be useful for you who want to try Vision API without being bothered to get the token API from Google Cloud.

steeve · on March 24, 2016

We are amazingly good results using SWT[1] for text detection/boundaries and Tesseract for OCR. Pretty much on par with the results here.

We used to run this on videos.

[1] http://libccv.org/doc/doc-swt/

beagle3 · on March 24, 2016

Can you elaborate a little more about what kind of texts you were reading from video? Also, how you used the swt for detecting texts/boundaries?

dtjones · on March 24, 2016

Seems simple and effective, thanks for sharing. What is the request latency?

danso · on March 24, 2016

Good question...I threw in some median numbers here:

https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d#perfo...

Basically, about 2 seconds for the road signs photo. 6+ seconds for the spreadsheet image (with occasional timeouts). So, probably not optimized/ideal for reading large amounts of text

zandorg · on March 24, 2016

I found this great software (called TIRG) which is free, open source, and finds text in images (though it doesn't normalise to black / white).

Compiles fine on Windows.

https://sourceforge.net/projects/tirg/files/

driverdan · on March 25, 2016

We recently did some testing of Google's OCR vs Abbyy. Google is much better than Abbyy and is cheaper. Abbyy fails at more complex fonts like script while Google still performs well.

yborg · on March 24, 2016

This is cool ... any idea what languages are supported? All I can find in the Google docs is "Vision API supports a broad set of languages."

diggan · on March 25, 2016

The HTTP API is relatively simple to work with actually. Here is a quick example on how to work with it in NodeJS:

    var request = require('request')
    
    var file = require('fs').readFileSync('./testimage.png').toString('base64')
    
    var body = {
      requests: {
        image: {
          content: file
        },
        features: [
          {
            type: 'TEXT_DETECTION',
            maxResults: 10
          }
        ]
      }
    }
    
    var url = 'https://vision.googleapis.com/v1/images:annotate\?key\=your_api_key'
    
    request({
      url: url,
      method: 'POST',
      body: JSON.stringify(body)
    }, (err, res, body) => {
      console.log(err)
      console.log(body)
      console.log(JSON.parse(body).responses[0].textAnnotations[0].description)
    })

Basically you want to convert image data into base64, put it in the requests.image.content field and make a POST request and you'll get back the text.

sagivo · on March 24, 2016

they compare it to tesseract but i really tend to like the open source version.

a simple service that has a free plan on top of it can be found here - https://scanr.xyz/

thesimon · on March 24, 2016

Thanks for sharing. Did you try using it for captchas? :)

danso · on March 24, 2016

Well, you sparked my academic curiosity:

https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d#1a-go...

(better than I thought, actually)