
Using Google Cloud Vision OCR to extract text from photos and scanned documents - danso
https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d
======
ImJasonH
While we're talking about the Google Cloud Vision API I'll take the
opportunity to plug the Chrome extension I wrote that adds a right-click menu
item to detect text, labels and faces in images in your browser:

[https://chrome.google.com/webstore/detail/cloud-
vision/nblmo...](https://chrome.google.com/webstore/detail/cloud-
vision/nblmokgbialjjgfhfofbgfcghhbkejac)

Try it out, let me know what you think. File issues at
github.com/GoogleCloudPlatform/cloud-vision/

~~~
fjallstrom
this is such a simple and brilliant idea. thanks for this!

------
jyunderwood
At work I replaced a [Tesseract]([https://github.com/tesseract-
ocr](https://github.com/tesseract-ocr)) pipeline with some scripts around the
Cloud Vision API. I've been pleased with the speed and accuracy so far
considering the low cost and light setup.

Btw, here is a Ruby script that will take an API key and image URL and return
the text:

[https://gist.github.com/jyunderwood/46b601578d9522c0e9ab](https://gist.github.com/jyunderwood/46b601578d9522c0e9ab)

~~~
zodiac
Did you see a significant accuracy increase over using tesseract?

~~~
Isamu
Personally I have seen a very significant increase in accuracy. In particular
with "real life" scenes, tesseract has a hard time.

------
Mithaldu
Submitter: If you're also the author, thank you for sharing your efforts. I
needed exactly this kind of information to improve protection against cp
spammers who had switched to posting images with the urls on one of my
websites. I had however not been able to find out how to start using ocr apis,
so this is a god send.

------
zurbi
This was useful information. Testing this was on my todo list for weeks now:

I read about these limitations in the Cloud Vision OCR API docs, but could not
believe that they would indeed not provide data at the word or region level.
Anyone has any idea why?

I mean, they must have this data internally and it is key for useful OCR.

Currently I am using the free ocr api at
[https://ocr.space/OCRAPI](https://ocr.space/OCRAPI) for my projects. It also
has a corresponding chrome extension called "Copyfish",
[https://github.com/A9T9/Copyfish](https://github.com/A9T9/Copyfish)

------
misiti3780
I was recently testing out google's OCR for some PDF docs - it thought it
worked really well (and is pretty reasonable priced). i didnt care so much
about the structure of the response/document.

------
alex_hirner
@danso, if there are any delimiters in the output (tesseract case) and you are
looking for automatic table extraction, check out
[http://github.com/ahirner/Tabularazr-
os](http://github.com/ahirner/Tabularazr-os)

It's been used with different kinds of financial docs such as municipal bonds.
Implemented in pure python, it has a web interface, simple API and does nifty
type inference (dates, interest rate, dollar ammounts...).

~~~
danso
Very cool, thanks for sharing. I'm guessing it doesn't do OCR yet? FWIW, you
may be interested in these similar projects, which are popular in the
journalism community though they don't provide the same high-level interface
or data-inference, just the PDF-to-delimited text processing:

\- [http://tabula.technology/](http://tabula.technology/) (Java)

\-
[https://github.com/jsvine/pdfplumber](https://github.com/jsvine/pdfplumber)
(pure Python as well)

~~~
alex_hirner
OCR is left out as a possible future extension, which is why I got interested
in this comparison. Thanks, I didn't know about pdfplumber! The utilization of
additional markup like vertical lines from pdfminer is very interesting. Razr
uses poppler tools with text-only conversion but from which it automatically
extracts column names and types.

Similar to plumber and opposed to Tabula, the goal was to extract tables from
a swath of documents without user intervention. Additionally, no knowledge
about the location tables in the document is required. A fully automated
workflow would curl -X POST localhost/analyze/... and filter down the json to
the type or types of tables needed (via context lines, data types, column
headers).

------
langitbiru
While we're talking about Google Cloud Vision API, I'll take the opportunity
to present the simple web interface to detect labels, text, landmark, faces,
logo, etc, using Vision API:

[https://iseeimage.com](https://iseeimage.com)

I hope it will be useful for you who want to try Vision API without being
bothered to get the token API from Google Cloud.

------
steeve
We are amazingly good results using SWT[1] for text detection/boundaries and
Tesseract for OCR. Pretty much on par with the results here.

We used to run this on videos.

[1] [http://libccv.org/doc/doc-swt/](http://libccv.org/doc/doc-swt/)

~~~
beagle3
Can you elaborate a little more about what kind of texts you were reading from
video? Also, how you used the swt for detecting texts/boundaries?

------
dtjones
Seems simple and effective, thanks for sharing. What is the request latency?

~~~
danso
Good question...I threw in some median numbers here:

[https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d#perfo...](https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d#performance-
and-latency)

Basically, about 2 seconds for the road signs photo. 6+ seconds for the
spreadsheet image (with occasional timeouts). So, probably not optimized/ideal
for reading large amounts of text

------
zandorg
I found this great software (called TIRG) which is free, open source, and
finds text in images (though it doesn't normalise to black / white).

Compiles fine on Windows.

[https://sourceforge.net/projects/tirg/files/](https://sourceforge.net/projects/tirg/files/)

------
driverdan
We recently did some testing of Google's OCR vs Abbyy. Google is much better
than Abbyy and is cheaper. Abbyy fails at more complex fonts like script while
Google still performs well.

------
yborg
This is cool ... any idea what languages are supported? All I can find in the
Google docs is "Vision API supports a broad set of languages."

~~~
diggan
The HTTP API is relatively simple to work with actually. Here is a quick
example on how to work with it in NodeJS:

    
    
        var request = require('request')
        
        var file = require('fs').readFileSync('./testimage.png').toString('base64')
        
        var body = {
          requests: {
            image: {
              content: file
            },
            features: [
              {
                type: 'TEXT_DETECTION',
                maxResults: 10
              }
            ]
          }
        }
        
        var url = 'https://vision.googleapis.com/v1/images:annotate\?key\=your_api_key'
        
        request({
          url: url,
          method: 'POST',
          body: JSON.stringify(body)
        }, (err, res, body) => {
          console.log(err)
          console.log(body)
          console.log(JSON.parse(body).responses[0].textAnnotations[0].description)
        })
    

Basically you want to convert image data into base64, put it in the
requests.image.content field and make a POST request and you'll get back the
text.

------
sagivo
they compare it to tesseract but i really tend to like the open source
version.

a simple service that has a free plan on top of it can be found here -
[https://scanr.xyz/](https://scanr.xyz/)

------
thesimon
Thanks for sharing. Did you try using it for captchas? :)

~~~
danso
Well, you sparked my academic curiosity:

[https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d#1a-go...](https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d#1a-google-
gmail-captcha-circa-2009)

(better than I thought, actually)

