Hacker News new | past | comments | ask | show | jobs | submit login
Using Google Cloud Vision OCR to extract text from photos and scanned documents (gist.github.com)
128 points by danso on March 24, 2016 | hide | past | favorite | 26 comments



While we're talking about the Google Cloud Vision API I'll take the opportunity to plug the Chrome extension I wrote that adds a right-click menu item to detect text, labels and faces in images in your browser:

https://chrome.google.com/webstore/detail/cloud-vision/nblmo...

Try it out, let me know what you think. File issues at github.com/GoogleCloudPlatform/cloud-vision/


this is such a simple and brilliant idea. thanks for this!


Another Chrome extension doing OCR, including some server-side processing: http://projectnaptha.com/


Not trying to be rude, but why would you need this?


At work I replaced a [Tesseract](https://github.com/tesseract-ocr) pipeline with some scripts around the Cloud Vision API. I've been pleased with the speed and accuracy so far considering the low cost and light setup.

Btw, here is a Ruby script that will take an API key and image URL and return the text:

https://gist.github.com/jyunderwood/46b601578d9522c0e9ab


Did you see a significant accuracy increase over using tesseract?


Personally I have seen a very significant increase in accuracy. In particular with "real life" scenes, tesseract has a hard time.


The accuracy is about the same. We process store circular images which are actually pretty easy to OCR. It helps that we have large images to start with and are converted to grayscale and then edge sharpened in imagemagick before being sent to the OCR process.


Submitter: If you're also the author, thank you for sharing your efforts. I needed exactly this kind of information to improve protection against cp spammers who had switched to posting images with the urls on one of my websites. I had however not been able to find out how to start using ocr apis, so this is a god send.


This was useful information. Testing this was on my todo list for weeks now:

I read about these limitations in the Cloud Vision OCR API docs, but could not believe that they would indeed not provide data at the word or region level. Anyone has any idea why?

I mean, they must have this data internally and it is key for useful OCR.

Currently I am using the free ocr api at https://ocr.space/OCRAPI for my projects. It also has a corresponding chrome extension called "Copyfish", https://github.com/A9T9/Copyfish


I was recently testing out google's OCR for some PDF docs - it thought it worked really well (and is pretty reasonable priced). i didnt care so much about the structure of the response/document.


@danso, if there are any delimiters in the output (tesseract case) and you are looking for automatic table extraction, check out http://github.com/ahirner/Tabularazr-os

It's been used with different kinds of financial docs such as municipal bonds. Implemented in pure python, it has a web interface, simple API and does nifty type inference (dates, interest rate, dollar ammounts...).


Very cool, thanks for sharing. I'm guessing it doesn't do OCR yet? FWIW, you may be interested in these similar projects, which are popular in the journalism community though they don't provide the same high-level interface or data-inference, just the PDF-to-delimited text processing:

- http://tabula.technology/ (Java)

- https://github.com/jsvine/pdfplumber (pure Python as well)


OCR is left out as a possible future extension, which is why I got interested in this comparison. Thanks, I didn't know about pdfplumber! The utilization of additional markup like vertical lines from pdfminer is very interesting. Razr uses poppler tools with text-only conversion but from which it automatically extracts column names and types.

Similar to plumber and opposed to Tabula, the goal was to extract tables from a swath of documents without user intervention. Additionally, no knowledge about the location tables in the document is required. A fully automated workflow would curl -X POST localhost/analyze/... and filter down the json to the type or types of tables needed (via context lines, data types, column headers).


While we're talking about Google Cloud Vision API, I'll take the opportunity to present the simple web interface to detect labels, text, landmark, faces, logo, etc, using Vision API:

https://iseeimage.com

I hope it will be useful for you who want to try Vision API without being bothered to get the token API from Google Cloud.


We are amazingly good results using SWT[1] for text detection/boundaries and Tesseract for OCR. Pretty much on par with the results here.

We used to run this on videos.

[1] http://libccv.org/doc/doc-swt/


Can you elaborate a little more about what kind of texts you were reading from video? Also, how you used the swt for detecting texts/boundaries?


Seems simple and effective, thanks for sharing. What is the request latency?


Good question...I threw in some median numbers here:

https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d#perfo...

Basically, about 2 seconds for the road signs photo. 6+ seconds for the spreadsheet image (with occasional timeouts). So, probably not optimized/ideal for reading large amounts of text


I found this great software (called TIRG) which is free, open source, and finds text in images (though it doesn't normalise to black / white).

Compiles fine on Windows.

https://sourceforge.net/projects/tirg/files/


We recently did some testing of Google's OCR vs Abbyy. Google is much better than Abbyy and is cheaper. Abbyy fails at more complex fonts like script while Google still performs well.


This is cool ... any idea what languages are supported? All I can find in the Google docs is "Vision API supports a broad set of languages."


The HTTP API is relatively simple to work with actually. Here is a quick example on how to work with it in NodeJS:

    var request = require('request')
    
    var file = require('fs').readFileSync('./testimage.png').toString('base64')
    
    var body = {
      requests: {
        image: {
          content: file
        },
        features: [
          {
            type: 'TEXT_DETECTION',
            maxResults: 10
          }
        ]
      }
    }
    
    var url = 'https://vision.googleapis.com/v1/images:annotate\?key\=your_api_key'
    
    request({
      url: url,
      method: 'POST',
      body: JSON.stringify(body)
    }, (err, res, body) => {
      console.log(err)
      console.log(body)
      console.log(JSON.parse(body).responses[0].textAnnotations[0].description)
    })
Basically you want to convert image data into base64, put it in the requests.image.content field and make a POST request and you'll get back the text.


they compare it to tesseract but i really tend to like the open source version.

a simple service that has a free plan on top of it can be found here - https://scanr.xyz/


Thanks for sharing. Did you try using it for captchas? :)


Well, you sparked my academic curiosity:

https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d#1a-go...

(better than I thought, actually)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: