Hacker News new | comments | show | ask | jobs | submit login
Documents OCR: Improving Efficiency by Making PDFs Searchable (medium.com)
126 points by ebibi 8 days ago | hide | past | web | favorite | 42 comments

If these were medical claims documents won't they contain sensitive data about individuals? I would have tried a local tesseract approach thoroughly before sending them off to a cloud service.

This is the right answer.

Personally I’ve wrapped it up in some shell and python scripts[1], and it does the job just fine.

[1] https://github.com/josteink/autoarchiver

It was the right answer for you. It may not have been the right answer given a different set of requirements (including scale, service availability, ops complexity, etc)

I would have tried recoll that can work with tesseract in the background. https://www.lesbonscomptes.com/recoll/

Abbyy is in my opinion by far the most powerfull OCR software. Linux command line OCR is available but unfortunately too pricey for the normal private user: https://www.ocr4linux.com/en:pricing:start

I've been happily using paperwork for this task ; https://gitlab.gnome.org/World/OpenPaperwork/paperwork/

A simple button for scanning a document, the software does all the processing for you (rotation, OCR,..). You can apply tags to documents, search them, export them. The storage format is more than open (a png file + OCR results as an HTML page,...)

That's great :)

I would have liked a comparison between Google, Amazon and tesseract. There are projects for doing this at home and I've been meaning to do this for some time but haven't got around to it yet.

Been a while but the only one I can remember off the top of my head is paperless: https://github.com/danielquinn/paperless

I would like to know more about their comparison results between Google vs. Amazon vs. Tesseract. How did they determine 98% accuracy? If I assume that was Google Visions accuracy, what was the accuracy of the others?

98% sounds good but that remaining 2% is crucial. I do OCR on a lot of my scanned documents to make them searchable and easy to copy text. I've tried a number of OCR programs but even after a lot of training something like Omnipage will still think a lower case l is a 1 in the middle of a word. It seems like it should be easy to put in a rule that says it is unlikely that a document will have a number surrounded by letters (without spaces) but that doesn't seem possible.

Depends on your use case. If you need a definitive answer of how often in 2017 you bought “Coca Cola Zero 1.5L” based on scanned receipts then 98% accuracy will be not enough. If you need that receipt from the one time you bought a garden hose, searching for “garden” and “hose” will probably get you there, even if the OCR tool read “garden h0se” or “g4rden hose”.

They OCR'd PDF documents. Interesting (and I wish I had time to do it to some of the PDFs I've been reading), but not earthshaking. Particularly with 98% accuracy; if 1 out of 50 characters are wrong, searches are going to be problematic.

If you find the term, yay; if you don't you can't really conclude it's not there.

$3200 seems pretty expensive. In the startup I used to work at, I was tasked with building something similar. Except, the documents I had to index was around 1000 pages each almost - they were building plans and diagrams. We did all of our processing within an AWS Lambda pipeline and Elasticsearch. Elasticsearch was the only real cost.

For 98% accuracy that's pretty cheap for OCRing a large number of pages. It wasn't that long ago that you had to use something like abbyy - now that was expensive.

Wow I didn't know. But I'd highly recommend using a lambda function and try with a local OCR library. To see if it afffects cost. No need for Google to be reading your docs. I believed I used PYOCR. Not sure how they got the 98% figure

I'm curious why you think Amazon isn't reading your docs (lambda function) but Google is?

I use Qiqqa http://www.qiqqa.com for searching my library of PDFs. It does OCR on scanned documents. Supports shared and cloud PDF libraries and managing bibtex citations.

As long as there is a BAA between this person’s company and Google, this should be acceptable from a HIPAA standpoint, no? Or are people’s concerns more about the fact that Google is in the mix, without the patients being aware?

How's that pricing work? Google Vision is $.60 per 1000 pages (after 5m pages, $1.50 before that), how can it be $.25?

Even $.25 seems expensive...

More pages? 0.00025/page seems okay, tbh. You'd already be at 10ct for a 400 page document, but consider what people spend on printing. I think the price can probably be justified for many.

Adobe reader has a built in advanced search function that will search every pdf in a folder. Not that this would replace the system that this author is writing about, but it works well for those times when you have to search through dozens of documents that tally to thousands of pages.

And it works when the PDF is made up of images of text?

google vision api for OCR should be this one:


The headline should be: How we handed Google all your medical information and you didn't even know about it.


The article is from someone in the Oscar Health Insurance.

According to Wikipedia, the company:


is claiming transparency in claims pricing:

>Oscar Health Insurance is a technology-focused health insurance company founded in 2012 and headquartered in New York City. The company has plans to change the health insurance industry through telemedicine, healthcare focused technological interfaces, and transparent claims pricing systems.

Maybe they extended the transparency to people's health data.

Here is another article (still on Medium) that incidentally talks of the Author (as engineer employed in the company):


Is this legal? Did they even read the TOS for Google?

Of course it's legal. They have a BAA with Google.

Do people think companies that deal with PHI have to build every service in house...? Run their own data centers? Build their own backhaul connections to their users?

You have a copy of the BAA? (or a link?)

Here is Google's information on the subject: https://cloud.google.com/security/compliance/hipaa/ (note that the Google Vision API is listed as a covered service, so it's approved for use under a BAA)

Obviously I don't have a copy of the contract between Google and Oscar, but I think it's profoundly unlikely that they would write a blog post about something that would be the end of their company without the "simple" (and industry standard) step of signing a BAA with a vendor.

"Google Cloud Platform was built under the guidance of a more than 700 person security engineering team, which is larger than most on-premises security teams."


With Abbyy and ocr.space Local there are good and (for companies) affordable local OCR solutions available. There is really no need to use online(!) OCR for sensitive data. Plus, local ocr is faster.

I'm sure they have a BAA with Google, which holds Google to the same HIPAA PHI handling requirements.

I am not sure they do. They probably are just using Google Vision API as a regular customer. The pricing will change if Google had to do a legal deal with these folks.

That would immediately trigger a company ending lawsuit. There is zero chance they are using a service for handling PHI without a BAA in place with the vendor of that service.

Defending on the service provider, there may be an upfront cost for signing a BAA, but generally the service costs remain the same (assuming you stick to "BAA approved" services). In the case of Google, there is no additional cost:

"As such, we can offer HIPAA regulated customers the same products at the same pricing that is available to all customers, including sustained use discounts. Other public clouds charge more money for their HIPAA cloud, we do not."


Evidence? Getting a BAA from Google or Amazon or Microsoft is trivial.

Because that will fit the HN narrative well ?. How about commenting on the Google vision API and it's accuracy as well ?

It’s a HIPPA compliant cloud service. Google is orders of magnitude less scary than your average health insurer.

I've never dealt with HIPPA; what restrictions does it put on internal use? Who, within Google, could they share raw data with? What about "anonymized" data?

See: https://cloud.google.com/security/compliance/hipaa-complianc...

Any reputable cloud service implements technical and process controls to control or eliminate access to customer data. Doing something like incorporating customer data into advertising would be a very serious situation, and frankly I doubt anyone would be that dumb at scale.

Google will contractually agree to a bunch of things relative to this and has third party audits to provide additional assurance.

What about embedded metadata comments in PDF source ?

Searchability is the killer feature for readin. Tt hat's why no matter how much I try I can't go back to paper books and newspapers except for outdoor fashion purposes.

isn't it already possible to search PDFs? I can start a search in zathura by typing '/'

Only if the document is already pdf text, or if it has the text annotation layer. There are tools, like the OP one, that can discover text and add a layer to an image-only doc.

As far as I know, almost every PDF application, including Acrobat, has had this capability for many years and it's widely used for the same purposes. In addition to searching, it also allows you to copy text.

Perhaps using Google's service is faster or more accurate, but it would be necessary to have a comparison.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact