
Documents OCR: Improving Efficiency by Making PDFs Searchable - ebibi
https://medium.com/oscar-tech/documents-ocr-improving-efficiency-by-making-pdfs-searchable-b56a261f07d
======
pkz
If these were medical claims documents won't they contain sensitive data about
individuals? I would have tried a local tesseract approach thoroughly before
sending them off to a cloud service.

~~~
josteink
This is the right answer.

Personally I’ve wrapped it up in some shell and python scripts[1], and it does
the job just fine.

[1]
[https://github.com/josteink/autoarchiver](https://github.com/josteink/autoarchiver)

~~~
JshWright
It was the right answer for you. It may not have been the right answer given a
different set of requirements (including scale, service availability, ops
complexity, etc)

------
jayalpha
I would have tried recoll that can work with tesseract in the background.
[https://www.lesbonscomptes.com/recoll/](https://www.lesbonscomptes.com/recoll/)

Abbyy is in my opinion by far the most powerfull OCR software. Linux command
line OCR is available but unfortunately too pricey for the normal private
user:
[https://www.ocr4linux.com/en:pricing:start](https://www.ocr4linux.com/en:pricing:start)

------
Davidbrcz
I've been happily using paperwork for this task ;
[https://gitlab.gnome.org/World/OpenPaperwork/paperwork/](https://gitlab.gnome.org/World/OpenPaperwork/paperwork/)

A simple button for scanning a document, the software does all the processing
for you (rotation, OCR,..). You can apply tags to documents, search them,
export them. The storage format is more than open (a png file + OCR results as
an HTML page,...)

------
tjoff
That's great :)

I would have liked a comparison between Google, Amazon and tesseract. There
are projects for doing this at home and I've been meaning to do this for some
time but haven't got around to it yet.

Been a while but the only one I can remember off the top of my head is
paperless:
[https://github.com/danielquinn/paperless](https://github.com/danielquinn/paperless)

------
matwood
I would like to know more about their comparison results between Google vs.
Amazon vs. Tesseract. How did they determine 98% accuracy? If I assume that
was Google Visions accuracy, what was the accuracy of the others?

------
jccalhoun
98% sounds good but that remaining 2% is crucial. I do OCR on a lot of my
scanned documents to make them searchable and easy to copy text. I've tried a
number of OCR programs but even after a lot of training something like
Omnipage will still think a lower case l is a 1 in the middle of a word. It
seems like it should be easy to put in a rule that says it is unlikely that a
document will have a number surrounded by letters (without spaces) but that
doesn't seem possible.

~~~
leokennis
Depends on your use case. If you need a definitive answer of how often in 2017
you bought “Coca Cola Zero 1.5L” based on scanned receipts then 98% accuracy
will be not enough. If you need that receipt from the one time you bought a
garden hose, searching for “garden” and “hose” will probably get you there,
even if the OCR tool read “garden h0se” or “g4rden hose”.

------
mcguire
They OCR'd PDF documents. Interesting (and I wish I had time to do it to some
of the PDFs I've been reading), but not earthshaking. Particularly with 98%
accuracy; if 1 out of 50 characters are wrong, searches are going to be
problematic.

If you find the term, yay; if you don't you can't really conclude it's not
there.

------
catchmeifyoucan
$3200 seems pretty expensive. In the startup I used to work at, I was tasked
with building something similar. Except, the documents I had to index was
around 1000 pages each almost - they were building plans and diagrams. We did
all of our processing within an AWS Lambda pipeline and Elasticsearch.
Elasticsearch was the only real cost.

~~~
matwood
For 98% accuracy that's pretty cheap for OCRing a large number of pages. It
wasn't that long ago that you _had_ to use something like abbyy - now that was
expensive.

~~~
catchmeifyoucan
Wow I didn't know. But I'd highly recommend using a lambda function and try
with a local OCR library. To see if it afffects cost. No need for Google to be
reading your docs. I believed I used PYOCR. Not sure how they got the 98%
figure

~~~
tensor
I'm curious why you think Amazon isn't reading your docs (lambda function) but
Google is?

------
snowwindwaves
I use Qiqqa [http://www.qiqqa.com](http://www.qiqqa.com) for searching my
library of PDFs. It does OCR on scanned documents. Supports shared and cloud
PDF libraries and managing bibtex citations.

------
carbocation
As long as there is a BAA between this person’s company and Google, this
should be acceptable from a HIPAA standpoint, no? Or are people’s concerns
more about the fact that Google is in the mix, without the patients being
aware?

------
hayd
How's that pricing work? Google Vision is $.60 per 1000 pages (after 5m pages,
$1.50 before that), how can it be $.25?

Even $.25 seems expensive...

~~~
solarkraft
More pages? 0.00025/page seems okay, tbh. You'd already be at 10ct for a 400
page document, but consider what people spend on printing. I think the price
can probably be justified for many.

------
SQL2219
Adobe reader has a built in advanced search function that will search every
pdf in a folder. Not that this would replace the system that this author is
writing about, but it works well for those times when you have to search
through dozens of documents that tally to thousands of pages.

~~~
CodeWriter23
And it works when the PDF is made up of images of text?

------
ausjke
google vision api for OCR should be this one:

[https://cloud.google.com/vision/docs/ocr](https://cloud.google.com/vision/docs/ocr)

------
Asmod4n
The headline should be: How we handed Google all your medical information and
you didn't even know about it.

~~~
jaclaz
Exactly.

The article is from someone in the Oscar Health Insurance.

According to Wikipedia, the company:

[https://en.wikipedia.org/wiki/Oscar_Health](https://en.wikipedia.org/wiki/Oscar_Health)

is claiming transparency in claims pricing:

>Oscar Health Insurance is a technology-focused health insurance company
founded in 2012 and headquartered in New York City. The company has plans to
change the health insurance industry through telemedicine, healthcare focused
technological interfaces, and transparent claims pricing systems.

Maybe they extended the transparency to people's health data.

Here is another article (still on Medium) that incidentally talks of the
Author (as engineer employed in the company):

[https://medium.com/oscar-tech/whats-it-like-to-be-an-
enginee...](https://medium.com/oscar-tech/whats-it-like-to-be-an-engineer-at-
oscar-fbaaba3ce94d)

~~~
infocollector
Is this legal? Did they even read the TOS for Google?

~~~
JshWright
Of course it's legal. They have a BAA with Google.

Do people think companies that deal with PHI have to build every service in
house...? Run their own data centers? Build their own backhaul connections to
their users?

~~~
infocollector
You have a copy of the BAA? (or a link?)

~~~
JshWright
Here is Google's information on the subject:
[https://cloud.google.com/security/compliance/hipaa/](https://cloud.google.com/security/compliance/hipaa/)
(note that the Google Vision API is listed as a covered service, so it's
approved for use under a BAA)

Obviously I don't have a copy of the contract between Google and Oscar, but I
think it's profoundly unlikely that they would write a blog post about
something that would be the end of their company without the "simple" (and
industry standard) step of signing a BAA with a vendor.

~~~
mcguire
" _Google Cloud Platform was built under the guidance of a more than 700
person security engineering team, which is larger than most on-premises
security teams._ "

That's...encouraging.

------
agumonkey
What about embedded metadata comments in PDF source ?

------
diminish
Searchability is the killer feature for readin. Tt hat's why no matter how
much I try I can't go back to paper books and newspapers except for outdoor
fashion purposes.

------
crunchiebones
isn't it already possible to search PDFs? I can start a search in zathura by
typing '/'

~~~
imglorp
Only if the document is already pdf text, or if it has the text annotation
layer. There are tools, like the OP one, that can discover text and add a
layer to an image-only doc.

------
forapurpose
As far as I know, almost every PDF application, including Acrobat, has had
this capability for many years and it's widely used for the same purposes. In
addition to searching, it also allows you to copy text.

Perhaps using Google's service is faster or more accurate, but it would be
necessary to have a comparison.

