
Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning - wwarner
https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/
======
arnioxux
> Our Word Detector was also a significant bottleneck. We ended up essentially
> rewriting OpenCV’s C++ MSER implementation in a more modular way to avoid
> duplicating slow work when doing two passes (to be able to handle both black
> on white text as well as white on black text); to expose more to our Python
> layer (the underlying MSER tree hierarchy) for more efficient processing;
> and to make the code actually readable.

Was this ever contributed back to opencv?

~~~
turblety
OpenCV is licensed under the BSD license. [1]

"BSD (unlike some other licenses) does not require that source code be
distributed at all." [2]

Looks like there is no legal obligation to provide their source code or
changes. But I really hope they did push their changes upstream. It is a real
shame when companies that benefit so much from the open source community don't
give back.

1\.
[https://github.com/opencv/opencv/blob/master/LICENSE](https://github.com/opencv/opencv/blob/master/LICENSE)

2\.
[https://en.wikipedia.org/wiki/BSD_licenses](https://en.wikipedia.org/wiki/BSD_licenses)

~~~
wiz21c
Hopefully, those who miss that Dropbox code today may think about using a more
powerful license (ie AGPL) on the code they'll write themselves in the future
so that they make sure their users will contribute back like they want DropBox
to. :-)

~~~
wwarner
Pretty sure that both OpenCV and Dropbox knew what they were doing when they
avoided the affero license.

------
cardigan
A better baseline than an out of the box OCR engine is applying some computer
vision techniques before using an out of the box OCR engine. This can
significantly out perform a pure OCR engine approach. (speaking from
experience working on startups doing this kind of stuff for ~10 months)

One dumb but surprisingly effective thing to try is apply a bunch of random
binarizing filters before putting your text into an OCR engine, then picking
the most common output from those filtered images. Some combination of
morphological operations (dilates/erodes in different orders and different
strengths), different levels of blur and sharpening, adaptive binary threshold
at different levels, resizing the image (hilarious but some OCR engines are
sensitive to relative scale), adding white borders of different thicknesses,
rotating by small angles, denoising, ...

This gave us ~95% word level accuracy on our dataset (with tesseract as the
OCR engine), without tuning the filters (we had used the filters individually
before with some brittle logic for deciding when to use them: the random
filter approach was kind of a joke which ended up working). Tesseract on its
own had an abysmal ~40% word level accuracy on that same dataset. Dataset was
of words segmented out of scanned bank statements.

Main downside was that this was pretty slow (we were generating ~10k filters
per word), but we found a way to make it work (hierarchical stacked AWS
Lambdas, lol).

~~~
Jhsto
Selfless plug, but for some examples of these see a side project of mine:
[https://www.juusohaavisto.com/northern-nike-
nabob.html](https://www.juusohaavisto.com/northern-nike-nabob.html)

------
philipkglass
10 years ago I used Abbyy Finereader to add an OCR text layer to about 500,000
pages of scanned PDF documents. At the time it appeared to be the best OCR
software available, at least of the options I could try.

Later I played around with the open source Tesseract OCR, but it was _way_
behind commercial desktop packages in accuracy/usability. It lagged especially
in dealing with different document layouts. What's the current leader in
desktop OCR? Has deep learning yet revolutionized the "traditional" OCR market
for desktop document processing, or is it just making it easier for companies
like Dropbox to quickly get something reasonably good working?

~~~
pault
I had to research the commercial OCR market recently for a client project.
It's abysmal. Most of the solutions are horrendously overpriced, windows only
enterprise packages, and the ones that I was able to try out bordered on
unusable (bad results and _terrible_ UI). Since OCR is basically "hello world"
for tensorflow, I don't understand why these incumbents haven't been wiped off
the map.

~~~
bflesch
Because for example Abby replaces the image with an PDF with invisible text
overlays. I think it is a bit more complicated once you account for the
formatting of documents.

~~~
milesokeefe
Abbyy has had that ability for a longer period of time so it may be more
accurate, but this is something tesseract supports[1]. In fact I'd say most
OCR systems support it with varying degrees of accuracy.

[1]:[https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-
do-i...](https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i-produce-
searchable-pdf-output)

------
milad_nazari
> "At some point, we will open source and release this synthetically generated
> data for others to train and validate their own systems and research on."

Glad to hear this.

~~~
Dowwie
This article was posted in April. Did Dropbox follow through and release? 7
months..

------
Roritharr
I'd kill for a self-hostable solution like that. I just don't have the
resources to spend like Dropbox. Can anyone recommend a poor man's approach to
this? I can afford a lot of AWS instances, but not the 8 months of
engineering.

~~~
romuloab42
Not self-hostable, but Google Vision API OCR was pretty good when I used last
time. AWS also has a offering for that, but I haven't personally tested.

------
SloopJon
It's pretty remarkable that they were able to build a system more or less from
scratch that "gave us the accuracy, precision, and recall numbers that all met
or exceeded the OCR state-of-the-art."

There was a recent post on the Google Developers blog about a similar project
to recognize bottle-cap codes in a mobile app:

[https://developers.googleblog.com/2017/09/how-machine-
learni...](https://developers.googleblog.com/2017/09/how-machine-learning-
with-tensorflow.html)

HN discussion:

[https://news.ycombinator.com/item?id=15489176](https://news.ycombinator.com/item?id=15489176)

------
tudorw
seems like a lot of value locked away in these training sets, is there a
commercial platform to buy and sell trained sets yet ?, answering my own
question, yes, [https://algorithmia.com/](https://algorithmia.com/) any others
?

------
ayw
Super cool!

Solving problems like DropTurk with stronger guarantees around
turnaround/quality/confidentiality is exactly what we do at Scale. Would love
to hear more about DropTurk’s experiences around building this internally and
how it compares to our product:

[https://www.scaleapi.com/ocr-transcription](https://www.scaleapi.com/ocr-
transcription)

------
brightball
Hadn't really used the document scanner yet...but it sounds like I might have
to now. If they keep moving on this it's going to replace my primary use of
Evernote...and since Dropbox has a Linux client but Evernote doesn't it'll end
up being a pretty easy switch.

Are there any Evernote -> Dropbox Paper migration tools out there?

------
verelo
Was this done by the Endorse team Dropbox acquired a few years back? I know
they were big into this same line of work.

------
evv
Are there any fully-offline alternatives to this? What is the most capable
open-source solution?

------
WalterBright
I'd love to see a system that worked on handwriting. With all the image
recognition progress in the last few years, I never see a mention of
recognizing handwriting.

~~~
ocrcustomserver
There are two options for this, one is online and the other is offline
handwriting recognition.

Online is when you have info about the strokes (e.g. writing on a tablet or on
a HTML5 canvas), offline is when you don't anything other than the image.

Online is mostly solved, offline is not working well yet. Deep learning will
certainly help accelerating progress.

I'm working on a prototype that extracts (and transcribes) handwriting info
from PDF documents (like forms).

~~~
WalterBright
Nice to see someone working on it!

The online system is not so useful, I prefer to type when I have the option
anyway. It's faster.

The offline system initially doesn't even need to be that accurate. OCRing
printed text must be accurate because of the sheer volume being OCRed.
Handwritten stuff tends to be of far smaller volume, and so can be hand-
corrected more practically.

------
jccalhoun
I hope some better OCR comes along. I use Nuance and either I have confused it
with my training file or it makes some very dumb decisions. It seems like is
should scan something that it sees as "wi1l" or "s0mething" and realize that
it isn't very likely that someone would put a number in the middle of a word
or put random superscript in the middle of a word.

------
ksk
Very interesting post. Is there any benchmark data showing how their system
compares to the competition?

------
whatever_dude
All this and yet they still can't implement support for a .dropboxignore file

~~~
dingo_bat
You can just place any such file outside the Dropbox directory. Doesn't seem
like a big deal.

~~~
Rotten194
It's very frustrating if you keep your programming projects backed up through
Dropbox, and want to back up the code but not gigabytes of node_modules and
python virtualenvs. Not that Dropbox has any incentive to make it easier to
use less space...

~~~
pault
Isn't git or mercurial more appropriate for archiving programming projects?

~~~
Nullabillity
For archival, yes. But it's not rare that I want to swap computer without
committing WIP changes.

~~~
parthdesai
Wouldn't a branching do the trick?

checkout a WIP branch and push it. You can pull it on any other computer you
want, keep working on it till you are comfortable enough to merge back on
master/develop.

~~~
whatever_dude
While that sort of works and I do that, it's a much more cumbersome solution.

I commit WIP code to a separate branch, change, and now if I go back to the
original computer, I need to make sure to push, go back, pull again... god
forbid I had done interactive rebasing or amended a commit, since it'll be all
of of sync... it's just more more complicated than simply saving a file and
having it reflected. "Temporary commits" should be a rare thing, not
mandatory.

I end up using both git and dropbox for some projects and it mostly works, but
having projects on dropbox is a mess because of node_modules, having sync
running continuously because some tools/projects/IDEs will keep generating
temporary files, etc.

~~~
parthdesai
Oh i do agree, it's cumbersome and dropbox should have .dropboxignore file. I
was just suggesting a work around :)

------
misiti3780
Dropturk sounds useful, any plans of open-sourcing it ?

------
webscalist
Why would a personal backup company invest in OCR?

> To peek at personal data

~~~
barkingcat
This statement is obvious. I pay Dropbox specifically so they can hold my data
and yes - look at my personal data. And by OCR'ing it make my data even more
understandable and searchable.

How is this a concern when that is the purpose of the company's software?

~~~
kinkrtyavimoodh
I understand that you are fine with them looking at your data, but how exactly
is it obvious that the primary purpose of a company you are paying to backup
your data is to mine that data?

~~~
barkingcat
The primary purpose of dropbox was never "backup". It was always to
intelligently store and share data, and to be able to identify information
inside of it - like if it's an image, have a gallery. If it's a word doc, have
a light weight interface for reading and editing, etc. It was also one of the
primary ways to get data out of cell phones because of the "sharing" and
"syncing" options.

I think you are confusing dropbox for something else?

In fact I think dropbox would be a horrible backup solution! Don't use it for
backup or you will be disappointed.

~~~
derefr
Right: Dropbox isn't a safe-deposit box for your data; it's a checking account
for your data.

------
amelius
Why do they use mechanical turk to train printed text? You could easily build
a tool that transforms ASCII into something that looks like photographed
printed text, with different levels of quality (using e.g. erosion and
dilation, different fonts, etc.)

~~~
dangrossman
They were using Mechanical Turk to evaluate existing OCR systems before they
started at building their own.

If you read further into the article, they did exactly what you suggested to
train their own system:

"A corpus with words to use...A collection of fonts for drawing the words...A
set of geometric and photometric transformations meant to simulate real world
distortions...The generation algorithm simply samples from each of these to
create a unique training example."

