
Using machine learning to index text from billions of images - bradneuberg
https://blogs.dropbox.com/tech/2018/10/using-machine-learning-to-index-text-from-billions-of-images/
======
teraflop
Gmail already does this with image attachments. I discovered this by accident
not long ago: I searched my inbox for a person's name, and one of the results
was a message containing a photo of a sheet of paper with their name on it.

------
perturbation
It would be nice to benchmark the text extraction to a baseline method, say
with Apache Tika ([https://tika.apache.org/](https://tika.apache.org/)).

I would expect the deep learning approach to outperform traditional approaches
in terms of accuracy, but it would be good to see accuracy vs. CPU / memory
used, etc.

~~~
milesokeefe
Tika doesn’t do OCR, it only extracts text content from binary files. For an
image it’ll only give you metadata and such.

A better comparison would be against Tesseract or ABBYY FineReader.

EDIT: I wasn't aware that Tika now embeds Tesseract.[1] Still, it's a simple
wrapper so the real comparison is against Tesseract.

[1]
[https://wiki.apache.org/tika/TikaOCR](https://wiki.apache.org/tika/TikaOCR)

~~~
zawerf
I think they already tried commercial off the shelf OCR software (which they
didn't name but I would assume it's ABBYY) before they decided to build their
own solution:

[https://blogs.dropbox.com/tech/2017/04/creating-a-modern-
ocr...](https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-
using-computer-vision-and-deep-learning/)

~~~
mehrdadn
ABBYY hasn't been all that amazing in my experience. I compared it with Neat
Scanner software a few months ago and the latter seemed to do a noticeably
better job.

------
albemuth
This is opt-in, right?

~~~
bradneuberg
It’s on by default for Dropbox Professional users, and opt-in Early Access for
Dropbox Business Advanced and Enterprise teams. More user level details here
(the blog post linked above is a technical background):
[https://t.co/vVRMnnbXIT?amp=1](https://t.co/vVRMnnbXIT?amp=1)

------
bpg_92
Pretty nice, I'm really interested in pipelines for deep learning at scale,
they link to this article ([https://blogs.dropbox.com/tech/2017/04/creating-a-
modern-ocr...](https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-
pipeline-using-computer-vision-and-deep-learning/)), if anyone has some other
insights I'd be thankful. So far I used TensorRT to deploy to inference
servers. But there is a lot of boilerplate and something like a 'load
balancer' for DL networks would be very interesting.

------
kwijibob
Hmmmm, I have Dropbox plus, for $99USD per year.

This is another Dropbox feature I would like but is not included in my
product.

------
shady-lady
Waiting on YouTube to use machine learning so i can select text off a video
frame..

YouTube Text Overlay - coming soon.

~~~
newman8r
I contemplated a startup a few years back that would let people copy the text
from coding tutorials. Uploaders would include a text file/repo for the work
they're referencing in the video and it would get cross referenced when shown
on the video to ensure the text was 100% accurate when copied.

------
sukeshroydsl
I have found a good article on how to design the dataset for the image
extraction. [https://www.datasciencelearner.com/design-best-machine-
learn...](https://www.datasciencelearner.com/design-best-machine-learning-
datasets/)

