Hacker News new | past | comments | ask | show | jobs | submit login
Using machine learning to index text from billions of images (dropbox.com)
157 points by bradneuberg 9 months ago | hide | past | web | favorite | 17 comments

Gmail already does this with image attachments. I discovered this by accident not long ago: I searched my inbox for a person's name, and one of the results was a message containing a photo of a sheet of paper with their name on it.

It would be nice to benchmark the text extraction to a baseline method, say with Apache Tika (https://tika.apache.org/).

I would expect the deep learning approach to outperform traditional approaches in terms of accuracy, but it would be good to see accuracy vs. CPU / memory used, etc.

Tika doesn’t do OCR, it only extracts text content from binary files. For an image it’ll only give you metadata and such.

A better comparison would be against Tesseract or ABBYY FineReader.

EDIT: I wasn't aware that Tika now embeds Tesseract.[1] Still, it's a simple wrapper so the real comparison is against Tesseract.

[1] https://wiki.apache.org/tika/TikaOCR

For the use-case of search, you can "cheat" and provide multiple answers for each word that you find in the image. Evernote does this. (It has 2-3 options for each word in its ocr results.) I don't know if tesseract supports this mode of operation, nor if Dropbox is doing this.

I think they already tried commercial off the shelf OCR software (which they didn't name but I would assume it's ABBYY) before they decided to build their own solution:


ABBYY hasn't been all that amazing in my experience. I compared it with Neat Scanner software a few months ago and the latter seemed to do a noticeably better job.

My first thought when reading this was it seemed almost over-engineered compared to just using Tika+Tesseract.

I'm not sure what benefit they are getting from using machine learning for this other than "decide whether to try and process this file or not".

Tika + Tesseract seems to be able to do the heavy lifting they spent a lot of time talking about in that article.

I worked in a very similar system for a very different company and I tend to think that a good reason to implement your own OCR models (if you can afford it) would be optimizing CPU cost. Tesseract can be quite expensive to run in scale, maxing out 100% for a simple page and taking about 5-30 seconds for full page extraction. Also, most Tesseract pipelines take entire PDF files for processing, whilst you could achieve better latency by processing pages in parallel and merge the results, as they suggest in the post.

Tesseract does not work well out of the box and is usually outperformed by custom models for OCR

This is opt-in, right?

It doesn't have to be, according to the current Dropbox Terms of Service[1]:

>We need your permission to do things like hosting Your Stuff, backing it up, and sharing it when you ask us to. Our Services also provide you with features like photo thumbnails, document previews, commenting, easy sorting, editing, sharing, and searching. These and other features may require our systems to access, store, and scan Your Stuff. You give us permission to do those things, and this permission extends to our affiliates and trusted third parties we work with.

This is unarguably something that facilitates searching.

[1] https://www.dropbox.com/terms

It’s on by default for Dropbox Professional users, and opt-in Early Access for Dropbox Business Advanced and Enterprise teams. More user level details here (the blog post linked above is a technical background): https://t.co/vVRMnnbXIT?amp=1

Pretty nice, I'm really interested in pipelines for deep learning at scale, they link to this article (https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr...), if anyone has some other insights I'd be thankful. So far I used TensorRT to deploy to inference servers. But there is a lot of boilerplate and something like a 'load balancer' for DL networks would be very interesting.

Hmmmm, I have Dropbox plus, for $99USD per year.

This is another Dropbox feature I would like but is not included in my product.

Waiting on YouTube to use machine learning so i can select text off a video frame..

YouTube Text Overlay - coming soon.

I contemplated a startup a few years back that would let people copy the text from coding tutorials. Uploaders would include a text file/repo for the work they're referencing in the video and it would get cross referenced when shown on the video to ensure the text was 100% accurate when copied.

I have found a good article on how to design the dataset for the image extraction. https://www.datasciencelearner.com/design-best-machine-learn...

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact