I would expect the deep learning approach to outperform traditional approaches in terms of accuracy, but it would be good to see accuracy vs. CPU / memory used, etc.
A better comparison would be against Tesseract or ABBYY FineReader.
EDIT: I wasn't aware that Tika now embeds Tesseract. Still, it's a simple wrapper so the real comparison is against Tesseract.
I'm not sure what benefit they are getting from using machine learning for this other than "decide whether to try and process this file or not".
Tika + Tesseract seems to be able to do the heavy lifting they spent a lot of time talking about in that article.
>We need your permission to do things like hosting Your Stuff, backing it up, and sharing it when you ask us to. Our Services also provide you with features like photo thumbnails, document previews, commenting, easy sorting, editing, sharing, and searching. These and other features may require our systems to access, store, and scan Your Stuff. You give us permission to do those things, and this permission extends to our affiliates and trusted third parties we work with.
This is unarguably something that facilitates searching.
This is another Dropbox feature I would like but is not included in my product.
YouTube Text Overlay - coming soon.