
Architecture of Nautilus, the new Dropbox search engine - WalterSobchak
https://blogs.dropbox.com/tech/2018/09/architecture-of-nautilus-the-new-dropbox-search-engine/
======
PokemonNoGo
Very interesting!

>For most documents, we rely on Apache Tika to transform the original document
into a canonical HTML representation, which then gets parsed in order to
extract a list of “tokens” (i.e. words) and their “attributes” (i.e.
formatting, position, etc…).

How good is really Apache Tike at this? I've messed about but its hard to find
solutions that cover the base cases.

What are the recommendations for covering lets say PDF, OpenXML, and ODF?

