
Architecture of Nautilus, the new Dropbox search engine - jorangreef
https://blogs.dropbox.com/tech/2018/09/architecture-of-nautilus-the-new-dropbox-search-engine
======
Tetris1
I remmember when Dropbox released Firefly. It was simple and elegant. Nautilus
is a monster in compare. It would be nice to see some pieces of code...

------
wiradikusuma
Also worth mentioning: [http://vespa.ai/](http://vespa.ai/)

------
arafalov
In 2015, they evaluated Apache Solr and Elasticsearch and decided to build
their own (Firefly). They said, other solutions did not scale. So, instead of
contributing to scaling (like Apple and Bloomberg and Cloudera did), they went
the other way. Now, they seem to be doing it again (at least they are using
Tika).

In a meanwhile, Solr implemented most of the features they are describing in
their architecture document.

Specifically:

1)General scaling: [https://lucene.apache.org/solr/guide/7_5/introduction-to-
sca...](https://lucene.apache.org/solr/guide/7_5/introduction-to-scaling-and-
distribution.html) (using ZooKeeper and SolrCloud)

2) Search Ranking and click-data training:
[https://lucene.apache.org/solr/guide/7_5/learning-to-
rank.ht...](https://lucene.apache.org/solr/guide/7_5/learning-to-rank.html)
(Contributed by Bloomberg)

3) Offline builds with substitution into production:
[https://lucene.apache.org/solr/guide/7_5/collections-
api.htm...](https://lucene.apache.org/solr/guide/7_5/collections-
api.html#createalias)

4) Near-Real-Time: [https://lucene.apache.org/solr/guide/7_5/near-real-time-
sear...](https://lucene.apache.org/solr/guide/7_5/near-real-time-
searching.html)

5) Sharding specifically: [https://lucene.apache.org/solr/guide/7_5/shards-
and-indexing...](https://lucene.apache.org/solr/guide/7_5/shards-and-indexing-
data-in-solrcloud.html)

6) Extraction pipeline, they are doing all together. We have:

a) pre-Solr extraction (usually done in a stand-alone client, though we do
include Tika and DataImportHandler for quick start),

b) in-Solr pre-schema processing with Update Request Processors
[https://lucene.apache.org/solr/guide/7_5/update-request-
proc...](https://lucene.apache.org/solr/guide/7_5/update-request-
processors.html)

c) Actual per-field text processing pipelines, separate both for index and
query (they call query part later "query understanding":
[https://lucene.apache.org/solr/guide/7_5/understanding-
analy...](https://lucene.apache.org/solr/guide/7_5/understanding-analyzers-
tokenizers-and-filters.html) Also, my own: [http://www.solr-
start.com/info/analyzers/](http://www.solr-start.com/info/analyzers/)

7) Pluggable internal index formats? Here is the latest (FST50):
[https://lucene.apache.org/solr/guide/7_5/the-tagger-
handler....](https://lucene.apache.org/solr/guide/7_5/the-tagger-handler.html)

8) Update system configuration live, over API?
[https://lucene.apache.org/solr/guide/7_5/configuration-
apis....](https://lucene.apache.org/solr/guide/7_5/configuration-apis.html)
[https://lucene.apache.org/solr/guide/7_5/schema-
api.html](https://lucene.apache.org/solr/guide/7_5/schema-api.html)

9) Tolerate small failures, but abort if something is definitely not right:
[http://www.solr-start.com/javadoc/solr-
lucene/org/apache/sol...](http://www.solr-start.com/javadoc/solr-
lucene/org/apache/solr/update/processor/TolerantUpdateProcessorFactory.html)

10) Retrieval root: [https://lucene.apache.org/solr/guide/7_5/solrcloud-query-
rou...](https://lucene.apache.org/solr/guide/7_5/solrcloud-query-routing-and-
read-tolerance.html)

11) Retrieval leaf: That's Solr's basic shard/core

12) The inverted and forward indexes look like standard Lucene index and maybe
docValues:
[https://lucene.apache.org/solr/guide/7_5/docvalues.html](https://lucene.apache.org/solr/guide/7_5/docvalues.html)

13) Search orchestrator seems to be a couple of features on top of Solr's
existing routing linked earlier. There were individual approaches/3rd-party
modules doing some of these (shadow, federation, ACL). Some of this is
definitely unique to Dropbox though.

14) Precision vs Recall vs Ranking is too many links, but there is a whole
book on this: [https://www.manning.com/books/relevant-
search](https://www.manning.com/books/relevant-search) (mostly about
Elasticsearch, but Solr has added some new features recently to make it even
better)

15) BM25, we had it back in 2015:
[https://opensourceconnections.com/blog/2015/10/16/bm25-the-n...](https://opensourceconnections.com/blog/2015/10/16/bm25-the-
next-generation-of-lucene-relevation/)

16) (Future for Nautilus): Distance Based embeddings, such as Word2Vec.
Commercial offering on top of Solr has it:
[https://lucidworks.com/2016/11/16/word2vec-fusion-nlp-
search...](https://lucidworks.com/2016/11/16/word2vec-fusion-nlp-search/) but
I remember discussion for Solr as well

17) (Future for Nautilus): Searching images/videos/etc:
[https://lucidworks.com/2015/08/28/shutterstock-
searches-35-m...](https://lucidworks.com/2015/08/28/shutterstock-
searches-35-million-images-color-using-apache-solr/)

And a lot more (Solr Reference manual is more than 1300 pages....).

Obviously, this is a bit of a dig at a Dropbox reinventing the wheel again (or
perhaps this time actually using Lucene, but forgetting to attribute it so
far).

But more importantly, it is a message to others that got excited by their
architecture post. You can have a similar battle-tested system for yourself,
for free. And if something is not perfect, you can fix it and help the rest of
the world too. We are always happy to see new contributors.

Finally, if you know Apache Solr well, it is not just Dropbox you can work
for, but also Lucidworks, Bloomberg, Cloudera, Alfresco, Shutterstock, Dice,
CareerBuilder, and many others.

~~~
decasteve
> at least they are using Tika

Why is that important? Is it advantageous versus the alternatives? (Genuinely
curious)

I have been using GNU libextractor but I see Tika quite often brought up in
the same breath. When I tried Tika a while back I didn't find it as good nor
as fast. Has that changed?

~~~
arafalov
Tika is a very active project that Solr also uses. And they rely on other good
libraries.

If libextractor is sufficient for you, that's great. If you hit its
limitation, try Tika.

Some use-cases I know of include

\- Parsing Microsoft Office Files

\- Doing OCR on images

\- Running Tika as a standalone server with HTTP interface

Tika is most definitely a secret component inside a lot of systems that
extract content/metadata from files. So, Dropbox leveraging Tika was a good
move and worth recognizing. Especially, given that the rest of their choices
does not quite make sense (based on the limited information provided).

------
markpapadakis
I am very much looking forward to forthcoming posts describing the actual
architecture and specifics -- this is a great high-level overview, but I hope
and expect they will expand on this expose soon.

------
tegansnyder
Is this based on Lucene in any way?

------
tomrod
Cool to hear about a revamp. It confuses me why some projects use a well-known
open source name. In Linux, it's primary desktop environment (Gnome) uses
Nautilus as a file manager. Dropbox even has a package for Dropbox/Nautilus
integration.

~~~
Insanity
Yeah whenever I hear Nautilus, the first thing that comes to mind is the file
manager.

But I suppose most dropbox users are on mac/windows.

~~~
hackandtrip
What alternative have Linux users to Dropbox, to using their own server? (That
is, for a number of reasons for most people, suboptimal)? You cant really use
OneDrive365, and Dropbox offers vast support to Linux, is easy to set and can
be used for free too. Is there a reason why Linux users wouldn't use it?
Asking for curiosity.

~~~
groovybits
> Dropbox offers vast support to Linux

Dropbox only supports unencrypted ext4 filesystems on Linux, so I would not
use the phrase 'vast support'.

~~~
coldtea
Because there are many popular desktop options for Linux that don't use ext4
and non-encrypted as their default?

~~~
yjftsjthsd-h
RHEL family is XFS-centric, and quite a few distros (including Ubuntu) offer
encryption in the default installer.

~~~
danieldk
ext4 on encrypted devices, such as dm-crypt/LUKS, is supported. What is not
supported are encryption filesystems that are 'filesystem overlays', such as
ecryptfs.

(Since I was using ZFS, I am still debating whether to stay with Dropbox after
November's filesystem apocalypse.)

~~~
module0000
FYI, you can still use ecrypt fs with dropbox. Put the encrypted store within
your dropbox, and mount it outside your dropbox. From the dropbox point of
view, you have thousands of files with gibberish as names.

