Hacker News new | past | comments | ask | show | jobs | submit login
Architecture of Nautilus, the new Dropbox search engine (dropbox.com)
119 points by jorangreef 5 months ago | hide | past | web | favorite | 45 comments

I remmember when Dropbox released Firefly. It was simple and elegant. Nautilus is a monster in compare. It would be nice to see some pieces of code...

Also worth mentioning: http://vespa.ai/

In 2015, they evaluated Apache Solr and Elasticsearch and decided to build their own (Firefly). They said, other solutions did not scale. So, instead of contributing to scaling (like Apple and Bloomberg and Cloudera did), they went the other way. Now, they seem to be doing it again (at least they are using Tika).

In a meanwhile, Solr implemented most of the features they are describing in their architecture document.


1)General scaling: https://lucene.apache.org/solr/guide/7_5/introduction-to-sca... (using ZooKeeper and SolrCloud)

2) Search Ranking and click-data training: https://lucene.apache.org/solr/guide/7_5/learning-to-rank.ht... (Contributed by Bloomberg)

3) Offline builds with substitution into production: https://lucene.apache.org/solr/guide/7_5/collections-api.htm...

4) Near-Real-Time: https://lucene.apache.org/solr/guide/7_5/near-real-time-sear...

5) Sharding specifically: https://lucene.apache.org/solr/guide/7_5/shards-and-indexing...

6) Extraction pipeline, they are doing all together. We have:

a) pre-Solr extraction (usually done in a stand-alone client, though we do include Tika and DataImportHandler for quick start),

b) in-Solr pre-schema processing with Update Request Processors https://lucene.apache.org/solr/guide/7_5/update-request-proc...

c) Actual per-field text processing pipelines, separate both for index and query (they call query part later "query understanding": https://lucene.apache.org/solr/guide/7_5/understanding-analy... Also, my own: http://www.solr-start.com/info/analyzers/

7) Pluggable internal index formats? Here is the latest (FST50): https://lucene.apache.org/solr/guide/7_5/the-tagger-handler....

8) Update system configuration live, over API? https://lucene.apache.org/solr/guide/7_5/configuration-apis.... https://lucene.apache.org/solr/guide/7_5/schema-api.html

9) Tolerate small failures, but abort if something is definitely not right: http://www.solr-start.com/javadoc/solr-lucene/org/apache/sol...

10) Retrieval root: https://lucene.apache.org/solr/guide/7_5/solrcloud-query-rou...

11) Retrieval leaf: That's Solr's basic shard/core

12) The inverted and forward indexes look like standard Lucene index and maybe docValues: https://lucene.apache.org/solr/guide/7_5/docvalues.html

13) Search orchestrator seems to be a couple of features on top of Solr's existing routing linked earlier. There were individual approaches/3rd-party modules doing some of these (shadow, federation, ACL). Some of this is definitely unique to Dropbox though.

14) Precision vs Recall vs Ranking is too many links, but there is a whole book on this: https://www.manning.com/books/relevant-search (mostly about Elasticsearch, but Solr has added some new features recently to make it even better)

15) BM25, we had it back in 2015: https://opensourceconnections.com/blog/2015/10/16/bm25-the-n...

16) (Future for Nautilus): Distance Based embeddings, such as Word2Vec. Commercial offering on top of Solr has it: https://lucidworks.com/2016/11/16/word2vec-fusion-nlp-search... but I remember discussion for Solr as well

17) (Future for Nautilus): Searching images/videos/etc: https://lucidworks.com/2015/08/28/shutterstock-searches-35-m...

And a lot more (Solr Reference manual is more than 1300 pages....).

Obviously, this is a bit of a dig at a Dropbox reinventing the wheel again (or perhaps this time actually using Lucene, but forgetting to attribute it so far).

But more importantly, it is a message to others that got excited by their architecture post. You can have a similar battle-tested system for yourself, for free. And if something is not perfect, you can fix it and help the rest of the world too. We are always happy to see new contributors.

Finally, if you know Apache Solr well, it is not just Dropbox you can work for, but also Lucidworks, Bloomberg, Cloudera, Alfresco, Shutterstock, Dice, CareerBuilder, and many others.

> at least they are using Tika

Why is that important? Is it advantageous versus the alternatives? (Genuinely curious)

I have been using GNU libextractor but I see Tika quite often brought up in the same breath. When I tried Tika a while back I didn't find it as good nor as fast. Has that changed?

Tika is a very active project that Solr also uses. And they rely on other good libraries.

If libextractor is sufficient for you, that's great. If you hit its limitation, try Tika.

Some use-cases I know of include

- Parsing Microsoft Office Files

- Doing OCR on images

- Running Tika as a standalone server with HTTP interface

Tika is most definitely a secret component inside a lot of systems that extract content/metadata from files. So, Dropbox leveraging Tika was a good move and worth recognizing. Especially, given that the rest of their choices does not quite make sense (based on the limited information provided).

about 17) we are using a powerful image search plugin (commercial) which does well for us https://pixolution.org/

> So, instead of contributing to scaling (like Apple and Bloomberg and Cloudera did),

Apple has 2-4 different search stacks. The web search one is fully homegrown and closed source.

Entirely possible. Yet, this is what Apple presented in 2014:

Jessica Mallet from Apple, Inc. gave a presentation on how Apple uses SolrCloud. She briefly outlined some terms and concepts and then dug into how Apple built a multi-tenant search platform with each cluster holding around one million logical indexes. She also explained how their automation tool SolrLord uses alarms to trigger several events and can fix issues without any human interaction.


It's OK to not like Solr, which is a large and extremely old codebase.

It's OK to not like Solr. At the same time, half of the features I listed above, are quite new (SolrCloud, docValues, LTR, Config and Schema API, JSON support, etc).

And the other half - the 'old' part you may not like - are battle-tested, multiple-times speed-optimized pieces of code. Like this one: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is... Moreover, their architecture makes very clear that they are making very similar choices, it is just their implementation is much fresher.

Sure, there is crud in Solr, it is an open-source product driven by the user needs. Sure, it is possible that - for some usecases - Java is disadvantage.

I would have loved that refreshed comparison to be in the article. It is very jarring that it was not. As it is, it felt that they walked away from 2015 and have not looked since. Even though their "simpler" approach did not work out and they had to throw it away.

Any idea what language they used to implement these in?

They only mention Tika and Kafka, both I believe are written/using Java. I think the next article is supposed to give more details, I am looking forward to that.

I am very much looking forward to forthcoming posts describing the actual architecture and specifics -- this is a great high-level overview, but I hope and expect they will expand on this expose soon.

Is this based on Lucene in any way?

Cool to hear about a revamp. It confuses me why some projects use a well-known open source name. In Linux, it's primary desktop environment (Gnome) uses Nautilus as a file manager. Dropbox even has a package for Dropbox/Nautilus integration.

Yeah whenever I hear Nautilus, the first thing that comes to mind is the file manager.

But I suppose most dropbox users are on mac/windows.

What alternative have Linux users to Dropbox, to using their own server? (That is, for a number of reasons for most people, suboptimal)? You cant really use OneDrive365, and Dropbox offers vast support to Linux, is easy to set and can be used for free too. Is there a reason why Linux users wouldn't use it? Asking for curiosity.

> Dropbox offers vast support to Linux

Dropbox only supports unencrypted ext4 filesystems on Linux, so I would not use the phrase 'vast support'.

Because there are many popular desktop options for Linux that don't use ext4 and non-encrypted as their default?

RHEL family is XFS-centric, and quite a few distros (including Ubuntu) offer encryption in the default installer.

ext4 on encrypted devices, such as dm-crypt/LUKS, is supported. What is not supported are encryption filesystems that are 'filesystem overlays', such as ecryptfs.

(Since I was using ZFS, I am still debating whether to stay with Dropbox after November's filesystem apocalypse.)

FYI, you can still use ecrypt fs with dropbox. Put the encrypted store within your dropbox, and mount it outside your dropbox. From the dropbox point of view, you have thousands of files with gibberish as names.

Every desktop Linux user could use Dropbox and still most users of Dropbox would probably be Windows and Mac.

OwnCloud supports Linux [1]. If you want a SaaS version, they have several hosting partners [2].

[1] https://owncloud.com/client/ [2] https://owncloud.org/hosting-partners/

Better go with Nextcloud than ownCloud. Nextcloud is the fork by the original developer team, and has quite a few nice improvements compared to ownCloud (e.g.: video and text chat, e2e encryption)

Didn't know about the fork. Thanks for the suggestion, will check it out.

> Is there a reason why Linux users wouldn't use it

https://www.theregister.co.uk/2018/08/14/dropbox_encrypted_l... ?

There is Syncthing which looks good on paper, but it lacks an iOS client. I'd love it if someone developed even a read-only iOS app for it.

I use SpiderOak One [1] which is a privacy focused alternative to Dropbox. I run it on Ubuntu (and previously on Debian and Arch Linux). There's no free tier like there is with Dropbox though.

[1]: https://spideroak.com/one/

For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

Yeah but that is a different group.

"A large percentage of linux users, use dropbox" does not equal "A large percentage of dropbox users, use linux".

keybase's kbfs is my dropbox-style replacement. It works really well in my experience

kbfs is a network filesystem, whereas Dropbox provides file synchronization. kbfs does not work when you are offline, whereas with Dropbox the files are always available locally on your machine.

Seafile is great

I'm mostly a windows user but have some exposure to Ubuntu and had never heard of Nautilus

It's funny because the only reason I know of Nautilus on Ubuntu is from installing Dropbox. To use the Dropbox daemon on Ubuntu you have to install it and then restart Nautilus.

You never used a file manager? That was a tiny, little exposure. :-)

I think Ubuntu renames Nautilus to "Files".

GNOME does.

That's why open source projects need to register their trademarks. That's how Gnome managed to stop Groupon from ripping off the name for their own project.

Clear disconnect from the open source community, which isn't a negative signal, but also is not a positive one.

Yeah, its enough that I am actively disinterested. Seemz like the opposite of goodwill.

Yeah, this seems unnecessarily confusing to me. It's an internal project, so it's not like the name matters for advertising or anything.

Surely it's just common courtesy to not step on top of an actively developed, very popular project that is directly related to file management.

Isnt this a proprietary backend program that users will never know about? Who cares

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact