In a meanwhile, Solr implemented most of the features they are describing in their architecture document.
1)General scaling: https://lucene.apache.org/solr/guide/7_5/introduction-to-sca... (using ZooKeeper and SolrCloud)
2) Search Ranking and click-data training: https://lucene.apache.org/solr/guide/7_5/learning-to-rank.ht... (Contributed by Bloomberg)
3) Offline builds with substitution into production: https://lucene.apache.org/solr/guide/7_5/collections-api.htm...
4) Near-Real-Time: https://lucene.apache.org/solr/guide/7_5/near-real-time-sear...
5) Sharding specifically: https://lucene.apache.org/solr/guide/7_5/shards-and-indexing...
6) Extraction pipeline, they are doing all together. We have:
a) pre-Solr extraction (usually done in a stand-alone client, though we do include Tika and DataImportHandler for quick start),
b) in-Solr pre-schema processing with Update Request Processors https://lucene.apache.org/solr/guide/7_5/update-request-proc...
c) Actual per-field text processing pipelines, separate both for index and query (they call query part later "query understanding": https://lucene.apache.org/solr/guide/7_5/understanding-analy... Also, my own: http://www.solr-start.com/info/analyzers/
7) Pluggable internal index formats? Here is the latest (FST50): https://lucene.apache.org/solr/guide/7_5/the-tagger-handler....
8) Update system configuration live, over API? https://lucene.apache.org/solr/guide/7_5/configuration-apis.... https://lucene.apache.org/solr/guide/7_5/schema-api.html
9) Tolerate small failures, but abort if something is definitely not right: http://www.solr-start.com/javadoc/solr-lucene/org/apache/sol...
10) Retrieval root: https://lucene.apache.org/solr/guide/7_5/solrcloud-query-rou...
11) Retrieval leaf: That's Solr's basic shard/core
12) The inverted and forward indexes look like standard Lucene index and maybe docValues: https://lucene.apache.org/solr/guide/7_5/docvalues.html
13) Search orchestrator seems to be a couple of features on top of Solr's existing routing linked earlier. There were individual approaches/3rd-party modules doing some of these (shadow, federation, ACL). Some of this is definitely unique to Dropbox though.
14) Precision vs Recall vs Ranking is too many links, but there is a whole book on this: https://www.manning.com/books/relevant-search (mostly about Elasticsearch, but Solr has added some new features recently to make it even better)
15) BM25, we had it back in 2015: https://opensourceconnections.com/blog/2015/10/16/bm25-the-n...
16) (Future for Nautilus): Distance Based embeddings, such as Word2Vec. Commercial offering on top of Solr has it: https://lucidworks.com/2016/11/16/word2vec-fusion-nlp-search... but I remember discussion for Solr as well
17) (Future for Nautilus): Searching images/videos/etc: https://lucidworks.com/2015/08/28/shutterstock-searches-35-m...
And a lot more (Solr Reference manual is more than 1300 pages....).
Obviously, this is a bit of a dig at a Dropbox reinventing the wheel again (or perhaps this time actually using Lucene, but forgetting to attribute it so far).
But more importantly, it is a message to others that got excited by their architecture post. You can have a similar battle-tested system for yourself, for free. And if something is not perfect, you can fix it and help the rest of the world too. We are always happy to see new contributors.
Finally, if you know Apache Solr well, it is not just Dropbox you can work for, but also Lucidworks, Bloomberg, Cloudera, Alfresco, Shutterstock, Dice, CareerBuilder, and many others.
Why is that important? Is it advantageous versus the alternatives? (Genuinely curious)
I have been using GNU libextractor but I see Tika quite often brought up in the same breath. When I tried Tika a while back I didn't find it as good nor as fast. Has that changed?
If libextractor is sufficient for you, that's great. If you hit its limitation, try Tika.
Some use-cases I know of include
- Parsing Microsoft Office Files
- Doing OCR on images
- Running Tika as a standalone server with HTTP interface
Tika is most definitely a secret component inside a lot of systems that extract content/metadata from files. So, Dropbox leveraging Tika was a good move and worth recognizing. Especially, given that the rest of their choices does not quite make sense (based on the limited information provided).
Apple has 2-4 different search stacks. The web search one is fully homegrown and closed source.
Jessica Mallet from Apple, Inc. gave a presentation on how Apple uses SolrCloud. She briefly outlined some terms and concepts and then dug into how Apple built a multi-tenant search platform with each cluster holding around one million logical indexes. She also explained how their automation tool SolrLord uses alarms to trigger several events and can fix issues without any human interaction.
And the other half - the 'old' part you may not like - are battle-tested, multiple-times speed-optimized pieces of code. Like this one: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is... Moreover, their architecture makes very clear that they are making very similar choices, it is just their implementation is much fresher.
Sure, there is crud in Solr, it is an open-source product driven by the user needs. Sure, it is possible that - for some usecases - Java is disadvantage.
I would have loved that refreshed comparison to be in the article. It is very jarring that it was not. As it is, it felt that they walked away from 2015 and have not looked since. Even though their "simpler" approach did not work out and they had to throw it away.
But I suppose most dropbox users are on mac/windows.
Dropbox only supports unencrypted ext4 filesystems on Linux, so I would not use the phrase 'vast support'.
(Since I was using ZFS, I am still debating whether to stay with Dropbox after November's filesystem apocalypse.)
"A large percentage of linux users, use dropbox" does not equal "A large percentage of dropbox users, use linux".
Surely it's just common courtesy to not step on top of an actively developed, very popular project that is directly related to file management.