
FastMail's Email Search Architecture - alfiedotwtf
http://blog.fastmail.com/2014/12/01/email-search-system/
======
ams6110
If you want Xapian search on a local maildir, I highly recommend notmuch[1].
Adding new mail and updating the index can take noticable time, but searching
is super fast, it allows easy custom tagging, and search results are better
than gmail in my experience.

I use it from the emacs notmuch mode.

[1] [http://notmuchmail.org/](http://notmuchmail.org/)

~~~
darklajid
I'm always a bit jealous when I see this setup (or mu).

But I don't use emacs, and that seems to lead to subpar support and crazy
hacks to get something up and working, unfortunately.

~~~
danieldk
_But I don 't use emacs,_

Me neither, but mutt-kz had built-in support for notmuch. Not the hacky kind
that calls 'notmuch', but it actually links against libnotmuch.

[http://kzak.redcrew.org/doku.php?id=mutt:start](http://kzak.redcrew.org/doku.php?id=mutt:start)

~~~
darklajid
That looks really neat. You ruined my day (in terms of productivity), but
might've given me a nice early Christmas present. Thank you!

------
brongondwana
An obvious question that I didn't hit on in the blog is "what about host
crashes"? The nice thing is, every index knows exactly which messages it
covers - and we can quite quickly (within an hour or so for an entire server)
scan all mailboxes and index the missing messages - it's more efficient than
doing it in a real time, because you are often indexing multiple messages in
the same mailbox.

Once the indexes are up to date, we can switch back to being masters again. We
index on all the replicas independently so that they are always ready.

~~~
brongondwana
(during a clean shutdown, we copy all the indexes over to the SSD, and they
get compacted to data in the next day's compact run)

------
kolev
Although in the past I've implemented Xapian [1] over Sphinx [2], Sphinx today
seems to be much better, but both Xapian and Sphinx are under-appreciated
compared to Solr [3] and Elasticsearch [4].

[1] [http://xapian.org/](http://xapian.org/)

[2] [http://sphinxsearch.com/](http://sphinxsearch.com/)

[3] [http://lucene.apache.org/solr/](http://lucene.apache.org/solr/)

[4] [http://www.elasticsearch.org/](http://www.elasticsearch.org/)

~~~
brongondwana
All of those last three are awesome if you either put all the user's search
indexes into a single engine, or have a shit-ton of memory.

With sphinx, we found we had to start and stop daemons all over the place to
manage memory, and it was just unworkable. It was either that or run one big
index per machine, but there are operational reasons I'd rather not be doing
that. We try to keep everything user-sized.

That said, there's still stub Sphinx code in there. Both engines are have GPL
licensing on them, which means compiling against Cyrus (BSD licensed) causes a
non-BSD licensed end result. Not an issue for us, since we publish all our
Cyrus code anyway.

There is talk of building an Elasticsearch backed into Cyrus as well - feel
free, it's all open source. We'd definitely take the patch if it's good code
(he says with his Cyrus Project Board Member hat on rather than his FastMail
Director hat on)

~~~
gregnbanks
Sphinx also had a bunch of really bad bugs around server startup and shutdown,
and some ugly code. I ended re-writing some of their pthreads code and
submitting patches, I have no idea if they ever used them or not because their
development tree is internal and not visible outside their company.

Solr (well, Lucene) has awesome natural language stemming abilities, many more
languages supported than Xapian. In particular it's much smarter than Xapian
about Chinese. But a) the memory requirement makes running multiple shards on
the same machine hard, and b) nobody in the company wanted to learn how to
handle Java operationally.

EDIT: since 2012 Sphinx appear to have made a public mirror of their internal
tree at
[https://code.google.com/p/sphinxsearch/source/checkout](https://code.google.com/p/sphinxsearch/source/checkout)

------
chatman
Apache Solr is best suited for such applications. AOL Mail uses Solr to power
search for all users [0].

[0] - [http://lucidworks.com/blog/podcast-solr-at-scale-at-
aol/](http://lucidworks.com/blog/podcast-solr-at-scale-at-aol/)

~~~
frankwiles
That advice is a bit dated honestly, Solr is fine but the current "best
practice" if you will is to use ElasticSearch.

Not saying you can't do it with Solr or that Solr doesn't scale, it does.
You'll just have an easier and more fun time doing it with ES.

Couple of related/examples:

[http://highscalability.com/blog/2014/1/6/how-hipchat-
stores-...](http://highscalability.com/blog/2014/1/6/how-hipchat-stores-and-
indexes-billions-of-messages-using-el.html)

[http://exploringelasticsearch.com/github_interview.html](http://exploringelasticsearch.com/github_interview.html)

------
hendzen
Seems like they independently invented a sort of log structured merge tree.

~~~
lobster_johnson
LSM-type storage is frequently used in IR; it's not a new technique. I
accidentally "invented" it a few years ago before I realized it had a name.
Even Lucene 1.x used a variation of this for its "segment" files (which it
still does, at least up to 3.x, afaik).

The reason is that you want to keep the inverted indexes sorted on disk, but
you don't want to sort the entire index every time you update. So you create
one mini-index per update and merge them lazily when you get too many of them.

------
mikebo
During the compaction phase, when a new temp db is installed while the old is
being compacted, isn't there a window where messages in the old temp db are
not searchable?

------
cobookman
Does anyone know why they chose Xapian over elasticsearch?

~~~
ams6110
For me it would be the use of Java. I've just had too much bad luck with it.
Admittedly that's not really very objective reasoning.

~~~
alfiedotwtf
From memory (sorry, it was a while back) we actually started with
Elasticsearch, but it was way to heavy and looked for an alternative solution.
Even with a single user it was consuming way too much memory.

~~~
robn_fastmail
Sorry Alfie, your memory fails. We started with Sphinx.

We use Elasticsearch elsewhere (ELK stack), but not for mail search.

------
iancarroll
Fastmail is a great service, however iCloud completely fails with their web
client. I have >50k emails, probably closer to 100k and searching simply times
out. Never has worked for me. It's sad, really...

~~~
brongondwana
Sorry, I don't understand what you mean. iCloud is a service, and FastMail's
web client doesn't talk to iCloud, it only talks to FastMail's servers - at
least for now.

~~~
iancarroll
I was contrasting the iCloud interface to FastMail, an implication they were
connected wasn't intended.

~~~
brongondwana
Ahh, ok. I misread your post then, sorry.

