

Turbocharging Solr Index Replication with BitTorrent - mcfunley
http://codeascraft.etsy.com/2012/01/23/solr-bittorrent-index-replication/

======
briandoll
Besides being an awesome post on both the business reason for this
implementation as well as details on their solution, this is a perfect model
for how companies use and contribute to open source software.

Etsy wins, Etsy customers win, and now everyone who uses Solr and ttorrent
wins too.

------
jzawodn
Excellent. I've been considering doing something similar for our Sphinx
indexes at Craigslist. Glad to see it works as well as I might have hoped.

------
duggan
I've no idea what your constraints are, but splitting the index into more
manageable chunks, writing to multiple masters and reading from n slaves off
each is an approach that has worked quite well for me (40 million plus
records, big lumps of user generated content, total index about 100 GB iirc).

You sacrifice on the accuracy of IDF+, but gain some robustness as result too.

If the BitTorrent approach doesn't work too well, you might consider something
similar. I've jotted down a few reading resources for scaling Solr++ but I
should probably do a write up of the architecture I built for Boards.ie.

[1] <http://wiki.apache.org/solr/DistributedSearch>

[2] [http://rossduggan.ie/blog/technology/reading-list-for-
scalin...](http://rossduggan.ie/blog/technology/reading-list-for-scaling-
solr/)

------
jamesclarke
Great article and insight.

FYI mrsync is confusingly named since it doesn’t actually use the rsync
algorithm it performs multicast file transfer. (The whole file is sent not
just the deltas).

I’ve always thought it would be cool if someone added rsync capabilities on
top of bittorrent. I’m not sure Etsy would see any benefit but I bet Twitter
with their murder server deploy would.

------
joevandyk
How does postgresql's search compare to solr?

(With postgresql, I think replicating the search index to read-only nodes is a
solved problem.)

~~~
mcfunley
Well, before Etsy switched to Solr we were using postgres tsearch2, and by the
end of that search response times were in the 90 second range. And at the time
replication wasn't part of postgres so we had a hacked up buggy in-house
replicator.

~~~
mcfunley
And the results were god-awful

~~~
mcfunley
And we replaced sixteen extremely beefy postgres slaves with four commodity
solr boxes with plenty of headroom to spare.

(Of course we now have more than four solr slaves.)

------
karussell
With ElasticSearch you wouldn't have that problem. It uses push instead pull
replication ... which is also (near) real time aware. Even for peta whatever
byte.

~~~
karussell
Here is the video where Shay explained the difference
<http://vimeo.com/26710663> (from 18min on)

------
joshu
ha, kellan is standing in my office door right now

