

Displacing MySQL with Solr - mattdeboard
http://mattdeboard.net/2011/12/29/displacing-mysql-with-solr/

======
andrewingram
I'd like to mention the need to use caution if using Haystack with Solr for
large indexes and relatively high rates of change. The default behaviour of
Haystack is geared towards getting up and running with search quickly, I've
had to tweak a lot of the Solr backend to make it not fall over with our index
of 9m products.

Solr is incredibly easy to work with directly, so if you know you're going to
be using it as a core part of your site functionality, you might want to
consider skipping Haystack entirely (the 2.0 version is a big improvement but
it's not released yet).

A few examples:

\- The default behaviour of Haystack is to commit on every update. Say you're
doing batches of 1000, if you've optimised your indexing database queries
well, you might be able to issue an update every 10 seconds or so. The problem
is that issuing a commit against an index of 9m every 10 seconds (more
frequently if using multiple processes or threads) or so is going to eat up
memory and disk space fast. So if you have this scenario, make sure you
disable the default commit behaviour and issue manual commits instead (we're
doing it every 200k documents).

\- Faceting by Author was a key feature, and we had about 500k of them.
Haystack's faceting functionality didn't provide any mechanism to limit the
number of facet choices returned, or to limit them to just those with a count
> 0\. It was an easy fix, but you might miss it if you're not careful.

\- We only had a few fields we were interested in searching over, but had
about 50 that were candidates for faceting. The Solr schema generated by
haystack assumes you're going to want to search over every field, which means
there's a lot of unnecessary overhead. Make sure you look over every field and
think about how you're going to use it and turn off as much as possible. From
what I can recall, using omitNorms and omitTermFreqAndPositions appropriately
ended up saving us 2GB of otherwise wasted RAM. Some wise guy on my project
decided to try and be helpful and run the manage.py command for generating the
schema before they committed :)

Haystack is an excellent piece of work, but in this project (E-Comm site with
9m products) it ended up costing more time than it saved. Admittedly, this was
the first time I'd worked with a catalogue of this scale, so I shoulder a lot
of the blame. So I just wanted to highlight some of the problems I've faced.

~~~
mattdeboard
Yeah I heavily modified Haystack and pysolr to make it work for us. Standard
disclaimers about leaky abstractions apply.

------
simonw
For <http://lanyrd.com/> we treat Solr as a denormalisation layer for our
MySQL database - there are plenty of queries that are pretty gnarly in SQL
which Solr handles with ease.

Aa good example is our <http://lanyrd.com/calendar/> page which shows events
your Twitter contacts are attending in the future - this was horrible as a SQL
query but works great as a Solr search, especially as scaling Solr read-slaves
horizontally is pretty simple.

Holding denormalised data in a separate data store entirely also makes it
easier to spot bugs where denormalised data isn't being correctly updated.

~~~
mattdeboard
We do the same as well wrt denormalization, though it was not the original
intent of the refactor. Nearly every item on our job sites started life at the
beginning of the project's lifecycle as a "pretty gnarly" SQL query, but now
lives on as a highly performant Solr query.

(Admittedly we don't have nearly as many widgets going on as you do!)

------
rb2k_
I currently run a setup where I interact with CouchDB and have ElasticSearch
use the couchdb river plugin [0] to pretty much replicate an exact copy of the
data for ad-hoc queries.

There have been times when I thought about just interacting with ES directly
rather than going through the CouchDB layer. I'm still not 100% sure if I
should or not.

I don't need any consistency guarantees for most of my data and I'm ok with a
few records going missing in case of a crash. I like CouchDB, but I have to
use the CouchBase single server dev preview to get snappy compression and the
copy on write approach to data storage makes the on-disk files grow pretty
large over time. Compaction currently takes several hours :-/

[0] [https://github.com/elasticsearch/elasticsearch-river-
couchdb...](https://github.com/elasticsearch/elasticsearch-river-
couchdb/blob/master/README.md)

------
justincormack
It is not actually novel, as he says, it is a fairly common use case. Some
sites build their API off a "search" layer too (eg The Guardian), and many of
the Java CMS systems do this too for many or all queries.

~~~
mattdeboard
It may be "common" (debatable and highly dependent on frame of reference) but
I was mostly referring to novelty in relation to how Solr is
marketed/billed/normally used.

