Displacing MySQL with Solr

andrewingram · on Dec 31, 2011

I'd like to mention the need to use caution if using Haystack with Solr for large indexes and relatively high rates of change. The default behaviour of Haystack is geared towards getting up and running with search quickly, I've had to tweak a lot of the Solr backend to make it not fall over with our index of 9m products.

Solr is incredibly easy to work with directly, so if you know you're going to be using it as a core part of your site functionality, you might want to consider skipping Haystack entirely (the 2.0 version is a big improvement but it's not released yet).

A few examples:

- The default behaviour of Haystack is to commit on every update. Say you're doing batches of 1000, if you've optimised your indexing database queries well, you might be able to issue an update every 10 seconds or so. The problem is that issuing a commit against an index of 9m every 10 seconds (more frequently if using multiple processes or threads) or so is going to eat up memory and disk space fast. So if you have this scenario, make sure you disable the default commit behaviour and issue manual commits instead (we're doing it every 200k documents).

- Faceting by Author was a key feature, and we had about 500k of them. Haystack's faceting functionality didn't provide any mechanism to limit the number of facet choices returned, or to limit them to just those with a count > 0. It was an easy fix, but you might miss it if you're not careful.

- We only had a few fields we were interested in searching over, but had about 50 that were candidates for faceting. The Solr schema generated by haystack assumes you're going to want to search over every field, which means there's a lot of unnecessary overhead. Make sure you look over every field and think about how you're going to use it and turn off as much as possible. From what I can recall, using omitNorms and omitTermFreqAndPositions appropriately ended up saving us 2GB of otherwise wasted RAM. Some wise guy on my project decided to try and be helpful and run the manage.py command for generating the schema before they committed :)

Haystack is an excellent piece of work, but in this project (E-Comm site with 9m products) it ended up costing more time than it saved. Admittedly, this was the first time I'd worked with a catalogue of this scale, so I shoulder a lot of the blame. So I just wanted to highlight some of the problems I've faced.

mattdeboard · on Dec 31, 2011

Yeah I heavily modified Haystack and pysolr to make it work for us. Standard disclaimers about leaky abstractions apply.

simonw · on Dec 31, 2011

For http://lanyrd.com/ we treat Solr as a denormalisation layer for our MySQL database - there are plenty of queries that are pretty gnarly in SQL which Solr handles with ease.

Aa good example is our http://lanyrd.com/calendar/ page which shows events your Twitter contacts are attending in the future - this was horrible as a SQL query but works great as a Solr search, especially as scaling Solr read-slaves horizontally is pretty simple.

Holding denormalised data in a separate data store entirely also makes it easier to spot bugs where denormalised data isn't being correctly updated.

mattdeboard · on Dec 31, 2011

We do the same as well wrt denormalization, though it was not the original intent of the refactor. Nearly every item on our job sites started life at the beginning of the project's lifecycle as a "pretty gnarly" SQL query, but now lives on as a highly performant Solr query.

(Admittedly we don't have nearly as many widgets going on as you do!)

rb2k_ · on Dec 31, 2011

I currently run a setup where I interact with CouchDB and have ElasticSearch use the couchdb river plugin [0] to pretty much replicate an exact copy of the data for ad-hoc queries.

There have been times when I thought about just interacting with ES directly rather than going through the CouchDB layer. I'm still not 100% sure if I should or not.

I don't need any consistency guarantees for most of my data and I'm ok with a few records going missing in case of a crash. I like CouchDB, but I have to use the CouchBase single server dev preview to get snappy compression and the copy on write approach to data storage makes the on-disk files grow pretty large over time. Compaction currently takes several hours :-/

[0] https://github.com/elasticsearch/elasticsearch-river-couchdb...

justincormack · on Dec 31, 2011

It is not actually novel, as he says, it is a fairly common use case. Some sites build their API off a "search" layer too (eg The Guardian), and many of the Java CMS systems do this too for many or all queries.

mattdeboard · on Dec 31, 2011

It may be "common" (debatable and highly dependent on frame of reference) but I was mostly referring to novelty in relation to how Solr is marketed/billed/normally used.