I'd like to mention the need to use caution if using Haystack with Solr for larg...

I'd like to mention the need to use caution if using Haystack with Solr for large indexes and relatively high rates of change. The default behaviour of Haystack is geared towards getting up and running with search quickly, I've had to tweak a lot of the Solr backend to make it not fall over with our index of 9m products.

Solr is incredibly easy to work with directly, so if you know you're going to be using it as a core part of your site functionality, you might want to consider skipping Haystack entirely (the 2.0 version is a big improvement but it's not released yet).

A few examples:

- The default behaviour of Haystack is to commit on every update. Say you're doing batches of 1000, if you've optimised your indexing database queries well, you might be able to issue an update every 10 seconds or so. The problem is that issuing a commit against an index of 9m every 10 seconds (more frequently if using multiple processes or threads) or so is going to eat up memory and disk space fast. So if you have this scenario, make sure you disable the default commit behaviour and issue manual commits instead (we're doing it every 200k documents).

- Faceting by Author was a key feature, and we had about 500k of them. Haystack's faceting functionality didn't provide any mechanism to limit the number of facet choices returned, or to limit them to just those with a count > 0. It was an easy fix, but you might miss it if you're not careful.

- We only had a few fields we were interested in searching over, but had about 50 that were candidates for faceting. The Solr schema generated by haystack assumes you're going to want to search over every field, which means there's a lot of unnecessary overhead. Make sure you look over every field and think about how you're going to use it and turn off as much as possible. From what I can recall, using omitNorms and omitTermFreqAndPositions appropriately ended up saving us 2GB of otherwise wasted RAM. Some wise guy on my project decided to try and be helpful and run the manage.py command for generating the schema before they committed :)

Haystack is an excellent piece of work, but in this project (E-Comm site with 9m products) it ended up costing more time than it saved. Admittedly, this was the first time I'd worked with a catalogue of this scale, so I shoulder a lot of the blame. So I just wanted to highlight some of the problems I've faced.