Looks like a sales/SEO play for a Solr/ElasticSearch consultant. Still seems pretty helpful as a community resource. I emailed the author to see if he's interested in setting up a public GitHub repo to take pull requests.
Personally, I'd like to see similar comparisons for other search engines, like Sphinx and Postgres Full-Text. When I talk to people about search engines, the first questions they ask me are to compare one against some other.
Not sure about codewright's use cases. However in my own brief experimentation with SOLR I ran into performance issues with garbage collection.
I setup a cluster of about 15 cc2.8xlarge machines (5 Shards with 3 replicas each) containing 240Gb worth of documents (48gb per shard). Each node was given on the order of 40GB heap space. While performing load tests with a relatively small load (~150 QPS) after a few minutes the garbage collector on nodes would kick in and run on the order of 15 to 30s. This had a cascading effect of causing zookeper to think nodes were down, start leader re-election, etc.
Admittedly I am quite inexperienced when it comes to dealing with applications using such large heap sizes. Though I tried a few different JVM options with respect to GC I was unsuccessful in resolving the problem.
If any folks here happen to have some good resources regarding GC and large Solr clusters I would definitely be interested.
Thanks for the tips. I was considering trying testing again with more partitions w/ smaller machines. Perhaps N x m1.xlarge w/ 8 GB heap space.
I was starting to think that since the heap space was so big perhaps I should be worrying about page sizes as well. While I tried various GC settings (UseConcMarkSweepGC, ConcGCThreads, UseG1GC, etc. ) I didn't take a stab at playing with the size of New Genearation. Could you explain the reasoning behind this? Is the idea that most objects die young so try to increase the number of short run minor GCs and avoid bigger Major GCs? I am quite interested.
Edit: Regarding the cluster you were working on. Would you be able to give general dimensions to the number of nodes & partitions in your cluster + memory for each? Just trying to get a general guideline to aim for.
In general, I fix the newgen size mostly to avoid the optimizer choosing something braindead in a pathological case. 50/50 is safe, but not optimal.
In general, you should have enough unallocated memory on the box to cover your working dataset (it'll get used by caches and memmaps). If you can, find a way to exploit data locality. I shoot for (number of cores * 1-4)-ish partitions per box depending on workload. Using bigger boxes is usually better, because you can avoid communication latency and variance that arises from having tons of boxes.
If you want to know more, you can email me at email@example.com.
I am not codewright, but we had troubles with sphinx on search queries containing larger number of terms (for us hiccups started after 100 terms or so). Besides, setting up delta-indexing is PITA and extensibility/configurability is limited. We ended up using solr (which is a memory hog) but at least it works.