I setup a cluster of about 15 cc2.8xlarge machines (5 Shards with 3 replicas each) containing 240Gb worth of documents (48gb per shard). Each node was given on the order of 40GB heap space. While performing load tests with a relatively small load (~150 QPS) after a few minutes the garbage collector on nodes would kick in and run on the order of 15 to 30s. This had a cascading effect of causing zookeper to think nodes were down, start leader re-election, etc.
Admittedly I am quite inexperienced when it comes to dealing with applications using such large heap sizes. Though I tried a few different JVM options with respect to GC I was unsuccessful in resolving the problem.
If any folks here happen to have some good resources regarding GC and large Solr clusters I would definitely be interested.
Try it again with sane GC parameters, e.g.:
-Xmx<N>G -Xms<N>g -XX:NewSize=<N/2>G -XX:MaxNewSize=<N/2>G -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+CMSIncrementalMode
Edit: I was benchmarking a similarly sized (though very differently configured) Solr cluster for a well-known internet company, and was able to tune it to do 5000qps, with p50 ~2ms and p99 ~20ms.
I was starting to think that since the heap space was so big perhaps I should be worrying about page sizes as well. While I tried various GC settings (UseConcMarkSweepGC, ConcGCThreads, UseG1GC, etc. ) I didn't take a stab at playing with the size of New Genearation. Could you explain the reasoning behind this? Is the idea that most objects die young so try to increase the number of short run minor GCs and avoid bigger Major GCs? I am quite interested.
Edit: Regarding the cluster you were working on. Would you be able to give general dimensions to the number of nodes & partitions in your cluster + memory for each? Just trying to get a general guideline to aim for.
In general, you should have enough unallocated memory on the box to cover your working dataset (it'll get used by caches and memmaps). If you can, find a way to exploit data locality. I shoot for (number of cores * 1-4)-ish partitions per box depending on workload. Using bigger boxes is usually better, because you can avoid communication latency and variance that arises from having tons of boxes.
If you want to know more, you can email me at firstname.lastname@example.org.
Jesus dude. I couldn't reproduce that with my single or multi-node ElasticSearch clusters if I wanted to.
How were the EBS backing stores setup for these EC2 nodes?
Edit: Also, when I was talking about "them" falling apart, I meant Postgres or Sphinx, not Solr/ElasticSearch.
Well-configured Solr and ElasticSearch clusters can work very well for most people.
I almost wish it was more of a standard data store. Here's to hoping RethinkDB can fill that void.