I've used ElasticSearch & Logstash (& Kibana, also shown here), and I've been nothing but impressed.
My primary concern was how it would cope when data exceeded RAM (I like to call this the MongoDB fallacy). However, based on this post it looks like the ElasticSearch setup works just fine.
Yeah, it takes a bit of tuning, depending on what you want to do with Elasticsearch. There are so many different use cases. I found that especially keeping the heap size for field data down to 40% helped quite a bit, in my case.
What I found most important is to monitor Elasticsearch while doing that tuning. That's when I set up Graphite and StatsD and https://github.com/ralphm/vor.
First of, you need to make sure Elasticsearch can lock a chunk of memory (using mlock). About half of the available RAM is a good size, as other system processes need some memory, too, and not everything is on the heap.
You want to look for how many items you are indexing and how much of the heap the field data cache is using while doing queries. By default ES tries to keep the total heap size at about 3/4 of the allocated memory. The types of queries are important, too. E.g. if you do faceting or sorting on fields that have many different values, this will fill up the field data cache in no time.
Is that the kind of information you're after? I can go into in more detail if you have more specific question.
I worked on a similar system at somewhat larger scale (maybe 5-10x larger volume and dataset size). There's a lot of basic JVM tuning to be done. You have to monitor heap usage and GC pauses and keep manipulating heap size, perm gen size, heap usage that should initiate major GC (forget what variable this is) until you get things as smooth as possible and meet whatever resource constraints you want to meet on your hardware. In my experience, for large datasets ES really pushes the memory architecture of the JVM -- it doesn't really perform too well when your heap size is like 30g.
After that, you'll have to tune merging. The underlying Lucene storage engine chokes when it tries to merge segments that are large, so you have to tune the max segment sizes, etc.
Then, you'll have to tune your queries -- it's not feasible to do a full index scan on a large index so you have to get clever about how you pull data in chunks. If your data is timestamped and that timestamp is indexed, you can pull data for smaller time ranges which will be faster than pulling all data for an index at once.
I have understood that you should definitely try to keep the heap size under 32G as apparently the JVM can only compress pointers below that. Because we do event logging and our documents don't change after being indexed, some of the problems you mentioned (like merging) are not as pressing. And of course each log event has a timestamp and we can indeed limit our queries with them.
Have you considered using any Graphite alternatives? I found it to be very slow with rendering data, especially when multiple arrays are being used as sources. Also, it lacks some features that graphs rendering software is generally supposed to have, at least the half-a-year old version.
I did look at the greater landscape. What I found great about Graphite is the number of functions you have to massage the metrics. Not all metrics come in the same form (total number of bytes received since boot, or number of bytes received per second) and things like the `hitcount` and the `perSecond` (in the upcoming 0.10) functions really help.
It can indeed be slow, and this is usually a I/O issue. The Whisper storage backend does a lot of seeks, and people recommend using SSDs to deal with that. Also, you can have graphite-web just give you the (calculated) metric data for a particular query in JSON, and have it rendered client-side. http://graphite.readthedocs.org/en/latest/tools.html lists a few.
Finally, we are investigating other storage backends for more fault tolerance. Probably we'll settle on something based on Cassandra, like http://blueflood.io/.
at my work we use custom javascript/php to render our graphs with d3.js. It works quite well - we have things like a live(ish) piechart oh the status codes of web requests in the last (hour/timeperiod).
the upside is we have a dashboard that is tailored for our use case, the downside is it took a couple of days to get right.
Elasticsearch is really awesome for keeping large amount of searchable data. I used it in a previous application where we stored millions of items a week.
For data retention I had different indexes with different TTLs, depending on the type of queries that hit them (queries that only dealt with frequent items were sent to an index with a very short TTL).
After reading this article, I checked out logstash.net today nearly after an year.... I am happy to see they added so many connectors and it has grown so much.... This open source software can be serious competition to splunk..
In our case, most of the metrics come from the events flowing through Logstash, and statsd is on the machines running the Logstash server. Note that we are not using Logstash for shipping log events.
My primary concern was how it would cope when data exceeded RAM (I like to call this the MongoDB fallacy). However, based on this post it looks like the ElasticSearch setup works just fine.