Hacker News new | past | comments | ask | show | jobs | submit login
Using ElasticSearch and Logstash to Serve Billions of Searchable Events (elasticsearch.org)
97 points by twakefield on Oct 1, 2013 | hide | past | favorite | 28 comments

I've used ElasticSearch & Logstash (& Kibana, also shown here), and I've been nothing but impressed.

My primary concern was how it would cope when data exceeded RAM (I like to call this the MongoDB fallacy). However, based on this post it looks like the ElasticSearch setup works just fine.

Yeah, it takes a bit of tuning, depending on what you want to do with Elasticsearch. There are so many different use cases. I found that especially keeping the heap size for field data down to 40% helped quite a bit, in my case.

I'd love to see a blog post that went into detail about those issues: what to measure, what to tune etc.

What I found most important is to monitor Elasticsearch while doing that tuning. That's when I set up Graphite and StatsD and https://github.com/ralphm/vor.

First of, you need to make sure Elasticsearch can lock a chunk of memory (using mlock). About half of the available RAM is a good size, as other system processes need some memory, too, and not everything is on the heap.

You want to look for how many items you are indexing and how much of the heap the field data cache is using while doing queries. By default ES tries to keep the total heap size at about 3/4 of the allocated memory. The types of queries are important, too. E.g. if you do faceting or sorting on fields that have many different values, this will fill up the field data cache in no time.

Is that the kind of information you're after? I can go into in more detail if you have more specific question.

I you want to monitor elastic search with Graphite then Collectd provides an excellent curl-json plugin which works really well with the ES health api


That's great stuff - thanks. Just trying to collect tips from those that have been there, so that when I get there I have something to start from!

You're most welcome. Drop me a line any time.

I worked on a similar system at somewhat larger scale (maybe 5-10x larger volume and dataset size). There's a lot of basic JVM tuning to be done. You have to monitor heap usage and GC pauses and keep manipulating heap size, perm gen size, heap usage that should initiate major GC (forget what variable this is) until you get things as smooth as possible and meet whatever resource constraints you want to meet on your hardware. In my experience, for large datasets ES really pushes the memory architecture of the JVM -- it doesn't really perform too well when your heap size is like 30g.

After that, you'll have to tune merging. The underlying Lucene storage engine chokes when it tries to merge segments that are large, so you have to tune the max segment sizes, etc.

Then, you'll have to tune your queries -- it's not feasible to do a full index scan on a large index so you have to get clever about how you pull data in chunks. If your data is timestamped and that timestamp is indexed, you can pull data for smaller time ranges which will be faster than pulling all data for an index at once.

I have understood that you should definitely try to keep the heap size under 32G as apparently the JVM can only compress pointers below that. Because we do event logging and our documents don't change after being indexed, some of the problems you mentioned (like merging) are not as pressing. And of course each log event has a timestamp and we can indeed limit our queries with them.

I wrote an article that covers some of this: http://www.found.no/foundation/elasticsearch-in-production/

It's a bit introductory on the memory-parts. One that goes in more detail on memory- and JVM-tuning is planned.

(Full disclosure: I work for Found)

Have you considered using any Graphite alternatives? I found it to be very slow with rendering data, especially when multiple arrays are being used as sources. Also, it lacks some features that graphs rendering software is generally supposed to have, at least the half-a-year old version.

I did look at the greater landscape. What I found great about Graphite is the number of functions you have to massage the metrics. Not all metrics come in the same form (total number of bytes received since boot, or number of bytes received per second) and things like the `hitcount` and the `perSecond` (in the upcoming 0.10) functions really help.

It can indeed be slow, and this is usually a I/O issue. The Whisper storage backend does a lot of seeks, and people recommend using SSDs to deal with that. Also, you can have graphite-web just give you the (calculated) metric data for a particular query in JSON, and have it rendered client-side. http://graphite.readthedocs.org/en/latest/tools.html lists a few.

Finally, we are investigating other storage backends for more fault tolerance. Probably we'll settle on something based on Cassandra, like http://blueflood.io/.

Have you looked into using Elasticsearch's date histograms as sources for Graphite-functions?

Jordan Sissel wrote a short note about it over here: https://gist.github.com/jordansissel/3760225

That'll obviously be very expensive when the resolution is very high, though.

I missed that, but that looks pretty awesome. Thanks for sharing!

at my work we use custom javascript/php to render our graphs with d3.js. It works quite well - we have things like a live(ish) piechart oh the status codes of web requests in the last (hour/timeperiod).

the upside is we have a dashboard that is tailored for our use case, the downside is it took a couple of days to get right.

Elasticsearch is really awesome for keeping large amount of searchable data. I used it in a previous application where we stored millions of items a week.

For data retention I had different indexes with different TTLs, depending on the type of queries that hit them (queries that only dealt with frequent items were sent to an index with a very short TTL).

For graphing I also used Graphite, with metrics (http://metrics.codahale.com/) for sending data from Java programs and scales (https://github.com/Cue/scales) for sending data from Python applications.

The only problem I had was tuning it for faceting (Faceting consumed lots of RAM).

Here is a blogpost we (https://commando.io) wrote on shipping nginx access logs using LogStash and ElasticSearch


After reading this article, I checked out logstash.net today nearly after an year.... I am happy to see they added so many connectors and it has grown so much.... This open source software can be serious competition to splunk..

How much time was involved in building this setup? Just curious.

2 devs and 2.5 months, including multi-tenant client service with authorization

Is this a compelling OSS replacement for Splunk?

Also, are you using local statsd installations at each server that is sending stats, or you keep statsd separately?

In our case, most of the metrics come from the events flowing through Logstash, and statsd is on the machines running the Logstash server. Note that we are not using Logstash for shipping log events.

I'll bite: what are you using to ship logs?

EDIT: or I can read it in the post...

Probably just syslog or its recent variants.

Anyone used elasticsearch as a database, like mongodb ?

It won't have a decent Disaster Recovery plan until 1.0, so I would suggest holding off on making it the authoritative source of any of your data.

FYI I use Solr cloud for that purpose.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact