

Logstash joins Elasticsearch - j4mie
http://www.elasticsearch.com/blog/welcome-jordan-logstash/

======
clarkdave
Logstash, Elasticsearch and Kibana are just fantastic. After being unsatisfied
with a whole bunch of Logging As A Service providers (I tried loggly.com,
logentries.com and splunkstorm.com) I spent an afternoon setting up Logstash
and co and couldn't be happier.

There's a neat demo of Kibana here:
[http://demo.kibana.org/#/dashboard/elasticsearch/Logstash%20...](http://demo.kibana.org/#/dashboard/elasticsearch/Logstash%20Search)

The only thing that isn't fully baked in with this stack is alerts (e.g.
sending an email if a certain error log message comes in), but you can do that
using Logstash filters and outputs, although there's no pretty UI.

There are some excellent Chef cookbooks for setting up Logstash and friends
too:

\- Logstash: [https://github.com/lusis/chef-
logstash](https://github.com/lusis/chef-logstash)

\- Elasticsearch: [https://github.com/elasticsearch/cookbook-
elasticsearch](https://github.com/elasticsearch/cookbook-elasticsearch)

\- Kibana: [https://github.com/lusis/chef-
kibana](https://github.com/lusis/chef-kibana)

~~~
davidy123
it is worth noting there is a Node implementation of logstash.

[https://github.com/bpaquet/node-logstash](https://github.com/bpaquet/node-
logstash)

It is "logstash compatible" (at ElasticSearch, so it works with Kibana) and in
my experience very easy to work with, and probably a lot lighter weight than
the JRuby version.

~~~
stock_toaster
oh neat. thanks for the link -- had never run across it before.

------
capkutay
For anyone who can't immediately see the significance..this is Elasticsearch's
entry into real-time log analytics. There is plenty of room for innovation and
financial opportunity in this area, given the success of the $5 billion valued
Splunk along with companies like SumoLogic and LogLogic.

What's most interesting is that Elasticsearch seems like a completely open
source (and widely used) offering of a product that Splunk charges close to
oracle pricing for.

Shameless plug: If you're looking for an opportunity at a well-funded true
real-time analytics company in silicon valley...feel free to ping me. There's
lots of exciting and fun work to do in this area.

~~~
nasalgoat
The one thing Splunk has going for it over ES is the amount of resources it
requires to work at scale.

I needed 12 ES boxes for every one Splunk box to handle the 100MB/day log load
of my system, and even then they ran at a high load and searches often failed,
and in some cases it took hours for the indexer to catch up.

~~~
jordansissel
This experience sounds especially bad. Sorry about that.

As mentioned in another comment in this post, I was doing 300gigs of data per
day with an elasticsearch cluster size of 7 elasticsearch nodes (16 cores &
16gb ram per node) and load was around 5-10% cpu utilization.

100MB/day is pretty small in terms of log data, I think. If you attempt this
again, please invoke the community (elasticsearch's is great!) and see if we
can assist you in figuring out what's busted.

------
benmmurphy
logstash + elasticsearch are pretty amazing. however, if you are generating a
high rate of log entries you may want to consider using mozilla hekad instead
([http://hekad.readthedocs.org/en/latest/](http://hekad.readthedocs.org/en/latest/)).
on our servers logstash was running around 20% CPU during quite periods while
hekad was running around 1-2% CPU. while during busy periods i think logstash
was going up to 100% CPU while hekad was sitting around 20-30% CPU.

hekad is written in go which compiles down to native code while logstash is
written in jruby which is not the most performant runtime.

~~~
quicksilver03
Another possible log shipper is nxlog, it compiles to native code and does not
have any noticeable impact in terms of CPU or memory usage on my various low-
end servers.

[http://nxlog-ce.sourceforge.net/](http://nxlog-ce.sourceforge.net/)

~~~
kossmoboleat
Do I have to buy the commercial version to get a web interface or GUI to
analyze or browse the logs?

[http://log4ensics.com/](http://log4ensics.com/)

~~~
quicksilver03
Only if you want.

On my servers I use the open source version of nxlog to collect various logs
and forward them to a central nxlog server, which in turn feeds logstash.
Behind logstash I have configured elasticsearch as storage and I use kibana as
a GUI to search and browse.

------
JoachimSchipper
I'm confused. Can someone explain to me why this is so obviously interesting,
yet not worth discussing, that it stands - as of 2 hours after submission - at
75 points with zero comments?

Honestly, I've never heard of either company, although I obviously wish them
the best of luck. Am I just out of touch?

~~~
jpgvm
Logstash + Elasticsearch + Kibana is the biggest thing in opensource
operational tools since Nagios.

~~~
ape4
Maybe I a too traditional... but I like KISS when it comes to this kinds of
thing.

~~~
zwily
Logstash, ES, and Kibana actually are more KISS than any other log searching
setup I've tried.

Except for grep of course.

------
netvarun
This is great news. Our centralized logging system at Semantics3
([https://semantics3.com](https://semantics3.com)) is built using
Logstash+Kibana+Rsyslog+ElasticSearch. Running off a single EC2 large instance
it has been been able to seamlessly aggregate and process logs from about
200-300 instances, processing on average of about 15 GB of log data. We hit
some performance bottlenecks (particularly with elasticsearch) when our number
of instances went beyond the 300 mark. But that should get fixed once we shard
and distribute ElasticSearch.

Looking forward to some really tight integration between the Logstash, ES and
Kibana.

------
100k
Logstash is awesome. We use it at Swiftype to index all our logs and it's
super helpful nailing down support requests and bugs (using Kibana).

Since you can access the logs via the Elasticsearch API, we made users' recent
logs available to them in our dashboard: [https://swiftype.com/blog/api-
logs.html](https://swiftype.com/blog/api-logs.html)

------
victorhooi
I wonder how all this compares to Graylog2?
([http://graylog2.org/](http://graylog2.org/))

Those guys are meant to be releasing a new re-vamped version at the end of
October, from the screenshots and videocasts, looks pretty good:

[https://www.facebook.com/graylog2](https://www.facebook.com/graylog2)

------
vosper
For people using this, I'd be interested to know what kind of throughput
you're seeing and your cluster size - I'm trying to find something that can
handle upwards of 100k small messages per second for a near-realtime analytics
platform, and although this is a bit left-field (compared to Cassandra, HBase
etc...) it could be a fit.

~~~
jordansissel
At my last job (prior to joining elasticsearch), I had a cluster of 7 machines
(16 cores, 16gb ram, 2TB raid1), each running logstash and elasticsearch.

The event rate going into this cluster was about 5000 events/sec on average
(burst up to 10,000 events/sec sometimes).

During a maintenance (two machines going offline for disk repairs), I
benchmarked the surviving 5-node cluster at 88,000 events/sec peak
performance.

In terms of capacity planning, this means that we could have a 9x increase in
normal event load and still not need to grow the cluster's processing
capacity.

Persistent storage is another story. We stored about 300GB/day of events,
getting us roughly 45 days of data retention before we would run out of space
(2TB * 7 nodes / 300gb/day; roughly 45 days). I'm working on improving storage
efficiency of logstash and elasticsearch, too, so retention should improve
greatly in the long term.

For other experiences, it's useful to invoke the community and ask what others
are done - the #logstash irc channel on freenode is very active as is the
logstash-users@googlegroups.com mailling list.

Hope this helps!

~~~
markelliot
What's the raw scale of input data for your 300GB/day of stored events?
(assuming that's 300GB on disk stored in Elasticsearch)

~~~
jordansissel
I think it was roughly 300 million events/day (1kb per event). There is some
overhead incurred by logstash (turning a log into json, parsing it into
fields) and by elasticsearch (analyzing/indexing data).

In practical terms, and by way of example, a plain text apache access log,
fully parsed by logstash (breaking out fields, etc), has historically bloated
by quite a bit (6.2x I have measured). Lately, however, with improvements to
logstash, better default settings, and elasticsearch being awesome, the
'inflation' number gets down to something more like 1.5x - which isn't bad
considering all the awesome you get with it.

Long term, I am working towards making the 'raw data to stored data' ratio
something less than 1x.

You can see some experiments I did a year ago on this:
[https://github.com/jordansissel/experiments/blob/master/elas...](https://github.com/jordansissel/experiments/blob/master/elasticsearch/disk/README.md)

I will repeat these experiments after the next release of logstash, and I
expect storage ratios to improve significantly.

------
jaryd
Logstash is really great and Jordan is approachable and very helpful. To all
interested, I recommend joining their IRC channel (#logstash on Freenode) and
talking to the people there a bit.

Congrats :)

------
Keyframe
I'm currently evaluating elasticsearch and riak for rt analytics of large
amount of data. Anyone has similar experience? Maybe even Cassandra, haven't
touched it seriously yet.

~~~
devopser
ElasticSearch itself should be very good now since they have moved to Lucene
4.0 which brought in lot of improvements in memory usage.

I evaluated elasticsearch for RT analytics. It works wonders for point
queries, where your result set is going to be small. Didn't work well for
aggregate queries which need to scan lot of data. The biggest problem was
field cache in Lucene. Almost all our queries needed to fo faceting which had
a big impact on field cache.

Also, I don't know about Riak, but in ES the joins you can do are very
limited.

~~~
Keyframe
I'll do extensive testing, but I need to scan a lot of data (aggregate
basically). I'd be comfortable even with index size in multiples of data size
if it delivered RT queries. Have you evaluated anything else?

~~~
devopser
We also checked mongodb. We dropped it mainly because index size was getting
too big.

If your data is read-only then Cloudera Impala is worth a try. It's really
fast.

~~~
Keyframe
I was looking at Impala (Cassandra) as well as keeping an eye on Drill
progress. My data is write only in ETL stage so it seems it could be the right
way. Lots of testing ahead! - thanks

------
mrmondo
Both Logstash and elasticsearch are great - but they both suffer from the same
flaw: they're a pain to deploy and it's a pain to manage their packages.

~~~
jordansissel
With logstash, I aim to make it as easy to deploy as possible. That is, in
part, why the releases are self-contained jar files with all depenencies
built-in (except for java itself). We also started working on shipping rpm/deb
packages with recent releases.

Like I always say, if it's hard to use or appears to have major flaws or
pains, it's a bug, and we can fix it. Let us know! :)

------
devopser
This space is heating up. Cloudera is building a similar stack with Solr -
[http://www.cloudera.com/content/cloudera/en/campaign/introdu...](http://www.cloudera.com/content/cloudera/en/campaign/introducing-
search.html)

------
vigeek
This is great news as well. @ Wildbit we have a dedicated logging server
consisting of Rsyslog, ES, LogStash and Kibana3. It's been improving
considerably each month.

------
chriscareycode
I love Logstash+Kibana+Elasticsearch. Holding 410 million log files in a 10
node cluster! Congratulations Jordan!

------
koppo
this is the bestest news i've heard in a long long time ...

