

Central Logging with Open Source Software - reyjrar
http://divisionbyzero.net/article/2012/06/17/central-logging-with-open-source-software.html
I'm attempting to implement a Splunk-like setup with open source components.  This blog entry is the first in many to be a brain dump of how I'm using this setup, what I get from it, and why I've arrived at each of these components.
======
nl
Just get Splunk.

I'm a pretty experienced Solr developer, and I've played with Elastic Search
etc, and I've been using Splunk for about a year.

The thing people miss about Splunk unless they know it is how good the search
interface is. For example, the search language roughly comparable to
Lucene/Solr/Elastic Search, but also includes the ability to parse input
files, and present results graphically. No open source solution integrates all
that.

If you want to compete with Splunk (something I've thought about a few times)
then you need to match that. I'd estimate 2 developer for a year to build out
those features on top of Solr or ES.

~~~
packetslave
Yes, except Splunk gets very expensive, very quickly if you want more than the
free tier gives you (features or indexing volume). 500mb/day is not all that
much when you start shoving everything under the sun into it (and once you've
used it, you'll want _everything_ available to it).

~~~
nl
What do you class as _very_ expensive?

We put multiple orders of magnitude more data than the free tier into Splunk,
and it's still a lot cheaper than 2 developer-years.

It is true, though, that if the licensing was cheaper we'd put even more data
into it.

------
Estragon
This reminds me of something I've been wondering about since the Bitcoinica
heist: how do people usually set up secure offline backups which can't be
erased using the credentials on the backed-up server? I would probably do
something with ssh authorized_keys if I had to make it from scratch, but are
there obscure security/reliability risks, and tools which have already
mitigated these risks for you?

~~~
swombat
The standard way used to be to use write-only media. For example, if you log
to a server which writes the logs incrementally to a DVD writer, you can be
fairly certain that the logs won't be erased...

~~~
beagle3
That's only true if you have software that can mount arbitrary past sessions,
which is rarely the case. When you put in a dvd, what gets mounted is the
latest session -- which is supposed to also include all previous sessions, but
doesn't have to.

------
georgebarnett
Just a note - using TCP logging is dangerous. If the syslog server hangs,
clients may block writing to the socket and your whole infrastructure will
lock up.

See: [http://blog.bitbucket.org/2012/01/12/follow-up-on-our-
downti...](http://blog.bitbucket.org/2012/01/12/follow-up-on-our-downtime-
last-week/)

~~~
zobzu
Hmm, shouldn't this be alleviated with RELP [1]? (which the author suggest you
use).

Because, otherwise, since AFAIK rsyslog doesn't support DTLS it means
unencrypted log transmission. (For RELP it also means running stunnel anyways,
which supports DTLS, and may be a solution)

[1]: <http://www.librelp.com/>

------
canistr
Shameless self promotion but I wrote a blog post about using Logstash,
ElasticSearch, and Kibana in production for not only capturing syslog-ng
messages but multi-lined Java stack trace errors via log4j.

[http://blog.tagged.com/2012/05/grabbing-full-java-stack-
trac...](http://blog.tagged.com/2012/05/grabbing-full-java-stack-traces-from-
syslog-ng-with-logstash/)

------
suprgeek
This is a great blog entry on exactly the kind of system I am trying to build.
When we went thru the evaluation for this stack - Elasticsearch came out as
the choice for the datastore and querying part. Where we are still not decided
is using Flume vs logstash. Have you compared the two? We will be building our
own UI ...

------
ashayh
Is there anyone who has used splunk and logstash/graylog2 on a large scale and
can compare the two?

~~~
reyjrar
+1 - Kind of the reason I started this blog post.. Besides reporting (which I
know my setup lacks), from a techie perspective, what would this setup lack
that Splunk provides.

Follow-up, is it _honestly_ worth the licensing to get those features?

~~~
gregr401
No on the license front. Splunk is way too pricey, IMO.

Thanks for posting this! I've been eyeballing logstash for a while but had not
run across Kibana's UI. More fun reading ahead.

------
hcarvalhoalves
A little bit of warning, to be fair: he mentions you should quit everything in
favor of Graphite. As awesome as Graphite it is, it's not really production
ready, and setting it up is an exercise in putting together semi-working
software without any documentation.

~~~
packetslave
Companies like Orbitz and Etsy that use the heck out of Graphite would be
surprised to hear that it's "not really production ready"

~~~
mkramlich
Yes I was a senior software engineer on the Ops Arch team, and a coworker of
Chris Davis when he was at Orbitz, and both my team and the prod
sysadmin/netops folks were the first users of it, anywhere in the world,
debutting roughly with the Austin project, which was a big rewrite of the
Orbitz/Cheaptickets codebase to support i18n & white label functionality. I
can assure everybody it was used in production for a huge travel website at
least as far back as the 2008 timeframe. And it stood up very well back then.
I'd be surprised if it hasn't gotten even better since.

------
spudlyo
Anyone doing this at scale using Scribe/Flume with HDFS and
Hive/Pig/MapReduce?

~~~
flyt
Facebook.

~~~
zobzu
Have a link?

~~~
flyt
[http://axonflux.com/how-facebook-uses-scribe-hadoop-and-
hive...](http://axonflux.com/how-facebook-uses-scribe-hadoop-and-hive-for)

------
ova
Has anyone looked at using ELSA?

[https://code.google.com/p/enterprise-log-search-and-
archive/...](https://code.google.com/p/enterprise-log-search-and-
archive/wiki/Documentation)

------
j_baker
_Man_ do I wish I had heard of Graylog2 or logstash before now. That would
have eliminated days of hacking together my own two-bit implementation.

------
zobzu
I found this interesting. Do you have any benchmarking of the speed of ES vs
everything else?

~~~
reyjrar
We have a lot of in house expertise in ElasticSearch and choose it for a few
reasons: 1) It's easier than Lucence/Sphinx to setup. 2) It's clustering
support works out of the box and is so easy to configure it's not funny.

ES is basically a usability wrapper around Lucene. I've heard that Sphinx is
better for a single node configuration, it's faster and uses less resources,
but clustering with Sphinx is apparently tricky.

The other competitor, which I have no experience with is Solr, but this write-
up [http://karussell.wordpress.com/2011/05/12/elasticsearch-
vs-s...](http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-solr-
lucene/) gives an overview of the ES advantages over Solr.

I'm not a Full-Text Search expert, but a number of really smart people at my
company evaluated a number of them for one of the most critical pieces of our
production site and they chose ElasticSearch.

~~~
zobzu
Ok. The one part I don't like in ES is being java and the trouble that usually
goes with it (yeah, judging stuff like this is bad, I know :)

For logging searching is the major item at large sites IMO (talking terabytes
at least), when you're looking for all occurrences of "item x" over..

"the past week", it may take 1H

"the past month", it may take 10-30H

"the past year", uh, no, you don't do that.

So you gotta use ranges, but it's often hard to guess and you end up missing
many log entries just because you don't have the time to search through them.

(obviously loading gigabytes of indexed data takes a while "physically
speaking" anyway. I'm guessing ES can distribute the load tho, much like a web
search engine does)

~~~
nl
The vast majority of open source "big-data" infrastructure is in Java (Hadoop,
HBase, Cassandra, Solr, Elastic Search etc). It works pretty well.

I'm not sure what your question is, but I've experimented with loading netflow
data in Solr and I'm averaging sub-2 second query times. That's on a laptop,
with a couple of minutes of netflow (around 10Gb).

With proper indexing your search response time shouldn't increase lineally
with your data size.

~~~
zobzu
loading 10gb on a traditional hdd takes more than 2s (that's 5gb/s read speed.
Nice hard drive.). your data is either in ram and you've a lot of ram, either,
it's just not 2s, or its not a 10gb index.

And i'm talking 100gb+ indexes ;-)

Obviously 2min of netflow data ain't much. I would want to see the result over
200h (or more) of netflow data, for example

~~~
nl
No, _querying_ the data takes less than 2 seconds. I can't remember the load
time.

 _Obviously 2min of netflow data ain't much_

Depends where you work...

I just checked, and it was 2Gb of netflow I tested on. That seemed small, so I
looked a bit deeper and indeed I was only using a small fraction of our total
netflow for that period. Tt was adequate for what I was trying, though.

