Hacker News new | past | comments | ask | show | jobs | submit login
Central Logging with Open Source Software (divisionbyzero.net)
79 points by reyjrar on June 17, 2012 | hide | past | favorite | 49 comments

Just get Splunk.

I'm a pretty experienced Solr developer, and I've played with Elastic Search etc, and I've been using Splunk for about a year.

The thing people miss about Splunk unless they know it is how good the search interface is. For example, the search language roughly comparable to Lucene/Solr/Elastic Search, but also includes the ability to parse input files, and present results graphically. No open source solution integrates all that.

If you want to compete with Splunk (something I've thought about a few times) then you need to match that. I'd estimate 2 developer for a year to build out those features on top of Solr or ES.

Yes, except Splunk gets very expensive, very quickly if you want more than the free tier gives you (features or indexing volume). 500mb/day is not all that much when you start shoving everything under the sun into it (and once you've used it, you'll want everything available to it).

What do you class as very expensive?

We put multiple orders of magnitude more data than the free tier into Splunk, and it's still a lot cheaper than 2 developer-years.

It is true, though, that if the licensing was cheaper we'd put even more data into it.

Splunk is absurdly priced for normal verbose syslogs for a bunch of hosts. You could preprocess or tune your logging to only send important stuff to Splunk to make up for this.

It's cheap for application-specific logs where each line is relatively high value.

I think it defeats the purpose. Splunk is great but you need to pay for a license.

What's missing is a free as in beer and as in freedom solution that is decent. Mostly because it means we can all commit fixes/updates/etc to it. Including people who can't pay for a product (but are willing to pay for support) such as communities.

Don't listen to this guy if you own more than a couple servers.

Why do you say that?

We have a couple of datacenters, so yes, we have more than a couple of servers.

In a situation where one has that much money to blow on something so limited, virtually anything would've sufficed.

We did a trivial test of Splunk at my last company, it's extremely expensive and it's very easy to bump into its limitations. We were able to wreck the poor Splunk server with some rather sundry queries into a dataset that shouldn't have been that big of a deal. Issues that we took back to the company and didn't get any real answer on.

Its popularity leads me to surmise that there is still a lot of money to be made in solving mundane problems. (Which is good news if you're a product-minded programmer)

What is extremely expensive for you? We find the overheads on storing & processing the data are much more than the cost of the license, on a per GB basis.

Without knowing details of exactly what you are doing it's difficult to comment on your problems with queries. It's true that something like Solr gives you more control over the indexing process, so you can optimize it more for specific queries. Splunk tends to rely more on saved searches (and the new search acceleration feature).

>We find the overheads on storing & processing the data are much more than the cost of the license, on a per GB basis.

What are you storing the data with...the etchings on wings of fairies?

>Some blather about Splunk's "saved searches"

We talked to the company, explored every avenue. Our volume of data simply overwhelmed it. (Data from three Apache servers. Lol.)

I am 100% certain you know less than Splunk-The-Company, so our conversation is done here.

What are you storing the data with..

It's on a SAN. We'll probably migrate to local disks at some point. The pricing is typical SAN pricing[1].

* Our volume of data simply overwhelmed it. (Data from three Apache servers. Lol.)*

Yeah, well we do a lot more data than that.

[1] Take a look at the NetApp, Dell & EMC prices on http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-h..., or look at http://serverfault.com/questions/76725/whats-the-nominal-cos... and you'll be in the right price range.

This reminds me of something I've been wondering about since the Bitcoinica heist: how do people usually set up secure offline backups which can't be erased using the credentials on the backed-up server? I would probably do something with ssh authorized_keys if I had to make it from scratch, but are there obscure security/reliability risks, and tools which have already mitigated these risks for you?

You pull rather than push.

The webserver has no credentials for accessing the backup server. Instead the backup server accesses the webserver.

This strategy places higher trust on the backup server, but the backup server is easier to defend -- it only needs connectivity to a small number of other IPs.

1. Don't make your backup server accessible on the public internet. 2. Don't allow shell access from any server that does have access to the public internet. When your web server gets hacked, you don't want your assailant to have the ability to shell around in your network. 3. If you need shell access from outside the network, have a host specifically for this purpose and disallow password authentication (.ssh/authorized_keys indeed) 4. Backup server is write-only. I don't have a hard-and-fast method for enforcing this, but a process (or kernel module?) that watches for incoming backups, moves them immediately, and prevents overwriting existing files seems simple enough.

EDIT: lists on HN- doin it rong

Thanks. Great advice.

The standard way used to be to use write-only media. For example, if you log to a server which writes the logs incrementally to a DVD writer, you can be fairly certain that the logs won't be erased...

That's only true if you have software that can mount arbitrary past sessions, which is rarely the case. When you put in a dvd, what gets mounted is the latest session -- which is supposed to also include all previous sessions, but doesn't have to.

Not softwarily, anyway.

One of the follow-up posts to this is going to be on using OSSEC-HIDS which will give you logfile chained checksums. It's not perfect, but again it's about achieving the most value for the least amount of effort.

My quick and dirty way to it is to run a cron on the backup server that chowns incoming files to another user (with a few refinements, like preventing exec, etc). But I'd definitely like something more solid.

You should look into a tool that stores meta information on the backup files, such as rdiff-backup. Manually restoring ownership/permissions from a backup is probably tiring.

I backup (and chown) archives, not directly the files, so restoring the permissions isn't much of an issue. Sorry, was unclear :s

Just a note - using TCP logging is dangerous. If the syslog server hangs, clients may block writing to the socket and your whole infrastructure will lock up.

See: http://blog.bitbucket.org/2012/01/12/follow-up-on-our-downti...

Hmm, shouldn't this be alleviated with RELP [1]? (which the author suggest you use).

Because, otherwise, since AFAIK rsyslog doesn't support DTLS it means unencrypted log transmission. (For RELP it also means running stunnel anyways, which supports DTLS, and may be a solution)

[1]: http://www.librelp.com/

Shameless self promotion but I wrote a blog post about using Logstash, ElasticSearch, and Kibana in production for not only capturing syslog-ng messages but multi-lined Java stack trace errors via log4j.


This is a great blog entry on exactly the kind of system I am trying to build. When we went thru the evaluation for this stack - Elasticsearch came out as the choice for the datastore and querying part. Where we are still not decided is using Flume vs logstash. Have you compared the two? We will be building our own UI ...

Is there anyone who has used splunk and logstash/graylog2 on a large scale and can compare the two?

+1 - Kind of the reason I started this blog post.. Besides reporting (which I know my setup lacks), from a techie perspective, what would this setup lack that Splunk provides.

Follow-up, is it _honestly_ worth the licensing to get those features?

No on the license front. Splunk is way too pricey, IMO.

Thanks for posting this! I've been eyeballing logstash for a while but had not run across Kibana's UI. More fun reading ahead.

I haven't implemented logstash or graylog2 (yet) but I've implemented Splunk multiple times, at multiple companies since back in 2006 and it's a simply fantastic piece of software. Unless the pricing model has changed significantly since the last time I bought it, I don't see it as that expensive. The licensing model is based on daily volume indexed but the licenses are perpetual. It has all the features you'd expect built-in and the reporting/searching/alerting and other integrations are the best that I've seen. I am curious to see what can be accomplished with open source software though.

Splunk licenses may be perpetual, but there is much less up-front cost with a yearly license. The license model is fair, though.

(Another happy Splunk user here)

A little bit of warning, to be fair: he mentions you should quit everything in favor of Graphite. As awesome as Graphite it is, it's not really production ready, and setting it up is an exercise in putting together semi-working software without any documentation.

We collect @2.5 million data points every minute with our Graphite system. While I agree that initial installation is not as easy as "yum install graphite ; /etc/init.d/graphite start", I wouldn't hesitate to call it production ready.

You set it up once. After you get through that hour long process you've got a great system that is battle tested at many large shops, with plenty of domain knowledge on the Internet.

Companies like Orbitz and Etsy that use the heck out of Graphite would be surprised to hear that it's "not really production ready"

Yes I was a senior software engineer on the Ops Arch team, and a coworker of Chris Davis when he was at Orbitz, and both my team and the prod sysadmin/netops folks were the first users of it, anywhere in the world, debutting roughly with the Austin project, which was a big rewrite of the Orbitz/Cheaptickets codebase to support i18n & white label functionality. I can assure everybody it was used in production for a huge travel website at least as far back as the 2008 timeframe. And it stood up very well back then. I'd be surprised if it hasn't gotten even better since.

I worked at Orbitz as one of the very first users of Graphite, and yes Orbitz was the first corporate user of it, and it was used heavily in production as far back as say 2008. Graphite is (or at least was then) production ready. It does have documentation but it doesn't need much.

Anyone doing this at scale using Scribe/Flume with HDFS and Hive/Pig/MapReduce?


Have a link?

Man do I wish I had heard of Graylog2 or logstash before now. That would have eliminated days of hacking together my own two-bit implementation.

I found this interesting. Do you have any benchmarking of the speed of ES vs everything else?

We have a lot of in house expertise in ElasticSearch and choose it for a few reasons: 1) It's easier than Lucence/Sphinx to setup. 2) It's clustering support works out of the box and is so easy to configure it's not funny.

ES is basically a usability wrapper around Lucene. I've heard that Sphinx is better for a single node configuration, it's faster and uses less resources, but clustering with Sphinx is apparently tricky.

The other competitor, which I have no experience with is Solr, but this write-up http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-s... gives an overview of the ES advantages over Solr.

I'm not a Full-Text Search expert, but a number of really smart people at my company evaluated a number of them for one of the most critical pieces of our production site and they chose ElasticSearch.

Ok. The one part I don't like in ES is being java and the trouble that usually goes with it (yeah, judging stuff like this is bad, I know :)

For logging searching is the major item at large sites IMO (talking terabytes at least), when you're looking for all occurrences of "item x" over..

"the past week", it may take 1H

"the past month", it may take 10-30H

"the past year", uh, no, you don't do that.

So you gotta use ranges, but it's often hard to guess and you end up missing many log entries just because you don't have the time to search through them.

(obviously loading gigabytes of indexed data takes a while "physically speaking" anyway. I'm guessing ES can distribute the load tho, much like a web search engine does)

The vast majority of open source "big-data" infrastructure is in Java (Hadoop, HBase, Cassandra, Solr, Elastic Search etc). It works pretty well.

I'm not sure what your question is, but I've experimented with loading netflow data in Solr and I'm averaging sub-2 second query times. That's on a laptop, with a couple of minutes of netflow (around 10Gb).

With proper indexing your search response time shouldn't increase lineally with your data size.

loading 10gb on a traditional hdd takes more than 2s (that's 5gb/s read speed. Nice hard drive.). your data is either in ram and you've a lot of ram, either, it's just not 2s, or its not a 10gb index.

And i'm talking 100gb+ indexes ;-)

Obviously 2min of netflow data ain't much. I would want to see the result over 200h (or more) of netflow data, for example

No, querying the data takes less than 2 seconds. I can't remember the load time.

Obviously 2min of netflow data ain't much

Depends where you work...

I just checked, and it was 2Gb of netflow I tested on. That seemed small, so I looked a bit deeper and indeed I was only using a small fraction of our total netflow for that period. Tt was adequate for what I was trying, though.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact