I'm a pretty experienced Solr developer, and I've played with Elastic Search etc, and I've been using Splunk for about a year.
The thing people miss about Splunk unless they know it is how good the search interface is. For example, the search language roughly comparable to Lucene/Solr/Elastic Search, but also includes the ability to parse input files, and present results graphically. No open source solution integrates all that.
If you want to compete with Splunk (something I've thought about a few times) then you need to match that. I'd estimate 2 developer for a year to build out those features on top of Solr or ES.
We put multiple orders of magnitude more data than the free tier into Splunk, and it's still a lot cheaper than 2 developer-years.
It is true, though, that if the licensing was cheaper we'd put even more data into it.
It's cheap for application-specific logs where each line is relatively high value.
What's missing is a free as in beer and as in freedom solution that is decent. Mostly because it means we can all commit fixes/updates/etc to it. Including people who can't pay for a product (but are willing to pay for support) such as communities.
We have a couple of datacenters, so yes, we have more than a couple of servers.
We did a trivial test of Splunk at my last company, it's extremely expensive and it's very easy to bump into its limitations. We were able to wreck the poor Splunk server with some rather sundry queries into a dataset that shouldn't have been that big of a deal. Issues that we took back to the company and didn't get any real answer on.
Its popularity leads me to surmise that there is still a lot of money to be made in solving mundane problems. (Which is good news if you're a product-minded programmer)
Without knowing details of exactly what you are doing it's difficult to comment on your problems with queries. It's true that something like Solr gives you more control over the indexing process, so you can optimize it more for specific queries. Splunk tends to rely more on saved searches (and the new search acceleration feature).
What are you storing the data with...the etchings on wings of fairies?
>Some blather about Splunk's "saved searches"
We talked to the company, explored every avenue. Our volume of data simply overwhelmed it. (Data from three Apache servers. Lol.)
I am 100% certain you know less than Splunk-The-Company, so our conversation is done here.
It's on a SAN. We'll probably migrate to local disks at some point. The pricing is typical SAN pricing.
* Our volume of data simply overwhelmed it. (Data from three Apache servers. Lol.)*
Yeah, well we do a lot more data than that.
 Take a look at the NetApp, Dell & EMC prices on http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-h..., or look at http://serverfault.com/questions/76725/whats-the-nominal-cos... and you'll be in the right price range.
The webserver has no credentials for accessing the backup server. Instead the backup server accesses the webserver.
This strategy places higher trust on the backup server, but the backup server is easier to defend -- it only needs connectivity to a small number of other IPs.
EDIT: lists on HN- doin it rong
Because, otherwise, since AFAIK rsyslog doesn't support DTLS it means unencrypted log transmission. (For RELP it also means running stunnel anyways, which supports DTLS, and may be a solution)
Follow-up, is it _honestly_ worth the licensing to get those features?
Thanks for posting this! I've been eyeballing logstash for a while but had not run across Kibana's UI. More fun reading ahead.
(Another happy Splunk user here)
ES is basically a usability wrapper around Lucene. I've heard that Sphinx is better for a single node configuration, it's faster and uses less resources, but clustering with Sphinx is apparently tricky.
The other competitor, which I have no experience with is Solr, but this write-up http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-s... gives an overview of the ES advantages over Solr.
I'm not a Full-Text Search expert, but a number of really smart people at my company evaluated a number of them for one of the most critical pieces of our production site and they chose ElasticSearch.
For logging searching is the major item at large sites IMO (talking terabytes at least), when you're looking for all occurrences of "item x" over..
"the past week", it may take 1H
"the past month", it may take 10-30H
"the past year", uh, no, you don't do that.
So you gotta use ranges, but it's often hard to guess and you end up missing many log entries just because you don't have the time to search through them.
(obviously loading gigabytes of indexed data takes a while "physically speaking" anyway. I'm guessing ES can distribute the load tho, much like a web search engine does)
I'm not sure what your question is, but I've experimented with loading netflow data in Solr and I'm averaging sub-2 second query times. That's on a laptop, with a couple of minutes of netflow (around 10Gb).
With proper indexing your search response time shouldn't increase lineally with your data size.
And i'm talking 100gb+ indexes ;-)
Obviously 2min of netflow data ain't much. I would want to see the result over 200h (or more) of netflow data, for example
Obviously 2min of netflow data ain't much
Depends where you work...
I just checked, and it was 2Gb of netflow I tested on. That seemed small, so I looked a bit deeper and indeed I was only using a small fraction of our total netflow for that period.
Tt was adequate for what I was trying, though.