
Ask HN: How do you analyze logs? - hckrt_
Analyzing logs is huge part of a software problem solver&#x27;s job. What kind of tools and techniques do you use to be more effective in your job?
======
fscof
One thing we've done that's greatly improved the effectiveness of our API logs
is creating a unique context identifier string for each api call and passing
it back in the response headers of a call. This allows you to copy the string
from the browser and grep the logs to immediately find the call in question,
and if an error occurred.

------
jedberg
When I cared about logs on individual servers, I wrote a program to parse them
to make it easy to find what I want, and I wrote it as a command line utility
so that it could be used in combo with cat, grep, awk, sed, etc. It's on my
github:
[https://github.com/jedberg/quickparse](https://github.com/jedberg/quickparse)

------
dmourati
Splunk/ELK. I have used Splunk since its inception (I was one of the first
paying customers) and enjoy its many features and integrations (e.g. AWS
Cloudtrail, Nagios, and anomaly detection). In the past few years, I've come
to know and like ELK, ElasticSerch, Kibana, and Logstash. The open source
approach has some nice properties as well. It's on you to get the logs
ingested but generally that is remote syslog (rsyslog,syslog-ng, or equiv) and
logstash with redis. The docs on this integration are not super strong but
with a little hacking you can get it working. The feature set is less in this
stack and the UI is not nearly as nice. The savings are huge though especially
as log volume goes up. Splunk also has a free service targeted at developers
called Spluk Storm. Good for proof of concept and easy to setup without any
hardware requirements as it runs on AWS.

~~~
kiyoto
Some people find EFK (Elasticsearch Fluentd Kibana) to be another compelling
alternative to Splunk (Disclaimer: I am one of the maintainers of Fluentd)

[http://docs.fluentd.org/articles/free-alternative-to-
splunk-...](http://docs.fluentd.org/articles/free-alternative-to-splunk-by-
fluentd)

~~~
otterley
What makes Fluentd better than syslog-ng?

~~~
kiyoto
It is not "better" but different.

1\. Easier to extend that syslog-ng if you have a modest knowledge of Ruby

2\. Easy to configure file- and memory- based buffering and failover.

3\. Advanced filtering out of the box.

4\. Rich plugin ecosystem with 300+ plugins.

At least that's what I've heard from the users who switched from syslog-ng to
Fluentd. I am happy to learn more about what makes syslog-ng great since I've
never used it seriously myself =)

------
kjhosein
Once Splunk was about to break the bank, we abandoned it and started looking
for something in the open-source world.

We've toyed with and pretty much failed using Graylog2. Although it has been
coming along steadily in features and stability, we just found that the
interface-although pretty-was not intuitive to us: lots of links and multi-
click scenarios to get to what you want; and creating filters and streams was
difficult and prone to failure.

After watching a couple of very compelling presentations by Jordan Sissel
(Logstash founder), we decided to test it out. Once I realized that creating a
filter (Grok rocks!) that searched for a term and reorganized the log to my
liking only took a couple hours, I was sold.

Another selling point for us was that Logstash has over 2 dozen ways to suck
logs in, including the usual suspects - syslog, files, tcp, udp and *mq. You
can also perform a bunch of log parsing on the client (i.e. the servers with
the logs) before sending them to your central ELK server/cluster.

At the end of the day, there is nothing magical about any of these systems.
You alone know your logs best and have to figure out how to read/parse/search
them. Our switch to Logstash from Graylog2 was our failing, not Graylog2's.

------
whiskykilo
We use the heck out of GrayLog2. We have a dual datacenter setup with multiple
elastic search instances. A coworker of mine did a fantastic job writing up
our setup:

[http://secopsmonkey.com/migrating-
graylog2-servers.html](http://secopsmonkey.com/migrating-
graylog2-servers.html) [http://secopsmonkey.com/migrating-graylog2-servers-
part-2.ht...](http://secopsmonkey.com/migrating-graylog2-servers-part-2.html)
[http://secopsmonkey.com/migrating-graylog2-servers-
part-3.ht...](http://secopsmonkey.com/migrating-graylog2-servers-part-3.html)
[http://secopsmonkey.com/migrating-graylog2-servers-
part-4.ht...](http://secopsmonkey.com/migrating-graylog2-servers-part-4.html)
[http://secopsmonkey.com/migrating-graylog2-servers-
part-5.ht...](http://secopsmonkey.com/migrating-graylog2-servers-part-5.html)
[http://secopsmonkey.com/migrating-graylog2-servers-
part-6-le...](http://secopsmonkey.com/migrating-graylog2-servers-
part-6-lessons-learned.html)

------
jrro
I've only maintained a small number of servers, but I've found a good solution
in what I'll call the GEL (Graylog2 - Elasticsearch - Logstash) stack. It's
been some time since I last used Graylog2, and I can recall that it was
somewhat lacking in the pretty charts and graphs department, though that may
have improved recently, and the search functioned beautifully.

~~~
toomuchtodo
I'm surprised Graylog2 isn't mentioned more here. We've got tens of terabytes
of logs in Graylog2, and I couldn't imagine not using its streams, alerts, and
search functionality. It's become a core part of our alerting an monitoring
infrastructure.

------
buro9
logstash + elasticsearch + kibana

That's for personal projects, the CloudFlare setup is a little more complex
and perhaps one of the data team would be best answering that... if you're
interested then I can ping them to see if there's a volunteer for a blog post
describing how we do logs at scale.

~~~
johnbellone
An ELK stack is definitely a lot easier and straight forward to setup. Since
we need to support a dozen or so teams with varying performance considerations
we found that logstash left a lot to be desired. In the end fluentd performed
well and gave us a lot of flexibility.

~~~
provost
I'd like your view from the trenches regarding Logstash vs Fluentd,
specifically why it's better for 12+ teams? I'm having to make a similar
decision myself and would enjoy your insight! Thanks

~~~
kiyoto
(Disclaimer: I am a Fluentd maintainer)

Here are some comparisons out in the wild (Note: neither of them is a Logstash
or Fluentd maintainer afaik)

\- [http://jasonwilder.com/blog/2013/11/19/fluentd-vs-
logstash/](http://jasonwilder.com/blog/2013/11/19/fluentd-vs-logstash/)

\- [https://blog.deimos.fr/2014/05/13/logstash-vs-
fluentd/](https://blog.deimos.fr/2014/05/13/logstash-vs-fluentd/)

Also, if you have any question or doubt about Fluentd, please feel free to
email me at kiyoto@treasure-data.com

------
gargarplex
[https://papertrailapp.com/](https://papertrailapp.com/)

------
tstack
[http://lnav.org](http://lnav.org) \-- A tool I wrote and use everyday to
view/analyze the logs of the software I work on.

------
ozgey
Ozge, co-founder of topLog (toplog.io) here. Perhaps my opinion is slightly
biased but here is what we have learnt for the last couple of years.

\- ELK is great (in fact, we use E+L under the hood) however you really need
to know what you're doing with it and you need to spend some time configuring
things while putting it together. Do you have that kind of time? Maybe, maybe
not...

\- Every available tool lets you search and create alerts for monitoring. So,
the analysis is always on you. This still takes a lot of manual search time
during troubleshooting.

\- What if pattern and behaviour detection on your logs can be done
automatically? Well, it can be. And that saves you some good amount of time
instead of you creating regexs and following the trails to find the root-
cause.

I would love to hear your thoughts on automated analysis and anomaly detection
on logs. If you're giving a keyword and a specific time frame to search for
anomalies, is it real anomaly detection? or a just an improved search?

~~~
rand0muid
You should give Echofish atry. It's made wonders on our network with its
"whitelisting of normal behaviour". You wont beleive the things you ll
discover with this approach.

EDIT: The most fascinating aspect for me is that echofish is more geared
towards the actual log entries, rather than statistical analysis, in order to
automatically detect anomalies in your logs activity.

~~~
plara
Is echofish geared towards network activity norm / abnorm or general logs
(syslog, app / dev logs, etc)?

Sounds cool.

~~~
rand0muid
Well, its approach (quoting its project page) is pretty simple:

Echofish is a purpose-built solution for filtering & monitoring of syslog
activity. By whitelisting regular messages through the web UI, the
administrator can instruct the log processing mechanism to create alerts only
for anomalies (irregular messages).

...and actually, it can do lots more once you read the built-in help (such as
distribution (using BGP) of IP blacklists, consisting of IP addresses
collected through syslog activity).

TLDR; It's gearred towards filtering noise from logs. This also means you can
possibly have another daemon reporting network activity through syslog, while
echofish can act as your noise-filter.

------
bluedino
Did monitoring logs with Bayesian filters ever catch on? They are very good
and finding things that are "off"

~~~
thrownaway2424
By "monitoring" do you mean anomaly detection?

~~~
plara
thats part of how our engine works.. part (full disclosure, co-founder of
toplog.io)

------
espenwa
To the people suggesting ELK i just want to ask if you have actually used it
in production? Like for real bughunting and investigating support requests?

As much as we absolutely love ElasticSearch for our other indexing needs, we
find it quite hard to get the LK-part of the stack to deliver as promised.
Kibana may serve up nice graphs and charts, but when you need to drill down
into a large amount of log data, we often feel like loosing both overview
_and_ detail.

It might very well be that we are to blame, and that we are just doing it
wrong (tm) - but I would love to hear how other people are leveraging the ELK
stack in production environments?

~~~
dreamdu5t
We use ElasticSearch and Kibana in production for real bughunting and support
requests. Logstash was too frustrating to deal with so we wrote our own simple
wrapper around an open source ElasticSearch client library to log ourselves.

We log every request (everything but the body usually) and response. If an
error occurs, its logged as part of the request. We can practically replay
actions taken by users and easily drill down to the exact requests pertaining
to an error.

------
tmhedberg
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf)

------
sikhnerd
awk, grep, sed, ag, cut, sort,etc for most of the stuff. Elasticsearch,
logstash, kibana for more complex queries

~~~
emmelaich
We use splunk but there's something to be said for the low-tech approach for
ad-hoc queries.

A great little tool is 'since' \- a stateful tail.
[http://welz.org.za/projects/since](http://welz.org.za/projects/since)

    
    
      since is a unix utility similar to tail.
      Unlike tail, since only shows the lines
      appended since the last time. It is useful
      to monitor growing log files.
    

It's in the usual yum/apt repos as well as homebrew for Mac.

It's a bit hard to websearch for because 'since' is a common word.

------
cle
I regularly search logs across a huge fleet of hosts (thousands). I have a
script that will send an arbitrary command in parallel to each host and return
the output in a JSON file.

Once I get the logs I'm interested in, it's usually a straightforward
combination of jq, grep, sed, awk, cut, sort, uniq, xargs, etc. If I need to
do some fancy queries on the data, I have another script that will parse the
logs and load them into a SQLite db.

~~~
brown9-2
What are you using to execute commands on thousands of hosts in parallel?

~~~
cle
Just forking off processes that SSH into the hosts and write to temp files,
which are picked up the by the main process. It's gotten pretty elaborate, but
the basics are pretty simple.

------
owlish
As others have said, the classics awk, grep, cut, sort, uniq, and scripting
languages (perl/ruby for on the fly one-liners) are great for log analysis.
One additional tool I've found particularly useful (three actually) is
zgrep/zcat/zless. You'll often be searching through archived gzipped logs, so
it's nice being able to work with the files without needing to pipe everything
through tar.

------
rongster
Lots of companies use SumoLogic
[https://www.sumologic.com/signup/](https://www.sumologic.com/signup/). The UI
is very similar to Splunk. But having used both Splunk and SumoLogic, I
personally think that Sumo is superior in many ways.

1\. It requires very little hacking and setup to work (cough ELK cough Splunk)
since Sumo is completely cloud-based. It literally takes 2 minutes to sign up,
5 minutes to download and configure some log collectors, then voila you're
ready to send data and search. Also I believe you can tell Sumo Logic to grab
logs directly from S3 for example if you're running everything on AWS.

2\. SumoLogic is pretty easy to use (it's basically cloud-based grep/awk) and
has some really cool features that makes Splunk feel clunky in comparison.
Parsing and transposing data for graphing is really simple. Also little things
like auto-suggesting sources/hosts while you're typing a query makes the
experience much smoother than jumping around tabs copy/pasting shit.

3\. If you start to generate a lot of logs, and I mean a metric fuckton of
logs from 1000s of servers, Splunk Storm will most definitely not be able to
help you. In-house Splunk / ELK clusters will need to be carefully sized (just
google ELK sizing).

As software developers, we have enough on our plates that it really pays to
use tools that help rather than make you wanna throw up your fists and curse.
KISS

~~~
TomFrost
Coincidentally, I had a demo call with Sumo today. I was very impressed,
except for one point that makes Splunk the clear winner at the moment: I can
send structured logs to Splunk and it automatically finds the keys and allows
me to query on them immediately.

Meaning, I can send: 2015-02-02 01:00:00 event="Product sold" price=5

And with zero configuration in Splunk, I can now query: event="Product *"
price>2 | stats sum(price)

And in the next iteration of my app, I could add 30 more key/value pairs to
that message and could query on the rest of them just the same, no
configuration. It makes development incredibly rapid to be able to instantly
report on any metric anyone on my team logs out, debug-related or otherwise,
without having to maintain some master list of every key in every log message
in every service we write.

I was floored in my Sumo call today when I was told that wasn't the case in
that product yet. It seems like such a basic feature-- and is why many
products have switched entirely to JSON-based logs. Have you discovered a
workaround, or find that to be as cumbersome as I'm anticipating?

~~~
JamesHollinger
Sumo Logic currently has the ability to extract known fields on ingest, making
them available for searches, much like the Splunk query provided above.
Dynamic fields, such as new KVPs that are logged out are able to be pulled out
in the query with one extra step, as follows:

| kv infer "event","price" | sum(price) by event | where price >2

The kv operator refers to key value pairs. There is also a json operator which
functions the same way.

------
lmedinas
Personally i use Emacs with occur-mode for filtering text also when it finds
the pattern it is able to switch from instances quickly. Sometimes i also use
regex with occur to find multiple items.

Also Notepad++ has a Analyze plugin which i recommend for complex stuff and if
you dont like emacs.

~~~
lord_quas
I second emacs occur.

------
h1fra
Kibana provide an excellent way to track log.

Currently handle +100gb/day of heavy document (more or less 100 items per
document), on our current setup. And probably designed to handle way mooore.

Dashboard are constantly opened on +50 screens and we use them also to track
MySQL, mailing, internal stats...

------
amatai
Used papertrailapp.com but as the number of servers grew, this became
expensive.

Moved to EL + Kibana, but not liking the interface yet and it doesn't seem to
have 'tail -f' kindof functionality.

~~~
enjo
We use paper trail. I suppose expense is relative, but it's absolutely worth
it to us. They've really nailed log search, which is what is really important
to us.

------
gunjan2307
Kafka for log aggregation. sed,awk,grep,tail and other bash utilities for
analysis. Extremely fast at over 5k lines per second and totally maintenance
free running on a 2G AWS instance

------
Thaxll
ELK is easy to setup but when you really need to analyse logs it can be
painful, I think the real problem is Kibana which is just not good enough for
that task.

\- no logs coloration

\- a lot of bugs / weird refresh behavior

\- no auth

~~~
provost
Do you know of a better alternative to Kibana for a EL_ stack?

------
jvehent
At Mozilla Opsec, we use (and write) MozDef:
[https://github.com/jeffbryner/MozDef/](https://github.com/jeffbryner/MozDef/)

~~~
rubiquity
Amazing name

------
gesman
Splunk. Here's my latest Splunk log analytics app:

[http://www.mensk.com/#prettyPhoto/0/](http://www.mensk.com/#prettyPhoto/0/)

------
dreamdu5t
Ship JSON to ElasticSearch, visualize with Kibana, and backup to S3/glacier.

We use found.on and qbox.io for managed hosting of ElasticSearch clusters.

------
thoufno
graylog2. they have worked around the shortcomings of elasticsearch for log
management (in the end it is a lucene based full text search engine for
general purpose tasks) and the 1.0 that is about to go GA soon has crazy
stability

[https://www.graylog2.org/](https://www.graylog2.org/)

------
mattkrea
LogEntries.com

~~~
kawsper
We tried logentries, but their agent was a terrible java-application, that was
hard to get working right.

The list of files was saved on their service (rather than in a text-file on
the server), and the name of our servers was also guessed by their servers
which made it hard for us to add and maintain servers.

I think they should build a better agent that embraces UNIX more, and can be
configured through a local configuration. Their platform seems nice, but we
weren't able to use it, sadly.

~~~
tparso
Hey - You might want to check the agent docs here which have a local
configuration file:
[https://github.com/logentries/le#configuration](https://github.com/logentries/le#configuration)

So the logs being followed are actually configured in a text file on the
servers. This makes it super simple for deploying via chef/puppet in large
scale environments.

------
falcolas
Grep, awk, sed for the easy stuff, logastash/elasticsearch/kibana for the
harder stuff.

------
mgrassotti
logstash

------
tobyc
fluentd -> elasticsearch/kibana

Works pretty damn well.

------
rusbus
www.sumologic.com

~~~
sshillo
My work recently switched to sumo and I love it. I've you've used splunk,
logstash, elastic search + kibana previously and think sumo is the best. It is
the right amount of power and simplicity plus great documentation.

------
bra-ket
grep

------
thrownaway2424
How many logs are we talking about, and produced at what rate? These are key
questions. If you have tons of logs and you might have to make a full access
pass over them, the last thing you want to do is to centralize them on some
kind of logs host with lots of disks and few CPUs. But if you have little or
moderate amounts of logs you may be able to get away with a single host and
some xargs -P grep type of thing.

------
imaginenore
Logstash/kibana (free, open source)

Splunk (good, but ridiculously expensive)

~~~
michaelmior
You can do quite a lot with Splunk Storm[0] for free.

[0] [https://www.splunkstorm.com/](https://www.splunkstorm.com/)

~~~
nilic
And if you prefer a non-cloud option, there's a Splunk Free license [1] which
allows for indexing of up to 500MB logs per day.

[1]
[http://docs.splunk.com/Documentation/Splunk/6.2.1/Admin/More...](http://docs.splunk.com/Documentation/Splunk/6.2.1/Admin/MoreaboutSplunkFree)

------
firefoxNX11
splunk

