Hacker News new | past | comments | ask | show | jobs | submit login
Tips for analyzing logs (jvns.ca)
300 points by historynops on Dec 13, 2022 | hide | past | favorite | 134 comments



One of my pet peeves is "The Useless Use of cat Award". Someone awarded it to me as a teenager in the late 90s and I've been sore ever since.

Yup, it's often a waste of resources to run an extra 'cat'. It really demonstrates that you don't have the usage of the command receiving the output completely memorized. You know, the thousand or so commands you might be piping it into.

But, if you're doing a 'useless' use of cat, you're probably just doing it in an interactive session. You're not writing a script. (Or maybe you are, but even still, I bet that script isn't running thousands of times per second. And if it is, ok, time to question it).

So you're wasting a few clock cycles. The computer is doing a few billion of these per second? By the time you explain the 'useless' use of cat to someone, the time you wasted explaining to them why they are wrong, is greater than the total time that their lifetime usage of cat was going to waste.

There's a set of people who correct the same three pairs of homophones that get used incorrectly, but don't know what the word 'homophone' is. (Har har, they're/their/there). I always liken the people who are so quick to chew someone out for using cat, in the same batch of people who do this: what if I just want to use cat because it makes my command easier to edit? I can click up, warp to the front of the line, and change it real quick.

Sorry. I did say, it is a pet peeve.


I do “useless” use of cat quite often because, in my brain, the pipeline naturally starts with “given this file”, so it makes the pipeline more consistent e.g. `cat f | a | b | c` rather than `a < f | b | c` where one must start with the first “function” rather than with the data. I see starting with `cat` analogous to the `->` thread macro in Clojure, `|>` pipe in Elixir, and `&` reverse application operator in Haskell. If bash permitted putting the filename first, I’d stop using `cat`; alas, it does not.


There’s this. Though it would look nicer with another |.

  <f a | b | c


That’s actually not half-bad. I think I’ll use that, thanks!


> One of my pet peeves is "The Useless Use of cat Award". Someone awarded it to me as a teenager in the late 90s and I've been sore ever since.

Wear it as a badge of honor! It marks you as a person who puts clarity, convenience and simplicity before raw performance. I can't think of a single case when that bit of performance matters.

Needless to say, I'm happily using cat (uselessly) myself and have no plans to convert.


> It marks you as a person who puts clarity, convenience and simplicity before raw performance.

This. As noted, even in scripts it usually makes more sense since the result is a pipeline that's easier to read, annotate and modify.

Case in point:

  cat file.txt \
  | sed '1s/^\xEF\xBB\xBF//' `# Strip UTF-8 BOM at the beginning of the file` \
  | ...
Specifying a file name would only make the "black-magic-line" of `sed` more complicated while also making it more complicated to modify the pipeline itself. Now, if I want to skip that step to test something, I don't have to figure out if/how the next command takes an input file (or, ironically, replace `sed` with `cat`).


> It really demonstrates that you don't have the usage of the command receiving the output completely memorized.

No, it demonstrates that you don't have redirection memorized, and don't know that you can place it anywhere in the command line, including on the left.

> So you're wasting a few clock cycles

Keystrokes too:

   cat x | cmd
   < x cmd
It's also possible that cmd may detect that its standard input is connected to a real file, and take advantage of being able to lseek() the file descriptor. For instance say that x is an archive and cmd is an extractor. If cmd needs to skip some indicated offset to get to the desired payload material, it may be able to do it efficiently with an lseek, whereas under cat, it has to read all the bytes and discard them to get to the desired offset.


I still prefer to have cat there because it is interchangeable with other output-producing commands and it can handle globs. In an interactive session I iterate on the last command many times, and if I decide to filter stuff can just replace cat with grep or if I decide to pull from a directory of files can add a glob, if compressed it turns to zgrep or zcat etc. With redirects I'd have to change the structure of the pipeline which wastes mental effort. IMO.


> the time [used to] explain the 'useless[ness]' .. of cat to someone .. is greater than the total time that their lifetime usage of cat was going to waste

If you look for situations like this they are surprisingly common.

> pet peeve


I don't often (err, ever…) reply without reading further but this time I must, because: I've never heard this turn of phrase "useless use of cat" and it turned my brain upside-down for a moment, because: "interactive" is precisely how I learn and do and I suppose it was a nice reminder that sometimes really big (read: useless) things are actually kinda small (useful) and vice versa.


Well, you waste an entire fork and exec, so I believe you are underestimating the time by a few orders of magnitude. Also, it's almost always grep following the cat, so it's not much to memorize.

But it's well worth wasting a process to have a nice pipeline where each command does a single thing so you can easily reason about them.


It's a lot more than the few extra cycles to spin up the process - it's also an extra copy of all the data. Usually that's also not much, but occasionally it's everything, as the consuming program can seek in a file, but not in a pipe, so might otherwise only need a tiny bit of the data.


It totally makes sense, after all the electron apps/docker containers/statically linked utilities/microservices running on my machine, needlessly running cat might be the straw that breaks the camels back.


not counting the subpixel aliasing for your terminal font ;)


> Sorry. I did say, it is a pet peeve.

You're not alone. I disable the warning in shellcheck. I've been uselessly cat'ing since the 90s, like you.


Doesn’t people like Rob Pike argue for `cat |` since it fits so orthogonally with the rest of the pipeline?


It's a perfectly valid pet peeve, especially with things like `tr` being weird about accepting files as input.

This works:

  cat file.log | tr 'a' 'b'
This doesn't:

  tr 'a' 'b' file.log


   tr a b < file.log

   < file.log tr a b

The pattern "cat file | command" introduces an unnecessary cat process to achieve the effect of "< file command".

It is not called "useless use of cat" (UUoC) without justification.


It is not called "useless UUoC comment" (UUUoCC) without justification. ;)

From a personal taste perspective, I'm not a fan of either. Having a floating "<" at the start of a line just isn't my cup of tee. Not dealing with explicit stdin/stdout just makes my code easier to read. And especially considering the post's advice is about reading logs, a lot of the post is very likely built around outage resolution. Not the time I want to be thinking "oh yeah `tr` is special and I need to be explicit" -- nah, just use cat as a practice. And no, I'm not going to write `grep blah < filename` as a practice just because of commands like `tr` being weird.

But honestly, if it's such a big deal to have a cat process floating around, there are probably other things you should be concerned about. "Adds extra load to the server" points to other problems. If perf matters, CPU shielding should be used. Or if that's not an option, then sure, there's some room for trifling around, but if you're at a point where you're already running a series of pipes, a single cat command is beans compared to the greps and seds that come after it.


> cat

> pet peeve

nice


My biggest quality of life improvement for understanding logs has been lnav (https://lnav.org/) -- does everything mentioned in this post in a single tool with interactive filtering and quick logical and time based navigation.


Looks great, but I struggle to get this working with the two main log sources on my system:

1) systemd logs (binary data stored at /var/log/journal)

2) docker logs (from various containers running on the host)

Any hints on getting this working? I don't see this covered in the docs (https://docs.lnav.org/en/latest/formats.html).


for systemd logs this works for me

    journalctl -o json | lnav


Thanks for linking to this!

Until now I thought logview.el[0] is the bee's knees, but now I can feel feature envy set in. There are some seriously powerful ideas listed on the lnav page, and it's also the first time I saw SQLite virtual tables used in the wild.

--

[0] - https://github.com/doublep/logview


At my current company, we're using pdsh with tail -f piped into sed to follow logs from multiple remote servers at the same time, label them and colorize them. Works okay. Decent solution without needing to install much other software. Not my favourite because it doesn't deal well with SSH sessions timing out and leaves some tail -f processes hanging, and some other quirks. But out of laziness and risk of breaking things in prod, we haven't tried experimenting with much else.


lnav has some support for tailing files on remote hosts -- https://lnav.org/2021/05/03/tailing-remote-files.html

It works by uploading a stub that is an "actually portable executable" to the remote over ssh. lnav then talks to the stub to sync files back to the local host. I don't know the scale of the hosts or files, but if they're not too big, it might work fine.


Run that in a tmux session. Problem solved.


Huh, I almost posted a duplicate recommendation. My only complaint with lnav was that it had to be built from source on Linux and the build was frigging huge. Apparently they have a pre-compiled linux-musl binary now.


> that it had to be built from source on Linux

Seems to be packaged quite widely https://repology.org/project/lnav/versions


I used nix when I had to make lnav work on some machines where this would have been difficult.


I love lnav and use it constantly, but it crashes a lot. I do wish there was something like lnav that was a little simpler to use, and written in a more resilient way that crashed less. I can cut lnav some slack for the crashes because identifying and parsing arbitrary log formats seems like a messy problem. Still it shouldn't crash 1/3rd of the time I use it.


Sorry for the crashes :( I've been trying to improve it's internals more than adding features as of late. If you haven't already, please file bugs on github and/or submit crash logs to the mailing list. (I've taken a break from working on it lately. So, if you've done that and I haven't gotten back, I apologize.)


No worries. It's a great tool and I'd be very happy to make reports and share crash logs.


Yes! lnav (https://lnav.org) is phenomenal. Embedded SQLite... easily scriptable... OOB log formats galore, or define your own... it's a mini ETL powertool that scales to at least a few million rows and runs in your terminal. Maintainer's a friendly dude, too.


Interesting! Do you know how it compares to multitail? (https://www.vanheusden.com/multitail/) They look very similar.


I don't think multitail really understands logs like lnav does, it's just following the last lines in the file. For example, if you try to follow multiple files in multitail like so:

    $ multitail /var/log/install.log -I /var/log/system.log
You get a view with the tail from one file followed by the tail from the other, they are not collated by timestamp. In contrast, if you do the same thing in lnav:

    $ lnav /var/log/install.log /var/log/system.log
You will get a single view with all of the log messages from both files and they're sorted by their timestamps. Here is all of what lnav is doing:

  * Monitoring files/directories for updates
  * Decompressed files/archives
  * Detected the log format for each file
  * Created SQLite vtables that provide access to log messages
  * Built an index of all the log messages in all the files so
    you can jump to a certain point in time or to the
    next/previous error message.
  * Display all log messages with syntax highlighting


I migrated from multitail to lnav. Turned out to be a no-brainer.

I second the above, just one pain point with multitail to add. I often page/search/filter in the scrollback buffer (I typoed "bugger" - Freudian slip?) and in multitail the scrollback is a separate window with a frame and everything, which is a pain (copying whole lines using mouse includes the frame, ugh). The filtering/searching being a separate pain.

One thing I used in multitail and not sure if I migrated wholly to lnav was log file syntax highlighting using regexes.


Our Easy Data Transform software is intended for data wrangling and desktop ETL. But it has lots of features useful for browsing and manipulating log files, including:

* powerful filtering (including with regex)

* smooth scrolling of millions of rows of data

* support for csv, text, xml, json, Excel and other formats

* available for Window and Mac


Came here to say the same - massive props to the lnav devs.


One thing I've done to identify infrequent log entries within a log file is to remove all numbers from a file and print out a frequency of each. Basically just helps to disregard timestamps (not just at the beginning of the line), line numbers, etc.

  cat file.log | sed 's/[0-9]//g' | sort | uniq -c | sort -nr
This has been incredibly helpful in quickly resolving outages more than once.


Brilliant hack. I've used just about all the tricks from the blog and many of the comments here, but never this one. I've stripped timestamps for sure, but never considered all numerics. Nice one !


I love this tip for all kinds of diffing. I'm so sure I'm going to use it that I already assigned it an alias to strip numbers from whatever's on my Mac clipboard:

    alias numberless="pbpaste | sed 's/[0-9]//g' | pbcopy"


It's probably worth exploring making that a frequency bucketing pipe that dashboards N minute intervals .. so that operators can see any abrupt changes.


My tips:

1) Fuck grep, use ripgrep, especially if you have to scour over an entire directory.

2) Get good with regex, seriously, it will shave hours off your searching.

3) For whatever application you are using, get to know how the logging is created. Find the methods used where said logs are made, and understand why such a log line exists.

4) Get good with piping into awk if needed if you need some nice readable output.


Piping into AWK feels like a misuse of the tool in all but the simplest of cases. Don't forget that you can write complete AWK scripts and invoke them from a file!


I use ack for this purpose, because the color filtering options are so good.

https://beyondgrep.com/


Every time I've done research (twice) ripgrep comes out on top.

Usually it's something like rg > ag > ack > grep and that's overall testimonial, not just speed.

What am I missing?


I started using ack about 14 years ago, so theres a lot of inertia there. It is slower than the others, but it was so much faster at the time for searching through source trees than grep.

For log colorizing, I make use of the --color-match and --passthru arguments, like https://powdahound.com/2009/10/colorize-log-output-with-ack/


For log colorizing, here's the approximate equivalent of that blog post for ripgrep:

    cat UNLICENSE 
      | rg --passthru --color always --colors 'match:fg:red' public 
      | rg --passthru --color always --colors 'match:fg:green' heirs 
      | rg --passthru --color always --colors 'match:fg:cyan' benefit


Honestly, the most amazing thing I did with logs was learn how to do subtraction. Any time you have multiple instances of a thing and only some of them are bad, you can easily find the problem (if anyone bothered to log it) by performing bad - good.

The way you do this is by aggregating logs by fingerprints. Removing everything but punctuation is a generic approach to fingerprinting, but is not exactly human friendly. For Java, log4j can use class in your logging pattern, and that plus log level is usually pretty specific.

Once you have a fingerprint, the rest is just counting and division. Over a specific time window, count the number of log events, for every finger print, for both good and bad systems. Then score every fingerprint as (1+ # of bad events) / (1 + # of good events) and everything at the top is most strongly bad. And the more often its logged, the further up it will be. No more lecturing people about "correct" interpretations of ERROR vs INFO vs DEBUG. No more "this ERROR is always logged, even during normal operations".


Isn’t this basically what structured binary logs are? Instead of writing a string like `Error: foo timeout after ${elapsed_amt} ms` to the log, you write a 4-byte error code and a 4 byte integer for elapsed_amt. I know there are libraries like C++’s nanolog that do this for you, under the hood.


Maybe this is a silly question, but is there much value in a 4-byte binary code compared to a human-readable log with human-readable codes? Maybe size, but logfmt especially is not much less compact than binary data.


Well, the value is mostly in query writing, but at that point you're basically writing a very bad TSDB?


That solves the fingerprint, but you still need to count and score.


You might be interested in the TF-IDF algorithm used in information retrieval and text classification.


Yes, this is pretty much TF-IDF for people too lazy to count the number of unique items in the corpus.

Since that number should be the same (or at least close!) in both good and bad datasets, I'm not sure the extra math matters much.


One thing I didn't see was how to use GREP to view the lines before and after a match:

   grep regex /var/log/logfile -A5 #To view the next 5 lines
   grep regex /var/log/logfile -B5 #To view the previous 5 lines
   grep regex /var/log/logfile -05 #To view the 5 lines before *and* after the match
This is super handy to find out what happened just before a service crashed, for example.


    grep regex /var/log/logfile -05 #To view the 5 lines before *and* after the match
This appears to work with GNU grep, at least, but YSK that the POSIX grep option for this is `-C`.


Loosely related: a few years ago I wanted a simpler alternative to some of the more feature-full log viewers out there so I threw together a tiny (50kb) app that might be useful to some folks in here.

All it does is consistently colors the first field in a line from stdin so you can quickly see which log lines have the same first field.

I used it in combination with the parallel[0] command to prefix log lines by replica name when tailing logs across machines: https://github.com/jasisk/color-prefix-pipe

[0]: https://www.gnu.org/software/parallel/


Nice tool. You should really consider including a screenshot showing the actual coloured output on the GitHub README.


As much as I approve of a skillset to analyze local logs, but after a relatively small scale (10-20 systems), a central decent log aggregation like opensearch or ELK just brings so much value even on 1-3 nodes. It'd be one of the first changes I make to an infrastructure because it's so powerful.

And its not just log searching and correlation value. At work, the entire discussion "oh but we need access to all servers because of logs" just died when all logs were accessible via one web interface. I added a log aggregation and suddenly only ops needed access to servers.

Designing that thing with accessibility and discoverability in mind is a whone nother topic though.


Just to expand on this more - you don't have to use ELK which is pretty resource and maintenance heavy, there are much easier and faster alternatives that don't need attentive care.

Throw Loki+Grafana on a single VM somewhere (or run it on your Kubernetes/Nomad/ECS/etc. cluster) and it will get you very far, as long as you plan ahead a bit (most notably, indexing happens at ingestion, so you need to have an idea of what you want from your logs or your queries will be slower).

Or use a SaaS like logz.io, AWS OpenSearch, Datadog, etc. Most support OpenTelemetry now, so switching data ingestion is quite easy (unlike dashboards and alerts).

It makes sense to have centralised logs pretty much as soon as you outgrow the "everything runs on this one box and my DR plan is a prayer" stage, IMO.


Agreed! Centralizing logs is so helpful, but you don't know it until you've done it. Too many people rely upon grep, when a handy tool is just a download away. Plug for said too: https://log-store.com


We went to the trouble of setting up ELK and I was excited, but I never use it any more. You can't scroll/search through the logs as fast as vi and grep. You have to click to see more log lines. And I can write an alias to tail a log so I don't have to log into a UI.


syslog has allowed forwarding logs long before elk


Glad to see lnav (https://lnav.org) already getting some love in the comments. Hands-down the most reliable source of "thank you! I wish I'd known about this tool sooner!" responses, even from v experienced sysadmins / SREs.


A minor optimization is collapsing the grep -v, from this:

  cat file | grep -v THING1 | grep -v THING2 | grep -v THING3 | grep -v THING4
to this:

  egrep -v 'THING1|THING2|THING3|THING4' file
That gets rid of the cat and three greps. Both POSIX and GNU encourage grep -E to be used in preference to egrep.

A pcregrep utility also used to exist, if you want expansive perl-compatible regular expressions. This has been absorbed into GNU grep with the -P option.


> A pcregrep utility also used to exist, if you want expansive perl-compatible regular expressions. This has been absorbed into GNU grep with the -P option.

'pcregrep' still exists. But with PCRE2 supplanting PCRE, it is now spelled 'pcre2grep'.

I don't know the precise history of 'grep -P' and whether 'pcregrep' was actually absorbed into it, but 'pcregrep' is its own thing with its own features. For example, it has a -M/--multiline flag that no standard grep (that I'm aware of) has. (Although there are some work-arounds, e.g., by treating NUL as the line terminator via the -z/--null-data flag in GNU grep.)


Oddly, there are pcre2 packages in RedHat/Alma 9, but they do not include a pcre2grep.

GNU grep is also linked to pcre, not pcre2.

  # pcre2grep
  bash: pcre2grep: command not found...

  # yum install pcre2grep
  Last metadata expiration check: 1:58:58 ago on Tue 13 Dec 2022 11:45:44 AM CST.
  No match for argument: pcre2grep
  Error: Unable to find a match: pcre2grep

  # yum whatprovides pcre2grep
  Last metadata expiration check: 2:09:25 ago on Tue 13 Dec 2022 11:45:44 AM CST.
  Error: No matches found.

  # rpm -qa | grep pcre2 | sort
  pcre2-10.40-2.0.2.el9.x86_64
  pcre2-syntax-10.40-2.0.2.el9.noarch
  pcre2-utf32-10.40-2.0.2.el9.x86_64

  # which grep
  /usr/bin/grep
  # ldd /usr/bin/grep | grep pcre
   libpcre.so.1 => /lib64/libpcre.so.1 (0x00007efc473c4000)


GNU grep recently migrated to PCRE2. My GNU grep is linked to PCRE2:

    $ grep --version | head -n2
    grep (GNU grep) 3.8
    Copyright (C) 2022 Free Software Foundation, Inc.
    $ ldd /usr/bin/grep
            linux-vdso.so.1 (0x00007ffd2ddd5000)
            libpcre2-8.so.0 => /usr/lib/libpcre2-8.so.0 (0x00007f4f88b81000)
            libc.so.6 => /usr/lib/libc.so.6 (0x00007f4f8899a000)
            /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f4f88c6f000)
As for pcre2grep, Archlinux includes it as part of the pcre2 package:

    $ pacman -Qo $(which pcre2grep)
    /usr/bin/pcre2grep is owned by pcre2 10.40-3
So this is a distro packaging thing. But what I said is true: pcregrep, pcre2grep and grep -P are all distinct things.


I once compared the speed of these two approaches, rather accidentally. I did output colorization by adding ANSII sequences. I thought, of course one process must be more efficient than a pipe of processes. After the rewrite, I was disappointed about the slowdown and reverted back to the pipe.

PS I checked back and I used sed rather than grep. I think the result would hold for grep but the morale is that you should verify rather than assume.

I have around 50 seds in the pipe, running in parallel (which is what makes it faster), it would have been a half of that when I tried the rewrite.


I usually prefer to pipe fgrep -v into fgrep -v. With egrep you need to escape brackets and other characters.


If you haven't seen it, there's been some noise recently about fgrep and its egrep friend

  $ fgrep -h
  fgrep: warning: fgrep is obsolescent; using grep -F
  $ egrep -h
  egrep: warning: egrep is obsolescent; using grep -E


True, for absolute fixed strings, the author's approach is superior.

Off the cuff, another way to do it is with awk's index function. I don't know what speed penalty this might impose.

  $ cat ~/fmgrep
  #!/bin/awk -f

  BEGIN { split(ARGV[1], s, ","); ARGV[1]="" }

  { m = 1; for(c in s) if(index($0, s[c])) m = 0; if(m) print }

  $ ~/fmgrep sshd,systemd,polkitd /var/log/secure


This also works:

fgrep -v -e THING1 -e THING2 -e THING3 -e THING4


Or the following if pipes or egrep make you nervous:

grep -v -e THING1 -e THING2 -e THING3 -e THING4 file


> you’ll get overwhelmed by a million irrelevant messages because the log level is set to INFO

I know this happens, but I think it's because programmers are abusing INFO. In principle it's reserved for messages that are informative at a level sys admins and a few others can make sense of and use. Unfortunately abuse often leads to "We turned INFO off" making it much harder to diagnose things after the fact.


I think there should be an counterpart to "log analysis" which is "logging strategies for your app". Which should be WHAT to log and WHEN.

Stuff like, if you are exposing an HTTP endpoint, you should log the request URL and the time it took to serve it. Or if you are invoking an extenral service, you should log the response time of that service.


And you should produce a single line of output for each request that identifies all of the pertinent information. You can have more than one, for e.g. a thread dump, but there should be one that provides a complete summary. I've lived with apps that logged at each stage of the process as separate lines, and that's just not useful data when grepping for anomalies.


I hypothesize it's not useful because you're using grep. If you use a tool that can show you multiple lines all tied by a request ID, it becomes much more helpful.


I am quite often looking for patterns across thousands of requests. As an example, one thing I inherited didn’t even log how long each request took to serve. Sure, you find out how long a single request took, by comparing the first and last log entry, but that’s just not useful 99% of the time.


This is exactly the kind of lack of logging info that exasperates any on call person.


When I was training sysadmins back in the dark ages, one of the rules I taught was: know what good looks like in your logs. If you are scanning hundreds of lines of logging under duress to find a smoking gun, and you don't know the difference between what the logs normally show, and what you are seeing, you'll waste a lot of time.

Corollary is that good day logs should be minimal and "clean", e.g not logging a lot, or, logging nice and predictably (which makes them easy to strip out via grep -v, etc.)


> Often log lines will include a request ID.

Yes, always include a request id in every request structure you create and include it also in the response and print it. It would seem something obvious that everyone does by default but instead, no, it's not so obvious it seems.


Not so obvious. How to implement it without passing request id to all the functions, when they are unrelated to request/http ? Especially in languages without thread-locals such as javascript?


Many frameworks solve this with logger context. Add the properties you want to the logging context and all future logs in that context will have that property.

One simple example - Serilog enrichment https://github.com/serilog/serilog/wiki/Enrichment

This does however assume your entire application uses the same logger. But this is generally a requirement in services that require tracing like this.


Nodejs now has async local storage built in, which makes this fairly easy.



You can stick it in a middleware which wraps the request lifecycle.


I recently used clickhouse-local to do some log analysis on a lot of elastic load balancer logs (~10s of GBs) and it was spectacular.

In short, you can add clickhouse-local to a shell pipeline and then run SQL queries on the data. An example from the docs:

$ ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' | clickhouse-local --structure "user String, mem Float64" --query "SELECT user, round(sum(mem), 2) as memTotal FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"


I wrote https://github.com/ljw1004/seaoflogs - an interactive filtering tool, for similar ends to what's described here. I wrote it because my team was struggling to analyze LSP logs (that's the protocol used by VSCode to communicate with language servers). But I made it general-purpose able to analyze more log formats too - for instance, we want to correlate LSP logs with server logs and other traffic logs.

(1) I wanted something where colleagues could easily share links in workplace chat with each other, so we could cooperatively investigate bugs.

(2) For LSP we're often concerned with responsiveness, and I thought the best way to indicate times when viewing a log is with whitespace gaps between log messages in proportion to their time gap.

(3) For LSP we have lots of interleaved activity going on, and I wanted to have visual "threads" connecting related logs.

(4) As the post and lnav say, interactivity is everything. I tried to take it a step further with (1) javascript, (2) playground-style updates as you type, (3) autocomplete which "learns" what fields are available from structured logs.

My tool runs all in the browser. (I spent effort figuring out how people can distribute it safely and use it for their own confidential logs too). It's fast enough up to about 10k lines of logs.


That sounds really interesting, although too limited for my current work focus which deals with Really Big logs.

I tried the demo link, it rendered a lonely @ at the top left of the page, the rest was blank. The console said:

    Uncaught TypeError: [].filter(...).reverse()[0] is undefined
        onload https://ljw1004.github.io/seaoflogs/demo.html:14
This was in Firefox 98.0 in Windows.


Some extra tips:

  Keep access logs, both when a service receives a request and finishes a request.
  Record request duration.
  Always rotate logs.
  Ingest logs into a central store if possible.
  Ingest exceptions into a central store if possible.
  Always use UTC everywhere in infra.
  Make sure all (semantic) lines in a log file contain a timestamp.
  Include thread ids if it makes sense to.
  It's useful to log unix timestamp alongside human readable time because it is trivially sortable.
  Use head/tail to test a command before running it on a large log file.
If you find yourself going to logs for time series data then it is definitely time to use a time series database. If you can't do that, at least write a `/private/stats` handler that displays in memory histograms/counters/gauges of relevant data.

Know the difference between stderr and stdout and how to manipulate them on the command line (2>/dev/null is invaluable, 2>&1 is useful), use them appropriately for script output.

Use atop, it makes debugging machine level/resource problems 10 fold easier.

Have a general knowledge of log files (sometimes /var/log/syslog will tell you exactly your problem, often in red colored text).

If you keep around a list of relevant hostnames:

  cat $hostname_list_file | xargs -P $parallelness -I XHOSTNAME ssh XHOSTNAME -- grep <request_id> <log_file>
This needs to be used carefully and deliberately. This is the style of command that can test your backups. This style command has caused multiple _major_ outages. With it, you can find a needle in a haystack across an entire fleet of machines quickly and trivially. If you need to do more complex things, `bash -c` can be the command sent to ssh.

I've had an unreasonable amount of success opening up log files in vim and using vim to explore and operate on them. You can do command line actions one at a time (:!$bash_cmd), and you can trivially undo (or redo) anything to the logs. Searching and sorting, line jumping, pagedown/up, etc, diffing, jump to top of file or bottom, status bar telling you how far you are into a file or how many lines it has without having to wc -l, etc.

Lastly, it's great to think of the command line in terms of map and reduce. `sed` is a mapping command, `grep` is a reducing command. Awk is frequently used for either mapping or reducing.


Some of these are KPIs (Key Performance Indicators). What we did at a previous job was to have a system like Etsy's statd [1] (it's an easy system to implement) and it made it easy to add statistics like latency of requests, number of errors, just about anything that could be measured, without excessive overhead (in terms of source code).

[1] https://www.datadoghq.com/blog/statsd/


> Always use UTC everywhere in infra

Can Amazon do this? They use UTC and your local browser’s time seemingly randomly depending on AWS service, and it drives me nuts. They usually (not always) put the timezone next to it, but why can’t they just have a mandate that it either is or is not UTC?! (The worst one is that the Lambda console is UTC but cloudwatch isn’t, so you think you haven’t received a request in hours but then you did)


also, journalctl. Very nice to have all logs in one place in a semi-structured nature

`journalctl -u myservice -S "5 min ago"`


Surprised to see that under the section "correlate between different systems" tracing isn't mentioned as an alternative approach. That's what tracing is: logging across different systems and getting that structure all stitched together for you.


“scroll through the log really fast”

This is a great point and probably underrated.

You may also benefit by scrolling sideways really fast.

https://en.m.wikipedia.org/wiki/Superior_colliculus


Best tips I discovered: use emojis. They have colors and they are easy to spot.

For instance if an API call is made use the phone emoji, when there is a timeout use a clock, when an order is dispatched used a package...

When you have to go through huge log file it is a life saver.


This seems like a prime candidate for a "7 tips for app devs SREs will hate" article, that only works because you're the only one doing it.


A few weeks ago I had a windows installer that was silently failing when upgrading from an older version (installation from scratch was working without issues). And as windows install logs aren't exactly easy to read, I was stumped, until I took an upgrade log from an older, working build, strip all date information from both files and compare them, checking all the sections which were different, until I found a line indicating that a colleague had forgotten about a limitation when dealing with msp (don't delete components on minor upgrades, but I didn't throw any stones as I've done the same mistake, twice, one and two years ago...)


One of my favorite tricks is to use a visual difftool.

Copy good log into left panel. Copy bad log into right panel. Quickly show which lines are new, which are missing and which are out of order. Obviously ignore the timestamps ;)


I do this all the time (PyCharm has an amazing diff tool hidden under Cmd-Shift-A, Show Diff). With the command from above that removes all the numeric characters it's going to be even easier to compare, no more timestamps!


Just use structured logs, its 2022 ffs.


> cat file | grep -v THING1 | grep -v THING2 | grep -v THING3 | grep -v THING4

Keyboard manufacturers HATE this simple trick:

grep -vE "THING1|THING2|THING3|THING4" file


One thing we did at my previous job was to add a "trace flag" to each account. Normally, they log nothing about a transaction (other than it happened), but if the trace flag is set, then a lot of information is logged about the transaction. Also, this trace flag is propagated throughout the distributed system they have, so we can trace the action across the network.


One thing that's improved my log analysis is learning awk (well actually Perl but I think 95 % of what I do with Perl I could also do with just awk). Often the most useful way to look at logs is statistically, and awk can quickly let you aggregate things like time-between-statements, counts, rates, state transition probabilities, etc for arbitrary patterns.


Also, sometimes logs stretch across multiple lines, and the other lines won't have the identifier you are searching for. For example, Java stack traces. In that case if you are stuck parsing unstructured logs, the simplest thing to do is to look at the entire file and search for the timestamp that found the first line.


Your friends

- grep (or ripgrep), sort, uniq, head, tail, cut, colrm, tr

- diff

- jq

- bat

- ministat

- less, more, percol

- Facebook Path Picker (fpp)


One of my favorites that I don't see used enough is:

    watch -d -n $INT
Watch for delta changes in interval.

I have lots of bad habits on the cli I should correct, now I'm feeling guilty after reading all the comments.


jless for JSON is very good.


I feel like this is a huge anti-pattern. Use a hosted service that does all of this for you, and then have a whole query language, build alerts, graphs, etc based on these results.

It's not super cheap, but it's 10x cheaper than wasting dev time in the terminal. (Sumologic, splunk are the two I can vouch for)


I found the histogram technique to be really helpful. Slight mod - I tend to sort reverse at the end of the pipeline (sort -rn); then |head is often more useful.

It's also good to have histograms by hour or day. I've hacked up scripts to do this but I should really make something better!


Usage of correlation ids and structured logs is pretty much standard. Don't go down this path.


Re timing, logs like nginx access logs have their timestamp from when the request completed, not when the request came in. That's a significant difference for long duration (~10s+) requests, and matters when trying to correlate logs or metrics to a request.


I saw a tip a whilr back about not needing to keep adding "| grep -v stuff" but instead "grep -e -v stuff -v stuff2" i remember getting it to work on Linux but last i tried that on macos I didn't have much luck


We've added a log tailing feature into our product UI which also has a basic find/filter. It's been enormously useful for cases where something weird happens as you can immediately access the last few mins of logs.


If I had a nickel for every time I have used this pattern

... | perl -ne 'print "$1 $2 ...\n" if /some-regex-with-capture-groups/' | ...


If you are using Kubernetes, I highly recommend using https://github.com/stern/stern


I second that. It's fast and conveniant, and you can visualize logs accross several instances easily,


No mention of klogg? It’s a lifesaver when opening huge log files


Jeez, this stuff is frontpage on HN? Sounds… pretty basic. I’m sure our AI overlords could produce deeper content.


I'm interested in this idea as well. Seems like it would be useful for detecting unusual issues assuming you have a large enough data set of "normal" runs.

I've been mulling this idea over in my head as well. I have a fleet of PCs out in the wild, all running the same software. It would be nice to have an easy way to detect strange behavior like processes that are continually respawning / segfaults / crashes / etc, without explicitly writing a bunch of search terms.


Yours is an unhelpful and non-constructive comment. Clearly a lot of people have been getting something out of the content in this post, as it's started several discussions.


People on HN have varied level of skills, and this is a well structured introduction to diving into logs. It already started conversation about better tooling. Let's celebrate today's lucky 10000 https://xkcd.com/1053/ rather than talk something down for being basic.


in addition to these one of my favorite is a perl one liner that generates time delta from a regex pattern of interest. And then I plot it using gnuplot. It seriously helps to 'see' the events with timing in a chart & allows you to do a quick visual search for problem areas.


Has anyone tried passing logs into chatGPT? I had thought it would be especially effective here.


It may work for a few lines, but you can't really pipe a GB of logs into it, which is what we typically deal with.

It works for simple explanations though.


My tips for analyzing logs:

cd ~/logs

cat * | grep <my string>


I guess it depends, but would you not encounter issues like permission blocks, or there being so much content in the logs that it is slow to find what you need?


journalctl supports grep, queries by priority, time, daemon, and custom defined fields.


Do we still use utilities like grep for searching logs? Are these when we cannot stream logs to tools like Splunk & Loggly and use their search services?


One problem is that you start using these services and then the bill arrives and you then spend all your time removing stuff from the logs to keep the bill down.


All of our logs are in Kibana, but sometimes I'll `kubectl logs pod > tmp; grep pattern tmp` because Kibana's search and filtering is often annoying. Actually, I'll usually open the logs in vim and use its search, which also get me the ability to eg delete every line not containing the pattern and then search only things matching the first pattern. I'm going to try lnav as mentioned in this thread but I've gotten by fine with judicious use of grep




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: