Hacker News new | past | comments | ask | show | jobs | submit login
Good Logging (henrikwarne.com)
100 points by henrik_w 9 months ago | hide | past | favorite | 99 comments



I feel that logging is a seriously hard problem and even experienced programmers struggle with it. It is almost impossible to anticipate what information will be useful later. So the first problem is to figure out the right logs at the right places.

The second problem is to figure out if logging is affecting performance. Or if a couple of logs can overpower everything else in a specific edge case. This will need thoughtful rate limiting and a good knowledge of the use cases your system will be subjected to.

Third problem is journald. Those journal logs are the worst. The viewer journalctl is probably the worst log viewing app that exists. These log files also have the unique quality of not being of any use outside the system they were generated on. No way to open them in Windows. Even on Linux, the times are all wrong. I don't know if that is a problem with journalctl or my system.


> So the first problem is to figure out the right logs at the right places.

Over the years I have come to the following rule: during the development design-code-test loop, if the code does not work the first time it gets a logging line where it failed. It helps with debugging right now, and makes a sharp edge surface by some definition. You don't know what you don't know, so you might as well have information gathering around entropy-generating places.

Other than that:

* Don't log in high throughput loops (test them extra!); if you must, use a rate limiting logger with expiring delay.

* Log important parameters

* Log assumptions. Especially about data. There's always way more entropy in data.

* This applies to user-reported bugs, which is a form of a failed assumption. All information necessary to debug the issue now must go to the log.

Also end-to-end testing >> unit testing


> Don't log in high throughput loops

Looks deceptively simple. Typically a function implements some behaviour or computation, and hopefully does so correctly and efficiently. Its usage is, well, up to the user. With the requirement that we don't log in high-throughput loops, this is no longer the case: we also have to anticipate usage patterns.

To put that another way, you might not know at the time whether your function is going to be invoked from a high-throughput loop.

Is there a nice solution to this problem?


> To put that another way, you might not know at the time whether your function is going to be invoked from a high-throughput loop.

> Is there a nice solution to this problem?

There's two ways to do it from within the logging framework. The first way is that you can suppress duplicate messages with a filter, like the DuplicateMessageFilter [1]. It wouldn't be that hard to add a filter that looked for the message template, and could suppress based on other properties on the LoggingEvent, such as the request ID or the overall similarity of the event.

The second thing you can do is add a budget for your logger, which will only allow it to log a certain period in a window. This is something you can do with budget evaluators [2].

Depending on the layer you put over SLF4J, you can also add conditional logging [3], which puts the logic at the application layer. Under this model, you'd have a counter or circuit breaker in the condition that would short-circuit the logging.

[1] http://logback.qos.ch/manual/filters.html#DuplicateMessageFi...

[2] https://tersesystems.github.io/terse-logback/guide/budget/

[3] https://tersesystems.github.io/blindsight/usage/conditional....


I think this is mostly a result of the OOP ethos to think of actions on single objects, where each function call created a log message. This is then called in a loop.

Meanwhile the general case is actually that actions are performed on a collection of things, although this is often not obvious due to OOP design. But when working with collections it becomes obvious that log messages should be aggregates over collections ("2 out of items failed, errors: ..."). This then creates magnitudes less of logging. It's just not mainstream now.


Sounds like you don't have enough information to make a decision when developing the function so thw logging behaviour should be configurable by the user.


Yes, I suppose a simple parameter would do the job. Probably not worth anything fancy.


A parameter where? Obviously not in the code of the function itself; polluting your application code with flags such as log=true|false is obviously a no-go. Especially since logging is only one such "cross cutting concern", one of many, and you simply cannot go polluting your code with all of them because it leads to a mess.

Many Java logging systems allow you to enable/disable logging per package, but this is still not enough for a per function decision.

Logging is a hard problem.


> Many Java logging systems allow you to enable/disable logging per package, but this is still not enough for a per function decision.

I have something for that. [1]

[1] https://tersesystems.github.io/blindsight/usage/conditional....


Logging filtering is available per class in Logback [1], as well as having Markers which can target individual logging statements [2]

[1] https://stackoverflow.com/questions/46062711/show-logs-only-...

[2] https://stackoverflow.com/questions/30248169/how-to-deal-wit...


I mean there's no reason you couldn't have a thread-local override to the log level and restore it after the function call.

The fact that logging is at all intermixed with business logic I think is the main problem. I should be able to take a fully working logless program and apply different logging profiles to it. What gets logged should be a function of the person invoking the program.


Definitely. I was trying to argue something about intermixing logging with app logic :) It indeed seems to me the decision about what to log should be the caller's, not the callee's, and so it shouldn't be hardcoded within the latter.


I interpret your last line as "If an E2E test fails because of a product bug, the result should be a unit test".

I've been fighting this fight for a decade now, and I'm pretty sure I've never won it for a whole team. The best I got was that people would pre-populate bug fix code reviews with reasons why they couldn't sensibly unit test this particular issue. If you have had better success, please tell me your secrets!


> use a rate limiting logger with expiring delay

Hmm, any examples of this?


>> use a rate limiting logger with expiring delay

> Hmm, any examples of this?

https://tersesystems.github.io/terse-logback/guide/budget/



You just have to remember the last time a particular log message was logged, and don't log it again until until some amount of time has passed.


You forgot also making sure you aren't logging things you shouldn't log. Some are obvious maybe like passwords. Others might be less obvious like the title of every window on the user's desktop which Facebook Oculus logs. Those titles end up including every page of every website you've visited, the title of every document you've edited, and every video you've watched.


Ideally the data should be filtered before it gets to logging in the first place, but if not, you can add another layer of defense by employing censors:

https://tersesystems.github.io/terse-logback/guide/censor/

Or you can use logstash-logback-encoder, which has masking:

https://github.com/logstash/logstash-logback-encoder#masking


There's also Witchcraft Logging, which has the concept of "safe" parameters and is available in Go, Rust, and Clojure:

https://docs.rs/witchcraft-log/0.3.0/witchcraft_log/


Is this a bug or a feature?


> The viewer journalctl is probably the worst log viewing app that exists

Completely disagree there. I've had quite a pleasant experience with journalctl. It has sane arguments and you can easily grep any relevant log messages from the journal. You can add custom fields to your log messages. You can even get the logs in JSON format to parse log message from any scripts. And you can even use the cursor feature of journalctl to keep reading the logs at regular intervals without having to keep a constantly running process and tail the logs.

I can understand people not liking systemd-journald on account of the binary logs, but I definitely don't understand calling journalctl the worst log viewing app.


To me that's why the original promise of dtrace was so nice, the idea of being able to dynamically turn logging on and off from the outside at locations never intended to log anything without even touching the source code sounded wonderful.

Sadly $dayjob runs servers on linux and isn't going to merge the dtrace patches in there, and python's systemtap probes are… limited.

Apparently bcc can hook into USDT probes though, I should look into that: https://ish-ar.io/python_ebpf_tracing/


I know this doesn't help for Python, but... You can do that in Java with instrumentation. Things like ByteBuddy [1] or ReactorDebugAgent [2] will add debugging statements into your code without you having to touch anything.

If you want more advanced, then you can look at libstapsdt [3] which enables JVM applications to create USDT probes.

[1] https://tersesystems.github.io/terse-logback/guide/instrumen...

[2] https://github.com/reactor/reactor-tools#reactor-debug-agent...

[3] https://github.com/goldshtn/libstapsdt-jni



Depends on which flavour of Linux one goes by.

https://docs.oracle.com/en/operating-systems/oracle-linux/dt...

Regarding BPF, the book is a must have for anyone doing serious stuff with it.

https://www.informit.com/store/bpf-performance-tools-9780136...


I would strongly doubt the sanity of anyone getting their linux distro from oracle.


Luckily most Fortune 500 management boards think otherwise, so I get to enjoy nice tools.


In cases like HPC and other performance-centric workloads, mainline kernel just doesn't have the hardware support needed for more exotic stuff like Infiniband and Mellanox cards, high-end SSD etc. Oracle got tired of waiting for sufficient support to be included which is why IIRC OEL UEK was created.

If such hardware isn't being used, then yes, sanity should be questioned for supporting Uncle Larry's Linux.


I agree. Our architecture is partially built around logging every method called. Although it’s useful information, it still makes our product more difficult to program for.

Performance isn’t much of a problem though. Have different logging levels and only turn on a high level when you’re trying to nail down a particular STR.


You can also hook up your logging to a feature flag system, so it's only activated for a single user. [1]

[1] https://tersesystems.com/blog/2019/07/22/targeted-diagnostic...


> The second problem is to figure out if logging is affecting performance.

Spoiler -- it totally can. Kirk Pepperdine talks about logging as a memory allocation bottleneck [1]

So I gave my usual logging rant at a workshop that I gave. I was in the Netherlands about a year or so ago. And that night, they went and stripped all the logging out of their transactional monitoring framework that they’re using, which wasn’t getting them the performance they wanted, which is why I was there giving the workshop in the first place. And when they stripped out all their logging, the throughput jumped by a factor of four. And they’re going like, “Oh, so now our performance problem not only went away, we’ve been tuning this thing so long that actually, when they got rid of the real problem, it went much faster than they needed, like twice as fast as what they needed to go.” – The Trouble with Memory

Unfortunately it can be kind of hard to track the memory allocation rate over time, and it's typically not the sort of thing you're focused on. I put together an audio queue that will play the memory allocation rate as a sine wave [2] so I can tell when it's getting to be a problem.

[1] https://www.infoq.com/presentations/jvm-60-memory/

[2] https://tersesystems.com/blog/2020/07/19/listening-to-jvm-me...


It is but if you log decisions you make and why and stupid things like variables at various stages then you at least have half a chance.


My favorite logging trick that walks the line between “not too much” and “not too little”: a ring buffer.

Put highly detailed log messages into an in-memory ring buffer, and when you encounter an error, log out the entire contents of the ring buffer. That way you get detailed information when things go wrong, an inspectable-while-debugging record of actions, all while not overwhelming your log aggregation system with useless information.


That's good for debugging, not so good for analytics


This article is about debug logging.

Debugging is also the primary use case for logging program internals.

Analytics should be done on non-internal logs (such as traffic logs), unless you're just logging everything to the same destination without any structure, which is probably a bad idea.


Using analytics generated from logs is also problematic, because if the operations team decides a service is emitting too many logs, they will start pruning them, and not always from the aggregations service.

Analytics instead should be a separate stream in most cases.


Yup, I have a demo app showing dumping from ringbuffer to JDBC here:

https://terse-logback-showcase.herokuapp.com/


One disadvantage of that - though it only matters in some contexts - is that you pay the cost of formatting all those log messages (or at least copying enough data to be able to format them later) even if they're never read.


That depends on where the ringbuffer is placed.

If you're placing the ringbuffer inside an appender, then the appender's encoder will format. However, if the ringbuffer itself is an appender, then you can store the logging events as is until it's time to dump them -- this is the approach in terse-logback [1]

Another option is to put the ringbuffer BEFORE sending it to the logging framework, so it's never processed by SLF4J at all. This means that it's more accessible to the application, at the cost of losing the initial timestamp of the logging event. This is the approach taken in Blindsight. [2]

Shout out to Bunyan, which is (AFAIK) the only server side logging framework that has ringbuffers out of the box. [3]

[1] https://tersesystems.github.io/terse-logback/guide/ringbuffe...

[2] https://tersesystems.github.io/blindsight/usage/buffer.html

[3] https://github.com/trentm/node-bunyan#raw--ringbuffer-stream


The cost of writing data to memory is incredibly small; you'd have to be working on something like gaming engines for it to even be noticeable.

Logging can also be done in a separate thread, reducing even that exceptionally minor cost.


Can confirm -- writing data to memory is incredibly fast. It's faster to use a lock-free queue and use CAS to insert data though:

http://psy-lob-saw.blogspot.com/2016/12/linked-array-queues-...

I've benchmarked code myself, and the flamegraphs will barely show the SpscArrayQueue in the profile.


Interesting. Never heard of this before.

Any examples you can provide?


Here's some examples I found pretty quickly:

https://www.ibm.com/developerworks/aix/library/au-buffer/ind...

http://www.exampler.com/writing/ring-buffer.pdf

A google search for "logging to a ring buffer" will provide more.


Thank you, very useful



The best advice is hidden in one of the links in the article: Always include some sort of session identifier, as to identify all the log messages that belongs to a session or transaction.

The articles speaks briefly on log severity levels, which is key to manageable logs. Make sure to specify what levels the application uses, preferably in some design document if one is being used.

For example, high severity error messages, should be only situations that should not normally occur or problems internal to the organization. For example, end user invalid credentials is not a high severity error, but invalid credentials towards the backing data store is.

Personally I tend to keep log severity thresholds local to the session, because that's the kind of filters I normally want to write.


it's as if we should do logging modeling. There's structure to map logging to the code (sub)processes and also a need to relate traces to lifecycle.


This is what Honeycomb and Charity Majors talk about when they get into observability. They talk about events rather than logs, and discuss how to construct a meaningful event from a series of data points [1].

Having a good dictionary of logging fields is something that is coming together more [2] although some companies will prevalidate log messages to ensure they meet the given conditions, like Palentir's Witchcraft [3]

[1] https://www.honeycomb.io/blog/event-foo-a-series-of-unfortun...

[2] https://github.com/vektra/logging-fields

[3] https://github.com/palantir/safe-logging


thanks, very interesting


A variant of the "No Screaming" rule (which funnily enough mentions "The only acceptable differentiator is to use different log levels, like WARNING and INFO"):

Don't log at an unnecessarily high level. It's tempting to do this when you're working on a new feature e.g. if everything above WARNING goes to the console maybe you'll make your new feature all WARNING. This is partly just for ease of debugging while you're developing, and partly (especially for junior devs) because it feels like your new feature is way more important than everyone else's so it needs to be logged more loudly. The right way to do this is to make it configurable per-feature what logging levels go to the console.


After having spent some time at a company that banned them, I am not even sure that log levels are an acceptable differentiator, anymore.

The rationale was that knowing that every single log statement absolutely would be written to the log would encourage developers to think more carefully about their logging. From what I saw, it worked.


I work on a small team which does its own operations. Anything logged at WARNING goes to our production radiator. Anything logged at ERROR goes to our production radiator and makes it play an annoying sound.

If you overuse WARNING or ERROR logs - as i initially did on joining the team - you will rapidly get feedback from other team members to knock it off.


What is a radiator in this context? A bit of searching didn't turn up anything except metal objects that exchange heat!


Ah, it means an information radiator, which is a funky term for a big thing on a wall that everyone can see:

https://www.leadingagile.com/2017/11/information-radiators-i...

In this case, it's a screen displaying monitoring information. And since we're all working from home, it actually means a web page we all have open and keep an eye on.

Daftly enough, we don't even call it the radiator, we call it the dashboard, but i thought "radiator" might be a more understandable name!


Amen to the idea of including host address and port in every log line about connection success/failure.

It's a pet peeve of mine to read "Failed to connect to server" or suchlike in an application log and then have to go capturing packet traces to find out what it failed to connect to exactly.


I would add couple more things to it:

* use some kind of context for the transaction to include unique id that is common theme through the system (maybe unique id originating from external system). This is called mapped diagnostic context in Log4j. This lets you grep all your logs for information about single transaction. Even better if you do this consistently for all services.

* offload metrics from logs to appropriate tool

* offload audit information from logs. I have a service that gathers all information of type (service A called service B and got C result for transaction D) and I have a simple page that can find information for transaction D and produce sequence graph where lanes are services involved.

* do not log an information twice. Don't log an error and throw the error (either you throw it or handle and log). It is ok to log additional information when passing error that would otherwise be lost.


> use some kind of context for the transaction to include unique id that is common theme through the system (maybe unique id originating from external system)

Oracle has this concept called "ECID" (Execution Context ID), where the first component in the system to get a HTTP request assigns it an ID, and then that gets passed down as a custom HTTP header through the system. Since a HTTP request can produce sub-requests (e.g. talking to backend micro-services), the ECID becomes hierarchical in which subrequest identifiers get appended to the ECID of the original request. The ECID can even get passed through JDBC into the RDBMS so you can correlate database logs with logs at the higher layers.

> do not log an information twice. Don't log an error and throw the error (either you throw it or handle and log)

Unfortunately, my personal experience is you can't always trust your caller to log exceptions properly. (Especially when you are just working on one component in some huge system over which you lack overall control.) Throw and log can add noise to the logs (the same exceptions gets logged two or more times), but it also reduces the risk that the exception just disappears and never gets logged at all. (Or gets logged incompletely – don't get me started about Java apps and libraries that rethrow exceptions without setting the cause, meaning you lose information on the root cause of the exception.)


1. The ID should ideally be from external system so that it is possible to always locate the records by ID known to the user. Let's say I am support and I want to figure out what happened, the user gives me the ID and I can use it to find the logs for this unique transaction, even if the user did not get any response.

2. Well, it is ok to log the error when it crosses component boundaries, this is basically what you do when you log an error before returning it from your service to another service. Ideally, there would be some coding guidelines that would enforce rather uniform error logging. We are enforcing this on code review.


>Unfortunately, my personal experience is you can't always trust your caller to log exceptions properly. (Especially when you are just working on one component in some huge system over which you lack overall control.) Throw and log can add noise to the logs (the same exceptions gets logged two or more times), but it also reduces the risk that the exception just disappears and never gets logged at all. (Or gets logged incompletely – don't get me started about Java apps and libraries that rethrow exceptions without setting the cause, meaning you lose information on the root cause of the exception.)

I know this to be 100% true. I deal with it at my dayjob. However, I will always take the hard-line stance that the caller needs to be fixed in these cases.


Agree with the unique log. While filename/line number seems good at first, its more mental overhead to find the file and then go to the line number.

I tend to add a unique id based which is usually just the first 7 characters from any hash function. Something like

    date | md5sum | cut 1-7
which gives you something unique for each log message making it really easy to identify where it happened in any codebase using your search tool of choice. You have to embed the value into the code itself for the benefit, but its really useful in the short/long term.

I was debugging a lot of lambda functions and this really ended up saving a lot of time as at runtime that was about the only information you could get.

One other thing I found useful was to use JSON format for logs. I know there are many against this, but I found being able to add additional params without breaking existing tooling was a massive win, plus it formatted nicely in cloud-trail. Later on I realised most of our logs were processed by humans second and computers first, so having a machine readable format was really a benefit. Additional we were using Scalyr at the the time, so each new value in the JSON became an additional filter which was a very nice outcome.


And if you are building anything other than toy systems then please, for the love of the FSM, use structured logging!

If I could get my hands on the developers who wrote all the snowflake log formats I had to deal with on a daily basis back when I was doing support and ops …


Came here to say this as well. If you're not using structured logging in 2020, get on it! It's usually trivial to set up. Most of the work will be converting your unstructured log messages to correctly separate out the application-level fields.

I cannot emphasise enough how much better it is for searching the logs and running any kind of automation off of the logs (analytics, monitoring, alerting).

"Noisy" logs are often cited as a major problem, but well-structured logging allows you to filter out the noise very easily, and focus on only the log events that are relevant to your task.


I tried this, but, the only way to "use" them is to pay absurd amounts of money for cloud tools or setup complex grafana + ... things.

Using them from the command line? Reading them just here? Don't exist anything alike "nano server.log".


Not at all - grep piped to jq works like a charm - I have made reports using that in the CLI wrapped in a Makefile that were exec friendly. And there are other tools like gron for example.

And if you are building something serious and with some sort of scale then you will need to have something like ELK stack anyway to aggregate and display data - there is no getting away from that.

Working with unstructured logs however is a mess. Perl / Awk helps of course but it is a lot of extra effort. And in the end, at scale you will just be writing those scripts to turn that mess into structured data to feed into that same ELK stack anyway.



If the logs are stored in JSON, the jq tool can be used. I understand it's more complicated that grep for certain structures, but also having a library of jq filters for the app may be useful and can be shared.


But that requiere to learn extra stuff that is not the normal flow: Locate that part of the log or scan it quickly.


That's a good guide to follow, I appreciate it. The logging too much syndrome is what I notice the most and I think it comes from a tendency to treat a log file like debugging session eg console.log(). A log file is a tool to help the future hapless developer (could be you) troubleshoot a problem with the system.

So it's a matter of putting yourself in someone else's shoes, but the shoes are a future situation that you have to imagine.


Good logging for me involves:

- aggregation from multiple systems

- one-line per event (so I can use unix tools)

- (ideally) enough information to replay some action in the event of a bug

- being able to live-tail it within seconds

- putting a unique, grepable identifier next to each printed output


> GREP-ABLE MESSAGES

seems to imply text-based logs parsed via regex variant. Even better is structured logs parsed by field explicitly.


My one principle for logging is that if you think your code needs a comment to explain it, you probably actually need to be putting whatever you're writing in a debug logging statement.

It's easy to remove that line later. It's a lot harder for ops to figure out what the hell is going on if it's not working.


2019-08-20 14:00:03.270 INFO Implements Ukkonen's suffix tree algorithm. For more on how it works, see https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.... or get a PDF from our file share at \\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf

2019-08-20 14:00:03.856 INFO Implements Ukkonen's suffix tree algorithm. For more on how it works, see https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.... or get a PDF from our file share at \\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf

2019-08-20 14:00:04.137 INFO Implements Ukkonen's suffix tree algorithm. For more on how it works, see https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.... or get a PDF from our file share at \\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf

2019-08-20 14:00:04.523 INFO Implements Ukkonen's suffix tree algorithm. For more on how it works, see https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.... or get a PDF from our file share at \\some\path\to\departmental\share\FileThatWasDeletedThreeYearsAgo.pdf

Something tells me that my logging under that heuristic wouldn't fly with sysops.


"Probably" aka use some common sense.

People frequently write business code like this:

# Setup the main data doSomeProcessWithData(data)

# Do the other process doTheNextProcess(data)

# Check if thing X due to some requirement if (some check happens): # Do the workaround <some hack which currently works>

# Finish up dispatchItToZ()

And provided this works, okay, no problem. But the reality is this is going to get deployed, hit some edge case requirements didn't cover, and start failing. Written as this, someone is probably looking at that output wondering where it's failing.

If on the other hand each of those comments was a debug log level line, as they should be, then at minimum the logs now have a mode where they document exactly what happens and why (of course a business process should be logged at Info IMO).


Well, originally it was just a joke. But since we're going there:

Needing to log at quite that level of detail worries me. It implies that non-deterministic behavior is pretty pervasive in the codebase. I'm generally a lot happier when the logging strategy on the happy path looks like:

1. Get all your I/O-type stuff handled before the call stack gets deep.

2. Log your inputs

3. Run the calculation

4. Log your outputs

If what happens in step 3 is deterministic, then it should always be possible to explain the output by looking at the input, and reading the code. And, if you've accomplished that, then it's straightforward to convert the logs into a test case.



He said a debug logging statement. You're logging at INFO level.


I'd argue that you want to keep that debugging line, but put it inside a guard. Diagnostic logging is used lots in production [1].

[1] https://tersesystems.com/blog/2019/10/05/diagnostic-logging-...


Logging strategy for single user apps (desktop and mobile) is probably very different from a multiuser app such as a web service, since things can interleave and the volume is much larger. With a web service I’m usually not as interested in what the user did fifteen minutes earlier because the nature of the web usually means any information is kept on either end (a db, a request) but in a client side app you will often want an activity log that outlines what the user did, and that’s not what naturally follows from a normal “subsystem log” where different pieces of code just log what they are up to.


It can also help a lot if you log all incoming/outgoing packets in a format that allows replaying the session.


I've found a lot of value with canonical logging[0] for server-side logging.

Basically, we emit one log per request/response with appropriate severity level (INFO if request succeeded, ERROR if it failed) and the log contains all the interesting information that we care to log. This makes it very easy to track and debug issues at an individual request level, while not having to overly rely on external tools to parse/combine logs to reconstruct requests/sessions.

The other benefit of this is that we can feed these logs directly into log ingestion tools like Datadog and plot graphs, view distributions and set alerts on metrics present in logs.

[0] https://stripe.com/blog/canonical-log-lines


From the linked article:

> in addition to their normal log traces

Does this mean that normal logging happens outside the final log line? If so, then that sounds great. Otherwise I'd worry about losing state due to a bug.


This is also known as an "event" in Honeycomb parlance.

https://www.honeycomb.io/blog/how-are-structured-logs-differ...


I'm going through this article and it seems to me that is suggesting very low level techniques. Grep-able messages? Like ssh-ing into a machine and running grep to see your logs?

I would go for a centralized logging system.

If you are working on a PaaS, probably has its own logging feature already built in.

If you are looking to self-host, give the ELK stack a try(many alternatives as well).

If you are looking a third-party service, look over at Loggly (many alternatives as well).

And the most important one is: export structured JSON. A centralized logging system will parse and index your logs.

In the ELK case, logs will be indexed in ElasticSearch. From there you will be able to not only filter and search but also create monitors and alerts on your logs.


Strongly agree with this comment. ELK and now EFK using fluentd are fairly mature and the dashboards used to view the data e.g. Kibana, Grafana are full of useful features to get around the "endless series of greps" method of manually parsing through logs. Another benefit of offloading app, system and security logging to a central logging server is removing a potential single point of failure from simply running out of storage. Elasticsearch clusters are easy to build out to scale, also.


It's a topic where people default to doing more or less counter productive things because they are not clear on goals.

Basically, if you have some kind of cloud based server infrastructure and you do logging, the goal is to get it aggregated in a place where you can do something productive with it. We are talking about a time series database where it is all about the metadata and less about the message. The message is still important but only if you can query it out of millions of other and possibly very similar messages without too much hassle.

Fallacy #1 is believing that a log event is a string. It's not. You might believe it is a string and format it as a string and you might even have some semi structured way of doing that and tools that feed your confirmation bias by treating logs as lines of text. But in the end you something is going to be picking that string apart to extract some meta data just so it can be inserted in a database and aggregated to the right place. Half the success is not losing the data that matters through wacky formatting, using the wrong tools for the job, etc.

Of course this picking apart if strings is a lot easier if you use some kind of standardized log event format. Which one is appropriate probably depends on where you aggregate your logs. The trick here is avoiding building assumptions about this into your software and avoiding prematurely discarding information that upstream aggregations might benefit from. A lot of stuff happens in between where you invoke some info or warn function in your code and where a message gets stored in some log event aggregator.

Some of the better logging frameworks typically have these features: log levels, an API for logging, an external way of configuring the output of the log engine (formatting, coloring, filtering things out), and some notion of attaching meta data to the events.

The latter is one of those things that are super useful but also one that relatively few developers even know is possible. If you are unfamiliar with the notion of a Mapped Diagnostic Context (MDC): this is where you put any structured information related to sessions, thread ids, and other context. Preserving this information all the way to your log aggregator is the goal.


Agreed, treating log events as a string seems like throwing a lot of readily available information away. There's various logging libraries which do allow more structure but too often this is ignored.

I think we got some of this right in the standard Julia logging frontend. We allow the user to attach arbitrary key value pairs (name=data) to the event, and also automatically attach some statically known context such as source location.

(The loggers are also dynamically scoped to the task, which has several good properties. For example it gives great control over attaching context known by the caller to the callee's messages and over dynamically routing or filtering messages. Sort of like MDC but more convenient.)


Agree with everything you've said, but MDC does have one big drawback: it's only convenient when you're working within a single thread. If you're writing asynchronous code (Scala, Akka, Vert.x etc) then MDC doesn't help you very much. [1]

I generally find it easier to have an explicit "context" object and pass that around in an operation, and then log once that context has been built up. If you're lucky, the logging framework will do that for you.

Bunyan did this the Right Way from the get go, using `log.child`. [2]

[1] https://tersesystems.github.io/blindsight/usage/context.html...

[2] https://github.com/trentm/node-bunyan#logchild


With Kotlin, you there is a way to make the MDC work with co-routines. Basically, it works by leveraging the co-routine scope to keep track of a context. This ships as part of the co-routines library in the form of extension functions on the slf4j API: https://kotlin.github.io/kotlinx.coroutines/kotlinx-coroutin...


Maybe worth adding that for lower level software it's well worth baking it in to start with, nothing worse than trying to add logging to a non functional ball-of-mud ;-(


Any recommendations for simple centralized logging for small/hobbist/non-profit projects?

I'm basically looking for something self-hosted or cloud-based where I can easily POST some data from Python/PHP/JavaScript/Bash running on two or three servers and easily have a look at the logs streaming in.

Spinning up an entire Graylog seems overkill in order to log a couple of dozen lines per day.


Suggest to try syslog-ng.

Have a syslog-ng running locally on all your hosts and forwarding logs to a central one. Applications can easily send logs to syslog, all languages/loggers support it.

The central syslog can listen for remote logs and save everything into /storage/hostname/year-month-date/file.log for example. This allows you to grep everything from that host.

It's simple and it uses very little resources.


https://grafana.com/docs/loki/latest/

is pretty good for structured logging. But as always: depends on your needs especially with regards to indexing as loki stores only the raw log data and set labels.

But putting new data into it via POST is pretty simple https://grafana.com/docs/loki/latest/api/#post-lokiapiv1push


AWS CloudWatch logs is cheap and very easy to 1. integrate with existing loggers, 2. send logs to your desired log stream (and to send to multiple streams depending on if your in dev, testing, prod etc.) and 3. fetch/review via CLI or their web interface.


Check out https://datalust.co/seq. Supports native ingestion over HTTP, as well as plugins for syslog/GELF. Can run locally as well as in docker, so it runs anywhere.


Honeycomb has a free tier that is very generous:

https://www.honeycomb.io/pricing?

You can set up with one of their Beelines, or just send JSON to the REST endpoint.


Here's my background in forestry shining through - when I read the title I though the article was about a good way to harvest trees. Maybe it is a sign of the times that I was a bit disappointed to see the article was instead about process data collection, something which is part of modern logging (in the forest) but far from the most interesting or challenging aspect.


Here's my interpretation of log levels:

TRACE = low-level details

DEBUG = details

INFO = good to know

WARN = might want to look into this

ERROR = definitely look into this

FATAL = why it died


My tipp for good logging: Log in json.

Just log directly to json, it makes any processing later so much easier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: