Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Best practices for log format?
137 points by user101010101 on Apr 27, 2017 | hide | past | web | favorite | 62 comments
We're currently going through a bikeshedding discussion about the log format that our microservices will generate.

Developers without Ops responsibilities want the logs to be as readable as possible to aid their development process (traditional timestamp, severity, component, action and a quick description in a single line and, depending on the type of event, maybe a JSON object in many lines with details about the request, possibly with blank lines and indentation to make everything very easy to read).

Ops people want the files to be easily parseable for injection into ElasticSearch. They want to avoid complex configuration and want more flexibility to generate reports. If logs were generated in a single line in JSON format, they would be happy.

It seems there is no way to please everybody. Are there any standard formats that microservices-oriented architectures are using these days? Is there a standard at all? How to approach this?




Do not invent your own formats.

There is an RFC for structured logging [1]

Also, journald does structured logging, plus indexing and searching like a simple database and it's designed for your use case. It can receive the logs and forward them using a connector for ElasticSearch [2]

[1] https://tools.ietf.org/html/rfc5424 [2] https://www.loggly.com/blog/why-journald/


RFC5424 Syslog defines some the common stuff all logs should have: hostname, app name, datetime, message, priority.

Many logging tools (Splunk, Loggly, etc.) consume syslog. Syslog works over UDP and TCP. Java Logback has a syslog appender. Linux systems have rsyslog for forwarding, writing, etc.

You'll wind up including your own custom structured data of course (IP address, response time, user id etc.), but it's nice use a recognized standard for the standard parts. (Kind of like, say, using status code, cache-control, and content-type in HTTP.)


The RFC defines key-value structured logging.

https://tools.ietf.org/html/rfc5424#section-6.3.5


I got excited reading your second link (https://www.loggly.com/blog/why-journald/ ), but from it:

> Sadly, journald does not come with a usable remote logging solution. The program systemd-journal-remote is more of a proof-of-concept than an actually useful tool, lacking good authentication among other things. Third-party packages such as journald-forwarder send journal logs to Loggly directly but are very new and experimental. As such, remote logging still has to go through existing syslog implementations and the syslog protocol and therefore cannot make use of many of the additional features of journald, in particular the structured data.

Anyone on HN cracked structured remote centralised logging with journald? Unstructuring everything for syslog, dealing with pretend stuff like uucp, fax and user3, and restructuring again seems like a kludge.

Also: weird loggly don't sieze the opportunity and make something that works with their service without the limitations they're writing about.


It's not as difficult as it's made out to be. We run a systemd service that does something like this:

   journalctl -f -o json | send-logs remote-server


And there are also tools like https://github.com/aiven/journalpump


What is send-logs and what protocol does it use?


Speaking of Loggly, I highly recommend them as a log analysis sink.


If you are micro-services based architecture, then it is very important to track and log a request through the multi-service calls and relate the logs.

Syslog with a common UUID set at the entry point (typically at proxy or load balancer level e.g. nginx) of the request and then logger will log the UUID as the part of UUID parameter of Syslog.

Syslog then can be parsed and also form JSON objects for the poeple in Ops and send it to a service. Basically you could have one more internal micro-service to handle logs from Syslog and then parse it and send it wherever Ops would need.


I would avoid syslog. The protocol does not handle multiline messages (stack traces) etc. gracefully since it's purely line-based. It's also very hard to encode application specific fields into syslog. However, it's a good idea to pick (or design) something based on which fields traditional log daemons use - either based on syslog, or based on systemds journal. (which, btw. does not suffer from any multiline issues)


Along those lines (and possibly what the OP was looking for as well), what are the conventions around key names including namespacing? Or what are the common environment variables you capture for a given service, i.e. the boilerplate stuff.


Syslog has very detail key name sand namespacing (http://www.iana.org/assignments/syslog-parameters/syslog-par...) which can be filled in with application related data and this can be then parsed by any PaaS logging platform / OSS logging tool

I think syslog is a widely accepted log format and most of the tools which revolve around log metrics do support syslog and more so with parsing tools


+1, Request tracking across micro services is a must. You will be crying floods at the worst imaginable time if you do not have this. Think of it like a stack-trace for each request.


For richer view of the call tree, consider full distributed tracing, like Zipkin.


General rule: writing logs is a means, not an end, so whoever reads it most gets to decide what it should look like.

Corollary 1: allow developers to use whatever they see fit for DEBUG logging.

Corollary 2: require single-line json for everything else (if your developers have to read lots of production logs, you probably have a bigger problem than your log format)

You will need something to have ops filter DEBUG lines out.

Options:

- disallow having DEBUG logging in production

- separate log file for DEBUG logging

- start each line with the log level

Also: give the developers a tool that converts the 'one json line per event' file to what they want. Configure less and other tooling to know about it.

Optionally, tweak the json serializer used in logging to always output the time stamp, severity, and short description at the start of the line.


Have you looked into https://brandur.org/logfmt?

Having lived with various points between the 'awesome to read' and 'simplest to parse' I currently favor simpler parsing and investing in tooling that makes manual parsing of logs easier.

Of course the decision should include the scale of the microservices. There would be no benefit in optimizing too much for readability if the number of requests to your services is so high that you would need complex filtering before you can look at them.

Another way to cut down the bike shedding is to go with a library that wraps around the format you choose. We built GitHub.com/clever/Kayvee and have changed the format a few times. However since everyone uses the kayvee libraries at Clever it's not too crazy to change the format.

Hopefully this was useful. Apologies for the disconnected paragraphs - on public transit :)


The only thing I can add to everyone else's replies is: VERSION IT.

This will allow you to write a parser which parses all versions of your log format, whichever such versions are – based on the version… and allow you to amend the log format at any time without worry that suddenly your old logs aren't going to be compatible with new ones.

"we" use this even for Apache access log files, and it's one small thing which has helped a ton.


Converting a structured format into a human-readable form is relatively easy. Parsing unstructured logs is hard.


> traditional timestamp

As as sysadmin I implore you use ISO 8601 format, preferably with the UTC offset.


Do you prefer the full format (2017-04-27T09:43:04,0010+0100) or a related but more readable format (2017-04-27 09:43:04.0010 +01)?


My personal preference would be sans TZ, since everything I can control is set to UTC immediately.


Technically "Z" postfix is a TZ identifier for UTC time. It's short and makes the timestamp compatible with RFC 3399:

    2017-04-27T09:43:04Z


Not only technically, but also practically, since many tools, libraries and service APIs accept and produce the Z as the TZ identifier for UTC.


Minus bugs, and then having the TZ (even though it always should be Z) saves your bacon.


Agreed.


Personally I prefer the RFC 3339 profile as most syslog daemons support this.


RFC 3339 allows application to use space instead of T between date and time:

      NOTE: ISO 8601 defines date and time separated by "T".
      Applications using this syntax may choose, for the sake of
      readability, to specify a full-date and full-time separated by
      (say) a space character.
https://www.ietf.org/rfc/rfc3339.txt


I prefer to use epoch as an additional timestamp - this allows us to parse the logs without being bothered about timezones and all.


If you're using a configurable logging library like logback for Java, it is fairly easy to have your application support both structured logging directly to Elasticsearch, and logging a more human readable format to disk or stderr for development.

It would be nice if there was a standard format you could just ship to a logging service, but the closest seems to be syslog, which is somewhat janky.

Re-parsing logs after they have been formatted for human consumption feels like an anti-pattern motivated by anemic logging libraries.


Exactly this. Configure two log handlers - one sending data to ES and another to disk.

Alternatively, give the ops team what they want (more information) and then generate log files that please the devs from ES query output


As part of my current role, I setup and managed a unified logging infrastructure as part of a SIEM install. We're talking ingesting logs in the order of TBs a day (Lots). Here are my observations:

1. Syslog type logs are great when the volume is low, but field extraction can be a pain (unless you're a Regex savant)

2. Structured logs are nice to deal with. Fairly easy to read and a dream to parse. Make sure you have a good ETL process in place to process these logs so they are searchable. 3. If you have a lot of logs, they are only as good at the usability of the search on that corpus of text.

I use Splunk to ingest the logs and once the extractions are in place, it is a dream to search. Other options like ELK are available and work pretty well too.

Show the devs the advantages of having structured logs indexed and ready to search. No more: SSH > cat | grep | tail | less | more

Real-time functionality and the ability to alert on certain conditions is an added bonus.


JSON-formatted log files can be pretty-printed by tools like the Logfile Navigator:

http://lnav.org

That should make the decision a lot easier.


We generate structured logs, I.e. key-value pairs. Depending on the environment they are either formatted with logfmt or JSON. This serves our use cases quite well.


This is a node module but the idea can be applied generally (https://github.com/trentm/node-bunyan). Basically you output the logs in the format ops require(Which i assume would be structured) and for devs put a beautifier that parses the logs and prints in the format they require.



I am the author of a tool called LogViewPlus (https://www.logviewplus.com) which can read log files in multiple formats (including JSON) and display the log entries in a grid. The tool has several features that make log files easier to analyse than other tools like a text editor or grep.

Best of both worlds?


Not directly answering your question but

1) fluentd[0] can be used to parse diverse log formats

2) You can set an environment variable, e.g. `ENV`, and output to a more human readable format when `ENV=development`.

[0] https://github.com/fluent/fluentd/


Fluentd's regexp parsing will be PITA to maintain, both because regexps only give match/no match, which is difficult to debug, and because of how to test new regexp set with Fluentd.

Liblognorm library is a much better choice. Regular rsyslog installation comes with liblognorm already, though I prefer a standalone log parsing daemon (and thus I wrote logdevourer).


Make one team responsible for maintaining logs and they get to decide the "how". This is most likely Ops. Devs are now a customer with requirements and Ops should find a solution that will both meet their needs and the needs of devs. That will at least stop the bikeshedding.


    maybe a JSON object in many lines with details about the
    request, possibly with blank lines and indentation to make
    everything very easy to read
Pretty-printing the JSON is a presentation concern - it shouldn't be baked into what's being written to the logs.


JSON feeded into ElasticSearch works pretty well for us. I was a bit sceptical about it a first moment since I was used to grep classic logs in /var/log/, but it's actually really good. Kibana allows you to have a bunch of dynamic criteria for the logs, and filtering obviously JSON is better than randomly grepping a thing in text. I cant think of any other solution than files in /var/log, and a thing from top of my head which annoys me in text logs - when you dont know when exactly something happened (eg. before midnight or after a midnight?), you have to grep two files (expecting you have one log file per day). Also to grep the files, you need to have an access to the servers, but the fewer people have the access there, the better.


This academic paper has some interesting insights about what distributed systems (e.g. Hadoop) log in order to be able to reconstruct what happened after a problem occurs.

http://www.eecg.toronto.edu/~yuan/papers/zhao_stitch.pdf

In particular,

"- log a sufficient number of events — even at default logging verbosity — at critical points in the control path so as to enable a post mortem understanding of the control flow leading up to the failure.

- identify the objects involved in the event to help differ- entiate between log statements of concurrent/parallel homogeneous control flows. Note that this would not be possible when solely using constant strings. For example, if two concurrent processes, when opening a file, both output “opening file”, without additional identifiers (e.g., process identifier) then one would not be able to attribute this type of event to either process.

- include a sufficient number of object identifiers in the same log statement to unambiguously identify the ob- jects involved. Note that many identifiers are naturally ambiguous and need to be put into context in order to uniquely identify an object. For example a thread iden- tifier (tid) needs to be interpreted in the context of a specific process, and a process identifier (pid) needs to be interpreted in the context of a specific host; hence the programmer will not typically output a tid alone, but always together with a pid and a hostname. If the identifiers are printed separately in multiple log state- ments (e.g., hostname and pid in one log statement and tid in a subsequent one) then a programmer can no longer reliably determine the context of each tid be- cause a multi-threaded system can interleave multiple instances of these log entries."

Some of this may be duh-obvious, but it results in a naming scheme where all important objects in the system have unique, printable ids.

Couple more observations: Hadoop logs tend to fit on one line (they can be long lines, though), except in the case of exceptions. Exceptions are almost always logged, even if they represent an ordinary situation that is recovered from in the error handling code. (Many software failures in these systems occur because a standard, recoverable exception is thrown in one place, improperly handled in the error handling code, and then this causes a much more severe exception further down the line.) All exception logs include a multi-line stack trace.


Imo json gives you the most flexibility and adds as much metadata the developer (applocation side) and ops (in case you use an orchestratiom such as kubernetes, etc) need.

Eg in kubernetes it's pretty common for applocations to log everything in json, kubernetes will then wrap the app logs in another layer of json that includes information about the container name, id, host ip, etc.

Microservices and containers will most likely require you to have some centralized way to view your logs (kibana etc) so ops and devs can create their own views.

Even in the basic case of tailing log files you can use a jq alias with a custom format to convert the json logs to something colorful and readable.

Ps. On mobile so i apologise for possible mistakes


We developed a common logging module that we include in our micro services to log to json. It also takes in environment variables so we can include the deployment related information to identify and query the logs once they get centralized to splunk.


All applications should generate JSON logs.

The transport system (syslog usually) will take care of time stamping and some meta data.

application => (JSON text) => syslog daemon => (syslog message) => graylog/ELK/splunk


Log format is the least of your problems (FWIW: ElasticSearch is happy to to take JSON.)

A more interesting question is "can you actually learn from your logs?" Most logs are just isolated data points in time. I've built a logging system called Eliot for Python (https://eliot.readthedocs.io) that lets you generate logs with causality: it's a tree of actions where you can see which action caused which other action, and whether each action failed. Actions can span processes, so it's also a tracing system.

This makes debugging complex systems much easier because instead of "X happened. Y happened. Z happened." you can see that Z happened because of X, but Y was unrelated.

(Output is JSON, easy to push into ElasticSearch.)

I'm sure there are other logging systems out there with similar features, or other approaches that work (combination of service-specific logging and protocol tracing, say). But the basic point is you should be thinking about what your goals are for the logs and what you want to learn from them... formats are easy to change and transform, but lost information is often lost forever.


My preference is:

2017-04-27T02:46:20.762-07:00 [$level] $message

Where $level is one of [debug, info, warn, error] and $message is a string log message.

If your working with distributed systems may want to add $hostname between the date and level.

Full example:

2017-04-27T02:46:20.762-07:00 web4 [warn] unable to connect to mobile interface


Being in a situation where logs are not read, or can't be read, is very similar to not having logs. So it's important to emphasize formatting and productivity when it comes to logs.

JSON can be useful. You can use jq (https://stedolan.github.io/jq/manual/) and common unix commands to parse logs (grep to filter, wc -l to count results, sort|uniq to get unique results, colrm to truncate lines, cut to get fields)... using those can be as powerful as using a query language.

Then, you can use a log aggregator to have access to all your logs. ElasticSearch/Kibana can be useful for that. Commercial services such as Splunk or SumoLogic can also be very helpful.


Don't forget log parser if you're using Windows.


why not generate the logs as json with a key called 'humanreadable' that contains an easy-to-ready logging statement that the devs will love?


If folks want a fancy format they can always use AWK to reformat the beautifully readable developer logs. Anyway why use Elastic Search when you could use Splunk?


Probably because Splunk is ridiculously priced, even for enterprise software.


I quite like splunk! prefer it over ES but did not use ES as much


On one really large project with lots of microservices that generated requests to lots of other microservices, we ended up using JSON logs and some convention. We had libraries in all the languages we were using (php, node js, go, python, c / c++) that followed the convention.

The convention was simple. Every log MUST have at least these three parameters: --action, --r, and --t, and one optional one: --p.

--action would be the "subject" of the message with a dot notation. A lot of times it's simply class.method or class.method.something.finished.

--r is the "request" hash. This hash represented everything that has happened in the current process to handle whatever was asked of it. So for a web request, the --r would start as the request came in from nginx, and the --r would end when the response was sent back.

--t is the "thread" hash. If the calling process sent a --t parameter, use that. Otherwise, generate a new one.

--p is the "parent" hash. If the calling process sent a --r parameter, set that as --p. Otherwise, don't add it.

All of this was automated in the logging libraries. So generally a developer would simply call Log.d('whatever', {something: 'something something'}) or Log::d('whatever', ['something', 'something else']);

Requiring those four params allows us a couple fantastic things. First, highlighting the --action field made it a lot like looking at your inbox. Every message had a subject; This made for very simple skimming. Second, when hunting errors, it was very easy to find all the other related log messages by searching for the matching --r(equest) value. Third, for larger processes that sometimes involved tens of servers, it was easy to generate a tree of the entire request from start to finish, by simply using the --t(hread), --p(arent), and --r(equest) values.

These logs were all sent to rsyslog, which forwarded all logs to a central server (Also running rsyslog), which kept up beautifully. Besides handling the collection and transport for us, it was great to be able to watch the logs centrally - and then on each server when necessary - just by tailing /var/log/syslog. I had written a couple parsers for our logs on the central server, but eventually, as the project grew, our logs became far too much to handle with simple tailing apps, so we went with ELK (Elasticsearch, Logstash, Kibana).

Back then we used Logstash for the conversion, but on recent projects, I've just used rsyslog's @cee plugin, which skips the "L".

And, over time, we'd added a few more indices that worked well with kibana dashboards and tables. The two most useful were --i and --ms.

An --i parameter was simply an incrementing int starting at 0 for every new --r(equest hash). Lots of times our logs came in out of order, so having an explicit incrementing value in the log messages was really handy. An --ms parameter was for milliseconds. Any log message that had timed something included an --ms message. That was most excellent for dashboards.

Also, all hashes were simply truncated SHA1 hashes. For instance, on the API server, we explicitly took the server IP, request URI, headers, parameters and a timestamp (to the ms), hashed it, and truncated to 12 chars. Any collisions were generally spread out apart enough in time that they weren't an issue.


What part of making everyone happy are you struggling with? Logging is simple, don't over-complicate it, just dump plaintext to a file.

If Ops really need those features they can have a process read the logs and ingest them into whatever system they want. Plaintext with the fields the devs want (time stamp, severity, etc) is easily parsable. Don't underestimate what you can do with tail and a few scripts.


Ops is maybe trying to help developers and increase security (no login on the boxes needed anymore to look at logs). Better workflow for (certain kind of) alerts, which may then get more complicated. Maybe kibana dashboarding. That in turn will help developers tracking logs for each requests (as someone mentioned above, with a common UUID passed around).

And in there lies their mistake: provide the functionality (single access point to logs, alerting on logs) and make the developers work to gain access to it like it was another service. You want ES access? Parse your logs, I'm not going to write filters for you (I can help, but you are going to do the bulk of the work). This guarantees logs in a parsable format in less than 3 months of developers being oncall, particularly if you have a container scheduler (we're talking about microservices right?) and use ASGs (and your policy for upgrades and box issues is "blow up the box and create another").


> Ops is maybe trying to help developers and increase security (no login on the boxes needed anymore to look at logs).

Can this be handled fine by having tail run as a daemon and forwarding all log entries to somewhere that the developers do have access?

> Better workflow for (certain kind of) alerts, which may then get more complicated.

This is likely to be highly specific as to what sort of alerts you want, but will tailing the log and running it through an awk script work? I assume you'll have to do something much like this with any alert management tool.

Tools like kabana like nice, but this is one of those areas where I think people might be going for the shiny solution (that looks great to management) instead of analyzing what they really need.


You can use whatever system you want (syslog can drop the logs in a single box, no need for horrible tail-based concoctions), as long as you mantain it and it's not Ops responsibility to fix your AWK scripts that look for events in logs from 3 different services for the same request and the same customer (and/or respond to the alerts those script generate at 2am after Bob forgot to update it to parse correctly a new log message - we have our own fires to fight already).

I'm not saying one should go straight to ELK. There are other ways, but at the end of the day you are going to implement a similar stack, guaranteed, and you are going to regret using freeform logging instead of a sensible structured format.


It sounds like Ops want all the control but to do none of the work? Maintaining scripts like this is there job.


I'm sorry: it's your application, your alerts, your responsibility. Not Ops.


If it's my responsibility then I need access to the systems, otherwise I can't perform excercise that responsibility.

How does alert creation and maintenance not fall under ops? They sound like the sort of ops people that devops teams were created to replace.


Forwarding, yes. tail, no. tail will lose log data.

* http://jdebp.eu./FGA/do-not-use-logrotate.html#Problems




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: