
Ask HN: Best practices for log format? - user101010101
We&#x27;re currently going through a bikeshedding discussion about the log format that our microservices will generate.<p>Developers without Ops responsibilities want the logs to be as readable as possible to aid their development process (traditional timestamp, severity, component, action and a quick description in a single line and, depending on the type of event, maybe a JSON object in many lines with details about the request, possibly with blank lines and indentation to make everything very easy to read).<p>Ops people want the files to be easily parseable for injection into ElasticSearch. They want to avoid complex configuration and want more flexibility to generate reports. If logs were generated in a single line in JSON format, they would be happy.<p>It seems there is no way to please everybody. Are there any standard formats that microservices-oriented architectures are using these days? Is there a standard at all? How to approach this?
======
eeZah7Ux
Do not invent your own formats.

There is an RFC for structured logging [1]

Also, journald does structured logging, plus indexing and searching like a
simple database and it's designed for your use case. It can receive the logs
and forward them using a connector for ElasticSearch [2]

[1] [https://tools.ietf.org/html/rfc5424](https://tools.ietf.org/html/rfc5424)
[2] [https://www.loggly.com/blog/why-
journald/](https://www.loggly.com/blog/why-journald/)

~~~
paulddraper
RFC5424 Syslog defines some the common stuff all logs should have: hostname,
app name, datetime, message, priority.

Many logging tools (Splunk, Loggly, etc.) consume syslog. Syslog works over
UDP and TCP. Java Logback has a syslog appender. Linux systems have rsyslog
for forwarding, writing, etc.

You'll wind up including your own custom structured data of course (IP
address, response time, user id etc.), but it's nice use a recognized standard
for the standard parts. (Kind of like, say, using status code, cache-control,
and content-type in HTTP.)

~~~
eeZah7Ux
The RFC defines key-value structured logging.

[https://tools.ietf.org/html/rfc5424#section-6.3.5](https://tools.ietf.org/html/rfc5424#section-6.3.5)

------
vayarajesh
If you are micro-services based architecture, then it is very important to
track and log a request through the multi-service calls and relate the logs.

Syslog with a common UUID set at the entry point (typically at proxy or load
balancer level e.g. nginx) of the request and then logger will log the UUID as
the part of UUID parameter of Syslog.

Syslog then can be parsed and also form JSON objects for the poeple in Ops and
send it to a service. Basically you could have one more internal micro-service
to handle logs from Syslog and then parse it and send it wherever Ops would
need.

~~~
bruth
Along those lines (and possibly what the OP was looking for as well), what are
the conventions around key names including namespacing? Or what are the common
environment variables you capture for a given service, i.e. the boilerplate
stuff.

~~~
vayarajesh
Syslog has very detail key name sand namespacing
([http://www.iana.org/assignments/syslog-parameters/syslog-
par...](http://www.iana.org/assignments/syslog-parameters/syslog-
parameters.xhtml)) which can be filled in with application related data and
this can be then parsed by any PaaS logging platform / OSS logging tool

I think syslog is a widely accepted log format and most of the tools which
revolve around log metrics do support syslog and more so with parsing tools

------
Someone
General rule: writing logs is a means, not an end, so whoever reads it most
gets to decide what it should look like.

Corollary 1: allow developers to use whatever they see fit for DEBUG logging.

Corollary 2: require single-line json for everything else (if your developers
have to read lots of production logs, you probably have a bigger problem than
your log format)

You will need something to have ops filter DEBUG lines out.

Options:

\- disallow having DEBUG logging in production

\- separate log file for DEBUG logging

\- start each line with the log level

Also: give the developers a tool that converts the 'one json line per event'
file to what they want. Configure _less_ and other tooling to know about it.

Optionally, tweak the json serializer used in logging to always output the
time stamp, severity, and short description at the start of the line.

------
m0hit
Have you looked into [https://brandur.org/logfmt](https://brandur.org/logfmt)?

Having lived with various points between the 'awesome to read' and 'simplest
to parse' I currently favor simpler parsing and investing in tooling that
makes manual parsing of logs easier.

Of course the decision should include the scale of the microservices. There
would be no benefit in optimizing too much for readability if the number of
requests to your services is so high that you would need complex filtering
before you can look at them.

Another way to cut down the bike shedding is to go with a library that wraps
around the format you choose. We built GitHub.com/clever/Kayvee and have
changed the format a few times. However since everyone uses the kayvee
libraries at Clever it's not too crazy to change the format.

Hopefully this was useful. Apologies for the disconnected paragraphs - on
public transit :)

------
mfontani
The only thing I can add to everyone else's replies is: VERSION IT.

This will allow you to write a parser which parses all versions of your log
format, whichever such versions are – based on the version… and allow you to
amend the log format at any time without worry that suddenly your old logs
aren't going to be compatible with new ones.

"we" use this even for Apache access log files, and it's one small thing which
has helped a ton.

------
sz4kerto
Converting a structured format into a human-readable form is relatively easy.
Parsing unstructured logs is hard.

------
gbrown_
> traditional timestamp

As as sysadmin I implore you use ISO 8601 format, preferably with the UTC
offset.

~~~
proaralyst
Do you prefer the full format (2017-04-27T09:43:04,0010+0100) or a related but
more readable format (2017-04-27 09:43:04.0010 +01)?

~~~
doubleplusgood
My personal preference would be sans TZ, since everything I can control is set
to UTC immediately.

~~~
ash
Technically "Z" postfix is a TZ identifier for UTC time. It's short and makes
the timestamp compatible with RFC 3399:

    
    
        2017-04-27T09:43:04Z

~~~
ralmeida
Not only technically, but also practically, since many tools, libraries and
service APIs accept and produce the Z as the TZ identifier for UTC.

------
devillius
As part of my current role, I setup and managed a unified logging
infrastructure as part of a SIEM install. We're talking ingesting logs in the
order of TBs a day (Lots). Here are my observations:

1\. Syslog type logs are great when the volume is low, but field extraction
can be a pain (unless you're a Regex savant)

2\. Structured logs are nice to deal with. Fairly easy to read and a dream to
parse. Make sure you have a good ETL process in place to process these logs so
they are searchable. 3\. If you have a lot of logs, they are only as good at
the usability of the search on that corpus of text.

I use Splunk to ingest the logs and once the extractions are in place, it is a
dream to search. Other options like ELK are available and work pretty well
too.

Show the devs the advantages of having structured logs indexed and ready to
search. No more: SSH > cat | grep | tail | less | more

Real-time functionality and the ability to alert on certain conditions is an
added bonus.

------
tveita
If you're using a configurable logging library like logback for Java, it is
fairly easy to have your application support both structured logging directly
to Elasticsearch, and logging a more human readable format to disk or stderr
for development.

It would be nice if there was a standard format you could just ship to a
logging service, but the closest seems to be syslog, which is somewhat janky.

Re-parsing logs after they have been formatted for human consumption feels
like an anti-pattern motivated by anemic logging libraries.

~~~
justanotherbody
Exactly this. Configure two log handlers - one sending data to ES and another
to disk.

Alternatively, give the ops team what they want (more information) and then
generate log files that please the devs from ES query output

------
tstack
JSON-formatted log files can be pretty-printed by tools like the Logfile
Navigator:

[http://lnav.org](http://lnav.org)

That should make the decision a lot easier.

------
tex0
We generate structured logs, I.e. key-value pairs. Depending on the
environment they are either formatted with logfmt or JSON. This serves our use
cases quite well.

------
cosmosgenius
This is a node module but the idea can be applied generally
([https://github.com/trentm/node-bunyan](https://github.com/trentm/node-
bunyan)). Basically you output the logs in the format ops require(Which i
assume would be structured) and for devs put a beautifier that parses the logs
and prints in the format they require.

~~~
atis
Also this: [https://github.com/pinojs/pino](https://github.com/pinojs/pino)

------
olalonde
Not directly answering your question but

1) fluentd[0] can be used to parse diverse log formats

2) You can set an environment variable, e.g. `ENV`, and output to a more human
readable format when `ENV=development`.

[0] [https://github.com/fluent/fluentd/](https://github.com/fluent/fluentd/)

~~~
dozzie
Fluentd's regexp parsing will be PITA to maintain, both because regexps only
give match/no match, which is difficult to debug, and because of how to test
new regexp set with Fluentd.

Liblognorm library is a much better choice. Regular rsyslog installation comes
with liblognorm already, though I prefer a standalone log parsing daemon (and
thus I wrote logdevourer).

------
al2o3cr

        maybe a JSON object in many lines with details about the
        request, possibly with blank lines and indentation to make
        everything very easy to read
    

Pretty-printing the JSON is a presentation concern - it shouldn't be baked
into what's being written to the logs.

------
yedpodtrzitko
JSON feeded into ElasticSearch works pretty well for us. I was a bit sceptical
about it a first moment since I was used to grep classic logs in /var/log/,
but it's actually really good. Kibana allows you to have a bunch of dynamic
criteria for the logs, and filtering obviously JSON is better than randomly
grepping a thing in text. I cant think of any other solution than files in
/var/log, and a thing from top of my head which annoys me in text logs - when
you dont know when exactly something happened (eg. before midnight or after a
midnight?), you have to grep two files (expecting you have one log file per
day). Also to grep the files, you need to have an access to the servers, but
the fewer people have the access there, the better.

------
tchaffee
Make one team responsible for maintaining logs and they get to decide the
"how". This is most likely Ops. Devs are now a customer with requirements and
Ops should find a solution that will both meet their needs and the needs of
devs. That will at least stop the bikeshedding.

------
serhei
This academic paper has some interesting insights about what distributed
systems (e.g. Hadoop) log in order to be able to reconstruct what happened
after a problem occurs.

[http://www.eecg.toronto.edu/~yuan/papers/zhao_stitch.pdf](http://www.eecg.toronto.edu/~yuan/papers/zhao_stitch.pdf)

In particular,

"\- log a sufficient number of events — even at default logging verbosity — at
critical points in the control path so as to enable a post mortem
understanding of the control flow leading up to the failure.

\- identify the objects involved in the event to help differ- entiate between
log statements of concurrent/parallel homogeneous control flows. Note that
this would not be possible when solely using constant strings. For example, if
two concurrent processes, when opening a file, both output “opening file”,
without additional identifiers (e.g., process identifier) then one would not
be able to attribute this type of event to either process.

\- include a sufficient number of object identifiers in the same log statement
to unambiguously identify the ob- jects involved. Note that many identifiers
are naturally ambiguous and need to be put into context in order to uniquely
identify an object. For example a thread iden- tifier (tid) needs to be
interpreted in the context of a specific process, and a process identifier
(pid) needs to be interpreted in the context of a specific host; hence the
programmer will not typically output a tid alone, but always together with a
pid and a hostname. If the identifiers are printed separately in multiple log
state- ments (e.g., hostname and pid in one log statement and tid in a
subsequent one) then a programmer can no longer reliably determine the context
of each tid be- cause a multi-threaded system can interleave multiple
instances of these log entries."

Some of this may be duh-obvious, but it results in a naming scheme where all
important objects in the system have unique, printable ids.

Couple more observations: Hadoop logs tend to fit on one line (they can be
long lines, though), except in the case of exceptions. Exceptions are almost
always logged, even if they represent an ordinary situation that is recovered
from in the error handling code. (Many software failures in these systems
occur because a standard, recoverable exception is thrown in one place,
improperly handled in the error handling code, and then this causes a much
more severe exception further down the line.) All exception logs include a
multi-line stack trace.

------
geoah
Imo json gives you the most flexibility and adds as much metadata the
developer (applocation side) and ops (in case you use an orchestratiom such as
kubernetes, etc) need.

Eg in kubernetes it's pretty common for applocations to log everything in
json, kubernetes will then wrap the app logs in another layer of json that
includes information about the container name, id, host ip, etc.

Microservices and containers will most likely require you to have some
centralized way to view your logs (kibana etc) so ops and devs can create
their own views.

Even in the basic case of tailing log files you can use a jq alias with a
custom format to convert the json logs to something colorful and readable.

Ps. On mobile so i apologise for possible mistakes

------
mnkmnk
We developed a common logging module that we include in our micro services to
log to json. It also takes in environment variables so we can include the
deployment related information to identify and query the logs once they get
centralized to splunk.

------
user5994461
All applications should generate JSON logs.

The transport system (syslog usually) will take care of time stamping and some
meta data.

application => (JSON text) => syslog daemon => (syslog message) =>
graylog/ELK/splunk

------
itamarst
Log format is the least of your problems (FWIW: ElasticSearch is happy to to
take JSON.)

A more interesting question is "can you actually _learn_ from your logs?" Most
logs are just isolated data points in time. I've built a logging system called
Eliot for Python
([https://eliot.readthedocs.io](https://eliot.readthedocs.io)) that lets you
generate logs with _causality_ : it's a tree of actions where you can see
which action caused which other action, and whether each action failed.
Actions can span processes, so it's also a tracing system.

This makes debugging complex systems _much_ easier because instead of "X
happened. Y happened. Z happened." you can see that Z happened _because_ of X,
but Y was unrelated.

(Output is JSON, easy to push into ElasticSearch.)

I'm sure there are other logging systems out there with similar features, or
other approaches that work (combination of service-specific logging and
protocol tracing, say). But the basic point is you should be thinking about
what your goals are for the logs and what you want to learn from them...
formats are easy to change and transform, but lost information is often lost
forever.

------
nodesocket
My preference is:

2017-04-27T02:46:20.762-07:00 [$level] $message

Where $level is one of [debug, info, warn, error] and $message is a string log
message.

If your working with distributed systems may want to add $hostname between the
date and level.

Full example:

2017-04-27T02:46:20.762-07:00 web4 [warn] unable to connect to mobile
interface

------
partycoder
Being in a situation where logs are not read, or can't be read, is very
similar to not having logs. So it's important to emphasize formatting and
productivity when it comes to logs.

JSON can be useful. You can use jq
([https://stedolan.github.io/jq/manual/](https://stedolan.github.io/jq/manual/))
and common unix commands to parse logs (grep to filter, wc -l to count
results, sort|uniq to get unique results, colrm to truncate lines, cut to get
fields)... using those can be as powerful as using a query language.

Then, you can use a log aggregator to have access to all your logs.
ElasticSearch/Kibana can be useful for that. Commercial services such as
Splunk or SumoLogic can also be very helpful.

------
NicoJuicy
Don't forget log parser if you're using Windows.

------
gargarplex
why not generate the logs as json with a key called 'humanreadable' that
contains an easy-to-ready logging statement that the devs will love?

------
Rodders7884
If folks want a fancy format they can always use AWK to reformat the
beautifully readable developer logs. Anyway why use Elastic Search when you
could use Splunk?

~~~
npalmer
Probably because Splunk is ridiculously priced, even for enterprise software.

------
enobrev
On one really large project with lots of microservices that generated requests
to lots of other microservices, we ended up using JSON logs and some
convention. We had libraries in all the languages we were using (php, node js,
go, python, c / c++) that followed the convention.

The convention was simple. Every log MUST have at least these three
parameters: --action, --r, and --t, and one optional one: --p.

\--action would be the "subject" of the message with a dot notation. A lot of
times it's simply class.method or class.method.something.finished.

\--r is the "request" hash. This hash represented everything that has happened
in the current process to handle whatever was asked of it. So for a web
request, the --r would start as the request came in from nginx, and the --r
would end when the response was sent back.

\--t is the "thread" hash. If the calling process sent a --t parameter, use
that. Otherwise, generate a new one.

\--p is the "parent" hash. If the calling process sent a --r parameter, set
that as --p. Otherwise, don't add it.

All of this was automated in the logging libraries. So generally a developer
would simply call Log.d('whatever', {something: 'something something'}) or
Log::d('whatever', ['something', 'something else']);

Requiring those four params allows us a couple fantastic things. First,
highlighting the --action field made it a lot like looking at your inbox.
Every message had a subject; This made for very simple skimming. Second, when
hunting errors, it was very easy to find all the other related log messages by
searching for the matching --r(equest) value. Third, for larger processes that
sometimes involved tens of servers, it was easy to generate a tree of the
entire request from start to finish, by simply using the --t(hread),
--p(arent), and --r(equest) values.

These logs were all sent to rsyslog, which forwarded all logs to a central
server (Also running rsyslog), which kept up beautifully. Besides handling the
collection and transport for us, it was great to be able to watch the logs
centrally - and then on each server when necessary - just by tailing
/var/log/syslog. I had written a couple parsers for our logs on the central
server, but eventually, as the project grew, our logs became far too much to
handle with simple tailing apps, so we went with ELK (Elasticsearch, Logstash,
Kibana).

Back then we used Logstash for the conversion, but on recent projects, I've
just used rsyslog's @cee plugin, which skips the "L".

And, over time, we'd added a few more indices that worked well with kibana
dashboards and tables. The two most useful were --i and --ms.

An --i parameter was simply an incrementing int starting at 0 for every new
--r(equest hash). Lots of times our logs came in out of order, so having an
explicit incrementing value in the log messages was really handy. An --ms
parameter was for milliseconds. Any log message that had timed something
included an --ms message. That was most excellent for dashboards.

Also, all hashes were simply truncated SHA1 hashes. For instance, on the API
server, we explicitly took the server IP, request URI, headers, parameters and
a timestamp (to the ms), hashed it, and truncated to 12 chars. Any collisions
were generally spread out apart enough in time that they weren't an issue.

------
flukus
What part of making everyone happy are you struggling with? Logging is simple,
don't over-complicate it, just dump plaintext to a file.

If Ops really need those features they can have a process read the logs and
ingest them into whatever system they want. Plaintext with the fields the devs
want (time stamp, severity, etc) is easily parsable. Don't underestimate what
you can do with tail and a few scripts.

~~~
chronid
Ops is maybe trying to help developers and increase security (no login on the
boxes needed anymore to look at logs). Better workflow for (certain kind of)
alerts, which may then get more complicated. Maybe kibana dashboarding. That
in turn will help developers tracking logs for each requests (as someone
mentioned above, with a common UUID passed around).

And in there lies their mistake: provide the functionality (single access
point to logs, alerting on logs) and make the developers work to gain access
to it like it was another service. You want ES access? Parse your logs, I'm
not going to write filters for you (I can help, but you are going to do the
bulk of the work). This guarantees logs in a parsable format in less than 3
months of developers being oncall, particularly if you have a container
scheduler (we're talking about microservices right?) and use ASGs (and your
policy for upgrades and box issues is "blow up the box and create another").

~~~
flukus
> Ops is maybe trying to help developers and increase security (no login on
> the boxes needed anymore to look at logs).

Can this be handled fine by having tail run as a daemon and forwarding all log
entries to somewhere that the developers do have access?

> Better workflow for (certain kind of) alerts, which may then get more
> complicated.

This is likely to be highly specific as to what sort of alerts you want, but
will tailing the log and running it through an awk script work? I assume
you'll have to do something much like this with any alert management tool.

Tools like kabana like nice, but this is one of those areas where I think
people might be going for the shiny solution (that looks great to management)
instead of analyzing what they really need.

~~~
chronid
You can use whatever system you want (syslog can drop the logs in a single
box, no need for horrible tail-based concoctions), as long as you mantain it
and it's not Ops responsibility to fix your AWK scripts that look for events
in logs from 3 different services for the same request and the same customer
(and/or respond to the alerts those script generate at 2am after Bob forgot to
update it to parse correctly a new log message - we have our own fires to
fight already).

I'm not saying one should go straight to ELK. There are other ways, but at the
end of the day you are going to implement a similar stack, guaranteed, and you
are going to regret using freeform logging instead of a sensible structured
format.

~~~
flukus
It sounds like Ops want all the control but to do none of the work?
Maintaining scripts like this is there job.

~~~
chronid
I'm sorry: it's your application, your alerts, your responsibility. Not Ops.

~~~
flukus
If it's my responsibility then I need access to the systems, otherwise I can't
perform excercise that responsibility.

How does alert creation and maintenance not fall under ops? They sound like
the sort of ops people that devops teams were created to replace.

