
Fast and flexible observability with canonical log lines - mglukhovsky
https://stripe.com/blog/canonical-log-lines
======
ianstormtaylor
Great article! I always love hearing Stripe talking about their internals.

I've been using this practice and I agree that it's incredibly useful. I think
because people tend to think in terms of "logs", they end up overlooking the
much more useful construct of "canonical logs". Many fine-grained logs
themselves are almost always less useful than the fewer fully-described
canonical logs. Other observability tools often call these "events" instead of
"logs" for that reason.

There's a tool call Honeycomb [1] that gives you exactly what this article's
talking about in a really nicely designed package out of the box. And since it
handles all of the ingestion and visualization, you don't have to worry about
setting up Kafka, or the performance of logplexes, or teaching everyone SQL,
or how to get nice graphs. I was a little skeptical at first, but after using
it for over a year now I'm completely converted.

If you record fully-described "events" for each request, and you use sub-spans
for the smaller segments of requests, you also get a waterfall-style trace
visualization. Which eliminates the last need for fine-grained logs
completely.

If this article seems interesting to you, I'd highly, highly recommend
Honeycomb. (Completely unaffiliated, I just think it's a great product.)

[1]: [https://www.honeycomb.io/](https://www.honeycomb.io/)

~~~
twic
Honeycomb's CEO on these "wide events":

> The most effective way to structure your instrumentation, so you get the
> maximum bang for your buck, is to emit a single arbitrarily wide event per
> request per service hop.

> We're talking wiiiide. We usually see 200-500 dimensions in a mature app.
> But just one write.

[https://twitter.com/mipsytipsy/status/1042817542648082432](https://twitter.com/mipsytipsy/status/1042817542648082432)

~~~
spimmy
^ CTO now, but thanks :D

------
firethief
It's interesting that they've found denormalizing their log data so useful.
I'm suprised to hear that that performs better for practical queries than a
database with appropriate indexes, and that they've been able to build more
ergonomic interfaces to query that than the standard relational approach a lot
of people already have experience with. But I don't know much about log
management at scale, so I'm only mildly surprised.

~~~
rco8786
As others mentioned, normalization is generally about saving space not
increasing performance.

That said I’m 100% positive that all those key value pairs are indexed for
searching and querying purposes.

~~~
RhodesianHunter
Not so much indexed, just columnar.

------
manigandham
Strange that they went with plain text when the industry is converging on
(newline delimited) JSON logs for structured data. This also serves as the
backbone of observability with metrics and tracing also being folded into and
output as JSON.

Call them events and you can claim all the event-sourcing buzzwords too.

~~~
brandur
I wouldn't put too emphasis on the plain text — we started logging back when
carrying everything via JSON would've been going against the grain. These days
it might've gone the other way (I'm not sure).

One point that I'd try to convey is that the canonical line technique works
for any kind of structured format. We use logfmt in all our examples, but JSON
would work just as well.

~~~
spearo77
I searched for logfmt and found your website –
[https://brandur.org/logfmt](https://brandur.org/logfmt)

What I can't find described for logfmt is how to handle escaping special
characters in the keys or values. Specifically the quotes and equals.

JSON defines this very well, but Splunk simple key=value logging formats
suffer from escaping and parsing problems.

~~~
mirekrusin
You could use yaml serialiser, which leaves non special strings as is and
quotes special ones. But this approach can break naive grepping.

------
chrisweekly
Related tangent: I can't say enough good things about
[lnav]([https://lnav.org](https://lnav.org)). It's like a mini-ETL powertool
at your fingertips, w/ an embedded SQLite db and a terrific API. As of
mid-2016 when I first used it, querying logs was extremely easy, and
reasonably fast (w/ up to several million rows). Highest recommendation.

Disclaimer: I have no affiliation w/ the project or its maintainer -- but out
of gratitude I mention it pretty much every time it's appropriate.

------
osswid
We've been using logging like this but with jsonl lines. Still easy to grep as
straight text, but very handy to be able to parse with jq or other tools and
be able to have rich values (or even substructures) as part of the log lines.

------
edsiper2
Log structure is really important, from the examples provided I would suggest
the same approach can be used using a full 'logfmt' style, so timestamp and
the event type can be set as keys, e.g:

    
    
      ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123
    

the main difference is that you make easier the parsing since many tools can
parse lOgfmt without problems.

One interesting use-case here for 'me' is the ability to perform queries in a
schema-less fashion and I will do a quick speech on what we are working on
Fluent Bit[0] (open source log project), pretty much the ability to query your
data while is still in motion (stream processing on the edge[1]). Consider the
following data samples in a log file:

    
    
      ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123
      ts="2019-03-18 22:48:32.991" event="User authenticated" auth_type=api_key key_id=mk_123 user_id=usr_123
      ts="2019-03-18 22:48:32.992" event="Rate limiting ran" rate_allowed=true rate_quota=100 rate_remaining=99
      ts="2019-03-18 22:48:32.998" event="Charge created" charge_id=ch_123 permissions_used=account_write team=acquiring
      ts="2019-03-18 22:48:32.999" event="Request finished" alloc_count=9123 database_queries=34 duration=0.009 http_status=200
    

so if I wanted to retrieve all events associated for user 123 I would process
the file as follows:

    
    
      $ fluent-bit -R conf/parsers.conf \
                   -i tail -p alias=data -p path=canonical.log -p parser=logfmt \
                   -T "SELECT * FROM STREAM:data WHERE user_id='usr_123';" 
                   -o null -f 1
    

the output is:

    
    
      [1552949312.991000, {"event"=>"User authenticated", "auth_type"=>"api_key", "key_id"=>"mk_123", "user_id"=>"usr_123"}]
    

the results are in a raw mode but can be exported to stdout in json, to
elasticsearch, kafka or any output destination supported.

One of the great things of the stream processor engine is that you can create
new streams of data based on results, use windows of time (tumbling) for
aggregation queries and such.

[0] [https://fluentbit.io](https://fluentbit.io)

[1] [https://docs.fluentbit.io/stream-
processing](https://docs.fluentbit.io/stream-processing)

------
sethammons
This is not unlike what we've been doing for years. We generate billions of
log lines like this daily as json and inspect them with splunk. By having
consistent values across log lines, we can query and do neat things. "What was
our system timing in relation to users who have feature x?" "What correlations
can we find between users whose requests took too long and were not throttled?
-> ah, 99% of those requests show $correlation_in_other_kv_pair!"

~~~
brandur
(I'm the author, and) Yeah, whatever you might call them, canonical lines are
an "obvious" enough idea that I'd expect a lot of shops to have arrived at
them independently. Besides yourself, I've heard from a number of people where
that's been the case.

That said, it's also a surprisingly _non-_ obvious idea in many respects — a
lot of people are used to just traditional trace-style logging and never come
up with a construct like them, so we felt they were worth calling out as
something that might be worth doing.

~~~
pmm
I feel that logs are around for so long that it's easy to take their
capabilities for granted and not go much further. This is another example
there's more that can be done. It reminds me rfc5424

At LogSense.com we actually tackled this problem too and came with automatic
pattern discovery that pretty much converts all logs into structured data. I
actually just posted it here:
[https://news.ycombinator.com/item?id=20569879](https://news.ycombinator.com/item?id=20569879)
I am really curious if this is something that you consider helpful and any
feedback is very welcome

------
thomas536
What's the primary use case for this? I almost always only look at logs to
debug things; very rarely to perform some sort of event math/analysis.

~~~
hinkley
How many requests per second are you dealing with? Scale changes what you look
at.

------
perq
Very similar to what logsense.com does with logs except you don't need to have
canonical log line as patterns are found automatically.

------
msoad
is there any legal restriction of how long you can keep internal systems logs?
if it's done right they don't contain PIIs but they _can_ be used to track
people if you have enough logs.

~~~
rco8786
Not to my knowledge. At least not in the US. CCPA (coming into effect Jan 1)
does give users the right to be deleted, which presents some challenges with
this sort of data but nothing insurmountable.

------
winrid
This kind of debugging is one reason I loved Sumologic's Join query feature.

------
chasers
We do exactly this with Logflare and send stuff to BigQuery.

