Hacker News new | past | comments | ask | show | jobs | submit login
Fast and flexible observability with canonical log lines (stripe.com)
221 points by mglukhovsky 25 days ago | hide | past | web | favorite | 39 comments

Great article! I always love hearing Stripe talking about their internals.

I've been using this practice and I agree that it's incredibly useful. I think because people tend to think in terms of "logs", they end up overlooking the much more useful construct of "canonical logs". Many fine-grained logs themselves are almost always less useful than the fewer fully-described canonical logs. Other observability tools often call these "events" instead of "logs" for that reason.

There's a tool call Honeycomb [1] that gives you exactly what this article's talking about in a really nicely designed package out of the box. And since it handles all of the ingestion and visualization, you don't have to worry about setting up Kafka, or the performance of logplexes, or teaching everyone SQL, or how to get nice graphs. I was a little skeptical at first, but after using it for over a year now I'm completely converted.

If you record fully-described "events" for each request, and you use sub-spans for the smaller segments of requests, you also get a waterfall-style trace visualization. Which eliminates the last need for fine-grained logs completely.

If this article seems interesting to you, I'd highly, highly recommend Honeycomb. (Completely unaffiliated, I just think it's a great product.)

[1]: https://www.honeycomb.io/

Honeycomb's CEO on these "wide events":

> The most effective way to structure your instrumentation, so you get the maximum bang for your buck, is to emit a single arbitrarily wide event per request per service hop.

> We're talking wiiiide. We usually see 200-500 dimensions in a mature app. But just one write.


^ CTO now, but thanks :D

Thanks for talking about Honeycomb, I’ve been looking out for peoples experiences as it looks like quite a nice tool.

We’ve just moved over to structured event based logging, but we’re still using tools like Sumo, newrelic, etc and it’s not all nicely tied together.

I’ve been looking at Honeycomb to either replace or augment our existing tools but had little to go on besides the marketing pages.

Honeybee here. Feel free to just try it, there's a 14 day free trial, and a free community edition for small amounts of data :) experiment away, and our community slack is super friendly!

We certainly wouldn't fit into the community edition :)

Our main project is running on Django 1.11. I'm going to wait until we're on Django > 2 for the database tracing integration.

What I'd love to see is a screencast, demo, or series of screenshots that digs into the (out of the box) Django integration. NewRelic gives us a lot of insight into our database performance, including EXPLAIN traces for slow queries. Does Honeycomb provide something similar?

Yes, honeycomb is great. It's one of those "I wish I had more bigger projects, just so I could use this more" services. Other APMs / logging systems are just not really comparable.

It does cost a bit more though :-)

I am wondering how things like OpenTracing-esque spans and sub-spans fit into the format Stripe describes. Are they just logged as `subspan1`, `subspan2`, `subspan3` in the log format?

It seems like that works, but I'm also unclear if maybe each sub-span is better off as its own log line? But that carries its own problems.

How are these "events" different from sending this data to a metrics aggregator?

Because the events are correlated with each other. When you send individual metrics, the fact that you incremented A and B together isn't visible.

It's interesting that they've found denormalizing their log data so useful. I'm suprised to hear that that performs better for practical queries than a database with appropriate indexes, and that they've been able to build more ergonomic interfaces to query that than the standard relational approach a lot of people already have experience with. But I don't know much about log management at scale, so I'm only mildly surprised.

Denormalization typically improves performance. Normalization isn't done for performance reasons but for consistency reasons, i.e. so that data isn't duplicated and there is only one source of truth

Yes, exactly — normalization is really useful for reasons of quality and correctness, but generally not so important for data like logs that's rotating through the system on a pretty constant basis.

And addressing the parent's point on databases: they don't look like an RDMS, but you can kind of think of log management/querying systems like Splunk et al. to be like a specialized database with specific properties:

- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.

- Optimized for recent data: Newer logs tend to be accessed relatively frequently and older logs much more rarely (if ever), so it's generally a good idea for these systems will rotate data through different tiers of storage as they age — the new on fast machines with fast disks, the old on slower machines with large disks, and the very old probably just in S3 or something.

- High volume: Any of the traditional relational databases would have a lot of trouble with the volume of data that we put through Splunk. (That said, its problem domain is more constrained — it scales horizontally much more easily because it doesn't have have to concern itself with things like consensus around write consistency.)

How many columns does the average canonical log entry at Stripe have? What's the mix of low/high cardinality string fields look like vs number of metric/counter fields?

On the order of many dozens of fields and it's a pretty good mix of all of those.

Lots of low cardinality fields, lots of counters and numbers (e.g. request duration), and quite a few high cardinality fields too. e.g. IDs, IPs.

Logs can be treated as database rows regardless of source format (plaintext, csv, JSON, etc). The modern approach for dealing with large scale tables is column-oriented storage and databases which can easily handle billions of log lines without indexes by using ordering, partition maps, compression, etc.

As others mentioned, normalization is generally about saving space not increasing performance.

That said I’m 100% positive that all those key value pairs are indexed for searching and querying purposes.

Not so much indexed, just columnar.

It's also about DRY and accurately modelling the data for general operations. Imagine the difference between a data environment where a GDPR deletion request comes in, and everything has a relation back to customer identity, and one where the customer identity is denormalised out to many places or only implicitly present.

Strange that they went with plain text when the industry is converging on (newline delimited) JSON logs for structured data. This also serves as the backbone of observability with metrics and tracing also being folded into and output as JSON.

Call them events and you can claim all the event-sourcing buzzwords too.

I wouldn't put too emphasis on the plain text — we started logging back when carrying everything via JSON would've been going against the grain. These days it might've gone the other way (I'm not sure).

One point that I'd try to convey is that the canonical line technique works for any kind of structured format. We use logfmt in all our examples, but JSON would work just as well.

I searched for logfmt and found your website – https://brandur.org/logfmt

What I can't find described for logfmt is how to handle escaping special characters in the keys or values. Specifically the quotes and equals.

JSON defines this very well, but Splunk simple key=value logging formats suffer from escaping and parsing problems.

You could use yaml serialiser, which leaves non special strings as is and quotes special ones. But this approach can break naive grepping.

Related tangent: I can't say enough good things about [lnav](https://lnav.org). It's like a mini-ETL powertool at your fingertips, w/ an embedded SQLite db and a terrific API. As of mid-2016 when I first used it, querying logs was extremely easy, and reasonably fast (w/ up to several million rows). Highest recommendation.

Disclaimer: I have no affiliation w/ the project or its maintainer -- but out of gratitude I mention it pretty much every time it's appropriate.

We've been using logging like this but with jsonl lines. Still easy to grep as straight text, but very handy to be able to parse with jq or other tools and be able to have rich values (or even substructures) as part of the log lines.

Log structure is really important, from the examples provided I would suggest the same approach can be used using a full 'logfmt' style, so timestamp and the event type can be set as keys, e.g:

  ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123
the main difference is that you make easier the parsing since many tools can parse lOgfmt without problems.

One interesting use-case here for 'me' is the ability to perform queries in a schema-less fashion and I will do a quick speech on what we are working on Fluent Bit[0] (open source log project), pretty much the ability to query your data while is still in motion (stream processing on the edge[1]). Consider the following data samples in a log file:

  ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123
  ts="2019-03-18 22:48:32.991" event="User authenticated" auth_type=api_key key_id=mk_123 user_id=usr_123
  ts="2019-03-18 22:48:32.992" event="Rate limiting ran" rate_allowed=true rate_quota=100 rate_remaining=99
  ts="2019-03-18 22:48:32.998" event="Charge created" charge_id=ch_123 permissions_used=account_write team=acquiring
  ts="2019-03-18 22:48:32.999" event="Request finished" alloc_count=9123 database_queries=34 duration=0.009 http_status=200
so if I wanted to retrieve all events associated for user 123 I would process the file as follows:

  $ fluent-bit -R conf/parsers.conf \
               -i tail -p alias=data -p path=canonical.log -p parser=logfmt \
               -T "SELECT * FROM STREAM:data WHERE user_id='usr_123';" 
               -o null -f 1
the output is:

  [1552949312.991000, {"event"=>"User authenticated", "auth_type"=>"api_key", "key_id"=>"mk_123", "user_id"=>"usr_123"}]
the results are in a raw mode but can be exported to stdout in json, to elasticsearch, kafka or any output destination supported.

One of the great things of the stream processor engine is that you can create new streams of data based on results, use windows of time (tumbling) for aggregation queries and such.

[0] https://fluentbit.io

[1] https://docs.fluentbit.io/stream-processing

This is not unlike what we've been doing for years. We generate billions of log lines like this daily as json and inspect them with splunk. By having consistent values across log lines, we can query and do neat things. "What was our system timing in relation to users who have feature x?" "What correlations can we find between users whose requests took too long and were not throttled? -> ah, 99% of those requests show $correlation_in_other_kv_pair!"

(I'm the author, and) Yeah, whatever you might call them, canonical lines are an "obvious" enough idea that I'd expect a lot of shops to have arrived at them independently. Besides yourself, I've heard from a number of people where that's been the case.

That said, it's also a surprisingly non-obvious idea in many respects — a lot of people are used to just traditional trace-style logging and never come up with a construct like them, so we felt they were worth calling out as something that might be worth doing.

I feel that logs are around for so long that it's easy to take their capabilities for granted and not go much further. This is another example there's more that can be done. It reminds me rfc5424

At LogSense.com we actually tackled this problem too and came with automatic pattern discovery that pretty much converts all logs into structured data. I actually just posted it here: https://news.ycombinator.com/item?id=20569879 I am really curious if this is something that you consider helpful and any feedback is very welcome

Oh, for sure. A lot of folks are doing really interesting things that others could learn from and they don't stop and write something up that helps the industry grow that much more. I'm glad y'all wrote this up; I might do a similar one!

What's the primary use case for this? I almost always only look at logs to debug things; very rarely to perform some sort of event math/analysis.

How many requests per second are you dealing with? Scale changes what you look at.

Very similar to what logsense.com does with logs except you don't need to have canonical log line as patterns are found automatically.

is there any legal restriction of how long you can keep internal systems logs? if it's done right they don't contain PIIs but they _can_ be used to track people if you have enough logs.

Not to my knowledge. At least not in the US. CCPA (coming into effect Jan 1) does give users the right to be deleted, which presents some challenges with this sort of data but nothing insurmountable.

I'm not sure if it's due to a legal requirement or not, but at my workplace (a university in Canada) we are required to keep all log files we produce in prod for 7 years.

There's some vague stuff. For example, GDPR requires you only keep data for a "reasonable" period of time. So many many years would likely not be reasonable in most logging scenarios.

I suspect as a payment processor though, being able to look back far when investigating breeches etc would be important.

This kind of debugging is one reason I loved Sumologic's Join query feature.

We do exactly this with Logflare and send stuff to BigQuery.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact