Hacker News new | past | comments | ask | show | jobs | submit login
Building Netflix's Distributed Tracing Infrastructure (netflixtechblog.com)
98 points by atg_abhishek on Oct 21, 2020 | hide | past | favorite | 17 comments



Always impressed to see what Netflix comes up with -- you almost always know you've had a good idea if it overlaps with what Netflix has built internally.

In my case though, I think people are going about mostly wrongly -- my ideal observability tool is:

- structured logs as the source of truth. Metrics and tracing are actually just special cases of any normal log stream. For applications it doesn't get much simpler than log-to-stdout/stderr, and the log shipping tools we have today can pick out/filter and bucket the metrics/traces/regular logs that come through (or do it on a node-local collector). You don't even have to necessarily trace context very well if you know that the logs from roughly 1-5 seconds are likely relevant and you can filter for signal (warnings/errors/etc) there.

- all-in-one but pluggable with reasonable defaults, for easy administration (because you don't really care how team X does their observability, you just want your own little instance with your own data in it)

- federated (rather than worrying about scaling one mega large service, why not run little ones that can move queries to data and come back?)

- Histograms, because distribution/bucketing is the fastest way to solve a lot of these problems

- schema driven (ex. sprinkle in some JSONLD)

- standardized RED/USE for everything (so your load balancer, network card, CPU, as low/high as you dare/care to go vertically)

To bring all these "features" together, I think the UX you'd put on top of that is incident driven and focused -- context is what matters for solving non-obvious problems, and getting the right context as quickly as possible usually has to do with distance-to-incident. In addition to this, the real next-level goal is to never have an un-analyzed/tagged incident, and soon you can start recognizing (via simple rules, at least) problems and suggesting solutions, or identifying problematic commits before they land (in your infra-as-code repo, or your code-as-code repos).


> Metrics and tracing are actually just special cases of any normal log stream.

Theoretically perhaps, but for high-volume metrics (think counters in a tight loop), you really want some form or pre-aggregation in memory before flushing out to disk or network.

If you look at the Prometheus model for example, there's no flushing to disk or network at all, but rather a Prometheus server scraping from hosts on demand.


I find it interesting that they do not mention https://opentelemetry.io/ at all


OpenTelemetry is just an API for instrumenting your software. For actually collecting and viewing your traces, you'll need an actual implementation -- Jaeger and Zipkin are some open source options.

If you develop an open source project and one of your goals is "I sure wish my users could use Datadog instead of being forced to use an open source thing!" then OpenTelemetry is a project you should keep an eye on. The goal is to make the instrumentation agnostic to the underlying provider.


They do, it wasn't called OpenTelemetry at the time, it was called OpenTracing:

> By 2017, open source projects like Open-Tracing and Open-Zipkin were mature enough for use in polyglot runtime environments at Netflix.

I don't think OpenCensus even existed in 2017, looks to have been announced in 2018:

https://opensource.googleblog.com/2018/01/opencensus.html


Too new or not mature enough at the time: “When we started building Edgar four years ago, there were very few open-source distributed tracing systems that satisfied our needs.”


Some of their engineers have made PRs to the OpenTelemetry JS repository, but it wouldn't have been mature enough (or even existed) when much of the work described in the blog was underway.


Because they don't use it or because they "reinvented the wheel"? Perhaps it wasn't mature enough when they started this journey.


Curious to know if any TSDBs (Timeseries databases) evaluated before finalizing on Cassandra to store the traces.


I don't think time is a particularly strong keying methodology for distributed tracing. You want to be able to quickly select all spans that have a certain trace ID. The time that the span was ingested is relatively unimportant to the querying mechanics, and you don't do any operations like "combine samples into a 1h average after 5d".


When I was at Dropbox I did do this ... we only retained traces for a very short time, so I did daily aggregations of traces into summary data structures that could attribute service time to various span types at different depths of the request etc. Don't ignore the value of aggregating and summarizing trace data. Of course, a TSDB is also useless for this, I agree there.


Yeah, using the trace data to get RPC latency and success/error charts is useful, but that's just another branch off the data-stream. Your metrics system needs to inspect every trace, so your database only needs the ability to send you a list.

(Jaeger seems to have Kafka support so you can do this in real time. Haven't tried it.)


I don't think there's any free TSDB available. The project is maybe 3 years old.

InfluxDB is more recent and quite limited. Things like sharding is not supported and they stated it would never be supported except in a paid edition when they make one.

Prometheus is more recent. Similar story with scaling. They changed storage formats and rewrote once or twice in the past few years, it's moving really fast. It's more of a standalone product for server metrics (node exporter + prometheus + grafana), wouldn't recommend to use as a general purpose database.


As of today sharding is supported in InfluxDB. Prometheus is not suitable for traces, it is more for metrics. The problem I see with all these DBs are they are well suited for specific applications like prometheus for metrics, Influx for IoT. Timescale is still in nascent stages.


For what it's worth, Jaeger, one of the most popular FOSS tracing servers uses ES/Cassandra for storing it's traces.


Also Netflix is a Java shop so anything JVM based makes sense for them.


I feel in between Linkedin, Twitter and Netflix all Java design pattern/framework/architect experts are located. A pleasant surprise to me is Oracle seems to have gotten rid of such products and people.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: