Hacker News new | past | comments | ask | show | jobs | submit login
Edgar: Solving Mysteries Faster with Observability (netflixtechblog.com)
80 points by talonx on Sept 12, 2020 | hide | past | favorite | 13 comments



There are many alternatives for distributed tracing like Lightstep, Jaeger and so on but the ambitious level of integration with log searching (like ELK) and payload tracking is like an integrated in-house Splunk. Great idea and great to see the energy and enthusiasm put into making debugging tools better! One dream feature for a tool like this: code execution counts showing which version of the code and even which lines were executed - in aggregate is useful but ideally for each trace.

Unfortunately, the tradeoff of value gained saving debugging time against cost of infrastructure and development is hard to manage. The storage costs are very easy to measure so it is tempting to go after them rather than the more intangible benefits that rely on a counterfactual of how hard things would be to debug without it.


Super powerful platform tools like these are crazy hard to build, even more-so if platform teams don't have iron-fist like control of their infrastructure. So that's the curious thing organizationally to me.

I'm curious how they manage their infrastructure in a way that enables these tools.

If it's truly distributed - where individual teams and provision resources self-service - then they would have to mandate (or template) that new services have service discovery and eventing/logging as a condition for SLA contracts?

Is infra completely abstracted away from product teams? How are resources provisioned and new services developed that ensures that these enterprise capabilities are pervasive?


I think Netflix has what they call a paved road, whereby every service in the company comes from a preset of templates (a web service in spring/cli in go etc), and all of these templates ship most integrations out of the box. It's fine to bring new frameworks/tools, but for those things to make it to prod they have to have a certain level of integration with the paved road.

This is the epiphany that I was told by the people that tried to implement something similar in a different company, so not a fact.


I'm seeing this "3 pillars of observability" framing as the main description these days, and I think it is nice breakdown of options. However, there is a common problem I am curious how others are dealing with. All metrics systems I have seen don't handle high cardinality data like Ids well. Most SASS products base their pricing on unique metrics because each unique aggregation has a cost. In the 3 pillars model, companies like Datadog are pushing those high cardinality values into the log pillar, and this article seems to imply the same. However, logs are often unstructured. There are often tools to search through logs, find values within those logs, and I can even do aggregations on them. However, when you know ahead of time what you want to aggregate on, text logs are more brittle than simply defining an event in json or another structured format. This log tier quickly becomes an ad-hoc version of a data warehouse and I feel like there is a missing tier here where you would send structured data to a datastore to do aggregations on for observability purposes only. I know datadog supports parsing structured data in this way, but I'm curious what is a common solution to this.

Is sending structured data through the "log" tier common, or are there structured event and reporting systems that are part of another system that is seldom discussed in this topic.


In our case, what we do is use Grafana Loki for logging, which emphasizes on a small amount of indexes, but great support for “sequential scans” of log files.

We send the logs in semi-structured format (logfmt), and we have no real performance problems with this. The trick is that you want to optimize for “easy to use regexes”, which we did not have with json. Logfmt gives this to us, while still being easy to read with the eyes.

One of the key decisions we made, however, was to not use any cloud provider for this and run it ourselves. It gave us much more freedom with various ways to interact with our metrics / data, without the risk of getting huge bills (I’m looking at you Datadog, your billing practices are terrible).


Traces can be a good place to handle high cardinality observability data. Traces essentially are a particular form of structured logs. Traces support arbitrary tags, and trace analysis systems can be optimized for quick aggregations over high cardinality fields.


> Edgar captures 100% of interesting traces, as opposed to sampling a small fixed percentage of traffic.

Very interesting. This is probably my single major complaint with AWS X-Ray, which I otherwise am a huge fan of and find really useful. Would love to figure out how they ensure their "interesting" classifier works well, or figure out how to workaround what happens if it doesn't classify things properly.


Every tech / SaaS product will eventually be reinvented internally at Netflix. Or vice versa.

Without being facetious: I would love to see some Of their cost/benefit analyses.


What would be the equivalent offering of this?


Datadog can give organizations this functionality. You can pull in traces/spans, logs, metrics, and fully correlate across all of those different sources. Specifically for this blog post I would focus on APM + Log Management

(Full Disclosure: I work at Datadog)



Honeycomb comes to mind


Here's what I want on observability/diagnostic platform. Are there tools to achieve something like this open source or SaaS in a reasonable way? Am I ignorant or is this actually too hard to do at scale I wonder.

1. Detail response/time taken for every single endpoint. So I want a histogram and not just an average. I have seen often you get the top X endpoints (for me mostly requests) in tools and top X by global measure. At my work the number of unique requests are a lot and the good response time for them varies a lot from 50ms to 1s easily. A global threshold is useless. There's also the fact that users have different cost for same endpoint but that I don't have a super good solution.

2. Some level of traces for requests that crosses a threshold. That means the agent has to keep the threshold for every endpoint but is that so expensive? I think I saw some ideas (in some commercial product) where they collect data and drop them if the request ends up being fast enough. I think that's a very good approach. So I want specifics when something is slow and against a per endpoint threshold. Perhaps I want all requests slower than 99 percentile, perhaps I keep 20% for > 90 percentile etc.

3. Now the per end point statistics has to be kept for significant period, months not days otherwise how would you see the change. I have things slowed down due to code change, due to usage increase, due to query plan getting whacky. You have different performance due to different level of usage (concurrency). No one can probably afford per second resolution for a year sure but if my 9am spikes are averaged out how would I know that this is an old problem or the spike actually worsened by 20% in the last two month? I think you can get away with reducing resolution a lot if you keep a histogram and not just average. But I don't think anyone optimizes for that. And you also need to keep those important traces for quite a long time.

4. I work in Java and use unstructured log and have been trying to figure out how to parse that reliably for debugging so that we don't spend hours to grep the logs. I just realized recently the most common queries are easily parsable for me users, servers/app-instance, codeLine (java loggers prints the class:line) that narrows things down so much for me that I can afford to just export those and grep if I need to. But most log tools seems like either grep & super structured. Also someone else mentioned cardinality because while the rest can be fine, user is definitely high cardinality so everything might break down there.

I think for my problems, at my scale Million of user not Million concurrent user I think these are nicely solvable given the capabilities the tools today have (even if they are not doing it exactly like that) but is is unscalable at large or this is just not needed if you do 'something` which I/we are not doing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: