
Edgar: Solving Mysteries Faster with Observability - talonx
https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f
======
vii
There are many alternatives for distributed tracing like Lightstep, Jaeger and
so on but the ambitious level of integration with log searching (like ELK) and
payload tracking is like an integrated in-house Splunk. Great idea and great
to see the energy and enthusiasm put into making debugging tools better! One
dream feature for a tool like this: code execution counts showing which
version of the code and even which lines were executed - in aggregate is
useful but ideally for each trace.

Unfortunately, the tradeoff of value gained saving debugging time against cost
of infrastructure and development is hard to manage. The storage costs are
very easy to measure so it is tempting to go after them rather than the more
intangible benefits that rely on a counterfactual of how hard things would be
to debug without it.

------
AndrewKemendo
Super powerful platform tools like these are crazy hard to build, even more-so
if platform teams don't have iron-fist like control of their infrastructure.
So that's the curious thing organizationally to me.

I'm curious how they manage their infrastructure in a way that enables these
tools.

If it's truly distributed - where individual teams and provision resources
self-service - then they would have to mandate (or template) that new services
have service discovery and eventing/logging as a condition for SLA contracts?

Is infra completely abstracted away from product teams? How are resources
provisioned and new services developed that ensures that these enterprise
capabilities are pervasive?

~~~
ohnoesjmr
I think Netflix has what they call a paved road, whereby every service in the
company comes from a preset of templates (a web service in spring/cli in go
etc), and all of these templates ship most integrations out of the box. It's
fine to bring new frameworks/tools, but for those things to make it to prod
they have to have a certain level of integration with the paved road.

This is the epiphany that I was told by the people that tried to implement
something similar in a different company, so not a fact.

------
aero142
I'm seeing this "3 pillars of observability" framing as the main description
these days, and I think it is nice breakdown of options. However, there is a
common problem I am curious how others are dealing with. All metrics systems I
have seen don't handle high cardinality data like Ids well. Most SASS products
base their pricing on unique metrics because each unique aggregation has a
cost. In the 3 pillars model, companies like Datadog are pushing those high
cardinality values into the log pillar, and this article seems to imply the
same. However, logs are often unstructured. There are often tools to search
through logs, find values within those logs, and I can even do aggregations on
them. However, when you know ahead of time what you want to aggregate on, text
logs are more brittle than simply defining an event in json or another
structured format. This log tier quickly becomes an ad-hoc version of a data
warehouse and I feel like there is a missing tier here where you would send
structured data to a datastore to do aggregations on for observability
purposes only. I know datadog supports parsing structured data in this way,
but I'm curious what is a common solution to this.

Is sending structured data through the "log" tier common, or are there
structured event and reporting systems that are part of another system that is
seldom discussed in this topic.

~~~
stingraycharles
In our case, what we do is use Grafana Loki for logging, which emphasizes on a
small amount of indexes, but great support for “sequential scans” of log
files.

We send the logs in semi-structured format (logfmt), and we have no real
performance problems with this. The trick is that you want to optimize for
“easy to use regexes”, which we did not have with json. Logfmt gives this to
us, while still being easy to read with the eyes.

One of the key decisions we made, however, was to not use any cloud provider
for this and run it ourselves. It gave us much more freedom with various ways
to interact with our metrics / data, without the risk of getting huge bills
(I’m looking at you Datadog, your billing practices are terrible).

------
yowlingcat
> Edgar captures 100% of interesting traces, as opposed to sampling a small
> fixed percentage of traffic.

Very interesting. This is probably my single major complaint with AWS X-Ray,
which I otherwise am a huge fan of and find really useful. Would love to
figure out how they ensure their "interesting" classifier works well, or
figure out how to workaround what happens if it doesn't classify things
properly.

------
tnolet
Every tech / SaaS product will eventually be reinvented internally at Netflix.
Or vice versa.

Without being facetious: I would love to see some Of their cost/benefit
analyses.

~~~
ohnoesjmr
What would be the equivalent offering of this?

~~~
gdgtfiend
Datadog can give organizations this functionality. You can pull in
traces/spans, logs, metrics, and fully correlate across all of those different
sources. Specifically for this blog post I would focus on APM + Log Management

(Full Disclosure: I work at Datadog)

------
tmd83
Here's what I want on observability/diagnostic platform. Are there tools to
achieve something like this open source or SaaS in a reasonable way? Am I
ignorant or is this actually too hard to do at scale I wonder.

1\. Detail response/time taken for every single endpoint. So I want a
histogram and not just an average. I have seen often you get the top X
endpoints (for me mostly requests) in tools and top X by global measure. At my
work the number of unique requests are a lot and the good response time for
them varies a lot from 50ms to 1s easily. A global threshold is useless.
There's also the fact that users have different cost for same endpoint but
that I don't have a super good solution.

2\. Some level of traces for requests that crosses a threshold. That means the
agent has to keep the threshold for every endpoint but is that so expensive? I
think I saw some ideas (in some commercial product) where they collect data
and drop them if the request ends up being fast enough. I think that's a very
good approach. So I want specifics when something is slow and against a per
endpoint threshold. Perhaps I want all requests slower than 99 percentile,
perhaps I keep 20% for > 90 percentile etc.

3\. Now the per end point statistics has to be kept for significant period,
months not days otherwise how would you see the change. I have things slowed
down due to code change, due to usage increase, due to query plan getting
whacky. You have different performance due to different level of usage
(concurrency). No one can probably afford per second resolution for a year
sure but if my 9am spikes are averaged out how would I know that this is an
old problem or the spike actually worsened by 20% in the last two month? I
think you can get away with reducing resolution a lot if you keep a histogram
and not just average. But I don't think anyone optimizes for that. And you
also need to keep those important traces for quite a long time.

4\. I work in Java and use unstructured log and have been trying to figure out
how to parse that reliably for debugging so that we don't spend hours to grep
the logs. I just realized recently the most common queries are easily parsable
for me users, servers/app-instance, codeLine (java loggers prints the
class:line) that narrows things down so much for me that I can afford to just
export those and grep if I need to. But most log tools seems like either grep
& super structured. Also someone else mentioned cardinality because while the
rest can be fine, user is definitely high cardinality so everything might
break down there.

I think for my problems, at my scale Million of user not Million concurrent
user I think these are nicely solvable given the capabilities the tools today
have (even if they are not doing it exactly like that) but is is unscalable at
large or this is just not needed if you do 'something` which I/we are not
doing?

