Why is observability so expensive?

vlovich123 · on April 3, 2024

A lot of words without any concrete proposals on how to solve the problem.

Telemetry is captured because after-the-fact analysis can’t be done retrospectively otherwise. If you can solve that time travel problem, people would capture less telemetry. I think the key thing is if you can do anomaly detection to capture the rare events because 90% of telemetry is garbage happy case telemetry that doesn’t really give you any extra insight. But doing that anomaly detection cheaply and correctly is extremely hard.

mklein123 · on April 3, 2024

(author) I agree that solving both time travel and anomaly detection is very important and is in my opinion the solution to this problem, but I was trying to keep this post focused on the problem itself. I will be writing some follow up posts on potential solutions. Thank you for reading and commenting!

mklein123 · on April 15, 2024

Follow up post here: https://mattklein123.dev/2024/04/10/do-you-need-to-store-tha...

nishantmodak · on April 3, 2024

I have a different take

>Engineers have to pre-define and send all telemetry data they might need – since it’s so difficult to make changes after the fact – regardless of the percentage chance of the actual need.

YES. Let them send all the data. The best place to solve for it is at Ingestion.

There's typically 5 different stages to this process.

Instrumentation -> Ingestion -> Storage -> Query (Dashboard) -> Query (Alerting)

Instrumentation is the wrong place to solve this.

Ingestion - Build pipelines that allow to process this data and provide for tools like streaming aggregation, cardinality controls that allow to 'process it' or act on anomalous patterns. This atleast makes working on observability data 'dynamic' instead of having to go change instrumentation always. Storage - Provide blaze (2hours), hot(1 month), cold(13 months) of tiered data storage with indipendent read paths.

This, in my opinion has solved for the bulk of cost & re-work challenges associated with telemetry data.

I believe, Observability is the Big Data of today, without the Big Data tools! (Disclosure: I work at Last9.io and we have taken a similar approach to solve for these challenges)

https://last9.io/data-tiering/

throwaway4good · on April 3, 2024

People put too much junk in their logs. Most logs are irrelevant. Companies have businesses selling logging solutions. They sponsor developer conferences hiding this simple truth.

Just say no to the logging industrial complex.

invalidname · on April 4, 2024

Video Title: "Logging best practices: Kill the bugs, not the rain forest" https://www.youtube.com/watch?v=53qCLRFcBSs

cassianoleal · on April 3, 2024

Thanks for opening my eyes to the inner workings of BigLog.

ryandvm · on April 3, 2024

Lol. I recall hearing DataDog bragging about one of their customers having a $65 million dollar bill. The logging industrial complex is real.

mklein123 · on April 3, 2024

I think I'm going to make "Just say no to the logging industrial complex" stickers! ROFL.

komuW · on April 4, 2024

My solution[1] to this problem is to do what they did in the Apollo Guidance Computer; log to a ring buffer and only flush it (to disk or wherever) on certain conditions.

1. https://www.komu.engineer/blogs/09/log-without-losing-contex...

nathants · on April 3, 2024

it’s not.

stuff logs into s3. learn to mine them in parallel using lambda or ec2 spot. grow tech or teams as needed for scale. never egress data and never persist data outside of the cheapest s3 tiers. expire data on some sane schedule.

data processing is fun, interesting, and valuable. it is core to understanding your systems.

if you can’t do this well, there is probably a lot more you can’t do well either. in that case, life is going to be very expensive.

it’s ok to not do this well yet! spend some portion of your week doing this and you will improve quickly.

dboreham · on April 3, 2024

Because Si Valley startups love to buy services from their VCs' portfolio companies without regard to their cost?

pmorelli · on April 3, 2024

There's some of that, but even rolling your own on oss software, your infra (compute, storage, network) can start to balloon really rapidly.

TBF, I think this is more of scale problem for medium to larger traffic companies than startups. I've seen this become especially acute when you're transition from a small or medium company and have started to hit those growth curves, and you get socked with these surprise bills. Something that was a tiny line item suddenly becomes... bad.

leetrout · on April 3, 2024

It truly is interesting how much this happens. I felt pressure to use an investor's staffing company at previous jobs.