A lot of words without any concrete proposals on how to solve the problem.
Telemetry is captured because after-the-fact analysis can’t be done retrospectively otherwise. If you can solve that time travel problem, people would capture less telemetry. I think the key thing is if you can do anomaly detection to capture the rare events because 90% of telemetry is garbage happy case telemetry that doesn’t really give you any extra insight. But doing that anomaly detection cheaply and correctly is extremely hard.
(author) I agree that solving both time travel and anomaly detection is very important and is in my opinion the solution to this problem, but I was trying to keep this post focused on the problem itself. I will be writing some follow up posts on potential solutions. Thank you for reading and commenting!
>Engineers have to pre-define and send all telemetry data they might need – since it’s so difficult to make changes after the fact – regardless of the percentage chance of the actual need.
YES. Let them send all the data. The best place to solve for it is at Ingestion.
There's typically 5 different stages to this process.
Ingestion - Build pipelines that allow to process this data and provide for tools like streaming aggregation, cardinality controls that allow to 'process it' or act on anomalous patterns. This atleast makes working on observability data 'dynamic' instead of having to go change instrumentation always.
Storage - Provide blaze (2hours), hot(1 month), cold(13 months) of tiered data storage with indipendent read paths.
This, in my opinion has solved for the bulk of cost & re-work challenges associated with telemetry data.
I believe, Observability is the Big Data of today, without the Big Data tools! (Disclosure: I work at Last9.io and we have taken a similar approach to solve for these challenges)
People put too much junk in their logs. Most logs are irrelevant. Companies have businesses selling logging solutions. They sponsor developer conferences hiding this simple truth.
My solution[1] to this problem is to do what they did in the Apollo Guidance Computer; log to a ring buffer and only flush it (to disk or wherever) on certain conditions.
stuff logs into s3. learn to mine them in parallel using lambda or ec2 spot. grow tech or teams as needed for scale. never egress data and never persist data outside of the cheapest s3 tiers. expire data on some sane schedule.
data processing is fun, interesting, and valuable. it is core to understanding your systems.
if you can’t do this well, there is probably a lot more you can’t do well either. in that case, life is going to be very expensive.
it’s ok to not do this well yet! spend some portion of your week doing this and you will improve quickly.
There's some of that, but even rolling your own on oss software, your infra (compute, storage, network) can start to balloon really rapidly.
TBF, I think this is more of scale problem for medium to larger traffic companies than startups. I've seen this become especially acute when you're transition from a small or medium company and have started to hit those growth curves, and you get socked with these surprise bills. Something that was a tiny line item suddenly becomes... bad.
Telemetry is captured because after-the-fact analysis can’t be done retrospectively otherwise. If you can solve that time travel problem, people would capture less telemetry. I think the key thing is if you can do anomaly detection to capture the rare events because 90% of telemetry is garbage happy case telemetry that doesn’t really give you any extra insight. But doing that anomaly detection cheaply and correctly is extremely hard.