Good article, thanks for sharing. I've been working on one part of this problem space for quite a while too. I want ability to directly drill down into latency reasons and underlying application component threads' wall-clock time, instead of having to correlate various systemwide utilization metrics and try to manually connect the dots.
I'm using eBPF-based dimensional data analysis, starting from bottom up (every system is a bunch of threads, including distributed systems) and move up from there. This doesn't replace existing distributed tracing approaches for end to end request view, but gives you deep observability all the way down to each service's underlying threads' wall-clock time (where blocked, sleeping and why, etc).
At this year's P99CONF I will launch the first GA release of my (open source) 0x.tools xcapture eBPF collectors, with a reference implementation of a TUI tool, showing dimensional performance modeling on these new thread sampling signals (xtop).
After a decade building large-scale systems at Google, Datadog, and Meta, I’ve noticed the same pattern repeat: observability keeps getting louder, costlier, and less useful.
We’re drowning in telemetry but starving for insight. The industry incentives are misaligned: they reward ingestion and storage, not intelligence.
I recently started an open, collective movement called omji.ai to explore a fundamental shift: measuring insight per dollar of telemetry stored. We need to push vendors and internal teams toward intelligence, not ingestion.
I’m curious to hear from folks facing the pain - how do we fix this? We need practical, non-obvious ideas.
1. What technical or economic levers would actually shift the industry's focus from volume to intelligence?
2. Has anyone in a large organization tried benchmarking observability systems based on insight (e.g., MTTR impact) vs. telemetry cost?
3. How could open collaboration (tools, standards, benchmarks) make this practical for every engineering team?
reply