For example, if I have a system that alternates between 100 requests/second for 1 second and 5 request/second for 1 second, reporting distribution per request would yield numbers like:
Average latency: 2000 ms / 105 ~ 19 ms;
90th percentile: 10 ms.
These metrics aren't wrong… but they're not very useful if you're interested in latency rather than throughput. From a latency standpoint, I'd rather know the distribution of latency based on something like: if I choose to kick off a request at a random point in the next minute/hour, what latency can I expect on average? what's the 90th percentile in terms of random request time? I'd find something like:
Average latency: .5 * 10 ms + .5 * 200 ms = 105 ms;
90th percentile: 200 ms.
The difference with percentiles is key, in my view. With event-based sampling, you can significantly improve percentiles by focusing on the cases that are already good (and that's exactly what we try to avoid with percentiles): a system with 200 rps for 1 sec, then 5 rps for 1 sec now has a 90th percentile latency of... 1000 / 200 = 5 ms! Even worse, we can also improve the 99th percentile simply by processing fewer events in the slow region.
In the extreme case, the system is so completely useless during its "slow" phase that it processes 0 events… and the information we gather now represents only the fast phase, even when the system is in slow mode 99% of the time.
TL;DR: Pay attention to the X axis in distribution data! Is it actually what you want to sample from to characterise your system?