DDSketch: A fast, fully-mergeable quantile sketch with relative-error guarantees

homin · on Aug 29, 2019

Author here. We wanted to be able to graph p99, p99.9 metrics with arbitrary ranges, and found the existing solutions were not accurate enough for our needs. Happy to answer any questions.

Code here:

https://github.com/DataDog/sketches-go

https://github.com/DataDog/sketches-py

https://github.com/DataDog/sketches-java

heinrichhartman · on Aug 29, 2019

Nice work! Averaging percentiles is well-known to give terrible results. Glad to see more people, taking this problem serious, and providing viable alternatives!

A note on Accuracy: At Circonus, we have been using a version of HDR-Histograms [1] for many years to aggregate latency distributions, and calculate accurate aggregated percentiles. Accuracy was never a problem (worst-case error <5%, usually _much_ better).

If I read your evaluation results correctly, you also found HRD-Histograms to be as-accurate or more-accurate, than DDSketches, correct?

The differentiator to HDR Histograms seems to be merging speed and size, where DDSketches seem to have an edge.

One thing that is not immediately clear to me from reading the paper is, how much of the distribution function can be reconstructed from the sketch? E.g. for SLO calculations one is often interested in latency bands: "How many requests were faster than 100ms?" [2].

Is it possible to approximate CDF values ("lower counts") from the sketch with low/bounded error?

[1] https://github.com/circonus-labs/libcircllhist

[2] http://heinrichhartmann.com/pdf/Heinrich%20Hartmann%20-%20La...

homin · on Aug 29, 2019

HDR is great for the use-case when you can bound your range beforehand and merging is not a requirement, but those were also the reasons we needed to develop DDSketch.

heinrichhartman · on Aug 29, 2019

Sorry, but I don't follow this argument:

(1) HDR-Histogram merges are 100% accurate and very fast (few microseconds)

(2) The range of HDR-Histogram is bounded in the same way that floating point numbers are bounded. Hence the name "High Definition Range":

> For example, a Histogram could be configured to track the counts of observed integer values between 0 and 3,600,000,000,000 while maintaining a value precision of 3 significant digits across that range.

http://hdrhistogram.github.io/HdrHistogram/

Circllhist offers a default range of 10^-128 .. 10^+127, which has been more than ample for all use-cases I have seen.

https://github.com/circonus-labs/libcircllhist/blob/master/s...

homin · on Aug 29, 2019

Yes, the sketch can quickly answer questions like "How many requests were faster than 100ms?" 100ms maps to a bucket index, and we would just sum up all the buckets with indices at most that.

oertl · on Aug 30, 2019

FYI, already in 2015, I have proposed exactly the same idea as improvement to HdrHistogram. See https://github.com/HdrHistogram/HdrHistogram/issues/54 and corresponding code https://github.com/oertl/HdrHistogram/blob/memory_efficiency.... Unfortuantely, the author of HdrHistogram did not pick up my proposal as that would have lead to major changes and problems regarding compatibility. The mapping to histogram bins is the same as in your code https://github.com/DataDog/sketches-java/blob/master/src/mai.... I am curious, have you been inspired by that?

jamesblonde · on Sept 7, 2019

This sucks, but a problem we have in academia is that you can't reference a closed PR request at a top-tier conference like VLDB. If you wrote up your proposal as a referable document, you should rightly claim that you deserve scholarly recognition. Morally, you deserve recognition (which you are getting here, at least). It's decent of the author to admit he was inspired by you - there are others who would think it is easier to deny it. But there is no recourse, if it's only a PR.

CharlesMasson · on Sept 3, 2019

I did see that thread and that's what inspired the mapping with the quadratic interpolation.

To clarify though, DDSketch as defined in the paper uses the logarithmic mapping, as our main goal was to make the memory footprint as small as possible. See: https://github.com/DataDog/sketches-java/blob/1650d939f1485f...

The Java implementation abstracts out the index mapping. This let us add alternative ways to map values to indices and we added the method with the quadratic interpolation as it seems to be a good tradeoff between index computation speed and memory footprint.

spenczar5 · on Aug 29, 2019

Thanks for publishing this. I admit I have only skimmed the paper and plan to look closer later today. My first critical thought was "I wonder why section 4 doesn't make comparisons to t-digest?" I think of t-digest as the most common mergeable streaming modern quantile algorithm in practice. Why didn't you include it in your comparisons?

Either way - looks very promising, I am excited to take a closer look and possibly use this. Thanks!

homin · on Aug 29, 2019

T-digest is definitely the best rank-accuracy sketch for higher percentiles in terms of size. However, it was much slower than GK, and we found that by increasing the rank-accuracy of GK (taking the penalty of a larger sketch-size), the results were not too different. Both of course had the issues inherent in all rank-accurate sketches in the higher percentiles over long-tailed data.

zawerf · on Aug 29, 2019

No background in this, my layman interpretation is: Track a histogram with log-scaled bins. These are easy to increment and only takes up size of the number of bins in terms of memory. Two histograms from different servers can easily be merged together since they use the same bins.

But then I got completely lost in the math to show why this guarantees anything. What's actually preventing the scenario where everything falls into the same or just a few bins? (as so happens in the real world where all your data is of a relatively similar order of magnitude) For example what would happen if my data is random from 1 to 10? And on a different server random 1000 to 1010?

heinrichhartman · on Aug 29, 2019

> number of bins in terms of memory

Less than that, with the appropriate compression scheme (sparse encoding).

> But then I got completely lost in the math to show why this guarantees anything.

You get guarantees on _relative_ error:

- 100 to 110 is a 10% error

- 1000 to 1010 is a 1% error

Depending on your logarithmic base, you can detect errors up to a certain relative %-tage.

zawerf · on Aug 29, 2019

That makes a lot sense, thank you! I was stuck on the visualization analogy where choosing the wrong axes for the histogram makes it useless (e.g., when all the data falls into the same bin you can't see the shape of the distribution). But for answering quantile, this doesn't matter. Even if P1 and P99 are in the same bin you still know what their values are up to a multiplicative constant. Fantastic!

ryanworl · on Aug 29, 2019

Is it fair to say that if you know the min and max values for your dataset, the DDSketch (fast) is strictly better than HDR Histogram for applications where lack of runtime memory allocations and add performance is critical?

homin · on Aug 29, 2019

In theory yes, but none of our implementations specifically prevent reallocations as we were not targeting this use-case. It should be easy enough to add though.

morei · on Aug 30, 2019

Minor nit from paper: The range for Moments is listed as bounded due to overflow risk using floats.

This may be true for the provided implementation, but it's not inherent to the algorithm. The calcuation of the higher order moments would normally done entirely in the log domain where there's little risk of overflow.

camel_gopher · on Aug 30, 2019

I've relied on HDR histograms for that. It looks like your sketches are a bit more performant in the paper, but what was the issue with accuracy that histograms couldn't fulfill?

germanjoey · on Aug 29, 2019

Hey, congratulations, this is a really cool algorithm! Thanks for sharing it.

I'm interested in this paper because I worked on a somewhat related problem some time ago, but got stuck on how to handle data that morphs into a mixed-modal distribution. Modes that are close together are no big deal, but modes that are spaced more exponentially apart are tricky to deal with. For an example of something that would be in DataDog's purview, it would be like trying to sketch the histogram of response times from an endpoint that sometimes took a "fast" path (e.g. a request for a query whose result was cached), sometimes took a "normal" path, and sometimes took a "slow" path. (e.g. a query with a flag that requested additional details to be computed) If the response times from the slow path is much bigger than the others, e.g. by an order of magnitude, their statistics might essentially drown-out the data from the other two paths since you're using them to calculate bin size.

I noticed you had some results from measuring DDSketch's performance on a mixed-modal distribution that looked pretty good (that "power" distribution on the last page). I was wondering if you had done any more investigation in this area? E.g. how messy/mixed can the data be before the sketch starts to break down?

MrBuddyCasino · on Aug 29, 2019

How does this compare to t-digest?

https://github.com/tdunning/t-digest/blob/master/docs/t-dige...

MrBuddyCasino · on Aug 30, 2019

Ah there is a short comparison in "related work":

The problems of having high relative errors on the larger quantiles has been addressed by a line of work that still uses rank error, but promises lower rank error on the quantiles further away from the median by biasing the data it keeps towards the higher (and lower) quantiles [7], [8], [17]. The latter, dubbed t-digest, is notable as it is one of the quantile sketch implementations used by Elasticsearch [18]. These sketches have much better accuracy (in rank) than uniform-rank-error sketches on percentiles like the p99.9, but they still have high relative error on heavy-tailed data sets. Like GK they are only one-way mergeable.

skyde · on Aug 29, 2019

great! so main difference is more accuracy on average or more the fact the maximum error possible is bounded?

homin · on Aug 29, 2019

Most previous sketches used what they called "rank accuracy". So if your inputs were [2^1, 2^2, ...., 2^1000]. The actual p95 is 2^950, and a 0.01 rank-accurate sketch would be allowed to give you any value between 2^940 to 2^960 (which can be up to a factor of 1024 away).

A 0.01 relative-accurate sketch has to give you a value within a factor of 100.

The above example is contrived, but with real web request data, there is often a very long tail, and rank accurate sketches quickly start giving values far from the actual percentiles.

So to more directly answer you question, it's that the maximum error possible is bounded.