
Perks – Effectively compute quantiles for unbounded data streams - bmizerany
https://groups.google.com/forum/?pli=1#!topic/golang-nuts/pP602bLRmhA
======
avibryant
People interested in this might also want to check out simmer, which provides
a simple unix stdin/stdout interface for this class of bounded summaries of
unbounded streams. Quantiles are actually a weak point for simmer right now (I
have a TODO item to add Q-Digest), but it has a bunch of other useful sketches
implemented, via the Algebird library which I also hack on. Would love to get
feedback & patches. <https://github.com/avibryant/simmer>

------
kingfishr
Hey, this is very close to something I've been needing recently (and in Go,
nonetheless).

Is there any way to get a similar thing for a sliding window of a stream? For
example, to be able to report (estimated) 90th percentile latencies for server
requests in the last 5 minutes, hour, and day.

~~~
bmizerany
Yes. I use this for that. Query then Reset every 5 minutes.

~~~
kingfishr
Oh, that'll be great! That's not _quite_ the same as a sliding window of last
5 minutes, but it'll definitely work for my use case.

(Intuitively, a sliding window seems very hard/impossible -- how do you
discard old events without keeping a complete record?)

~~~
teraflop
There are other algorithms that handle the sliding-window case, e.g.
<http://research.microsoft.com/pubs/77611/quantiles.pdf>

But for all but the most extreme cases, it's sufficient to just keep all the
values in memory until they fall out of your window. Even if you're getting
1000 requests/second, that's still only 300,000 values that you have to store.

~~~
bmizerany
There may be a way to implement this in perks/quantiles by adding another
piece of metadata for the timestamp. There is a space cost for this, but it
may be okay or even opt-in. Maybe I'll look into this soon.

Edit:

I do agree that if you're working with datasets that fit in memory, you're
probably better off keeping all the samples to find your percentile and not
using this package. In fact, perks will not compress for datasets under 500
values.

------
azmenthe
This is a really awesome problem! I tackled this for my work at TempoDB and
ended up going with the Q-Digest algorithm although I took a good look at
CKMS. Really cool to see this implements merging streams, I remember reading
that CKMS was more difficult to merge streams than Q-Digest.

If anyone is interested this was my write up for algorithm selection:
[http://blog.tempo-db.com/post/42318820124/estimating-
percent...](http://blog.tempo-db.com/post/42318820124/estimating-percentiles-
on-streams-of-data)

~~~
bmizerany
Great post.

Distributing the computation was quite easy. I emailed the authors of the
paper and they gave me a quick answer, which is what I implemented and tested.

I'm interested in adding more implementations of the problem to perks, like
Q-Digest, along with other streaming data problems.

~~~
azmenthe
Thanks!

I use this library for Q-Digest, it's worth taking a look at for an
implementation reference. <https://github.com/clearspring/stream-lib>

------
vosper
I don't really understand what this does, I didn't find a "for dummies"
section on the site - can anyone give a real-world use case and (ideally) a
comparison with some other system that does the same thing?

~~~
bmizerany
azmenthe's article covers the problem well: [http://blog.tempo-
db.com/post/42318820124/estimating-percent...](http://blog.tempo-
db.com/post/42318820124/estimating-percentiles-on-streams-of-data)

Basically, when you have more data than memory and time to sort them in order
to find the percentile you're looking for, you need to employ an algorithm
that trades rank selection accuracy for lower memory and CPU costs. This
package does that.

