
Optimal streaming histograms - sskates
http://blog.amplitude.com/2014/08/06/optimal-streaming-histograms/
======
StefanKarpinski
Online (i.e. single pass) estimation of quantiles is a well-studied subject –
doing some literature research would have gone a long way here. Back in 2004,
this was the best paper I found:

[http://ftp10.us.freebsd.org/users/azhang/disc/SIGMOD/pdf-
fil...](http://ftp10.us.freebsd.org/users/azhang/disc/SIGMOD/pdf-
files/058/303-space.pdf)

The technique is efficient, straightforward and has provable error bounds. It
was published in 2001, so it would also be worth looking at subsequent papers
that have cited it for further improvements:

[http://scholar.google.com/scholar?cites=6184540005789557130&...](http://scholar.google.com/scholar?cites=6184540005789557130&as_sdt=5,33&sciodt=0,33&hl=en)

~~~
infinite8s
I've found that people in industry are often unaware of recent discoveries in
academia (even if they themselves were once academics). Unfortunately most of
the top journals of interest to datascientists are locked behind a paywall,
but sometimes Google Scholar is able to find a free copy scraped from some
academic's personal site (the "All N versions" link is usually helpful unless
the paper was published in an ACM or IEEE journal - those guys seem to hunt
down carelessly posted papers on people's private sites).

------
alceufc
Log-binning can be useful. However it has some disadvantages.

I think that in your case your data (server response time?) looks good because
you probably have a log-logistic or log-normal distribution.

Suppose you were working with values that are exponentially distributed which
is also a reasonable hypothesis for your data. In that case the log-binned
histogram would like a plateau with the exception of the beginning and ending
bins. In this scenario a linear-binning approach would probably be better.

Unfortunately, I think that there is no approach for bucketing that is good
for all situations. Usually the best approach will depend on your data and
also on what you are trying to analyze.

------
noelwelsh
Hmmmm... my gut reaction is to consider reframing the question.

Do you really need to know the shape of the distribution, or is it sufficient
to know, say, various percentiles (e.g. 50%, 75%, 99%, 99.9%). If you're
looking at metrics like response time usually 99% is all you care about. If
this is the case there are efficient algorithms to estimate these values from
streaming data using O(1) space with high probability bounds on error.

If you just need counts indexed by some key, then the count-min sketch might
be sufficient (e.g.
[http://www.auai.org/uai2012/papers/231.pdf](http://www.auai.org/uai2012/papers/231.pdf))

Finally, if you really do need to know the shape of the distribution should
you be considering kernel density estimation (KDE) instead of histograms (see
[http://en.wikipedia.org/wiki/Kernel_density_estimation](http://en.wikipedia.org/wiki/Kernel_density_estimation))?
Most people would argue you should. Looks like there are streaming KDE
algorithms, though I don't know how practical they are.

------
calebreach
You might want to consider using kernel density plots rather than histograms,
as histograms exhibit aliasing artifacts.

In general, binning/bucketing can be seen as filtering the empirical density
function of your dataset with a box filter and then sampling. The frequency
response of a box filter is the sinc function, which has a lot of energy above
the Nyquist of this sampling rate. Kernel density plots with a gaussian
kernel, on the other hand, can be seen as filtering the empirical density
function with a gaussian filter and are thus approximately bandlimited.

------
CraigJPerry
Crap. I wish i didn't read the HN comments page now. I read the first sentence
of the article and immediately thought "you could bin on a log scale".

To be clear this isn't an area i'd consider myself all that knowledgeable.
Coming up with a useful idea was an achievement.

So as i read through, comparing each proposed solution to mine, i decided i
quite liked mine. Then i see that last proposal is my first idea. Great!

So over i wander to the HN comments and see that this is a common approach, i
didn't just come up with a nice neat solution, i probably learned it, forgot
it, and regurgitated it from my subconscious.

That sure doesn't feel as nice as the sparkly feeling i had just 3 minutes
ago...

~~~
sskates
That's still impressive! It took you two seconds while it took us 3 code
iterations to get it right. It seems so obvious in retrospect but in the
moment it can be hard to use the full range of your intuition about math. That
knowledge could probably have saved us a day or two of coding.

OTOH it's extra difficult because when we did a first implementation we didn't
we didn't have a full understanding of the shape of the problem.

------
kiyoto
The optimal number of bins has a rich statistics literature. The Freedman-
Diaconis Rule is a good place to start:
[http://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule](http://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule)

That said, the challenge with online (streaming) data is that you have to
constantly estimate the distribution you are sampling from. Someone here might
know better =)

~~~
viggity
I worked in R&D for an extremely large ag company. We found that .5 or 1 STD
bucket sizes seemed to work pretty well

------
msmillie
[http://dtrace.org/blogs/bmc/2011/02/08/llquantize/](http://dtrace.org/blogs/bmc/2011/02/08/llquantize/)
describes a very similar log binning technique added to dtrace a few years
back. In practice, it's been one of the most useful things in the toolkit.

------
jdmichal
Technique #4 would be extremely interesting if a user could click into a
bucket and drop to the appropriate resolution level. Using their example, the
user could click on the big yellow bar at 290 and see the .1-increment buckets
for a 10-sized distribution.

------
aroden
Strikingly similar to the problem of generating a close enough dendrogram over
large data sets very quickly. Back in ~2008 I did some undergrad research on
the topic to speed DNA analysis. Basically it solves your grouping problem
gracefully by attempting to be within a error level at N levels. Since you are
attempting N buckets this would solves your problem well. There was a very
good review paper on the subject but is not in first page of google and my
memory fails me completely on author (also I'm at work).

Might not meet your memory or computation constraints, but grouping is such an
wide topic.

------
infogulch
Could you not keep statistics on the values that went into each bucket, and
use those to redistribute the contents of the bucket when the buckets needed
to be resized?

Sure, it's an estimate so you'd have some error term but you could pass that
on to the new buckets. If the total error is small enough in practice it
wouldn't be a problem.

~~~
sskates
One of the problems we had to solve is compressing the data set so you don't
need to keep too many bins around. To get one more significant figure, you
need to keep 10x the number of bins. Once you have those individual bins you
can always merge them in the visualization, the problem is defining them in a
way that covers the distribution and doesn't take up too much space.

------
paladin8
Hive seems to use something similar to the 2nd algorithm (merging close bins
together):

[https://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/...](https://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java)

------
avibryant
There are some nice, very efficient implementations of their solution #4at
[http://hdrhistogram.github.io/HdrHistogram/](http://hdrhistogram.github.io/HdrHistogram/)

------
scottlocklin
A cute one I have been meaning to try: using a Haar wavelet as the histogram,
with online updates. theory.stanford.edu/~matias/papers/wavelets-vldb00.pdf

------
TheLoneWolfling
I wonder if a power of two would be better?

