Hacker News new | past | comments | ask | show | jobs | submit login
Optimal streaming histograms (amplitude.com)
52 points by sskates on Aug 6, 2014 | hide | past | favorite | 19 comments



Online (i.e. single pass) estimation of quantiles is a well-studied subject – doing some literature research would have gone a long way here. Back in 2004, this was the best paper I found:

http://ftp10.us.freebsd.org/users/azhang/disc/SIGMOD/pdf-fil...

The technique is efficient, straightforward and has provable error bounds. It was published in 2001, so it would also be worth looking at subsequent papers that have cited it for further improvements:

http://scholar.google.com/scholar?cites=6184540005789557130&...


I've found that people in industry are often unaware of recent discoveries in academia (even if they themselves were once academics). Unfortunately most of the top journals of interest to datascientists are locked behind a paywall, but sometimes Google Scholar is able to find a free copy scraped from some academic's personal site (the "All N versions" link is usually helpful unless the paper was published in an ACM or IEEE journal - those guys seem to hunt down carelessly posted papers on people's private sites).


Log-binning can be useful. However it has some disadvantages.

I think that in your case your data (server response time?) looks good because you probably have a log-logistic or log-normal distribution.

Suppose you were working with values that are exponentially distributed which is also a reasonable hypothesis for your data. In that case the log-binned histogram would like a plateau with the exception of the beginning and ending bins. In this scenario a linear-binning approach would probably be better.

Unfortunately, I think that there is no approach for bucketing that is good for all situations. Usually the best approach will depend on your data and also on what you are trying to analyze.


Hmmmm... my gut reaction is to consider reframing the question.

Do you really need to know the shape of the distribution, or is it sufficient to know, say, various percentiles (e.g. 50%, 75%, 99%, 99.9%). If you're looking at metrics like response time usually 99% is all you care about. If this is the case there are efficient algorithms to estimate these values from streaming data using O(1) space with high probability bounds on error.

If you just need counts indexed by some key, then the count-min sketch might be sufficient (e.g. http://www.auai.org/uai2012/papers/231.pdf)

Finally, if you really do need to know the shape of the distribution should you be considering kernel density estimation (KDE) instead of histograms (see http://en.wikipedia.org/wiki/Kernel_density_estimation)? Most people would argue you should. Looks like there are streaming KDE algorithms, though I don't know how practical they are.


You might want to consider using kernel density plots rather than histograms, as histograms exhibit aliasing artifacts.

In general, binning/bucketing can be seen as filtering the empirical density function of your dataset with a box filter and then sampling. The frequency response of a box filter is the sinc function, which has a lot of energy above the Nyquist of this sampling rate. Kernel density plots with a gaussian kernel, on the other hand, can be seen as filtering the empirical density function with a gaussian filter and are thus approximately bandlimited.


Crap. I wish i didn't read the HN comments page now. I read the first sentence of the article and immediately thought "you could bin on a log scale".

To be clear this isn't an area i'd consider myself all that knowledgeable. Coming up with a useful idea was an achievement.

So as i read through, comparing each proposed solution to mine, i decided i quite liked mine. Then i see that last proposal is my first idea. Great!

So over i wander to the HN comments and see that this is a common approach, i didn't just come up with a nice neat solution, i probably learned it, forgot it, and regurgitated it from my subconscious.

That sure doesn't feel as nice as the sparkly feeling i had just 3 minutes ago...


That's still impressive! It took you two seconds while it took us 3 code iterations to get it right. It seems so obvious in retrospect but in the moment it can be hard to use the full range of your intuition about math. That knowledge could probably have saved us a day or two of coding.

OTOH it's extra difficult because when we did a first implementation we didn't we didn't have a full understanding of the shape of the problem.


whenever i encounter "orders of magnitude" these days, my brain visualizes a log scale, lol.


The optimal number of bins has a rich statistics literature. The Freedman-Diaconis Rule is a good place to start: http://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule

That said, the challenge with online (streaming) data is that you have to constantly estimate the distribution you are sampling from. Someone here might know better =)


I worked in R&D for an extremely large ag company. We found that .5 or 1 STD bucket sizes seemed to work pretty well


http://dtrace.org/blogs/bmc/2011/02/08/llquantize/ describes a very similar log binning technique added to dtrace a few years back. In practice, it's been one of the most useful things in the toolkit.


Technique #4 would be extremely interesting if a user could click into a bucket and drop to the appropriate resolution level. Using their example, the user could click on the big yellow bar at 290 and see the .1-increment buckets for a 10-sized distribution.


Strikingly similar to the problem of generating a close enough dendrogram over large data sets very quickly. Back in ~2008 I did some undergrad research on the topic to speed DNA analysis. Basically it solves your grouping problem gracefully by attempting to be within a error level at N levels. Since you are attempting N buckets this would solves your problem well. There was a very good review paper on the subject but is not in first page of google and my memory fails me completely on author (also I'm at work).

Might not meet your memory or computation constraints, but grouping is such an wide topic.


Could you not keep statistics on the values that went into each bucket, and use those to redistribute the contents of the bucket when the buckets needed to be resized?

Sure, it's an estimate so you'd have some error term but you could pass that on to the new buckets. If the total error is small enough in practice it wouldn't be a problem.


One of the problems we had to solve is compressing the data set so you don't need to keep too many bins around. To get one more significant figure, you need to keep 10x the number of bins. Once you have those individual bins you can always merge them in the visualization, the problem is defining them in a way that covers the distribution and doesn't take up too much space.


Hive seems to use something similar to the 2nd algorithm (merging close bins together):

https://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/...


There are some nice, very efficient implementations of their solution #4at http://hdrhistogram.github.io/HdrHistogram/


A cute one I have been meaning to try: using a Haar wavelet as the histogram, with online updates. theory.stanford.edu/~matias/papers/wavelets-vldb00.pdf


I wonder if a power of two would be better?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: