
Best-Ever Algorithm Found for Huge Streams of Data - signa11
https://www.quantamagazine.org/best-ever-algorithm-found-for-huge-streams-of-data-20171024/
======
gwenzek
Clickbait title. Should be something like: efficiently find the most frequent
items in a stream of data.

TL;DR: split your items by chunck of 8 bits. Store for each chunck A the
number of time you saw it before the chunck B followed by C. Then you keep
track of only 2^24 counters instead of one counter by possible item. Then
using some graph algorithm (the interesting part, which is not explained in
the article) you can use those weighted edges to reconstruct the most frequent
items with a high confidence.

~~~
mcroydon
The original paper
([https://arxiv.org/abs/1604.01357](https://arxiv.org/abs/1604.01357)) from
2016 goes in to more detail.

~~~
AstralStorm
Thank you for the link. Interesting, they used the graph resampling and r-tree
approaches already in use for certain kinds of machine learning. (Esp.
approximate eigenvalue based classifiers)

Very well written too. Unfortunately the approach described only works for
dense countable data that is linearly separable (eta spectral clusters, in
fact even stronger assumption) - which still makes it quite useful as a
database index.

------
vog
Please change the title to something more descriptive and less click-baity.

Also, the original paper is worth a read:
[https://arxiv.org/abs/1604.01357](https://arxiv.org/abs/1604.01357)

~~~
Radim
Upvoted your comment, but I worry this trend will only get stronger if we
endorse and reinforce such behaviour. Why provide eyeballs, via the HackerNews
front page?

The OP is truly disgusting clickbait and crossed the line for me.

I flagged it for this reason.

