

Frequency Counting Algorithms Over Data Streams - mvogiatzis
http://micvog.com/2015/07/18/frequency-counting-algorithms-over-data-streams/

======
haberman
This area of CS (one pass analysis over data streams with tight memory bounds)
is pretty well-studied. They are known as "data stream algorithms."

This 2002 paper describes a data structure for this that I suspect is superior
to what is described in the article:
[https://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf](https://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf)

------
georgehm
"Count-Min and Count Sketches" section at
[http://web.stanford.edu/class/cs166/](http://web.stanford.edu/class/cs166/)
provides good explanations of this topic as well.

------
talaketu
> “Which pages are getting an unusual hit in the last 30 minutes?”

A good statistic for this is an exponentially weighted moving average, which
can be computed online by a leaky integrator using a single accumulator
register per counted entity.

It's not so much as counting, as estimating a rate.

~~~
haberman
> a single accumulator register per counted entity.

That requires O(n) memory space, where n is the number of counted entities. If
you're willing to spend this much memory, the problem is easy. Algorithms like
the one in the article (or the paper I mentioned) are for when you want to use
far less memory than that.

~~~
ted_dunning
Count-min sketches can be adapted to use exponential decay. Presumably any
sort of recurrent decay could be used to get approximately rectangular windows
as well.

------
amelius
Why not store the items in a random order? This can be achieved by performing
a sort using a random key. Sorting can be done as a background job, so that
the list is always "reasonably" random.

~~~
mvogiatzis
The items are stored in a map in random order (Red -> 45, Blue -> 23 etc). The
picture makes it easier to read by showing many red boxes but the actual
implementation just stores a number for each item label.

------
JohnDoe365
How does HyperLogLog fit in here?

[http://antirez.com/news/75](http://antirez.com/news/75)

~~~
steamer25
My understanding is that HyperLogLog will estimate the cardinality e.g., the
number of different types of events in the stream. The frequency counting
algorithms in the article can be used to estimate the number of occurrences of
each of the most frequently occurring types.

E.g., Where the frequency counting algorithms might estimate the following
counts for a given stream: A: 2, B: 5, C:1 ...HyperLogLog would estimate that
you have three types of events (A, B and C but you only get the count: 3) in
it. Presumably, HyperLogLog requires fewer time and space resources to achieve
this more summarized perspective.

