
Quickly counting lots of stuff in little RAM: CountMinSketch in Python - Radim
https://rare-technologies.com/counting-efficiently-with-bounter-pt-2-countminsketch/
======
wenc
Yahoo has a full library of production Sketches; that is, fast approximate
algorithms for computing stats on large datasets.

[https://datasketches.github.io/](https://datasketches.github.io/)

Hyperloglog for fast COUNT DISTINCTs is included in this library, as well as
quantile sketches, reservoir sampling sketches, theta sketches, etc.

[https://datasketches.github.io/docs/TheChallenge.html](https://datasketches.github.io/docs/TheChallenge.html)

~~~
Radim
Are there any benchmarks for this Yahoo lib? To trade performance/accuracy
numbers, rather than impressions.

There's no shortage of various streamed counting implementations; the sweet
spot of Bounter [0], the Python lib from the original blog post, is its super
simple use (drop-in replacement for Python's built-in Counter), small memory
footprint & high performance (optimized C underneath).

[0] [https://github.com/RaRe-Technologies/bounter](https://github.com/RaRe-
Technologies/bounter)

------
philipkglass
Eric Zhu has a Python library called datasketch also. The docs are good and I
found the code a great help in understanding implementation details of some
methods. I am not using his code in production but it was a great outline for
how to implement certain techniques I wanted to use in a Scala project.

[https://ekzhu.github.io/datasketch/index.html](https://ekzhu.github.io/datasketch/index.html)

[https://github.com/ekzhu/datasketch](https://github.com/ekzhu/datasketch)

I don't know how how CountMinSketch compares with Zhu's HyperLogLog and
HyperLogLog++ for cardinality counting, in terms of speed/memory/accuracy. But
if you're interested in Bounter you may well be interested in Zhu's code too.

~~~
Radim
Interesting link, thanks. HyperLogLog is a simpler algo in a way, as it only
estimates the _set cardinality_ ("count distinct"), as opposed to _frequencies
of individual items_.

Bounter includes HyperLogLog as well (mentioned in the blog post); it comes
for free with its optimized CountMinSketch.

~~~
philipkglass
Good point. I mention it because Zhu's code, the Yahoo datasketch code
(mentioned in another comment), and Bounter all seem like they may be of
interest if you are trying to solve certain classes of problems or just like
probabilistic algorithms.

------
thanatropism
That header that comes up every time one scrolls up is an absolute impediment
to reading this.

~~~
mjcohen
Right-click, Inspect, go up a few levels, Delete element

~~~
thanatropism
Wow.

