
Bounter: A Python counter that uses limited memory regardless of data size - Radim
https://github.com/RaRe-Technologies/bounter
======
asrp
The introduction is misleading about what tradeoffs it makes

    
    
      >>> from bounter import bounter
      >>> bounts = bounter(size_mb=1)
      >>> bounts.update(str(i) for i in xrange(1000000))
      >>> bounts['100']
      0L
    

should give 1. The description should be more upfront that you are getting
inaccurate answers.

> Bounter implements approximative algorithms using optimized low-level C
> structures, to avoid the overhead of Python objects.

"approximative algorithms" isn't a commonly used phrase (at least according to
Google) and this sentence doesn't say anything about what the actual trade-off
is (loss of accuracy in the counts). A possible alternative would be writing
to hard disk so what trade-off is made isn't obvious.

I also don't know what the precision and recall percentages in the example
table mean.

~~~
gipp
The tradeoffs themselves I think are fine. If I have an issue it's that the
three different use cases outlined have _completely_ different tradeoffs using
completely different data structures, but are presented behind one API
differing only in boolean flags.

I don't know about you, but a boolean flag completely changing the behavior of
a collection definitely violates least surprise to me. They should be
different classes with descriptive names, IMO.

~~~
Radim
Hi there, thanks for the constructive comments!

They actually are completely different classes internally. The ‘bounter’
function is just a convenience wrapper (factory). For power users, the
internal classes offer more control and parameters.

If you have any suggestions / concerns, please raise an issue on github. It’s
a new library, we’re looking for feedback!

------
alonmln
Looks useful! I like the way this library is documented, very well explained.

------
gtrubetskoy
Is this a variant of Space Saving or Frequent algorithm?

Space Saving:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.8360&rep=rep1&type=pdf)

Frequent: [https://stackoverflow.com/questions/3260653/algorithm-to-
fin...](https://stackoverflow.com/questions/3260653/algorithm-to-find-
top-10-search-terms/3260905#3260905)

And there is a variant with a Sketch function, known as "Filtered Space
Saving". I cannot find a working link to the paper, but here is a Golang
implementation of it:

[https://godoc.org/github.com/dgryski/go-
topk](https://godoc.org/github.com/dgryski/go-topk)

------
visarga
It would be useful to have load/save (mmap for faster loads).

------
sametmax
So like an hyperloglog ?

------
suff
It is good that someone wrote this. It is embarrassing how bad Python's native
performance is.

~~~
daveFNbuck
This isn't a replacement for any native Python data structures. It's a
collection of three different approximate data structures for dealing with
large data sets where you're ok trading off accuracy for reduced memory usage.

