
Counting Bloom Filter in C++ - ingve
https://medium.com/@cyril0allen/counting-bloom-filter-in-c-9672ec25b3ec
======
panic
_The Add() function has to account for the possibility of the count
potentially exceeding the maximum counter value. In this case, there’s not
much we can do except avoid an increment and flag the filter data as
potentially erroneous._

Most counting bloom filters handle this situation by using saturating
arithmetic: once the count hits the maximum, it remains stuck there, never
decrementing again until the filter is completely cleared (see WebKit's
implementation, for example:
[https://github.com/WebKit/webkit/blob/master/Source/WTF/wtf/...](https://github.com/WebKit/webkit/blob/master/Source/WTF/wtf/BloomFilter.h#L228-L231)).
This maintains the Bloom Filter Guarantee™ that you can get false positives
but never false negatives.

~~~
pzh
Wouldn't that only work if you only remove stuff that you already added, i.e.
you can't remove an element that would be considered a 'false positive' and
still expect to have no 'false negatives' after that?

~~~
stormbeard
Right, but I think they mean for that particular cause of false negatives
quoted (counter overflow). There's no good way I can think of to mitigate the
false negatives caused by removing items never inserted into the filter since
you have no way of knowing what was exactly added and what wasnt.

There's a paper linked in the blog post that goes into a lot of detail about
minimizing false negative probability.

------
stormbeard
This is mine! Thanks for posting. I'm really glad people enjoyed the post.

------
vivekseth
For those who don’t know there’s a data structure called a HyperLogLog
([https://en.m.wikipedia.org/wiki/HyperLogLog](https://en.m.wikipedia.org/wiki/HyperLogLog))
that does something similar. Just learned about it last week.

~~~
stormbeard
The LogLog and HyperLogLog algorithms are a probabilistic way to calculate the
approximate cardinality, or distinct elements, in some set using a small
amount of memory. I think it’s pretty awesome because it still works even if
there are duplicate elements in the set.

This counting bloom filter lets us estimate about how many times we’ve
encountered a particular element in some huge set using a relatively small
amount of memory.

So we’re answering two different questions with these guys:

Counting bloom filter: About how many times has a particular element shown up
in this set?

(Hyper)LogLog: About how many unique elements are in this set?

