
Show HN: PDD – Probabilistic De-Duplication of Streams with Bloom Filters - jparkie
https://github.com/jparkie/PDD
======
snissn
I've been running similar algorithms in production for around 4 years. One
issue we've never solved wonderfully are unit tests on our bloom filter
algorithms, we've used end to end tests to our satisfaction but not really
tests of the bloom filter algorithms itself. Do you have some test / benchmark
code that you could recommend for testing different bloom filter false
positive rates?

~~~
jparkie
I'm thinking of building a test suite which takes a distribution of distinct
elements and run a test for FPP and FNP. Then it asserts whether the the
actual FPP and FNP is within an epsilon of the calculated probability.

I want to add statistics for mine to hopefully be able to monitor them at
least by some estimate.

~~~
nightcracker
I would like to add to this that it's incredibly important you use fixed seed,
or at least logged seed tests for this.

If you fail to do this you get unreproducable ghost tests that you can never
investigate if they happen to fail rarely.

------
fatdog
Fascinating. Just reading about probabilistic data structures in general, and
would like to know if there is an efficient general method for generating
statistics about the FN rates. Are they related to ROC curves?

~~~
jparkie
I haven't looked into the relationship with ROC curves.

I'm not sure of an efficient general method for generating statistics. I know
empirically testing the structures is the easy way. For Bloom Filters which
evict old data, you can calculate the probability of FN by calculating the
probability that a given element is a duplicate but reported as distinct.

------
contravariant
This looks like a neat technique, but it's bothering me slightly that
'peekDistinct' returns _false_ if the element you're testing _is_ distinct.

~~~
jparkie
My bad. README/Documentation error. The tests are fine though.

------
muzakthings
Is this a library that does entity recognition and matching on individual
elements in streams, or are the streams themselves being featurized/matched?

~~~
jparkie
This library provide probabilistic set membership test on an individual
element basis as the stream progresses with a fixed bound on memory. Every
call to classifyDistinct() progresses the internal history of a
ProbabilisticDeDuplicator.

Traditionally, probabilistic set membership tests were accomplished by using
Bloom Filters. However, Bloom Filters only work well on finite sets because as
Bloom Filters age (fill up), the false positive rate approaches 1. This issue
was addressed by Stable Bloom Filters which frees up space for inserts;
however, this introduces false negatives.

This library contributes three implementations of de-duplication algorithms
centered around Bloom Filter variants whose design and replacement strategy
allows it to reach stability faster than a Stable Bloom Filter while reducing
the false FNR by even several orders of magnitude.

