
BloomFilter Experiments - m00dy
https://github.com/erenyagdiran/BloomFilter
======
jorangreef
If the keys being hashed are fixed-length, then a tabulation hash can offer
better collision resistance and performance qualities than Jenkins. It's
simple and easy to understand and safe to use.

The Power of Simple Tabulation Hashing:
[http://people.csail.mit.edu/mip/papers/charhash/charhash.pdf](http://people.csail.mit.edu/mip/papers/charhash/charhash.pdf)

~~~
m00dy
Let me read the paper.

------
prody
You have a very weird way of doing those graphs.

~~~
daveguy
For the author to clarify what's "weird":

1) Independent variable (thing you are changing) is usually on the x axis and
dependent variable (thing you are measuring) is usually on the y axis. Having
the two flipped requires a confusing in-head translation among people who look
at graphs regularly.

2) Your plotting software is not plotting your line graph in monotonically
changing fashion. This gives zig zags where the graph goes backwards.
Addressing #1 could help or plot as a scatter plot ("." option in matplotlib).

3) A detailed description of parameters under the plot would also be nice.

Regardless this is a cool example of bloom filters in action. Thank you!

------
jerguismi
Is there any "standard" for bloom filters? I think there is plenty of things
bloom filters could be used for, but often they would be useful for
communication. This means it should be easy for other parties to use the same
implementation. So, is there a bloom filter library which has multiple
wrappers for multiple languages etc?

------
niklabh
BloomFilter false positives costed us 100K $

~~~
LunaSea
Are you allowed to explain why exactly that happened?

~~~
hueving
Used a bloom filter to see if a password was correct to get into their bitcoin
stash.

------
chrstphrhrt
What is BloomFilter?

~~~
brightball
My understanding is that it's a means of checking a dataset very quickly with
a 100% chance of avoiding "false negatives" but no guarantees around false
positives.

Some interesting use cases around them for very high performance filtering
operations, but probably overkill and overthink for the vast majority of
issues out there. It's something you'd generally looking at if you were trying
to performance tune a lot of checks against a big dataset.

~~~
KMag
In the Gnutella P2P protocol, searches are broadcast to the nearest neighbors
in the (essentially random) graph (with a hop count to prevent infinite
propagation, and keeping a list of recently seen query IDs to prevent cycles).
This essentially causes a random graph with N-ary nodes to act like a tree
with a branching factor of N-1. Think of it as running the spanning tree
algorithm[0] for each query. (The Nth link is the link up to the parent of the
node. This analysis ignores duplicate links.) Peers exchange Bloom filters of
the keywords for the files they have. That way, if the hop count indicates
that a query has only one more hop to live, the Bloom filters are consulted to
eliminate the last hop in cases where a leaf node doesn't have any matching
files. (Since Bloom filters don't have false negatives, this pruning is
safe/conservative.) It might not sound like much to only save the last hop,
but remember that for a tree of degree N-1, the leaves represent almost
(N-2)/(N-1) portion of the nodes. So, if every peer is connected to 21 other
peers, you get a 20-way tree, and a bloom filter at best can save you almost
95% of your query traffic.

There's lots of simplification in the above analysis (ignoring multiple paths
to the same host, etc., etc.) but you get the gist of it.

(This assumes a query for some really rare keyword... the peers use sampling
of their nearest neighbors to first estimate the hop count they need in order
to get a target number of search results... so searches for popular keywords
are unlikely to be helped by the Bloom filter, but they're also sent with a
low hop count.)

[0][https://en.wikipedia.org/wiki/Spanning_Tree_Protocol](https://en.wikipedia.org/wiki/Spanning_Tree_Protocol)

