BloomFilter Experiments

jorangreef · on Feb 26, 2016

If the keys being hashed are fixed-length, then a tabulation hash can offer better collision resistance and performance qualities than Jenkins. It's simple and easy to understand and safe to use.

The Power of Simple Tabulation Hashing: http://people.csail.mit.edu/mip/papers/charhash/charhash.pdf

m00dy · on Feb 26, 2016

Let me read the paper.

prody · on Feb 26, 2016

You have a very weird way of doing those graphs.

daveguy · on Feb 26, 2016

For the author to clarify what's "weird":

1) Independent variable (thing you are changing) is usually on the x axis and dependent variable (thing you are measuring) is usually on the y axis. Having the two flipped requires a confusing in-head translation among people who look at graphs regularly.

2) Your plotting software is not plotting your line graph in monotonically changing fashion. This gives zig zags where the graph goes backwards. Addressing #1 could help or plot as a scatter plot ("." option in matplotlib).

3) A detailed description of parameters under the plot would also be nice.

Regardless this is a cool example of bloom filters in action. Thank you!

jerguismi · on Feb 26, 2016

Is there any "standard" for bloom filters? I think there is plenty of things bloom filters could be used for, but often they would be useful for communication. This means it should be easy for other parties to use the same implementation. So, is there a bloom filter library which has multiple wrappers for multiple languages etc?

niklabh · on Feb 26, 2016

BloomFilter false positives costed us 100K $

nl · on Feb 26, 2016

If that's the case then you were doing it wrong. False positives from Bloom filters are always possible, by design.

threatofrain · on Feb 26, 2016

Bloom filters are not about depending on true positives, but rather depending on true negatives.

KMag · on Feb 26, 2016

The proximate cause might have been a false positive, but I think it's more fruitful to think of the likely root cause: picking a data structure without thinking through/understanding the ramifications ultimately cost you 100k USD.

LunaSea · on Feb 26, 2016

Are you allowed to explain why exactly that happened?

hueving · on Feb 26, 2016

Used a bloom filter to see if a password was correct to get into their bitcoin stash.

m00dy · on Feb 26, 2016

it looks interesting. Tell us what happened.

eni · on Feb 26, 2016

can you elaborate on what happened?

chartpath · on Feb 26, 2016

What is BloomFilter?

mjhoy · on Feb 26, 2016

This is a great video that introduces the concept: https://www.youtube.com/watch?v=-SuTGoFYjZs

brightball · on Feb 26, 2016

My understanding is that it's a means of checking a dataset very quickly with a 100% chance of avoiding "false negatives" but no guarantees around false positives.

Some interesting use cases around them for very high performance filtering operations, but probably overkill and overthink for the vast majority of issues out there. It's something you'd generally looking at if you were trying to performance tune a lot of checks against a big dataset.

KMag · on Feb 26, 2016

In the Gnutella P2P protocol, searches are broadcast to the nearest neighbors in the (essentially random) graph (with a hop count to prevent infinite propagation, and keeping a list of recently seen query IDs to prevent cycles). This essentially causes a random graph with N-ary nodes to act like a tree with a branching factor of N-1. Think of it as running the spanning tree algorithm[0] for each query. (The Nth link is the link up to the parent of the node. This analysis ignores duplicate links.) Peers exchange Bloom filters of the keywords for the files they have. That way, if the hop count indicates that a query has only one more hop to live, the Bloom filters are consulted to eliminate the last hop in cases where a leaf node doesn't have any matching files. (Since Bloom filters don't have false negatives, this pruning is safe/conservative.) It might not sound like much to only save the last hop, but remember that for a tree of degree N-1, the leaves represent almost (N-2)/(N-1) portion of the nodes. So, if every peer is connected to 21 other peers, you get a 20-way tree, and a bloom filter at best can save you almost 95% of your query traffic.

There's lots of simplification in the above analysis (ignoring multiple paths to the same host, etc., etc.) but you get the gist of it.

(This assumes a query for some really rare keyword... the peers use sampling of their nearest neighbors to first estimate the hop count they need in order to get a target number of search results... so searches for popular keywords are unlikely to be helped by the Bloom filter, but they're also sent with a low hop count.)

[0]https://en.wikipedia.org/wiki/Spanning_Tree_Protocol

snerbles · on Feb 26, 2016

I've used bloom filters in cross-referencing multiple data sources, where queries from one badly-optimized database were very expensive. In this case, the penalty of a false positive was merely a five-second lookup.

fsloth · on Feb 26, 2016

There is also an overload of terms. In addition to the algorithmic meaning (as in this case) it's also a computer graphics technique: https://en.wikipedia.org/wiki/Bloom_(shader_effect)

m00dy · on Feb 26, 2016

well, it is a probabilistic data structure.

chartpath · on Feb 26, 2016

Ahh okay: https://en.wikipedia.org/wiki/Bloom_filter

That's super interesting!

I wonder what kinds of stuff you could do with that in a fancy datastore like Elasticsearch or GraphDB.

If it could be mapped to a resource model in a straightforward way, it'd be fun to make a special "ORM" that plugs into an API framework to expose public services.

rakoo · on Feb 26, 2016

Bloom filter are used to greatly reduce the amount of straw when you're looking for a needle in a haystack.

Take bup (https://github.com/bup/bup/) for instance. It splits your content into chunks and stores them into GB-sized packs. With large backups you easily end up in hundreds of packs. During the splitting process it needs to search for a hash (the needle) into the huge amount of chunks (the haystack). That's why bup has one bloom filter per pack, which means that instead of searching through hundreds of packs you will have to search through maybe 2 or 3 packs. There are false positives (you know a hash can only realistically be contained in a single pack) but it is you the developer who choose how much you want. There is some complicated math that allows you to balance false positives, size of the structure, and expected number of elements.

There's a similar process inside LevelDB (http://leveldb.org/) and it's going to be the same thing for systems that store their data in multiple parts and know that what they're looking for actually is in a single part.