If the keys being hashed are fixed-length, then a tabulation hash can offer better collision resistance and performance qualities than Jenkins. It's simple and easy to understand and safe to use.
1) Independent variable (thing you are changing) is usually on the x axis and dependent variable (thing you are measuring) is usually on the y axis. Having the two flipped requires a confusing in-head translation among people who look at graphs regularly.
2) Your plotting software is not plotting your line graph in monotonically changing fashion. This gives zig zags where the graph goes backwards. Addressing #1 could help or plot as a scatter plot ("." option in matplotlib).
3) A detailed description of parameters under the plot would also be nice.
Regardless this is a cool example of bloom filters in action. Thank you!
Is there any "standard" for bloom filters? I think there is plenty of things bloom filters could be used for, but often they would be useful for communication. This means it should be easy for other parties to use the same implementation. So, is there a bloom filter library which has multiple wrappers for multiple languages etc?
The proximate cause might have been a false positive, but I think it's more fruitful to think of the likely root cause: picking a data structure without thinking through/understanding the ramifications ultimately cost you 100k USD.
My understanding is that it's a means of checking a dataset very quickly with a 100% chance of avoiding "false negatives" but no guarantees around false positives.
Some interesting use cases around them for very high performance filtering operations, but probably overkill and overthink for the vast majority of issues out there. It's something you'd generally looking at if you were trying to performance tune a lot of checks against a big dataset.
In the Gnutella P2P protocol, searches are broadcast to the nearest neighbors in the (essentially random) graph (with a hop count to prevent infinite propagation, and keeping a list of recently seen query IDs to prevent cycles). This essentially causes a random graph with N-ary nodes to act like a tree with a branching factor of N-1. Think of it as running the spanning tree algorithm[0] for each query. (The Nth link is the link up to the parent of the node. This analysis ignores duplicate links.) Peers exchange Bloom filters of the keywords for the files they have. That way, if the hop count indicates that a query has only one more hop to live, the Bloom filters are consulted to eliminate the last hop in cases where a leaf node doesn't have any matching files. (Since Bloom filters don't have false negatives, this pruning is safe/conservative.) It might not sound like much to only save the last hop, but remember that for a tree of degree N-1, the leaves represent almost (N-2)/(N-1) portion of the nodes. So, if every peer is connected to 21 other peers, you get a 20-way tree, and a bloom filter at best can save you almost 95% of your query traffic.
There's lots of simplification in the above analysis (ignoring multiple paths to the same host, etc., etc.) but you get the gist of it.
(This assumes a query for some really rare keyword... the peers use sampling of their nearest neighbors to first estimate the hop count they need in order to get a target number of search results... so searches for popular keywords are unlikely to be helped by the Bloom filter, but they're also sent with a low hop count.)
I've used bloom filters in cross-referencing multiple data sources, where queries from one badly-optimized database were very expensive. In this case, the penalty of a false positive was merely a five-second lookup.
I wonder what kinds of stuff you could do with that in a fancy datastore like Elasticsearch or GraphDB.
If it could be mapped to a resource model in a straightforward way, it'd be fun to make a special "ORM" that plugs into an API framework to expose public services.
Bloom filter are used to greatly reduce the amount of straw when you're looking for a needle in a haystack.
Take bup (https://github.com/bup/bup/) for instance. It splits your content into chunks and stores them into GB-sized packs. With large backups you easily end up in hundreds of packs. During the splitting process it needs to search for a hash (the needle) into the huge amount of chunks (the haystack). That's why bup has one bloom filter per pack, which means that instead of searching through hundreds of packs you will have to search through maybe 2 or 3 packs. There are false positives (you know a hash can only realistically be contained in a single pack) but it is you the developer who choose how much you want. There is some complicated math that allows you to balance false positives, size of the structure, and expected number of elements.
There's a similar process inside LevelDB (http://leveldb.org/) and it's going to be the same thing for systems that store their data in multiple parts and know that what they're looking for actually is in a single part.
The Power of Simple Tabulation Hashing: http://people.csail.mit.edu/mip/papers/charhash/charhash.pdf