Took me a few reads, but it's actually quite simple.
You want a structure that can tell you something has been seen, but sometimes forgets, but will never incorrectly tell you something has been seen.
Solution: an array. Hash the item to find its index, swap out what's there, and see if it is your item. If so, you know for sure it was previously placed. If not, then it might not have (it may have been forgotten).
let arr = Array of key
let contains_key key =
let index = hash key % arr.length
let prev_key = arr.swap[index, key]
return prev_key = key
The hash algorithm is crucial. Reducing forgetfulness is as simple as making the array longer. And he points out that if you can compress the keys, you can reduce storage size.
@jderick: Specifically, it functions the same as a cache with a random removal policy, assuming a good hash function. So a cache and a Bloom filter are opposites: the first knows when something hasn't been seen, and might know when it has; the second knows when something hasn't been seen, and might know when it hasn't. The only difference between this and a normal cache is that we're dropping the values.
On that subject, you might as well have an LRU valueless cache if locality is going to play a role.
Pretty much. There's a reason that there's no wikipedia page for this algorithm (where there is for Bloom Filter, of course) -- it's just a hash table without chaining (i.e. it simply drops collisions).
So combining the two will not give you something that will make up for each of the ones weaknesses and always give a definite answer, but will give one of three answers – definite positive, maybe, and definitive negative.
> So combining the two will not give you something that
will make up for each of the ones weaknesses and always
give a definite answer
True, although the probability of a false answer given that
we combine them can be smaller from the probability of the same false answer in the case when we use each of them separately. This may be important in cases when the aversion
to false answers is extremely high.
Seems like a cache using open addressing but no probing, just instant eviction on any initial collision.
So, could also consider adding probing to lessen destructive collisions (up to some chosen probing depth) while there are other unused slots, in still-constant average time. Or, any other choice-of-slots approach such as 'cuckoo hashing':
Whether you'd want to pay that cost could depend on how soon you reach saturation within each reset period, after which 'every' insert results in an eviction. But if you were usually in one-for-one eviction mode, you might add a bit or more that hints 'age' or 'recent access' to probed items. Then each eviction could be biased towards an older/less-accessed value... taking more advantage of the clustering-in-time that was reported.
And once you add that aging bit you might not need discrete full-restart periods ("hourly") at all: just make a useful proportion of older entries eligible for eviction each interval, maintaining a steady-state of full-array, evictions-biased-towards-older-entries.
These sorts of filters have limits to accuracy and usefulness (and a definite tradeoff between memory requirements and accuracy). You can take both to the extreme, while still maintaining the definition:
A structure that may report false positives, but no false negatives: A single bit set to 1
A structure that may report false negatives, but no false positives: A single bit set to 0
Obviously that's entirely useless, but I think it shows that you can't make any definitive proof of size requirements
I don't think that shows that you can't make any definitive proof of size requirements. All it shows is that you have to make a statement about the accuracy. For example, it might be possible to prove that for all accuracies, an optimal bloom filter with a particular accuracy will always be smaller than an optimal inverse bloom filter of the same accuracy.
You have to define more specifically what you mean about accuracy here. Accuracy for the bloom filter would be probability of false positives. For the inverse it would be probability of false negatives.
For a bloom filter, if you're targeting a false probability f with n items, you want k = lg(1)/f hash functions, and m = 1.44 k*n bits.
That's the thing though. The bloom filter version scales up easily from the degenerate version (adding more bits increases accuracy). Unless I'm missing something, the inverse bloom filter doesn't scale up from the degenerate version in any natural way.
This will nag at me now. I'm going to need to give this some thought. If I come up with any proof on mem reqs. I'll let you know.
This is a very good point, and one I think the author completely missed.
I use bloom filters in an application at work and was very interested to hear about this, wondering if some sort of hybrid might be useful. But when I saw that the hash map was storing the actual objects, I immediately dismissed the idea for my use case - we can't afford to store anything more than a handful of bits. (Our bloom filters set between 9 and 11 bits per signature.)
The key advantage to bloom filters is that the represent set identity lookups (albeit in an adjustable lossy false-positive way). They do not, however, store actual sets of data. They must be coupled with a storage mechanism (traditionally with a several-order-of-magnitude slower lookup speed than the bloom filter) to actually store the items contained within the set - the bloom filter is just a way of definitively answering the question "Could this set possibly contain this item?"
You'd need to make your claim more precise, in order to prove it, but a proof might go something like this: assume that an anti-bloom filter starts empty, and that it over time sees some small fraction of the possible objects (because objects are 64+ bits, say). If the anti-bloom filter usually reports "yes" for a significant fraction (even just 1%) of the seen objects, then an information theoretic argument tells you it must have size around the number of objects times the object size.
IIRC, the real reason for this "power of 2" is for fast rehashing when the table grows. Calculate a 32 bit hash and store it all, but if your table currently has 1024 entries, only use 10 bits of it. Later, when your table grows to 2048 entries, you can just mask off 11 bits worth instead of recomputing every hash.
But, since the hash table is not going to grow in this particular case, any perf gain from using a power of 2 will probably be far less important.
It's fairly common to keep array sizes bounded at a power of two, because when you take a hash you can use bitwise shifting instead of a modulus to determine the hash bucket to use. Depending on the performance of your hash function it can have a noticeable impact, although you're right that it's probably not a matter of 'much' faster.
I've noticed the difference between modulo and bitshift in java code written in the last five years. If it's getting 99.9+% cache hits, being used in an innermost loop, and jdk7 hasn't gotten around to JITing special cases in modulo, this could absolutely matter.
I believe you. There are two reasons I think paul's right in the context of hash tables.
1. If you're using a non-trivial hash function, one more modulo calculation at the end to get the actual memory address is not a big difference.
2. If you're using a trivial hash function, I bet for a lot of data sets you'll get fewer collisions with a hash table whose size is a prime number, canceling out any benefit from calculating the address slightly faster.
Python dicts will grow and grow as you add more keys to them, but one of the constraints mentioned in the article (and implicitly included, because he's talking about a variation on a bloom filter) is that the data structure should not grow without bound. They wanted to use a fixed-size array.
Also, in your example you're storing True as the value, but if you do that, you can't tell the difference between rbloom_array[hashfunc(item1)] and rbloom_array[hashfunc(item2)], if hashfunc(item1) == hashfunc(item2) (that is, they collide). You'd need to store the actual item in the data structure, not just a bit, and then you compare what you get out of the data structure to what you're trying to put in it. If the _items_ are equal then it's a duplicate and you can skip the write (to the database, from the example). If the items are not equal, you need to do the write.
Couldn't you build a composite data structure that internally makes use of both a bloom filter and one of these "opposite of a bloom filter" structures, and get the best of both worlds - no false positives and no false negatives?
Sadly not - instead you'd get disagreement from the two structures, with no way to resolve it.
If the bloom filter says 'yes I've seen it!', and the opposite filter says 'nope, not here', which is accurate? You know it's either a false-positive from the bloom filter or it a false-negative from the opposite filter, but that's the limit of your information. Effectively, you've partitioned your set into three: [definitely not, ambiguous, definitely seen].
If you want to know 100% of the items you've stored with full fidelity, you need to store information for all of them (e.g. in a hash table). Which is a valid thing to do, but requires a lot more memory than the author of this article was willing to use.
There's no free lunch; anything that can track a set perfectly is going to end up taking as much space as just maintaining it in a more straightforward way.
In the case of your specific suggestion, though, the problem is this: what if the bloom filter says "I've seen it (but could be lying)", and the "opposite" filter says "I've never seen it (but could be lying)"? Then you still have no idea whether the object is in the set.
This is an appealing idea and it took me a moment to see why it wouldn't work (other than the part where you simply don't have enough information).
It's correct to say that the bloom filter will produce no false negatives and the OOABF will produce no false positives, so if your OOABF returns true or your bloom filter returns false then you know for sure whether you've seen the item before. The trouble happens when your OOABF returns false and your bloom filter returns true, in which case you might have seen it before.
The thing is, when the bloom filter returns "I've seen it", you still need to resort to the array that has the actual keys.
A good way to look at the problem is that you have N keys to store (because you cannot have false positives). So with N keys, if you want less than sizeof(N)*N bytes, the only real answer is compression. So either you just discard data (what the OP does), or find a clever way to compress things.
The benefit is that you're only testing for existence, so that gives you some leeway. For instance, suppose you are getting random 8-byte user IDs. You could store, say, 1M of them in a hashtable, then take those items, sort them, and store deltas. Instead of 8 bytes per item, you'd only need 44 bits on average (2^64/1M) to store the differences. I believe this is what the OP suggested at the end of the article.
So, it really depends on the key types, and the penalty for forgetting an item. At a really high penalty (say, something that needs to perform an ACID SQL transaction), maintaining a compressed block of IDs is rather attractive. Whereas, if it's just sending an extra write to a high-perf write store, maybe it doesn't matter, and the OPs quick forgetful array is useful.
Um, wouldn't just a cache of the last N (based on memory available) items (or their hashes, whatever) just do the job, and do it better than his solution (which removes elements randomly based on collisions, rather than just removing the oldest element in the cache when it gets full, a much better approach given the time locality he describes)?
The description in this article seems to add too much complexity ("the opposite of a bloom filter") for what really is just a simple cache (albeit, IMO, a relatively inefficient one for the problem he describes).
That's not going to work at all. If I have an object A that hashes to bits 1, 5, 17, 24; an object B that hashes to bits 1, 5, 13, 15; and an object C that hashes to bits 2, 9, 17, 24; and I add objects B and C, then the filter will incorrectly think A has already been seen.
False positives are possible but rare due to hash collisions
When there is a collision you check to see if the object living there is the one you're trying to store. If it's not then you assume negative. If it is the object you're checking then it's positive but it's a correct positive.
If it's not the object you're checking then you evict the object living there and put in the one you're checking, thus allowing the one you evicted to cause a false negative again.
The whole system works BECAUSE of intentional collisions. If there were no collisions it would just be a hashmap and you'd have unbounded data structure size.
I re-read the article and I stand corrected. I had misread it as being just a store of the hash, but this stores the entire original object at the location identified by the hash of the object. And it needs to, because if even one byte is different out of a gigabyte of input data it must return false. That's horribly inefficient.
With a bloom filter, you can take gigabyte-sized video files and determine if they have never been seen by the bloom filter. And the size of the bloom filter is fixed; choose the size based on expected inputs, desired accuracy, and so on and it doesn't grow based on the size of the objects being processed. With this implementation, it stores not only the hash of the video but the entire byte array of the video at the hash location in order to not have a false positive.
So it's possible to have an opposite-of-a-bloom-filter with only 8 buckets, but it consumes terabytes of storage space because you're processing really massive files.
I'm pretty sure the structure was designed around relatively small objects. If you need to store large objects, then you can modify this structure to store the full un-masked hash instead of the object. Assuming no hash collisions (even with MD5 this should be a safe assumption if you're working with non-malicious data) the structure should behave identically.
Though this is because they're storing an ID to check against and not anything inherent in the hashing or data structure which makes it not only the opposite of a bloom filter but rather different in spirit too.