
The Opposite of a Bloom Filter - DanielRibeiro
http://somethingsimilar.com/2012/05/21/the-opposite-of-a-bloom-filter/
======
MichaelGG
Took me a few reads, but it's actually quite simple.

You want a structure that can tell you something has been seen, but sometimes
forgets, but will never incorrectly tell you something has been seen.

Solution: an array. Hash the item to find its index, swap out what's there,
and see if it is your item. If so, you know for sure it was previously placed.
If not, then it might not have (it may have been forgotten).

    
    
      let arr = Array of key
      let contains_key key = 
        let index = hash key % arr.length
        let prev_key = arr.swap[index, key]
        return prev_key = key
    

The hash algorithm is crucial. Reducing forgetfulness is as simple as making
the array longer. And he points out that if you can compress the keys, you can
reduce storage size.

~~~
jderick
So it's just a cache?

~~~
JonnieCache
A lot of things turn out to be "just" caches.

Relatedly, the majority(?) of audio processing is just creative use of
buffers.

~~~
Dylan16807
Sure, but this is a rather simple cache. Hash into bucket, LRU per bucket,
stores the entire object. Nothing like the magic of a bloom filter that
represents an object with 3-5 bits.

------
gojomo
Seems like a cache using _open addressing_ but no probing, just instant
eviction on any initial collision.

So, could also consider adding probing to lessen destructive collisions (up to
some chosen probing depth) while there are other unused slots, in still-
constant average time. Or, any other choice-of-slots approach such as 'cuckoo
hashing':

<http://en.wikipedia.org/wiki/Hash_table#Cuckoo_hashing>

Whether you'd want to pay that cost could depend on how soon you reach
saturation within each reset period, after which 'every' insert results in an
eviction. But if you were usually in one-for-one eviction mode, you might add
a bit or more that hints 'age' or 'recent access' to probed items. Then each
eviction could be biased towards an older/less-accessed value... taking more
advantage of the clustering-in-time that was reported.

And once you add that aging bit you might not need discrete full-restart
periods ("hourly") at all: just make a useful proportion of older entries
eligible for eviction each interval, maintaining a steady-state of full-array,
evictions-biased-towards-older-entries.

------
ajb
This is useful. However, it is larger than a bloom filter, because a bloom
filter needs one bit where this stores a value. I suspect that this asymmetry
is unavoidable, but I can't prove it. Anyone?

~~~
gms7777
These sorts of filters have limits to accuracy and usefulness (and a definite
tradeoff between memory requirements and accuracy). You can take both to the
extreme, while still maintaining the definition:

A structure that may report false positives, but no false negatives: A single
bit set to 1

A structure that may report false negatives, but no false positives: A single
bit set to 0

Obviously that's entirely useless, but I think it shows that you can't make
any definitive proof of size requirements

~~~
ajb
I don't think that shows that you can't make any definitive proof of size
requirements. All it shows is that you have to make a statement about the
accuracy. For example, it might be possible to prove that for all accuracies,
an optimal bloom filter with a particular accuracy will always be smaller than
an optimal inverse bloom filter of the same accuracy.

~~~
gms7777
Intuitively, I think that might be the case.

You have to define more specifically what you mean about accuracy here.
Accuracy for the bloom filter would be probability of false positives. For the
inverse it would be probability of false negatives.

For a bloom filter, if you're targeting a false probability f with n items,
you want k = lg(1)/f hash functions, and m = 1.44 k*n bits.

That's the thing though. The bloom filter version scales up easily from the
degenerate version (adding more bits increases accuracy). Unless I'm missing
something, the inverse bloom filter doesn't scale up from the degenerate
version in any natural way.

This will nag at me now. I'm going to need to give this some thought. If I
come up with any proof on mem reqs. I'll let you know.

~~~
ajb
That would be interesting, thanks.

------
paul
I'm skeptical that "hashing is much faster on an array the size of a power of
two".

Also, if memory is actually an issue, he should store hashes instead of full
objects (assuming the byte arrays are more than 16 bytes).

~~~
AlisdairO
It's fairly common to keep array sizes bounded at a power of two, because when
you take a hash you can use bitwise shifting instead of a modulus to determine
the hash bucket to use. Depending on the performance of your hash function it
can have a noticeable impact, although you're right that it's probably not a
matter of 'much' faster.

~~~
paul
I bet that there's no measurable difference in this code. It just bugs me to
see so much unnecessary complication based on blindly imitating an
optimization that maybe made sense 20 years ago.

~~~
spullara
It was 10% faster to switch from modulo to & on a recent real world problem
for me.

~~~
brlewis
I believe you. There are two reasons I think paul's right in the context of
hash tables.

1\. If you're using a non-trivial hash function, one more modulo calculation
at the end to get the actual memory address is not a big difference.

2\. If you're using a trivial hash function, I bet for a lot of data sets
you'll get fewer collisions with a hash table whose size is a prime number,
canceling out any benefit from calculating the address slightly faster.

------
pdeuchler
This may be naive of me, but wouldn't it be better to leverage your language's
existing tools? For instance, in Python you would use a dict...

    
    
      >>> my_dict = {}
      >>> my_dict[item] = True #Insert an item
      >>> try:
      ...    tmp = my_dict[item]
      ...    print "Found"
      ... except KeyError:
      ...    print "Not found"
    

I can't imagine a custom implementation would be more efficient, though there
may be other reasons to roll your own.

~~~
sbov
The problem is your solution (as shown) will grow in memory forever.

You can add code to get around that (e.g. expiration times), but it also might
be easier (and have less memory usage) to write your own dict class that
overwrites values when a collision occurs.

Also, since the OP seems to be using java: if you can get away with storing a
primitive data type in an array (e.g. array of int or long) you can save a lot
of memory vs using a Set or Map.

------
davidkellis
Couldn't you build a composite data structure that internally makes use of
both a bloom filter and one of these "opposite of a bloom filter" structures,
and get the best of both worlds - no false positives and no false negatives?

~~~
gwillen
There's no free lunch; anything that can track a set perfectly is going to end
up taking as much space as just maintaining it in a more straightforward way.

In the case of your specific suggestion, though, the problem is this: what if
the bloom filter says "I've seen it (but could be lying)", and the "opposite"
filter says "I've never seen it (but could be lying)"? Then you still have no
idea whether the object is in the set.

~~~
dllthomas
> There's no free lunch; anything that can track a set perfectly is going to
> end up taking as much space as just maintaining it in a more straightforward
> way.

Unless there are regularities in the data that you can exploit.

~~~
denom
but this invalidates the idea of a general purpose algorithm

~~~
dllthomas
Yes; there is no compression function that will work for all data.

------
afc
Um, wouldn't just a cache of the last N (based on memory available) items (or
their hashes, whatever) just do the job, and do it better than his solution
(which removes elements randomly based on collisions, rather than just
removing the oldest element in the cache when it gets full, a much better
approach given the time locality he describes)?

The description in this article seems to add too much complexity ("the
opposite of a bloom filter") for what really is just a simple cache (albeit,
IMO, a relatively inefficient one for the problem he describes).

------
cnlwsu
I had pretty much exact problem. We keep index of row keys in Cassandra on
time partitioned data (by hour), but do not want to re-insert the index each
time there is a piece of data.

A ConcurrentLinkedHashMap
(<http://code.google.com/p/concurrentlinkedhashmap/>) solved it easily. It
does use more RAM though... but an insignificant amount on our heap.

~~~
sprsquish
This is almost exactly what our use case is. I slightly altered the
implementation to be more GC friendly though:
[http://squishtech.posterous.com/addendum-to-the-opposite-
of-...](http://squishtech.posterous.com/addendum-to-the-opposite-of-a-bloom-
filter)

------
scotty79
I'd implement it exactly like bloom filter but instead bits I'd use 0,1,2 0
means never touched. 1 means I've seen it, 2 means I've seen it more than
once.

True positive is when all places for given entry contain 1.

Adding is incrementing all places (if they have value less than 2).

Before putting new value in check if it's not already here. If it is don't
add.

~~~
eridius
That's not going to work at all. If I have an object A that hashes to bits 1,
5, 17, 24; an object B that hashes to bits 1, 5, 13, 15; and an object C that
hashes to bits 2, 9, 17, 24; and I add objects B and C, then the filter will
incorrectly think A has already been seen.

~~~
scotty79
You are right. I haven't properly thought that through.

------
mtrimpe
Another option, in case you know the bounds of the parameter space (eg all
natural numbers under 1000) you can implement this with a counting bloom
filter that has every possible element pre-inserted.

------
webreac
How to give fancy names to a simple algorithm using a hash map. I use perl a
lot and as a result, this kind of algorithm occurs always.

~~~
webreac
In fact, it is simpler than a hash. The provided code is not complete (the
limit to 1 hour is missing). Here is a full functional solution to its
problem:

    
    
      perl -MDigest::MD5 -nle '($id,@data)=split;$t=time;$j=unpack("L", Digest::MD5::md5($id))%100;print join(" ",$id,@data) if $t-$h{$j}>3600 or $hi{$j} ne $id;$h{$j}=$t;$hi{$j}=$id' input

------
biot
False positives are possible but rare due to hash collisions. With a bloom
filter, false negatives are impossible.

~~~
anonymoushn
False positives are not possible in the structure discussed in the OP.

~~~
biot
I re-read the article and I stand corrected. I had misread it as being just a
store of the hash, but this stores the entire original object at the location
identified by the hash of the object. And it needs to, because if even one
byte is different out of a gigabyte of input data it must return false. That's
horribly inefficient.

With a bloom filter, you can take gigabyte-sized video files and determine if
they have never been seen by the bloom filter. And the size of the bloom
filter is fixed; choose the size based on expected inputs, desired accuracy,
and so on and it doesn't grow based on the size of the objects being
processed. With this implementation, it stores not only the hash of the video
but the entire byte array of the video at the hash location in order to not
have a false positive.

So it's possible to have an opposite-of-a-bloom-filter with only 8 buckets,
but it consumes terabytes of storage space because you're processing really
massive files.

~~~
eridius
I'm pretty sure the structure was designed around relatively small objects. If
you need to store large objects, then you can modify this structure to store
the full un-masked hash instead of the object. Assuming no hash collisions
(even with MD5 this should be a safe assumption if you're working with non-
malicious data) the structure should behave identically.

~~~
bbrtyth
I can imagine the head scratching over this design decision 500 years in the
future when everyone is an AI that has bajillions of files. Or one really
unlucky guy tomorrow.

