
The Opposite of a Bloom Filter (2012) - espeed
http://www.somethingsimilar.com/2012/05/21/the-opposite-of-a-bloom-filter/
======
Rangi42
So in short, the opposite of a Bloom filter is a hash-map where, if you add
two items that have the same hash, the second one _replaces_ the first instead
of them both being stored in a linked-list bucket.

> The old id at that index is compared to the new id just added. If they
> differ, containsAndAdd returns false and the process should write the event.
> If the ids are the same, containsAndAdd returns true and the process can
> safely drop the event on the ground. Occasionally, two byte arrays may hash
> to the same index. When this happens, containsAndAdd will return false and a
> duplicate write will be emitted. However, since duplicate items come close
> together in time, this is relatively rare.

~~~
adrianN
So the opposite of a bloom filter is a cache with a random replacement
strategy. For some applications they work okay, [1], but for other
applications they're worse than FIFO, [2].

[1] [http://arxiv.org/abs/1202.4880](http://arxiv.org/abs/1202.4880)

[2]
[http://www.ece.uah.edu/~milenka/docs/milenkovic_acmse04r.pdf](http://www.ece.uah.edu/~milenka/docs/milenkovic_acmse04r.pdf)

------
mistercow
This seems like an alright solution, but I think an LRU or LFU cache would
perform better (they wouldn't have pathological cases like two colliding
events that frequently alternate), and you probably wouldn't have to implement
it yourself.

I also think it's a bit much to call this "the opposite of a bloom filter". A
bloom filter has a lot of properties that don't come through in the algorithm
proposed.

For one thing, the space required is O(MN) (size of an element * size of the
filter) instead of a bloom filter's O(N). For another, bloom filters give the
same results regardless of what order previous elements were added in. It
seems to me that if you're looking for "the opposite of a bloom filter",
you're probably wanting those properties.

~~~
sweezyjeezy
The space requirement is surely impossible to fulfil generically. As soon as
you do any sort of compression you make it possible for two elements to map to
the same thing. Unless you know that M << N, in which case you can just save
exactly one element, and just replace it every time you see a new one!

~~~
mistercow
That's basically true. Technically you can meet these requirements by
exploiting the false negatives loophole, but if the data structure can _ever_
report that it contains a size M element, it will need O(M) space.

------
Strilanc
So... a cache. You keep track of a bounded number of items, and can avoid
doing some work when a currently-tracked item shows up again.

~~~
eridius
The first footnote in the article says:

> _[1] It’s been 3 years since I wrote this article, and every time it pops
> back up people are like “so, a cache”, and every time I’m like “that wasn’t
> obvious from context?” and so now I’ve added this text._

~~~
Strilanc
I missed that.

It certainly wasn't clear to me from the rest of the post that the author
realized they were talking about a cache. Possibly because I'd be tempted to
make that the _focus_ , since I consider "by the way we just re-invented the
cache by accident!" to be an excellent climax-twist on a post like this. You
think it's about this cool new data structure, but then it's actually about
thinking from the ground up or about reasoning from multiple perspectives.

------
mooneater
Combine a bloom filter with opposite-of-bloom and get: a) a lossless
compression data structure (no false negatives or positives) OR b) garbage
(both false negatives AND positives)

~~~
eridius
Interesting idea. Took me a moment to figure out what the actual results would
be (beyond "garbage"). But unfortunately it doesn't sound very useful. What
you'd end up with is a structure that reports a tri-state value: "Yes I've
definitely seen it", "No I've definitely not seen it", and "I have no idea".
With the added drawback of having to know your data set size in advance to
tune the bloom filter (or resetting it periodically as the article suggests,
but that weakens the "No" answer to "Not since my last reset").

~~~
fit2rule
Pipe the I have no ideas back into another set of bloom/nonbloom stages, the
output of that back into this one, and the network becomes a computer.

------
gobengo
I can't hate on this guy for sharing a real-world situation and what he
designed.

Though I was expecting something exotic that can help with set membership
problems, but instead of being log n in size it would be exponential... which
would be comically useless (...or is there a use?)

------
robryk
So, every entry is either stored in memory or will be emitted again. This is
very different than in Bloom filters (though, admittedly, you can't do better
than that on this problem).

