Hacker News new | comments | show | ask | jobs | submit login
The opposite of a bloom filter (somethingsimilar.com)
76 points by Anilm3 67 days ago | hide | past | web | favorite | 25 comments

I would argue that the opposite of a bloom filter doesn't really exist, at least not in a satisfying way. A bloom filter's size is dependent only on the desired false positive rate, whereas its opposite must be dependent on the size of the data. (And don't be fooled by data that can be represented by a primary key, that's not as general as a bloom filter.) I tried, with limited success, to explain my point of view in this answer on StackExchange: https://cstheory.stackexchange.com/questions/6596/a-probabil...

This probably runs afoul of your "at least not in a satisfying way" constraint, but:

It is pretty easy (an exercise) to implement the "opposite of a Bloom filter" if you start from a summary of the complete set of events and support deletion, rather than starting from the empty set and supporting addition.

What makes everything seem hard is the (often unstated) requirement that you start from an empty set and support addition, which is roughly as hard as implementing a Bloom filter that starts from the complete set and supports deletion. Neither of the links make this requirement explicit (though, it is implicit in their "motivation" sections).

Bloom filters scale logarithmically with false positive rate and linearly with the number of items stored.

The article doesn't mention the false negative rate. Unlike a Bloom filter, it'll depend on order when the input includes repeated elements. But in general, required memory will increase quadratically in the number of items stored at a constant false negative rate (because of the "birthday paradox").

So it isn't the opposite of a Bloom filter. But what is?

TLDR: a cache with LRU eviction, but only storing the keys, not the values.

I was thinking the same thing

"A Bloom filter is a data structure that may report it contains an item that it does not (a false positive), but is guaranteed to report correctly if it contains the item (“no false negatives”)."

I'm afraid that is not how it works. A Bloom filter can tell whether an item may be in the set (false positive) but can definitely tell an item is NOT in the set (no false negative).

It's correct; you've misread it — in fact, you're agreeing with it. What it's saying is, if the item is in the set, the Bloom filter is guaranteed to report that it is in the set. What you're saying is, if the filter says the item is not in the set, then it is guaranteed not to be in the set. Those two statements are equivalent (being contrapositives: "A implies B" is equivalent to "not B implies not A").

The article states a Bloom filter is "guaranteed to report correctly if it contains the item", however a Bloom filter cannot do this. The bits set by the various hash functions could very well be set due to some other key. What the Bloom filter _can_ say, is that if none of the bits are set, then clearly the key was never inserted.

> The article states a Bloom filter is "guaranteed to report correctly if it contains the item", however a Bloom filter cannot do this.

I think what's going on here is that you're reading "if" as meaning "whether". In fairness, this is common English usage: "I'll tell you tomorrow if I'm going" usually means "I'll tell you tomorrow whether I'm going".

That's not what the OP means. The correct reading is perhaps clarified with a comma:

> A Bloom filter is guaranteed to report correctly, if it contains the item.

or perhaps better

> If it contains the item, a Bloom filter is guaranteed to report correctly.

Ah yes, that would be it.

IMO it is better to present the Bloom filter by asking/stating what one can be certain of given each return value of the Bloom filter.

I think it might just be a strangely formed sentence where the part "if it contains the item" is not the subject of the report, but a condition:

If it contains the item -> reports that it contains the item (correctly)

If it doesn't contain the item -> may report that it contains the item (incorrectly)

Nothing wrong with being pedantic when dealing with definitions.

Since we're on the topic of being pedantic, "well, you know, that's just, like... your opinion, man". I'd argue that there's nothing excessive about arguing if A may, in fact, be the opposite of A :)

Certainly not, I'm all for some arguing on the internet.

The logic you wrote is exactly the _opposite_ of what a Bloom filter does.

"If the item is in the set, the Bloom filter is guaranteed to report that it is in the set." This statement while correct is not useful since the query is against the Bloom filter, not the set. The Bloom filter reporting the object's hash key is in the filter does NOT mean the object is in the set. Multiple objects can have the same hashkey. A check of the hashkey on the Bloom filter just means ONE of these objects is in the set. The existence of a hashkey in the Bloom filter does not give much useful information.

He is saying "contains implies report". You are saying "not-report implies not-contain".

These are logically equivalent.

They are not. Multiple objects can map to the same hash key due to collision. Just because a hashtable containing the hash key of an object does not mean the object has been seen before. An object's hash key not contained in the hashtable definitely assures that the object has not been seen before.

This is called the contrapositive of your statement. It's logically guaranteed.

No one is claiming that "if the hash matches the object must be there". They're saying "if the object is there, the hash will match". This is logically equivalent to saying "if the hash does not match, the object is not there".

GP is right; you need to be careful about how you read it.

“contains implies report” is different than “report implies contains”. You are (correctly) arguing the against latter statement, which GP does not make.

I believe it is a typo. The rest of the article does align to that concept.

At first glance I do disagree that his/her solution will be as efficient in terms of size of the set and as performant as bloom filters or cuckoo filters. But I would have to benchmark it to be sure.

I can see what's confusing you, but it's correct as stated.

Item is not a reference to the physical data structure in memory. If you treat it as a black box where you put items in and then ask it if it has seen the 'item' then the wording is more clear.

"It's a cache." The use case is deduplication in an event-stream environment. This calls for exact matching without hash collisions.

Low memory version:

`return false;`

Not sure why it's getting fownvoted so much. On large datasets it's just as effective...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact