
Probabilistic M2M Relationships Using Bloom Filters - luu
http://zacharyvoase.com/2012/08/31/m2mbloom/
======
wtfleming
Maybe i'm misunderstanding, but wouldn't this mean that there would be no way
to safely remove a relationship? I guess there are a number of cases where
this would be ok.

But people tend to make mistakes. When you discover that someone has added a
link between Daniel Day-Lewis and Ernest Goes to Jail all you would need to do
is delete a row in the movies_people table. But because of potential
collisions in the bit arrays you can't safely remove a relationship?

~~~
Smerity
You're indeed correct as this is built upon bloom filters and standard bloom
filters don't support deletion[1]. You could reconstruct the bloom filter
minus the deleted element from each of the individual relations but then you'd
need the relation table.

The other substantial issue with the proposed scheme is false positives. In
this situation, you have no way to ensure that a given relation is not an
accidental match. For that you would again need the relations table.

What's fun about this story is that the author has inadvertently re-discovered
the bloom join[2] - he's just missing one of the important steps.
Traditionally bloom filters are used for a fast, initial check which is then
followed by a full check to remove false positives.

[1]: There are some modifications to bloom filters that support deletion but
they're all more complex than bitstrings.

[2]: [http://www.databasejournal.com/features/oracle/oracle-
bloom-...](http://www.databasejournal.com/features/oracle/oracle-bloom-
filters.html)

~~~
brudgers
Sometimes a few false positives don't matter. Or matter less than latency.
Nobody dies if a query returns Daniel Day Lewis as the star of _Ernest Goes to
Camp_ and many people might just write it off as human error...even if it
isn't.

------
DonHopkins
Came here looking for some Grindr social networking data analysis. Was
disappointed.

------
neil_s
Perfect timing, was just thinking about ways to do this, to store which images
have already been viewed by the user, with fast lookups to ensure that the
next image hasn't already been seen by them.

------
klkvsk
The guy did not add indexes for junction table, that is why it took so much
time. The benchmark is incorrect.

