
Probabilistic Many-to-Many Relationships (with Bloom Filters) - zacharyvoase
http://blog.zacharyvoase.com/2012/08/31/m2mbloom/
======
fusiongyro
This is a really cool technique and warrants some investigation, but I can't
let this go unaddressed:

> and the upper bound on the time taken to join all three tables will be the
> square of that

These kinds of from-principle assertions about what Postgres's (or other DBs')
performance will be like sound helpful but usually aren't. The kinds of
queries you issue can change everything. Indexing can change everything.
Postgres's configuration can change everything. Actual size of the table can
change everything. For example, if the table is small, Postgres will keep it
in memory and your plans will have scary looking but actually innocent
sequential scans, which I think actually happened in his join table example.

Anyway, it's good to have a lot of tools in your toolbox, and this is an
interesting tool with interesting uses. I just think it would be a grave error
to take the performance ratios here as fixed.

~~~
adgar
This is why DBAs are a thing. Just sit back and enjoy the rediscovery of
relational datastores in the blogosphere. The "thought leaders" are going to
slowly figure out things like your post that were worked out properly in the
80s.

------
zacharyvoase
OP here.

The lack of an index on the junction table definitely did have a major effect.
By just doing the following:

    
    
        CREATE INDEX ON movie_person (movie_id);
        CREATE INDEX ON movie_person (person_id);
    

The junction query speeds up to around 2ms—it's comparable to or faster than
the bloom filter query. But the trade-off is revealed when you see the total
size of the movie_person table (including indexes):

    
    
        SELECT pg_size_pretty(pg_total_relation_size('movie_person'));
        => 45MB
    

Whereas, by my calculations, the total size added by the bloom filters and
hashes on movie and person is just 2094kB in total.

I plan on adding an erratum to my article explaining my error, the time/memory
trade-off, and ideas for further improvement or exploration, potentially
including bloom-based GiST indexes and the opportunities for parallelization.

~~~
fdr
Also look at GIN. Although the documentation claims it is typically larger
than GIST, in certain cases (a small lexeme set, which is not true in the
general case full-text) it can seem to be smaller and faster. Mostly this can
be true over constrained symbolic domains (which is a lot of computing also).

We've recently replaced a GIST index with a GIN one and got a lot more than
the 'rule of thumb' speedup. More like an order of magnitude. Or two. So much,
that Tom Lane suggested we might have stumbled onto a bug in GIST, so we'll
see.

------
slig
Now _that's_ the kind of content that I'd love to see more here. It covers
basic tools like sed and awk, to nice concepts I didn't know, like BIT field
in Postgres.

Any recommended book or set of articles for starting with Postgres?

~~~
skb_
I found the postgres manual [1] to be quite sufficient. It covers pretty much
anything you might want to do.

[1] - <http://www.postgresql.org/docs/9.1/static/>

~~~
slig
Thanks, will take a look!

------
DEinspanjer
I like the exploration of this method, but I would have liked to see the
actual comparison of any false positives. Bad data can be acceptable in
statistical analysis, but if you were showing someone a list of their ratings
or the actors who were in the latest Kevin Bacon movie, false positives have a
much stronger impact.

Is there any chance that the bloom could be used as a short-circuit filter but
still follow-up with the m2m join to filter out the false positives? If the
query optimizer can take advantage of that, then you could likely balance the
size and cost of the bloom field.

~~~
spullara
This is exactly right. Bloom filters used in this way should generate
candidates not results to reduce the total amount of work you do when it is
expensive to confirm membership in the set. The best used cases check the
opposite, e.g. Kevin Bacon was definitely not in this movie, since there are
no false negatives.

------
mhale
Using a custom index implementation that uses bloom filters internally is
probably going to work out better in the long run. It should be way more
efficient than storing the data in a bit field, using app-layer code to
generate the bloom filter values, then doing bitwise comparisons on-the-fly at
query time.

The Postgres query planner can also recheck constraints automatically to
recover from bloom filter false positive matches at query time.

FYI -- bloom filters are already used internally within the PostgreSQL
intarray contrib module and the full-text search functionality.

See: [http://postgresql.1045698.n5.nabble.com/bloom-filter-
indexes...](http://postgresql.1045698.n5.nabble.com/bloom-filter-indexes-
td1899247.html) <http://code.google.com/p/postgres-learning/wiki/BloomFilter>

EDIT: for clarity, typo correction

------
ocharles
These seem to produce entirely different sets of results:

For the bloom filter: (actual time=0.033..2.546 rows=430 loops=1)

And for the join: (actual time=7.440..64.843 rows=9 loops=1)

So the join returned 9 movies for person_id=160, while the bloom filtered
returned 430.

I understand it's a probabilistic model, but that's a pretty whopping
difference in data. Have I missed something?

~~~
Groxx
I saw that too... I'm suspicious of the `SET person_filter = (SELECT
bit_or(person.hash) ...);` line, personally. If the hash function is supposed
to produce roughly random bits (I'm not familiar with murmur hash though, this
may not be true), and you're simply OR-ing them together, you lose zero bits
pretty quickly.

But I need to brush up on murmur hash, and bloom filters in general, so I may
be entirely wrong. Maybe it's valid, and maybe that does represent a 0.5%
error.

------
brown9-2
Regarding the space consumption, it seems like this is a tradeoff of storing
two integers (plus the row overhead) per rating versus storing 335 bytes per
movie/user.

For the join table, that's 2 integers * 575,281 ratings * 4 bytes = 4,602,248
bytes used in the join table.

With the filter, in each movie row, you need to store 1632 bits for the
person_filter and 1048 bits for the hash, so 3,883 movies * (1632 bits + 1048
bits) = 1,300,805 bytes.

In each user row you need to store the same number of bits for the filter and
hash, so 6,040 users * (1632 bits + 1048 bits) = 2,023,400 bytes.

Is my math here wrong? With this approach you save about 1.22MB, or about 27%
over the join table approach (ignoring how much overhead there is for each row
of the table and each page to store the table in).

Depending on the dataset it doesn't seem like the space savings would be worth
the sacrifice in accuracy.

~~~
SeoxyS
The benefit is not in the data set size, it's in performance of querying. 63ms
vs. 0.03ms is a HUGE difference.

~~~
nierman
or you could add an index to the join table and see a similar speedup without
the accuracy tradeoff. I'm not saying that the approach doesn't have
applications, but it would be much more useful if the post discussed a case
that couldn't be addressed easily by standard means.

------
tmoertel
EDITED ONE LAST TIME – I PROMISE, I REALLY DO, THIS IS THE LAST TIME – TO ADD:

I think the reason that OP's join-based queries are slow is that there are no
indexes over his junction table's foreign keys:

    
    
        CREATE TABLE movies_people (
          movie_id INTEGER REFERENCES movie,
          person_id INTEGER REFERENCES person
        );
    

Thus when he wants the movies associated with person 160, the database must
examine the entire junction:

    
    
        EXPLAIN ANALYZE SELECT * FROM movies_for_people_junction WHERE person_id = 160;
    
        Hash Join  (cost=282.37..10401.08 rows=97 width=33) (actual time=7.440..64.843 rows=9 loops=1)
          Hash Cond: (movie_person.movie_id = movie.id)
          ->  Seq Scan on movie_person  (cost=0.00..10117.01 rows=97 width=8) (actual time=2.540..59.933 rows=9 loops=1)
                Filter: (person_id = 160)
          ->  Hash  (cost=233.83..233.83 rows=3883 width=29) (actual time=4.884..4.884 rows=3883 loops=1)
                Buckets: 1024  Batches: 1  Memory Usage: 233kB
                ->  Seq Scan on movie  (cost=0.00..233.83 rows=3883 width=29) (actual time=0.010..2.610 rows=3883 loops=1)
        Total runtime: 64.887 ms
    

Note the sequential scan on movie_person that accounts for 2.5 to 60 seconds.
If there were an index on movie_person(person_id), this could be an index
scan.

(EDITED TO ADD: I _totally_ misread the timings in milliseconds for timings in
seconds, so most of what I originally wrote is off by a factor of 1000. I'm
leaving it here for entertainment value and because you might want to play
with the data set in SQLite. But my point is still valid: a vanilla join is
comparable in performance to the OP's bloom-filter method.)

I'm having a hard time believing that the straightforward join on a data set
as small as the OP's sample is really going to take 65 seconds on PostgreSQL.
Maybe that's what EXPLAIN predicts (with spotty stats, I'd wager), but EXPLAIN
is not a reliable way to measure performance. For this data, I'd expect real
queries to perform much better.

EDITED TO ADD: The OP's article shows the results for EXPLAIN ANALYZE, which
ought to have performed the queries. So I'm not sure why the results are so
slow.

Heck, even SQLite, when processing a superset of the data set on my 4-year-old
computer, can do the OP's final query (and return additional ratings data)
almost instantly:

    
    
        $ time sqlite3 ratings.db '
            select *
            from
              users
              natural join ratings
              natural join movies
            where user_id = 160
          ' > /dev/null
    
        real    0m0.006s
        user    0m0.002s
        sys     0m0.004s
    

If you want to try some queries yourself, here you go:

    
    
        $ wget http://www.grouplens.org/system/files/ml-1m.zip
        $ unzip ml-1m.zip
        $ cd m1-1m
        $ sqlite3 ratings.db <<EOF
    
          CREATE TABLE movies (
              movie_id INTEGER PRIMARY KEY NOT NULL
            , title TEXT NOT NULL
            , genres TEXT NOT NULL
          );
    
          CREATE TABLE users (
              user_id INTEGER PRIMARY KEY NOT NULL
            , gender TEXT NOT NULL
            , age TEXT NOT NULL
            , occupation TEXT NOT NULL
            , zipcode TEXT NOT NULL
          );
    
          CREATE TABLE ratings (
              user_id INTEGER REFERENCES users(user_id)
            , movie_id INTEGER REFERENCES movies(movie_id)
            , rating INTEGER NOT NULL
            , timestamp INTEGER NOT NULL
            , PRIMARY KEY (user_id, movie_id)
          );
    
          .separator ::
          .import movies.dat movies
          .import users.dat users
          .import ratings.dat ratings
    
        EOF
    

Fully enumerating the joined data takes under a second:

    
    
        $ time sqlite3 ratings.db '
            select count(*)
            from
              users
              natural join ratings
              natural join movies
          ' 
        1000209
    
        real    0m0.953s
        user    0m0.925s
        sys     0m0.021s
    

And even going so far as to _print_ the joined data takes under six seconds:

    
    
        $ time sqlite3 ratings.db '
            select *
            from
              users
              natural join ratings
              natural join movies
          ' > /dev/null
    
        real    0m5.586s
        user    0m5.497s
        sys     0m0.059s
    

Sometimes, the old stuff works better than we give it credit for.

~~~
jaekwon
Skimmed through it, but it seems that there is still an improvement with the
bloom filter with a factor of 2~3. Is that correct? (estimated 6(?)ms once
corrected with indices, vs 2ms with the bloom)

~~~
wiredfool
This sample set is way too small to be arguing about a millisecond here or
there. The variation from what's cached, or if something else happens on the
machine, or whatever is going to affect the benchmarks. For a 'real'
benchmark, I'd like to see a few gigs of data, or something that's not going
to fit in cache. The real gain from the bloom filter is going to be in a 4-10x
reduction in the working set for the same amount of data. But when the working
set is trivial, then the fluctuating and fixed overheads are going to
dominate.

I'd also like to see how good the bloom filter's indexing is at scale, or if
it winds up being some sort of table scan.

~~~
tmoertel
For a "real" benchmark, I'd like to see the query repeated until the response-
time distribution stabilizes. Then take the response time at the 90th, 95th,
and 99th percentiles, and also the mean. Actually, I'd like to see the CDF
plot for the whole distribution.

------
b0b0b0b
Thanks for this interesting writeup; it's definitely thought provoking.

Because you can't index that bloom column, it seem's you'd always be doing
full table scans.

In fact it doesn't appear any indexes were used throughout this whole
exercise, is that right?

~~~
asharp
I believe that you can index the bloom column, to some extent, with some
hacks.

A major problem is a lack of indexes in array elements, which then forces you
to build an otherwise unneeded table.

That being said, it's definitely doable, to some extent.

~~~
mhale
FYI - the intarray contrib module for Postgres does support indexing of arrays
of integers.

[http://www.postgresql.org/docs/9.1/static/intarray.html#AEN1...](http://www.postgresql.org/docs/9.1/static/intarray.html#AEN134364)

And it happens to use bloom filters under-the-hood to do it.

From the docs:

"Two GiST index operator classes are provided: gist__int_ops (used by default)
is suitable for small- to medium-size data sets, while gist__intbig_ops uses a
larger signature and is more suitable for indexing large data sets (i.e.,
columns containing a large number of distinct array values). The
implementation uses an RD-tree data structure with built-in lossy
compression."

That bit about the lossy compression in the implementation if referencing the
use of a bloom filter. Also mentioned here,
[http://archives.postgresql.org/pgsql-
hackers/2005-05/msg0147...](http://archives.postgresql.org/pgsql-
hackers/2005-05/msg01475.php)

~~~
asharp
I saw that, but it doesn't seem to implement any of the indexes you'd need in
order to run a bloom query efficiently.

That being said, if you stored your bitvector as an array of powers of two,
then it would work. But that would be horribly inefficient in terms of space
usage.

~~~
saurik
Which is why PostgreSQL is scriptable: the various contrib modules are often
better looked at as examples of how to build your own indexes using GIN/GiST
than "this is what we provide".

In your case, though, a strict immutable function mapping the bitvector to an
int array as part of a functional index should be sufficient to use the
existing contrib module: you don't need to store the things you index in the
table.

------
pr0filer_net
Nice article!

I see the table `movies_people` uses (SIGNED) INTEGERS as datatypes, but they
reference to a UNSIGNED BIGINT (SERIAL).

~~~
saurik
No: "integer" on PostgreSQL is, in fact, a signed 32-bit number, but so is
"serial". The difference between serial and integer is only that usage of the
serial type as a column implicitly creates a sequence (which itself is a
64-bit counter, but is unrelated to the storage in the column), sets its owner
to the column, and makes the column's value default to the next value of the
serial. PostgreSQL also has the type "bigint", which is a signed (not
unsigned) 64-bit number, and an equivalent "bigserial" type which you would
need to use if you want a 64-bit version of serial.

------
luney
Can anyone recommend any good interactive charting examples for many-to-many
relationships?

------
secure
As the article mentions at the very bottom, this technique is not accurate.

This fact makes it unusable for many use-cases, but it’s an interesting and
good article nevertheless.

~~~
mcherm
Yes, and if I were to make one addition to the article it would be to add a
section talking about the implications of that 0.5% error rate with some
examples of uses where this would be appropriate and where it wouldn't be.

