
Lab Notes: How We Made Joins 23 Thousand Times Faster, Part Two - nslater
https://crate.io/a/lab-notes-how-we-made-joins-23-thousand-times-faster-part-two/
======
mgsouth
tldr: CrateDB implemented hashed joins, where both tables are too big to fit
into memory, by reading left table in chunks and scanning entire right table
for each chunk.

Part one at [https://crate.io/a/lab-notes-how-we-made-
joins-23-thousand-t...](https://crate.io/a/lab-notes-how-we-made-
joins-23-thousand-times-faster-part-one/)

The hashing is used for equi-joins, where two tables are related with
(possibly multiple) equality operators; e.g. "select * from t1 join t2 on t1.a
= t2.b and t1.x = t2.y".

The benchmarks show a very large improvement over the previous algorithm, but
it's still O(M/c*N/d). It would be interesting to see CrateDB only keep the
hashes in memory, possibly using Bloom filters or such, then re-read the
tables, ignoring any rows that don't match a hash. If the _selected_ rows from
the second table will fit into memory, then you can get O(2M+N).

