

Things SQL needs: MERGE JOIN that would seek - thisisnotmyname
http://explainextended.com/2010/05/07/things-sql-needs-merge-join-that-would-seek/

======
btilly
I think it would be better still to, when seeking forward, have it examine a
couple of records, then start jumping forward in powers of 2 until it has
located an interval with the next value in it, which it then does a binary
search on to find the right location.

This is at most a factor of 2 less efficient than a binary search, it is fast
when the sequential reads are fast, and it doesn't rely on statistics telling
the whole story. (For instance the size discrepancy could be explained by a
small number of large gaps. This strategy would quickly jump the gaps while
still getting fast sequential read behavior elsewhere.)

~~~
quassnoi
Unfortunately, with a B+Tree index, you cannot jump in powers of 2 (well, you
can but only within one page). To find a record with a given offset you'd need
to traverse the whole linked list of the leaf-level pages, and that's what
MERGE JOIN does by default.

However, this solution would be nice if you had a sorted recordset which
allows efficient positional seek (for instance, a materialized result of the
previous operation which required sorting).

------
ars
He's cheating a bit. He created two tables each with 1 million rows, offset by
999,000. (i.e. 1 - 1,000,000 and 999,000 - 1,999,000)

Which of course is the best case scenario for this query.

But this is very unlikely in the real world. In the real world you use
sequential IDs, and almost all the IDs exist. So seek is not useful.

If you use random IDs (like using a GUID for a primary key), it's still not
useful. If your random number generator is at all good your index numbers are
pretty well distributed, without the large gaps this query likes.

~~~
quassnoi
> He's cheating a bit. He created two tables each with 1 million rows, offset
> by 999,000. (i.e. 1 - 1,000,000 and 999,000 - 1,999,000)

Of course this is cheating, but only to demonstrate the point :)

> But this is very unlikely in the real world. In the real world you use
> sequential IDs, and almost all the IDs exist. So seek is not useful.

Yes, and that's why the HASH JOIN is almost always used instead by the real
optimizers :)

However, there are situations when the gaps are expected. Say, the right
recordset is small compared to the left one. Usually, a NESTED LOOPS is used
in this case with the right recordset leading, but each loop uses a full-
fledged seek from the top of the B+Tree which can be inefficient when the
right recordset is not so small.

Using a statistics-assisted MERGE JOIN in this case would help to mix the
seeks and the scans.

Say, if the statistics say "current right is 500, current left is 300, there
are 40000 records from 0 to 1000 in the left recordset", we know that most
probably we'll need to skip 8000 records to reach the right pointer and we
better seek.

But if on next step it will say "current right is 501, current left is 500,
there are 40000 records from 0 to 1000 in the left recordset", we know that
there are only 40 records left which are most probably in one page and we can
scan instead.

It can be thought of as an improved NESTED LOOPS which caches the current
position.

