
Joins In Steps - zindlerb
http://www.zindlerb.com/joins-in-steps/
======
p2t2p
I think about my self as quite SQL-savvy person, I used to optimize quite
complex queries and is able to read plans for Oracle, Postgres and MySQL.

And yet, I've got not idea why would anybody need right join.

Have you guys ever had a case when you'd need a right join? I've been to the
field for 15 years and yet to see people using right join in the wild.

Like the last example in this link - why would you do that? Most probably your
business logic focuses on dogs, something like "find dogs with no owner" or
something, In this case it is much more readable and straight forward to go
with left join or even with sub-select where you'd have something like 'select
* from dogs where owner_id not in (select id from owners)'.

Have you used right joins and if you have can you explain the use case?

~~~
notafraudster
I use right_join never in SQL, but often in offline data analysis (e.g. in R).
Modern R has an operator called the "pipe" (%>%) which allows left to right
evaluation of functions, and so it's fairly common to write code that
basically reads like, say, chained method invocations in JavaScript (i.e.
object.method1().method2().method3()). The operator works so that the invoking
object is automatically passed as the first argument of the function, so
func(x, y) is the same as x %>% func(y). You might see where this is going.

Think of left join as a join from x to y, and a right join as a join from y to
x, where x is the data we keep all of and y is the data we keep only when
there's a match.

Then, in R, I often use right joins when my "y" data requires preprocessing,
resulting in lines of code that are like: (y %>% preprocess1() %>%
preprocess2() %>% right_join(x)). I could of course write this as
"y_preprocessed = y %>% preprocess1() %>% ...; x %>%
left_join(y_preprocessed)" but I think the former is actually a little
syntactically clearer.

~~~
wodenokoto
I tend to do like

    
    
        x %>% left_join(y %>% preprocess1() %>% preprocess2())
    

But yes, my understanding is that right and left joins are the exact same
function, just with the order of input switched.

------
rtpg
I really feel like we need a lower level expression language than SQL, one
based on actual query plans rather than a declarative thing.

Having that, and helping people learn that, would make it much harder to not
“get” how joins and the like end up working

~~~
contradictioned
Would you also force users to decide on the implementation of the join
operator? Because that is one of the benefits of a declarative query, as well
as placement of other operators and join ordering if multiple relations are
involved.

~~~
nothrabannosir
Would that come with the benefit of not getting nasty surprises when the
cantankerous scheduler suddenly decides to mess with a previously efficient
solution? Or even protect me from myself in the future accidentally breaking
an efficient query through an ostensibly innocent change?

I would appreciate a way to encode execution efficiency parameters in the
semantics of the query itself. Specifying which indexes to use, or the runtime
complexity of certain operations. This would also help me figure out which
indexes I need to create, in the first place. Today I can add a clause in an
order which makes using an index impossible, rendering a previously efficient
query suddenly extremely expensive. The failure mode is very obscure: it all
works, the DB just works overtime. I might not have a large enough DB yet to
actually notice this, until one day I see these slow queries and have to
backtrack all the way to that one commit that caused it, six months ago.

I have encountered this scenario a few odd times. It is not always obvious to
someone changing a query that it is meant to use an index, and it is very hard
for someone writing it to specify so in a unit test.

Flexibility for the scheduler to choose efficient strategies cuts both ways.
I'd appreciate the ability to leave more explicit annotations about expected
runtime characteristics, than mere comments offer.

~~~
jodrellblank
This talk ("How Modern SQL Databases Come up with Algorithms that You Would
Have Never Dreamed Of by Lukas Eder"
[https://www.youtube.com/watch?v=wTPGW1PNy_Y](https://www.youtube.com/watch?v=wTPGW1PNy_Y))
is not particularly gripping, but he does claim that modern database engines
will ignore indexes if they estimate it will be faster to read all the data
than it is to access the index first then read the relevant data. And that
Oracle can have multiple execution plans and switch between them mid-query if
the first one isn't working out as expected.

Could this be a rare case of the mythical sufficiently smart compiler having
more information at runtime than the programmer has at coding time, and that
if you could hint which index to use, the trade off would be that you'd
instead be troubleshooting queries which end up slower because of it?

~~~
kec
> Could this be a rare case of the mythical sufficiently smart compiler having
> more information at runtime than the programmer has at coding time [...]

Mythical? Isn't this the core concept behind optimizing JITs?

------
obituary_latte
That’s a typo in the last paragraph, no?

>The left right is the table named in the outer join part of the query.

~~~
zindlerb
Thanks for pointing that out. It has been updated.

------
revskill
Does SQL Join allow equality with operators like ilike, like,... or custom
equality function ?

select * from a inner join b on mycustomfunction(a, b);

~~~
hobs
Generally any equality is fine, but you'd usually do something like on
mycustomfunction(a) = mycustomfuntion(b)

And if you do this, it will be incredibly slow because you cant use any
indexes in this case.

~~~
mhotchen
Good news! Many database systems support indexes on functions, including on
user-defined functions. The only one I've used is Postgres[0] but apparently
MySQL added support recently as well[1]

[0] [https://www.postgresql.org/docs/current/indexes-
expressional...](https://www.postgresql.org/docs/current/indexes-
expressional.html)

[1] [https://dev.mysql.com/doc/refman/8.0/en/create-
index.html#cr...](https://dev.mysql.com/doc/refman/8.0/en/create-
index.html#create-index-functional-key-parts)

~~~
hobs
That's cool, but then I'd assume you have to make an index for each function.
In general if a person is asking basic questions about sargability, I will
steer them away from functions on a predicate (it's usually my last choice.)

Often you can just think a bit harder about your set based logic and make an
index that's reusable by other queries.

In other SQL's you can also make a materialized view, which may be able to
persist some of these things, its just like keeping n+1 copies of your data to
satisfy a different set of relations/projections.

