
Common Mistakes and Missed Optimization Opportunities in SQL - haki
https://hakibenita.com/sql-dos-and-donts
======
Svip
While counting columns will not include NULL columns, how about counting
joined tables?

    
    
      SELECT a.id, COUNT(b.*) FROM a JOIN b ON b.a_id = a.id GROUP BY a.id
    

is not permitted in Postgres.

Sure, I could just use COUNT(b.a_id) since that's what I join on, but a more
complicated example might not allow for that. For instance if it was a virtual
table.

~~~
gnud
I'm sorry, if you want to count NULL-rows in b.* , how will it ever be
different from just COUNT( * )? Maybe I'm misunderstanding what you're after?

~~~
Svip
You're right, it's a bad example, imagine if I joining two tables:

    
    
      SELECT a.id, COUNT(b.*), COUNT(c.*)
        FROM a JOIN b ON b.a_id = a.id JOIN c ON c.a_id = a.id
        GROUP BY a.id
    

I want to know how many occurrences a_id has in both table b and c. Again in
this simple example, I could just count on b.a_id and c.a_id, respectively,
but imagine if b and c were complex virtual tables:

    
    
      JOIN (SELECT NULL AS foo, 1 AS bar 
            UNION SELECT 1 AS foo, NULL AS bar) b ON b.foo = a.id OR b.bar = a.id
    

This would be useful if we are aggregating data together, where essentially,
there are two ways to join the data with the main table, and both columns can
be null.

Of course, in this example, you could count by going COUNT(b.foo) +
COUNT(b.bar), but that's a bit awkward, or a column in table b you know to
never be null. But what if you don't? And still have table c next to it?

Yes, in all cases, there would be a way out. In the extreme case, you could
wrap it in a virtual table, where you add a column that is just always 0 (not
null), so you can count on it. It would just be neat if b.* was possible.

~~~
sixbrx
I might misunderstand, but I think the join just isn't the right approach for
that sort of child record counting, ie. counting records from two or more
independent child tables associated with your rows (if that is what you're
wanting).

You're grouping and counting _after_ the three way join. That join will
involve all combinations of child records between the two child tables
associated with any given parent row (almost never what is wanted). So any
given non-null thing you're counting from one child record will appear
multiple times, = the number of child records in the other table associated
with the parent row.

I think you just want to use correlated subqueries to count the child records:
select a.id, (select count(whatever) from child1 c1 where c1.a_id = a.id),
(select count(whatever) from child2 c2 where c2.a_id = a.id) ...

TLDR: You almost never want to join independent children to a common parent,
use independent correlated subquery expressions instead.

~~~
Svip
Except, that's potentially super slow if the optimiser does not realise what
to do. In its default state, it will make two table look ups for each row in
table a. So that's 1+N*2 look ups compared to 3 look ups in my example.

For little data, that's probably fine, but for a big database, it will be
slow. However, the optimiser may be able to handle that? I know Sybase's and
MSSQL's had trouble with it, but I've heard Postgres' might be able to.

~~~
sixbrx
First, correctness first. Second, correlated subqueries in selects really
aren't noticably slow at all in my experience (Postgres, Oracle) when they're
really needed, meaning alternatives are as slow or slower. Especially compared
to the alternative of actually doing a cartesian product across independent
children of any size. I don't think the optimizer could help much with
avoiding that cartesian product to be actually realized, either - the grouping
after that join is going to be a _big sort_ in general with results that
depend on the distribution of child values on a row-by-row basis. But in any
case it would give the wrong answers for both children regardless of speed and
memory used!

------
adamiscool8
Some of these have been learned through trial and error over the years, but a
few were new and great to know.

On a related note, is the MCSE the gold standard for SQL education? Have been
looking for a way to brush up and formalize my SQL skills.

------
Dowwie
Would someone please confirm whether this article is misrepresenting a
subquery as an inline CTE? It is my understanding that as of Postgresql 12, a
programmer denotes a CTE as "AS MATERIALIZED", "AS NOT MATERIALIZED", or
neither and allow the default operation to happen: the CTE subquery will
default to inline if its result is used once.

for reference: [https://sudonull.com/posts/998-Important-changes-in-the-
CTE-...](https://sudonull.com/posts/998-Important-changes-in-the-CTE-in-
PostgreSQL-12)

Generally speaking, some clarification would be helpful!

~~~
rgharris
I think the article and you are correct - the article is worded a little oddly
though and ignores the fact that in Postgres 12 CTEs that are referenced
multiple times are MATERIALIZED by default.

Before Postgres 12 CTEs were always materialized so you did not get any query
optimization benefits of CTEs acting like inline subqueries.

After Postgres 12 all CTEs default to NOT MATERIALIZED if only referenced once
or MATERIALIZED if referenced more than once. You can override via
MATERIALIZED or NOT MATERIALIZED when defining the CTE.

Their example is showing that you can let Postgres (before 12) optimize a CTE
for you by writing it as an inline subquery instead of a CTE:

    
    
      SELECT *
      FROM (
        SELECT *
        FROM sale
      ) AS inlined
      WHERE created_by_id = 1
    
    

But with Postgres 12 their "don't" example would result in an index scan
without refactoring to the "do" example. Basically their advice on do vs don't
applies to before Postgres 12.

[https://www.postgresql.org/docs/12/queries-
with.html](https://www.postgresql.org/docs/12/queries-with.html) is pretty
thorough on this

~~~
Dowwie
Thanks for confirming

------
tempguy9999
This is a pretty trivial list. Useful for beginners I guess.

I seriously take issue with "Reference Column Position in GROUP BY and ORDER
BY" though. _If_ it is restricted to ad-hoc (AKA messing-about) queries I'd be
fine with it, but it won't be. Just don't do it.

~~~
wfriesen
It's especially egregious in the ORDER BY, since there you have the option of
using column aliases.

~~~
commandlinefan
Are you saying you can't use column aliases in group by? What version of
Postgres are you using? I just tried it in 11.5 and it worked:

    
    
        # select cust_id as c, sum(avail_balance) as b from account group by c order by b;

~~~
tempguy9999
Interesting. It _doesn 't_ work in MSSQL, and I understand that's correct (ie.
isn't allowed) per the standard.

~~~
commandlinefan
Huh - I guess I never thought about it. It makes sense to disallow it, though
- column aliases are there to rename complex expressions, which you probably
_shouldn't_ be grouping on anyway.

~~~
yellowapple
> which you probably _shouldn't_ be grouping on anyway.

This is frequently unavoidable, though. Or more precisely: it could be avoided
with a sane database design, but the databases on which I have to work for my
day job are the precise opposite of "well-designed", so grouping on complex
expressions is unfortunately an inevitability.

~~~
commandlinefan
That's true - I can definitely imagine having to group on something like
"concat(lastname + ', ' \+ firstname)".

------
godshatter
I'd never run across coalesce before. I usually end up doing nested NVL calls
if I'm trying to find the first non-null in a series of expressions (I'm on
Oracle, btw). I've now added this function to my toolbox.

~~~
oarabbus_
Coalesce and NVL are synonyms for each other.

~~~
godshatter
They don't seem to be in oracle. Giving more than two parameters to nvl gives
me an error but works fine with coalesce. Granted they are basically the same
thing if you are giving both two parameters.

------
Foobar8568
I would add to the common mistakes (should be generic, but I have more xp with
sql server) :

not indexing, most often, tables are not or poorly indexed.

Implicit conversion can generate a lot of io/leads to poor perf or just not
using indexes.

Sql function:sorry but they are most often crap and useless, better to in-line
or use TVF, and no its not code logic duplication.

Read uncommitted unless you enjoy not reading rows, multiple times or half of
a value (page split and/or LOB values)

~~~
SigmundA
> Sql function:sorry but they are most often crap and useless, better to in-
> line or use TVF, and no its not code logic duplication.

Functions have helped me tremendously in SQL server, but you do have to know
the issues, which can be taken advantage of to some degree.

Code reuse is the obvious use case, but due to lack of inlining up to SQL
Server 2019 meant you could reduce performance compared to hand inlined case
statement or whatever. Hopefully now that functions can be inlined in 2019
this will be a non issue going forward

They are an optimization barrier which can be a good thing. I have used to
this my advantage to stabilize tricky queries that where using views for code
reuse. The performance becomes consistent and predictable rather than going
pathological on some databases even though it may be slightly slower on
others.

~~~
pjungwir
Can an MS SQL Server expert explain why creating table variables inside sprocs
is so popular? I've done lots of Postgres & MySQL (and long ago lots of
Oracle), but now I find myself maintaining some MS SQL Server projects, and I
see `DECLARE @intermediateResults` all over the place. I thought it was just
the past developers being more familiar with imperative style than SQL, but it
seems to be part of the broader MS SQL Server culture, and somewhere I came
across some passing reference to read locking being a reason, but it was too
brief for me to understand. Is there any reason to do this so often in MS SQL
Server vs other RDMBSes?

~~~
monkeycantype
A great thing about Table Variables is that you can pass them as parameters to
stored procedures, which gives a straight forward way to have procs do set
based operations, avoiding the need to call procs in a loop.

I find this approach allows me to write smaller procs that become reusable in
different contexts, effectively composable.

Be careful that for over a thousand records table variables perform poorly, so
I usually dump the data into a temp table inside the proc, or a permanent
table with a key unique to this execution of the proc, this could be the SPID

------
esnard
In the "Avoid Transformations on Indexed Fields" part, I fail to understand
how the example can work if you're applying the timezone computation on the
right-hand side.

I'm not familiar with MS SQL (I've only worked with MySQL / PostgreSQL), can
someone explain me how it works?

~~~
moron4hire
Your only failure is because it's just wrong. It's about the same as trying to
change "if((a + b) > c)" to "if(a > (c + b))". If this weren't time zones,
it'd obviously be "if(a > (c - b))", because you have to balance the equation
by applying the same operation to boths sides. But because this is dealing
with timezones, the offset of "b" is different depending on the value of "a",
so we won't know what to subtract from "c" to get the right comparison. So the
right transformation for this "gotcha" is not even possible.

~~~
paulclinger
I think the advice will still work, but you'd need to switch from "named"
timezones to number-specific one, so for example replace `PST` with `-08:00`
and then apply the opposite conversion on the right side (as you and I
suggested).

------
irrational
In regards to formatting sql, I used to do it the way shown, but a coworker
formatted the columns in the select with the commas in front. This seemed
strange to me until I tried it. I realized that this solved the problem of
sometimes a query would be changed and the last item in the select list would
be removed, but the last comma would not be removed. Or, a new item was added
to the end of the select list, but they neglected to add in a comma at the end
of the previous last item.

SELECT

    
    
      col1
    
      ,col2
    
      ,COUNT(col3)
    

FROM

    
    
      t1
    
      JOIN t2 ON ta.pk = t2.fk
    

WHERE

    
    
      col1 = col2
    
      AND col3 > col4
    

GROUP BY

    
    
      col1
    
      ,col2
    

HAVING

    
    
      COUNT(col3) > 1

~~~
mwexler
Similarly, by having the AND start in each subclause instead of being at the
end of the previous clause, it's easier to "\--" comment out specific clauses
during development. Again, harder to read, but easier to work with.

~~~
1996
Interesting. Can you please provide an example of what you mean?

~~~
irrational
WHERE

T.first_name = "Ender" AND

T.last_name = "Wiggins"

versus

WHERE

T.first_name = "Ender"

AND T.last_name = "Wiggins"

you could also do

WHERE 1=1

AND T.first_name = "Ender"

AND T.last_name = "Wiggins"

~~~
1996
Thanks a lot, I thought it might have been WHERE 1=1

I will now use that to easily remove filters.

An extension for the SELECT based on the very insightful top comment could be

SELECT NULL

,something1

,something2

------
monkeycantype
I wish I could use ON for the selection criteria for the first table instead
of a where clause:

Select A.value, B.valuue

from tableA A on A.id = 77

join tableB B on B.id = A.bId

------
kbenson
> 2019-22-11: Fixed the examples in the "Faux Predicate" section after several
> keen eyed readers noticed it was backwards.

What abomination of a date format is this? I can only assume this is a bug, a
typo, or an easter egg for those paying attention. _Please_ let it be one of
those. The last thing the world needs is people pushing yet another crazy date
format into use.

~~~
jandrese
It looks like a typo. Hebrew date style is yyyymmdd[1].

[1]
[https://www.ibm.com/support/knowledgecenter/en/SSS28S_8.1.0/...](https://www.ibm.com/support/knowledgecenter/en/SSS28S_8.1.0/XFDL/i_xfdl_r_formats_he_IL.html)

------
gigatexal
Edit: “Don’t use an ORM” should be point 1

~~~
ars
The opposite. Point 1 should be _don 't_ use an ORM unless you don't know SQL.
But you should know SQL so don't use an ORM.

An ORM only works until the point where you need to join tables. As soon as
that's needed the ORM just causes you endless trouble.

~~~
gigatexal
Edited, i meant don’t use one.

