
Avoiding “OR” for Better Query Performance - pauloxnet
https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/
======
Someone1234
In Oracle land you have no choice but to use a long list of OR because IN has
a low clause limit (1000). So if you need e.g. 1.1K rows, even using a fast
indexed PK, your choices are OR or pain. The workarounds are horrible, like
splitting your IN clauses and UNION ALL two copies of the same query (runs
twice), or using a global temp table (which is awkward as heck to
setup/destroy each execution, particularly with IN clauses that could range
from 1 through 1K+).

Postgres has no practical limit, there's examples of people using 100K+ IN
clauses (not recommending). People keep telling me that Oracle is "enterprise
grade" and that Postgres is not. To be honest all I see is Postgres is
flexible and modern, and Oracle is kludgy and outmoded. I'm forced to use
Oracle but I'd never voluntarily start any new projects on it, I'm all in on
Postgres from here on out.

~~~
silveroriole
Used Oracle for a long time but have never had the need to do this. Can I ask
what your use case is? I can’t imagine one which isn’t solved by a subquery or
a built-in Oracle temp table which automatically restricts the contents of the
temp table to one transaction, with no need to setup/teardown per execution
(as long as you’re using sensible transactions).

~~~
jzoch
We have a problem at work where one of our tables is 5 billion rows and we get
a csv (or maybe a few csv's ;) ) with a total of 21 million primary keys. We
now need to find which of these 21 million keys are already in our table and
do some task on those that are not.

Originally, we batched 1k records at a time and used an IN. It was taking
hours. Each query finished in 1.5 seconds. Then we moved to a global temp
table. It was around 30% faster iirc. Still not good enough (open to
suggestions!!!) but way better. Wish oracle would just do this for us
internally.

~~~
bhaak
How about this: 21 million primary keys are easily kept in memory. Sort them.

Then do "SELECT id FROM five_billion_rows ORDER BY id" and iterate over that
with the CSV ids in lockstep.

Cursor management is pretty good in Oracle, this is likely much better than
your own paging with 1k of IN values. I would assume that it would be IO bound
regarding the index of the id column.

This also might be an interesting question for "Ask HN".

------
adrianmonk
It's surprising to me that "where id = 42 or id = 4711" behaves differently
than "where id in (42, 4711)".

I would have guessed the database optimizes this away. It seems like an easy
to implement and safe optimization. (Look at abstract syntax tree, find "or"
expressions, find equality test expressions under there, check if two or more
compare the same sub-expression against constants, then combine.)

Did I misunderstand what the article is saying PostgreSQL does, or does it
actually not do this optimization for you?

~~~
commandlinefan
> I would have guessed the database optimizes this away

And one thing about this type of "beneath the covers" optimization is that,
over time, the tool ends up doing it for you and applying these sorts of
optimizations can end up hurting more than helping - so by all means learn it,
but keep an eye on it as versions progress and make sure you're not fighting
the optimizer.

~~~
sudhirj
I think the simpler takeaway is that OR consistently represents a branch in
the query planning, where the planner now always has to do two things and
merge back instead of one. It makes sense for the OR operations to always
preserve this inefficiency instead of leaving the user guessing when it will
or won’t.

~~~
beart
I don't understand why that makes sense? Isn't the point of a declarative
language to remove the need to consider these details?

~~~
cyborgx7
I have long since resigned myself to the fact that "declarative language" in
practice only means that you don't have to manually write the for-loops
anymore. The concept of declarative languages seems like it has so much
potential, but I haven't seen any that have even come close to it.

------
the_duke
All of the examples seem like relatively straight-forward transformations.

I'm quite surprised that the PG query optimizer doesn't to those. How do other
engines compare here?

~~~
gwd
sqlite certainly seems to optimize it automaticaly; Note the 'IN operator' in
the plan where none was in the query:

    
    
        sqlite> explain select * from words where wid=1 or wid=2;
        addr  opcode         p1    p2    p3    p4             p5  comment
        ----  -------------  ----  ----  ----  -------------  --  -------------
        0     Init           0     21    0                    00  Start at 21
        1     OpenRead       0     3     0     2              00  root=3 iDb=0; words
        2     Integer        11    2     0                    00  r[2]=11; return address
        3     Once           0     11    0                    00
        4     OpenEphemeral  1     1     0     k(1,)          00  nColumn=1; RHS of IN operator
        5     Integer        1     3     0                    00  r[3]=1
        6     MakeRecord     3     1     4     D              00  r[4]=mkrec(r[3])
        7     IdxInsert      1     4     3     1              00  key=r[4]
        8     Integer        2     3     0                    00  r[3]=2
        9     MakeRecord     3     1     4     D              00  r[4]=mkrec(r[3])
        10    IdxInsert      1     4     3     1              00  key=r[4]
        11    Return         2     0     0                    00
        12    Rewind         1     20    0                    00
        13      Column         1     0     1                    00  r[1]=
        14      IsNull         1     19    0                    00  if r[1]==NULL goto 19
        15      SeekRowid      0     19    1                    00  intkey=r[1]
        16      Rowid          0     5     0                    00  r[5]=rowid
        17      Column         0     1     6                    00  r[6]=words.text
        18      ResultRow      5     2     0                    00  output=r[5..6]
        19    Next           1     13    0                    00
        20    Halt           0     0     0                    00
        21    Transaction    0     0     97    0              01  usesStmtJournal=0
        22    Goto           0     1     0                    00
    

I've verified that "where wid in (1,2)" produces the same plan.

------
jordigh
Here's a slightly different and older take on the same problem:

[http://blog.rhodiumtoad.org.uk/2017/01/22/performance-
issues...](http://blog.rhodiumtoad.org.uk/2017/01/22/performance-issues-with-
ored-conditions/)

------
methodover
In MySQL, discovering the UNION hack a couple years ago was like a revelation.
It’s something that I’ve since been preaching to everyone who’ll listen.

It’s extremely common to want to OR a small set of items. It _feels_ like the
compiler would be smart enough to use an index (presuming one exists matching
the searches fields), but it doesn’t. UNION makes it use the index.

It’s a nice tool to have in your kit.

(At least not in 5.6 or 5.7.)

------
mtzaldo
IDK about the profiler... but I have a hard time believing why the union of 2
joins will be faster than filtering first and then join the results :thinking:

~~~
Izkata
Yeah, seeing "sort" in the union-based plan also threw me for a moment about
why that would be any faster. Normally sorts kill performance, especially
compared to the merge join in the "ugly" version.

I think the key is in the description afterwards, where each subquery only
returns 1 row. Which just reinforces: measure everything, this advice may not
be generally applicable.

------
wolf550e
The simple query:

    
    
        explain (analyze, timing) SELECT id FROM a
        WHERE a_val LIKE 'something%'
           OR a_val LIKE 'other%';
                                                               QUERY PLAN                                                       
        ------------------------------------------------------------------------------------------------------------------------
         Bitmap Heap Scan on a  (cost=8.87..12.88 rows=20 width=4) (actual time=0.012..0.012 rows=0 loops=1)
           Recheck Cond: ((a_val ~~ 'something%'::text) OR (a_val ~~ 'other%'::text))
           Filter: ((a_val ~~ 'something%'::text) OR (a_val ~~ 'other%'::text))
           ->  BitmapOr  (cost=8.87..8.87 rows=1 width=0) (actual time=0.011..0.011 rows=0 loops=1)
                 ->  Bitmap Index Scan on a_val_idx  (cost=0.00..4.43 rows=1 width=0) (actual time=0.009..0.009 rows=0 loops=1)
                       Index Cond: ((a_val ~>=~ 'something'::text) AND (a_val ~<~ 'somethinh'::text))
                 ->  Bitmap Index Scan on a_val_idx  (cost=0.00..4.43 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                       Index Cond: ((a_val ~>=~ 'other'::text) AND (a_val ~<~ 'othes'::text))
         Planning Time: 0.088 ms
         Execution Time: 0.033 ms
        (10 rows)
    

using union:

    
    
        explain (analyze, timing)
        select id from a
        where a_val like 'something%'
        union
        select id from a
        where a_val like 'other%';
                                                                  QUERY PLAN                                                          
        ------------------------------------------------------------------------------------------------------------------------------
         HashAggregate  (cost=17.23..17.43 rows=20 width=4) (actual time=0.135..0.135 rows=0 loops=1)
           Group Key: a.id
           ->  Append  (cost=0.42..17.18 rows=20 width=4) (actual time=0.133..0.133 rows=0 loops=1)
                 ->  Index Scan using a_val_idx on a  (cost=0.42..8.44 rows=10 width=4) (actual time=0.128..0.128 rows=0 loops=1)
                       Index Cond: ((a_val ~>=~ 'something'::text) AND (a_val ~<~ 'somethinh'::text))
                       Filter: (a_val ~~ 'something%'::text)
                 ->  Index Scan using a_val_idx on a a_1  (cost=0.42..8.44 rows=10 width=4) (actual time=0.004..0.004 rows=0 loops=1)
                       Index Cond: ((a_val ~>=~ 'other'::text) AND (a_val ~<~ 'othes'::text))
                       Filter: (a_val ~~ 'other%'::text)
         Planning Time: 0.124 ms
         Execution Time: 0.164 ms
        (11 rows)
    

using gin index:

    
    
        EXPLAIN (analyze, timing)
        SELECT id FROM a
        WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);
                                                                QUERY PLAN                                                        
        --------------------------------------------------------------------------------------------------------------------------
         Bitmap Heap Scan on a  (cost=176.15..247.11 rows=20 width=4) (actual time=0.036..0.036 rows=0 loops=1)
           Recheck Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
           ->  Bitmap Index Scan on a_val_trgm_idx  (cost=0.00..176.15 rows=20 width=0) (actual time=0.034..0.034 rows=0 loops=1)
                 Index Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
         Planning Time: 0.091 ms
         Execution Time: 0.193 ms
        (6 rows)
    
    

Doesn't the simple query win?

------
CharlesDodgson
Thank you, that was a well written and informative post.

~~~
pauloxnet
You're welcome! I found it very helpful for me and than I shared it.

------
mkonecny
Is this applicable to mysql?

~~~
stevekemp
You might consider running "EXPLAIN" to see for yourself.

~~~
GrumpyNl
Both give the exact same result.

~~~
stevekemp
Well fill a table with a few million rows, and compare the results. If they
both come back with the same rough times then you've learned they're both
equally good :)

Sometimes its good to ask people "Is this true of XX?", but but if you're
curious it almost takes less time to expore and test than wait for somebody to
give you an answer - an answer that might actually be wrong. (Some things
change across DB-releases, as I've learned to my cost. If I'm ever unsure of
whether one approach is faster than the other IN MY SPECIFIC SITUATION I would
test. Random benchmarks are often not as useful as real-world tests.)

------
C1sc0cat
Why not just use a sub query ??

------
crimsonalucard
I've always said SQL is a bad abstraction. An artifact from the past that we
have to deal with like css and JavaScript.

This article solidifies my point. Seriously, ors are bad so union two queries
as a cheaper alternative? Theres no rhyme or reason to it at the API level.
You have to look under the hood to see why this hack works. Hence why SQL is
such a bad abstraction. We program and optimize sql not with logic but with
memorization of hacks.

~~~
ben509
> Seriously, ors are bad so union two queries as a cheaper alternative?

They should be equivalent, letting you choose the notation that expresses the
problem cleanly. If relations were sets of tuples, boolean expressions could
be rewritten safely as straightforward set arithmetic and vice versa.

But SQL relations (tables) are bags of tuples (rows), which breaks that
equivalence.

~~~
crimsonalucard
That's why the abstraction is bad. The abstraction is not zero cost.

Instead a high level API should be exposed that mirrors the underlying
implementation with a bijection. Databases are the bottleneck of most web
development and we need to optimize "bags of tuples" not "sets" in order to
gain the performance we need. The problem is, the API (SQL) exposes things as
if the underlying implementation was sets. Hence the need to memorize ugly
hacks like this in order to optimize SQL.

SQL is a language designed for business users, not a language designed to
optimize data retrieval speed. We need an API that can do the latter because
web devs are doing all these optimizations in the business language.

I'm not saying there's a better solution out there. I'm exposing a problem.
The problem is, SQL is not a zero cost abstraction. We need a "Rust" for
databases. A high level query language where the optimization paths are
isomorphic to the optimizations paths in the underlying implementation.

Maybe at first glance you don't agree with me. But think of it this way. If an
optimization path in Postgresql involves rewriting a Query with an OR clause
as a UNION of two queries and all SQL implementations are loaded with strange
hacks like this... is there a problem? and what is the solution?

