
How good are query optimizers, really? [pdf] (2015) - guodong
https://www.vldb.org/pvldb/vol9/p204-leis.pdf
======
devmunchies
One of the most important things I learned with databases was to run each of
my queries using EXPLAIN (EXPLAIN QUERY PLAN in sqlite) and seeing which
indexes are being used, if any.

One of the reasons I don't like ORMs is that I'm not able to see the
underlining query and truly optimize a service. That may be fine for a new
service where performance isn't crucial, but once it needs to scale, you need
to put on your engineering hat, get your hands dirty, and optimize queries.

You'll find you need to re-write queries so that there isn't complex nesting
in the WHERE statement and flatten your logic so that the SQL optimizer can
use your indexes. You may need to put SELECT statements within SELECT
statements, where the innermost SELECT uses indexes and the outer queries are
using the result of the inner query, which is smaller than the whole table.

~~~
obi1kenobi
I wrote a thread on this on Twitter:
[https://twitter.com/PredragGruevski/status/12639165990625402...](https://twitter.com/PredragGruevski/status/1263916599062540288?s=20)

I feel that SQL aimed to be Python and became x86 assembly instead. It's no
longer a simple "just works" query language the moment you have to worry about
predicate flattening, join decomposition, CTEs that introduce optimization
barriers, and "IN()" being faster than equivalent "JOINs".

As a result, I started a project that allows you to write read-only database-
agnostic queries called GraphQL compiler: [https://graphql-
compiler.readthedocs.io/](https://graphql-compiler.readthedocs.io/)
[https://github.com/kensho-technologies/graphql-
compiler](https://github.com/kensho-technologies/graphql-compiler)

The core idea of the project is to get us the convenience of specifying the
"what question I want answered," but without the inconvenience of "how is the
answer computed / with which specific set of queries / where did the data come
from?" \-- unless you want to peek under the hood, of course. All the
visibility into the nitty-gritty details available on demand, but without the
tedium of having to hand-optimize queries and know all the "magic" ways in
which queries get faster or slower for each individual kind of database.

~~~
petergeoghegan
> The core idea of the project is to get us the convenience of specifying the
> "what question I want answered," but without the inconvenience of "how is
> the answer computed / with which specific set of queries / where did the
> data come from?"

So...exactly like SQL, then?

~~~
ghusbands
The post you're replying to directly addresses that. When you write SQL:

> It's no longer a simple "just works" query language the moment you have to
> worry about predicate flattening, join decomposition, CTEs that introduce
> optimization barriers, and "IN()" being faster than equivalent "JOINs".

Though it doesn't seem to address how to optimize things using the GraphQL
compiler, when there's a need, without massaging the queries, as with SQL.

~~~
obi1kenobi
Just like how GCC and Clang/LLVM know all the quirks of various CPUs and can
optimize accordingly, GraphQL compiler aims to know the quirks of various
databases (down to individual database versions: e.g., in Postgres 12 certain
kinds of CTEs are no longer an optimization barrier) and optimize accordingly.

This is clearly a massive challenge, but one made easier by the fact that
GraphQL compiler queries (unlike SQL queries) operate at a much higher level
of abstraction. In SQL, you write "here's a CTE, now recursively JOIN X.foo to
Y.bar" where X and Y could be just about anything, even something where a
recursive JOIN is nonsensical. If you want to put a WHERE clause, you have to
decide whether it goes in the recursive CTE itself, in a separate CTE that is
ordered before the recursive CTE, or if you want to wrap the recursive CTE
into another SELECT and put the WHERE there. The correct answer varies from
database to database, and as a function of the size, layout, and index
coverage of your data.

In GraphQL compiler, your queries are much more declarative in comparison:
your query would say "find all subgraphs where vertex A's field 'foo' has
value 123, and where A has a recursively-expanded edge (i.e. 0+ hops along
that edge) to a vertex with field 'bar' with value 456". It's then the
compiler's job to figure out which of the many equivalent SQL statements (or
other db language queries, if you aren't using SQL) is going to be the best
way to compute the result you asked for.

Here's an example from our test suite: input query:
[https://github.com/kensho-technologies/graphql-
compiler/blob...](https://github.com/kensho-technologies/graphql-
compiler/blob/main/graphql_compiler/tests/test_input_data.py#L941)

Microsoft SQL Server-flavored compiled SQL output: [https://github.com/kensho-
technologies/graphql-compiler/blob...](https://github.com/kensho-
technologies/graphql-
compiler/blob/main/graphql_compiler/tests/test_compiler.py#L2449)

I'm writing a blog post about this with more detail, follow me on Twitter if
you'd like to see it when it comes out.

~~~
darksaints
If you know all of the quirks of the various databases, why aren't you hacking
on their optimizers? Why write a compiler that knows what's slow and what's
not when you can just fix what's slow?

~~~
obi1kenobi
It's an "and" rather than an "either-or" :)

When the database is open-source, and I spot something that's broken that I
know how to fix, I try to fix it. Here's a fix for a severe database query
planner correctness bug I contributed to an open-source database called
OrientDB:
[https://github.com/orientechnologies/orientdb/pull/7015](https://github.com/orientechnologies/orientdb/pull/7015)

Unfortunately, Microsoft SQL Server, Oracle, and many other databases are not
open-source, and I can't hack on their query planners. And even if they were,
SQL is an absolutely _massive_ language (the complete spec is 100,000+ pages).
The GraphQL compiler query language is tiny in comparison, the spec is maybe
10 pages: [https://graphql-
compiler.readthedocs.io/en/latest/language_s...](https://graphql-
compiler.readthedocs.io/en/latest/language_specification/definitions.html)

It's a lot easier to intelligently map a small language to a big one than it
is to optimize the big language outright.

In a sense, SQL is just not designed to be easy to optimize — it's too broad,
and there are too many equivalent-ish ways of doing the same thing. This is
why even after incredibly smart people cumulatively spent engineer-millennia
on the query execution and optimization systems in SQL databases, we still
keep having issues and there are still plenty of areas for improvement.

More info and more concrete examples of "why not just write SQL" in my
upcoming blog post!

~~~
petergeoghegan
> In a sense, SQL is just not designed to be easy to optimize — it's too
> broad, and there are too many equivalent-ish ways of doing the same thing.
> This is why even after incredibly smart people cumulatively spent engineer-
> millennia on the query execution and optimization systems in SQL databases,
> we still keep having issues and there are still plenty of areas for
> improvement.

The main reason is the inherent difficulty of cardinality estimation, as the
paper says.

Not every optimization is worth having. There is typically a distributed cost,
paid in extra planner cycles for queries that don't benefit from the
optimization. This is one of the main reasons why it's hard to contribute new
optimizations to the Postgres planner. Naturally, it's possible that a
marginal optimization will be incredibly important to one particular query or
user. It's a value judgement in each case.

Frankly, I find the suggestion that SQL is not designed to be easy to optimize
baffling.

~~~
obi1kenobi
I think we agree more than may seem apparent at first glance. In a sense, you
are also making the same point I was trying and probably failed to make.
Please bear with me as I give it another shot.

The difficulty of cardinality estimation is a function of the expressiveness
of the language. Imagine a new query language, SQL2, that only has the SELECT
and FROM keywords -- no WHERE, no JOIN, nothing else. Cardinality estimation
in SQL2 is trivial: just store the counts for each table, and you can estimate
everything trivially. Optimal query plans are trivial by extension as well.

Now let's add the WHERE keyword and a few operators to this SQL2. Cardinality
estimation and good query planning got much harder now! For example, if the
WHERE predicate touches two columns, we need to know about correlations
between the two columns, or we might make incorrect estimates and therefore
get worse plans. And since the plan space got bigger, we spend more cycles on
planning. If we continue to add more features to SQL2 to bring it to parity
with SQL proper, all these problems get harder as we go.

The language semantics behind GraphQL compiler aim to get sufficient
expressiveness for many (hopefully, most) use cases, while limiting the scope
so that the problems of cardinality estimation and good planning don't become
too hard to solve effectively. In comparison, SQL is significantly more
expressive, and as a result also much more difficult to execute and optimize.

------
SigmundA
I have really tried to let the optimizer do its thing and generally it does
and everything's ok.

Until its not and then I want hints to save my ass, and they are not hints, I
want want to TELL the f'ing computer what to do because I know better than the
optimizer period.

So surprised to find out PG doesn't support hints don't think I will ever be
able to move anything serious until it does, just not going to take that kind
of risk.

I have played the whole rewrite query to try and convince the optimizer what
to do with barrier tricks, no thanks, give me some hints and I will tell it
exactly what to do when thanks.

~~~
ninkendo
Is the query actually slower, or is it just not using an index you want it to
use?

Often times PG won’t bother with an index for a variety of reasons (sequential
scans can be legitimately faster in some scenarios), especially when the
number of rows is small.

~~~
stubish
The cool thing about hints is you can quickly drop in a hint to confirm your
suspicions and narrow down the problem, rather than trying to do this sort of
diagnosis in a vacuum. But because some people use hints for evil, nobody is
allowed to use them.

~~~
why-el
You can also disable seq_scan and force pg to consider indexes, usually that's
enough.

~~~
iracic
Access method is just part of the story. Same index may be accessed in
different ways, you might also want to combine them. Sometimes you may change
table join order to see how it estimates (or executes). Usually there are two
parts 1) cost and reasoning for some estimation 2) how it executes (timings,
resource usage, locks/contention)

------
abernard1
This paper is from 2015 it appears.

Can anyone comment on how relevant this is with the enhanced statistics types
in Postgres 10, 11, 12?

~~~
jzelinskie
I can't comment on your question, but thank you for finding the year this
paper was written. Having read many older papers, it's sometimes like solving
a murder mystery figuring out what year a paper was written. The year a paper
is written is vital for understanding social context of the research being
presented in addition to any context cited in the paper.

~~~
nighthawk454
Agreed, especially in fast-changing fields or after recent breakthroughs. One
trick I use is look at the References and find the approx. max year cited.
Generally the same or pretty close to the year of the paper.

------
sradman
I like the methodology of the Join Order Benchmark (JOB). The key takeaway is
PostgreSQL specific:

> ...the most important statistic for join estimation in PostgreSQL is the
> number of distinct values. These statistics are estimated from a fixed-sized
> sample, and we have observed severe underestimates for large tables.

Live statistics, incrementally updated on DML execution, is a key feature for
a good query optimizer. As a zero-administration RDBMS, SQL Anywhere had
gained a reputation as a best-of-breed query optimizer [1] a decade ago; I'm
curious if this still holds true.

In the last decade, the importance of OLAP queries in row stores has
diminished due to the superiority of column stores. I'd be interested in a
comparison of the Citus query optimizer vs. say Presto.

[1]
[https://www.student.cs.uwaterloo.ca/~cs448/W11/cs448_Paulley...](https://www.student.cs.uwaterloo.ca/~cs448/W11/cs448_Paulley_March_2011.pdf)

------
flooo
I'm surprised there is no mention of Postgres' Genetic Query Optimizer (GEQO)
here. It kicks in when there is a large number of joins and reduces query
planning time at the cost of query execution time.

Another post in this thread mentions adaptive query planning and mistakenly
imply that the GEQO is a module for this. My hands have been itching to look
into experimenting on some improvements on the GEQO, specifically by improving
the genetic algorithms used. When there are many similar queries, so in the
adaptive query planning setting, one could also use reinforcement learning to
improve query planning over time.

------
jzoch
The simplified cost model they use was really surprising. 34% better than the
complex pg one not only sounds great (incoming "simple is better" replies
below) but is really nice to hear. Hopefully postgres has or will consider
changing the default cost model to a simpler, more modern function that takes
the current landscape into account.

~~~
petergeoghegan
That may be true, but that doesn't seem like the important takeaway to me. The
important takeaway is "In contrast to cardinality estimation, the contribution
of the cost model to the overall query performance is limited". Actually, the
paper itself says "This improvement [the 34% one you mention] is not
insignificant, but on the other hand, it is dwarfed by improvement in query
runtime observed when we replace estimated cardinalities with the real ones".

Optimizers are weird.

~~~
jzoch
Oh definitely agree. Mainly I find the cost model interesting because it’s so
simple and contained. Cardinality estimation is a hard problem and requires
real expertise. But the easy wins you get by just throwing out something based
on old assumptions like the cost model is fun!

