
PostgreSQL Query Optimization - craigkerstiens
http://jinchengli.me/post/postgres-query-opt/
======
allan_s
Maybe a very newbie question but I've always been wondering:

As query optimizer can't explore the whole possibility tree because of time
constraint. Would it be possible theoretically (and if so, has any RDBMS
implemented it) to do like for prepare statement "tell the optimizer to take
its time to give the best optimized code it can" and even, let's be crazy
crafting some JIT-ish code heavily optimized just for that one request.

I'm certainly missing a lot of things, but it would certainly make sense for
webservices where you have very limited set of "critical" queries that you may
want to tell your RDBMS to optimize the best it can regarding current set of
data at service startup.

~~~
twic
To do this, you would have to prepare a plan once and then use it for many
queries. The problem with that is that you can't make use of query-specific
information when making the plan, and that can be crucial to picking the right
plan.

For example, if you have a query like 'select * from products where category =
?', then you basically just have to choose between a table scan and an index
lookup. Which is best will depend on how selective the predicate is: if the
category in question only covers one in a million products, an index lookup is
probably be faster, but if it covers a quarter of them, then a table scan will
probably be faster.

If you know from statistics that no category covers more than a small fraction
of products, then sure, prepare an index lookup plan for all queries, and vice
versa if you know that all categories cover a substantial fraction of all
products. But if you have a spread, with some categories being big and some
small, then you are best off leaving the selection of the plan to runtime.

It's much the same situation as with ahead-of-time vs just-in-time
compilation, but with a bigger penalty for picking the wrong plan, and a
smaller ratio of time spent planning to time spent executing, both of which
push towards deferring the decision.

That said, perhaps a good strategy would be to do what JIT compilers can do
with dynamic deoptimisation: pick a plan that looks good, but guard it with
assertions about the things which vary at runtime. So, with the example above,
if your categories are mostly small, prepare a plan which uses an index, but
guard it with a check that the category is in fact small. If a query comes
along for a big category, ignore the prepared plan and make a new one from
scratch. Then track statistics about the hit rates on the guard conditions,
and if it looks like different assumptions would lead to a more useful plan,
switch to those. You could even have multiple plans for different conditions.

As for crazily crafting some JIT-ish code heavily optimized just for that one
request, people are already working on that (the slides are in Russian, but
the abstract is still interesting!):

[https://pgconf.ru/en/2016/89652](https://pgconf.ru/en/2016/89652)

------
jamii
The main benchmarks that these compilers compete on are those from
[http://www.tpc.org/](http://www.tpc.org/). They are synthetic datasets which
are fairly naively generated.
[http://www.vldb.org/pvldb/vol9/p204-leis.pdf](http://www.vldb.org/pvldb/vol9/p204-leis.pdf)
is a really interesting paper that looks at how some of the heuristics used in
those compilers work well on TPC but break down on real datasets. It can give
you a good intuition of when you have to be wary of bad plans.

------
saosebastiao
As someone who spends a fair amount of time in databases but also large scale
mathematical optimization, it's been a goal of mine to dig into the Postgres
code and start hacking on the query optimizer. From a very quick overview, I
see tons of potential for huge improvements.

1) The query optimizer doesn't explore all plans, but it also doesn't need to.
There are very smart algorithms for exploring branches of microptimizations
only when they are likely to improve upon an existing plan, and they range
from local search to genetic algorithms to branch-and-cut column generation.
Outside of a very limited application of Genetic Algorithms which most users
won't ever see used, Postgres doesn't do this very smartly.

2) Most production databases have the same set of queries running on them 99%
of the time. Queries are optimized on the spot, or pre-optimized but missing
important runtime information. Postgres could take a hybrid approach...pre-
optimize queries, but explore LP duality models of the optimization to
discover where the thresholds are for alternative plans. For example, use one
query plan for all values of columnA, except when columnA = "FOO" or columnB >
42.

3) The statistics that are made available to the optimizer are very
rudimentary. Multi-column statistics and conditional probability statistics,
for example, are missing. They are left out because gathering such statistics
can be very expensive. Understanding cardinality can help understand which
statistic improvements could be cheaply obtained, and machine learning models
and LP duality information could help determine which of the more expensive
sets of statistical information would be valuable for the types of queries
that tend to be run on the database. I would estimate that conditional
probability collections on multi-column indexes alone could result in massive
(order of magnitude) improvements over full table scans.

4) I have not verified this, but I don't think Postgres does any sort of
statistical or ML-based analysis on the IO or compute power characteristics of
the server. Making this information available could result in huge
improvements in query optimization when there is a tradeoff that can be made
between IO and Compute (such as how to compress data, where thresholds should
be set to activate parallel aggregates and scans, etc.). They could also do
some learning on the read-write ratios on a table level, which could also
offer some important optimizations.

5) There is room for extensions of the SQL table creation syntax that could
make available static information that can both be applied as a constraint but
also used for query optimization. For example, lets say you have a multi-
column key (or even not a key, but just known in advance) where the
information is purely hierarchical from column1->column2->column3. If that is
the case, certain values of column3 would only ever occur if column2 = 'bar'
and column1 = 'foo'. Even if your query doesn't ever touch column1, the query
optimizer could possibly choose to not evaluate a predicate on column3 until
it knows that column2 = 'bar' and column1 = 'foo'. And as a benefit, you could
get a crucial DML runtime error if that ever magically changes on you.

Now, I just need to learn some C and find some time to hack on it :)

~~~
AlterEgo20
As for number 5 - Postgres already does that. This feature is called
"constraints". Such constraints can be applied to multiple columns and are
used to build better query plans (as well as limiting DML operations)

------
Someone
For comparison, SQLite's optimizer is described in
[https://www.sqlite.org/optoverview.html](https://www.sqlite.org/optoverview.html)
and, linked from it, [https://www.sqlite.org/queryplanner-
ng.html](https://www.sqlite.org/queryplanner-ng.html).

SQLite doesn't want to spend time considering all possible join combinations,
accepting the risk that it chooses a plan that is a lot worse than the optimal
one.

------
brightball
That's a great overview. This will come in handy for training at work.

------
known
And throw in more memory to the system

