
Following a Select Statement Through Postgres Internals - craigkerstiens
http://patshaughnessy.net/2014/10/13/following-a-select-statement-through-postgres-internals
======
tofflos
I liked the article, and I know this is nitpicking, but I disagree that
creating an index is the best solution for eliminating the sequential scan.
Hopefully the next article mentions it.

Just drop sort clause. Returning the user with the lowest id seems like an
unusual use case.

Imagine searching for a name in a phone book and only finding the oldest
person with that name, or the person with the lowest phone number.

More common cases are checking whether a user exists, which doesn't require a
sort, or finding all the users with a particular name, which I guess could be
paged using limit and offset.

~~~
pat_shaughnessy
Oh absolutely. I thought about saying that, and just ran out of energy before
posting the article this morning.

The reason I used this particular SQL statement as an example is that
ActiveRecord (a Ruby ORM) generates it from a fairly simple, common Ruby
expression. I suppose ActiveRecord could be improved to drop the sort when you
know there's one one possible match.

~~~
talaketu
Unless "name" is unique, there could be multiple matches. Asking for the
"first" demands an ordering.

Perhaps (a) the user doesn't care what ordering is used, and/or (b) (such as
in your example in the preceding post) hasn't specified an order (and
apparently PK ordering is defaulted to PK - a model option?). It would be
error-prone for ORM to take (b) as implying (a).

~~~
pat_shaughnessy
Yes, good point. Rails has a lot of conventions like this - that "first"
implies ordering by primary key (by default). I just imagine that most Rails
developers don't think about sorting and its implications when they ask for
the first record.

------
cheez
The Ruby AST -> SQL String -> PG SQL AST -> Plan process is a nice reminder of
why we should use prepared queries wherever possible.

~~~
mvc
I'm not a postgres hacker but as I understand it, query plans generated at
"exec time" can be better than those generated at "prepare time" because they
have more information to hand about the expected size of nodes within the
plan.

There are other reasons to use prepared queries but if performance is your
only one it may not be worth it unless your query is either very complex, or
runs in a tight loop.

~~~
fdr
You are absolutely right. That's because when a parametric query arrives, the
parameters are unbound and the planner cannot take advantage of fit-to-purpose
selectivity estimation. It must instead estimate generically.

Newish versions of Postgres (9.2+ I believe) try to paper over this surprising
effect by re-planning queries a few times to check for cost stability before
really saving the plan. It has proved very practical.

See [http://www.postgresql.org/docs/9.2/static/sql-
prepare.html's](http://www.postgresql.org/docs/9.2/static/sql-prepare.html's)
notes section, reproduced here:

    
    
        Notes
        If a prepared statement is executed enough times, the server may
        eventually decide to save and re-use a generic plan rather than
        re-planning each time. This will occur immediately if the prepared
        statement has no parameters; otherwise it occurs only if the generic
        plan appears to be not much more expensive than a plan that depends
        on specific parameter values. Typically, a generic plan will be
        selected only if the query's performance is estimated to be fairly
        insensitive to the specific parameter values supplied.

------
netcraft
Ive toyed with the idea in the past to create a custom parser for an
alternative sql syntax - although I doubt I have enough knowledge to do so.
Not necessarily revolutionary, mostly just a shuffling of normal sql syntax to
make it more understandable like moving the selected columns to the end. One
day maybe.

~~~
jhallenworld
There is something called "relational algebra" notation:

    
    
      a * b           Cross-product / Join if column names match
      a + b           Union
      a(first=="joe" && salary>100000)  Select rows
      a[first,last,ssno]    Select columns / project
    

I hate SQL syntax and wish relational algebra was used instead.

~~~
lmz
Also agree. SQL is not easily composable. I wish for relational algebra
programs that can have an optimizer run over it and determine the optimal join
order etc.

The argument that SQL allows you to say what the result should be without any
reference to the order in which the steps should be done is fine, except that
it is hard to say some things in SQL. If C compilers can transform sequential,
imperative code to an equivalent optimized form, I don't see why relational
algebra compilers / optimizers cannot.

~~~
KMag
> I wish for relational algebra programs that can have an optimizer run over
> it and determine the optimal join order etc.

Optimal join order usually is a function of both the query and the data, and a
query optimizer inside your database doesn't necessarily find the optimal join
order. It uses heuristics (which often include particulars about the data
currently stored in the database) to find a join order that's hopefully better
than a naive query plan, in much the same way that optimization passes in a C
compiler use heuristics to generate machine code that's hopefully faster than
a naive translation to machine code.

> The argument that SQL allows you to say what the result should be without
> any reference to the order in which the steps should be done is fine, except
> that it is hard to say some things in SQL.

The GP is talking about a surface syntax change, not a change in the
underlying computation model. Using a relational algebra notation, the query
optimizer would have just as much freedom as an SQL query optimizer.
Relational algebra isn't any more inherently imperative than SQL is.

> If C compilers can transform sequential, imperative code to an equivalent
> optimized form, I don't see why relational algebra compilers / optimizers
> cannot.

It sounds like you want a semantic change that gives the query optimizer less
freedom than an SQL query optimizer. The GP is suggesting only a syntactic
change. The thing it sounds like you want does sort-of exist... many SQL
databases will let you inspect the query plan that their optimizer has
generated.

However, it sounds like you want some statically defined query plan. The
problem with this is that the optimal plan depends on the data that's in the
database at the time the query is run. For instance, a query optimizer can
look at a complex query with multiple constant WHERE clauses on indexed
columns, and use the indexes to quickly determine the size of intermediate
tables when deciding the order in which to perform joins. A query language
that statically defines a query plan cannot take advantage of this
information, unless you want to "re-compile"/"re-optimize" once a day or
something. However, if you trust the database to automatically re-optimize on
some schedule, then you've lost your static query plan and it seems you might
as well let the query optimizer regenerate the plan based on heuristics
created by the database developers rather than sticking to some static
schedule of recompilation/reoptimization points.

~~~
lmz
What I want is to be able to transform relations in a certain order for ease
of reasoning (and introspection of intermediate values), but then have the
optimizer transform it into whatever equivalent plan it can determine that
gives the same results. Of course, there may be places where the steps taken
will over-constrain the optimizer, but that's probably an acceptable risk (and
as long as we're wishing for things here, that should preferably be detected).

------
trevorriles
Good post! Looking forward to your next post about how indexes are handled.

~~~
pat_shaughnessy
Thanks :) Yea indexes are where some of the real "magic" of database servers
happens. It will be fun to write that up...

~~~
preillyme
Yeah I'm excited to read about indexes in PostgreSQL. Indexes also add
overhead to the database system as a whole, so they should be used sensibly —
I'm looking forward to seeing your approach for good index candidates.

------
dmunoz
If you enjoyed this post, dig into Pat's archive. He's written a handful of
posts that are in a similar style.

Indeed, I'm starting to feel like a shill given how often I praise it, but if
you have any interest in language implementations and don't mind Ruby as a
vehicle for some expiration, check out Pat's book Ruby Under a Microscope.

------
adamnemecek
If this tickled your fancy, you should checkout the book "SQL Performance
Explained" [http://sql-performance-explained.com](http://sql-performance-
explained.com)

------
Animats
Postgres did an awful job on that query. Small values of LIMIT should not
require a full sort, even when there's no index. I think MySQL has a special
case for small values of LIMIT.

Looking forward to the next installment. That query is so simple that you're
not seeing what a real database can do. Let's see something with a JOIN and
lots of indices, so the optimizer can do some work.

~~~
davidw
My reading of the cost statements is that the sort doesn't matter at all: the
sequential scan is what took up virtually all of the time.

But I'm hardly a PG expert.

Excellent article in any event, it's really interesting to see how things work
under the hood.

~~~
Animats
Depends on the contents of the database. The query optimizer doesn't know, at
optimization time, how many hits such a statement will produce. Given a
database with only a few entries with the same name, the sort cost is tiny. If
there are a lot of hits, it's expensive.

~~~
yangyang
It can estimate though, using per-column statistics. ANALYZE (and the
autoanalyze background process) update these in PostgreSQL.

These stats are a huge part of cost-based query optimisation, which all major
DBMSs do these days.

Details here: [http://www.postgresql.org/docs/9.3/static/planner-
stats.html](http://www.postgresql.org/docs/9.3/static/planner-stats.html)

