
The best Postgres feature you're not using – CTEs aka WITH clauses - craigkerstiens
http://www.craigkerstiens.com/2013/11/18/best-postgres-feature-youre-not-using/
======
pilif
One thing to keep in mind: CTEs are defined as being optimisation boundaries
and databases are not allowed to optimise across CTEs. So CTEs have to always
be resolved first.

If you later join that CTE against some other table, eliminating most of the
rows fetched by the CTE, you will have wasted all the time for retrieving
these rows.

This means that you might get a worse plan by using a CTE instead of a direct
join.

It also means that you might convince the system to give you a much better
plan by using a CTE (if you can limit the total amount of rows by the rows in
the CTE instead of the tables you join it against).

Just keep that in mind when you're debugging query speed issues, but in
general, CTEs are good for very readable queries that are much easier to
understand and maintain than if you join inline.

The additional readability is for sure worth it to first try to solve the
issue with a CTE and only if you run into above problem and if the performance
is inacceptable to then convert the query to a more traditional (in a sense of
"what the various open source RDBMs provided until two years ago when Postgres
introduced CTEs") join.

~~~
pradocchia
_CTEs are defined as being optimisation boundaries and databases are not
allowed to optimise across CTEs._

Wait, no. I know for a fact that SQL Server will happily optimize across CTEs.
They are effectively syntactic sugar, and handled much like a view or derived
table.

So, either the SQL standard defines this optimization boundary and SQL Server
ignores it, or this is an idiosyncrasy of the Postgres implementation, or this
is an idiosyncrasy of some other implementation that you have assumed applies
to all implementations.

I honestly don't know. I wouldn't mind a syntactic construct that did define
optimization boundaries, but to my mind CTEs are not that construct.

~~~
pilif
Unfortunately I haven't read the SQL standard(s) and I also seem to have
trouble finding them in full on the internet.

However I was told on IRC in #postgres that not optimising across CTEs was
something mandated by the standard.

Of course I might have been told something that's not quite correct or I might
have misunderstood.

However in the context of postgres (which is what the article is talking
about), I know for a fact that CTEs are first fully resolved. This is what I'm
seeing in my query plans and it's what the manual says:
[http://www.postgresql.org/docs/9.3/static/queries-
with.html](http://www.postgresql.org/docs/9.3/static/queries-with.html) (at
the end of 7.8.1)

If I'm wrong what the standard is concerned, then I apologise.

~~~
bjourne
Yes, as far as the standard is concerned you are wrong. See
[http://dba.stackexchange.com/questions/27425/is-the-
optimisa...](http://dba.stackexchange.com/questions/27425/is-the-optimisation-
fence-behaviour-of-a-cte-with-query-specified-in-the-sql2). But it is a very
common misconception you can hardly be faulted for because it is the way
postgres does it and many people claims it is due to the sql standard.

Of the only annoying features in postgres, imho. You have a large query and
you try to break it up into smaller self-contained cte parts and voila,
performance goes crap.

~~~
einhverfr
As per discussons on the email lists, PostgreSQL does it this way because this
is the safest way to adhere to the standard. Presumably over time cross-CTE
optimizations will be added. The problem though is guaranteeing the stability
the standard requires without doing so. That's a technical issue, not a
standards issue though and you are quite right for pointing that out.

------
adwf
I love using small CTE's to help with the hassle of the alternative - creating
temp tables and then cleaning them up. They can be particularly useful for
data modification, as you can't ordinarily do a recursive delete or insert all
that easily.

PS: I always cringe when I see this sort of syntax though: "FROM users, tasks,
project"

Much clearer using explicit JOINS imo, especially if your goal is readability.

~~~
socialist_coder
Completely disagree. All the database developer "experts" I know of (including
myself) hate the explicit join syntax.

I prefer to list out your tables and then do your joins in the "where" clause.
For 3+ tables and mixing inner & outer joins it's especially better. I think
maybe this only looks better once you become really really good at reading and
writing SQL though. Maybe it's also because I come from the Oracle world where
it is the standard.

~~~
matwood
I'm the opposite. I hate joins in the where clause because it is semantically
wrong and confuses the building the set part with the filtering the result
part. I want to see how a table is joined into the query at the place it is
joined and additional filtering in the where. To me it makes it much easier to
follow and not miss a join item, especially if the join happens on multiple
columns.

I really hope you are not suggesting going back to things like <star>= and
=<star>

~~~
neilc
_I hate joins in the where clause because it is semantically wrong and
confuses the building the set part with the filtering the result part._

It is not "semantically wrong": conceptually, you can compute a join by taking
the Cartesian product and then applying the join predicates to the result,
which is actually what the implicit join syntax suggests. Semantically, there
is no difference between applying an (inner) join predicate and applying any
other predicate. There is also no actual division between "building the set"
and "filtering the result": different predicates might be applied in different
orders, before or after different relations in the query have been accessed.

~~~
socialist_coder
Thats how I think about it too. Thanks for the better explanation.

------
ef4
> For some reason most people throw out principles we follow in other
> languages such as... composability just for SQL

This is because the most common API for accessing databases is wrong and
doesn't support composition.

Imagine designing a general-purpose, rich API that has only one method. The
method takes one giant string argument. Insane? Absolutely.

SQL is actually awesome and amazing, but many programmers hate it because
their API for accessing it is horribly broken. The whole point of relational
algebra is that it's... an algebra. None of which is usefully exposed if
you're just mashing strings of SQL together.

~~~
bunderbunder
I'll one up you on that. . . the common approach of treating the database as
if it were an API that you shove string arguments into is wrong.

Imagine if you developed your website without ever editing a *.js file
directly, and instead just programmatically mashed together strings of
JavaScript code and emitted them inline in the HTML wherever necessary. You'd
probably end up generating much crummier JavaScript code, too. But should you
blame JavaScript or your Web application framework for the fact that your
client-side scripting is crummy? No, of course not. The real problem is that
you're horribly misusing your tools.

------
buckbova
mssql has a lot of "tricks" associated with CTE's. Using the OUTPUT clause on
updates is one of them.. Here's some queue code using a cte to grab the top
one by queued date.

    
    
      DECLARE @Queue TABLE (QueueID INT)
    
      WITH q AS (
      		SELECT TOP (1) QueueID, StatusID, ModifiedDate
    		FROM worker.Queue WITH (ROWLOCK, READPAST)
    		WHERE StatusID=(SELECT TOP 1 s.StatusID FROM worker.Status s (NOLOCK) WHERE s.[Status] IN ('Queued'))
    		ORDER BY QueueDate ASC
      	)
      	UPDATE q SET 
    		StatusID=(SELECT s.StatusID FROM worker.Status s (NOLOCK) WHERE s.[Status]='Processing'), 
    		StatusMessage='New Process',
    		ModifiedDate=GETDATE()
    	OUTPUT INSERTED.QueueID INTO @Queue

~~~
pradocchia
The OUTPUT clause in MSSQL doesn't require a CTE, and can be used with INSERT,
DELETE and MERGE in addition to UPDATE.

Likewise, your example can also be done with derived tables, which predate
CTEs in MSSQL:

    
    
        UPDATE a SET col2 = 10
        OUTPUT INSERTED.key1 INTO @local_table (key1)
        FROM (SELECT TOP (1) * FROM table1 ORDER BY col1) a

------
mindcrime
CTEs in general are fine, just beware of using recursive CTEs to model graphs
- and do graph operations - inside Postgres. My experience has been that the
performance is pretty abysmal in that case, for even fairly small graphs. IMO,
if you find yourself doing a lot of "graph stuff" just go ahead and store the
graph in Neo4J or Titan or something.

Note that I don't mean to argue against the point of this post... CTEs _are_
very handy. It's just that one of the more commonly cited uses for them is to
do graph stuff, and my experience trying to that, specifically, has not been
positive.

~~~
batbomb
Doing graph processing by explicitly populating a temp table is nearly as fast
in many cases as neo4j with the added benefit of a vastly improved analytics
experience if you plan to analyze the data or do range queries, etc... It is
more work to do it smartly and efficiently, but it is possible. I'll try to
find a python script I wrote as a PoC with SQLite.

CTEs and recursive queries are okay for many cases, but they can blow up your
RDBMS temp space easily if you aren't careful. Especially useless is a non-
materialized view based on them.

~~~
aidos
I'd be interested in seeing how you approach some of the things that graphs
are better at. I have an understanding of creating efficient tree structures
in a relational model - I haven't thought about the more general case of
graphs though.

I was recently involved in a project that was using neo4j as a datastore. In
the end we switched to postgres because the cognitive load was lower, the
tooling was better and the relationship management was easier and more
explicit. The trade off is going to be that later the relationship exploration
will be a little harder. My plan was to add a graph datastore back in to
mirror the relationships in the system so we could query it.

------
seanwoods
The topic of CTEs was recently discussed (with much vigor) on the SQLite users
mailing list:

[https://www.mail-archive.com/sqlite-
users@sqlite.org/msg8119...](https://www.mail-archive.com/sqlite-
users@sqlite.org/msg81199.html)

~~~
fphhotchips
"as long as the 'RECURSIVE' part of CTE is ignored."... To me, the recursive
part of CTEs is half the point! All the other stuff is syntactic sugar that
can be done elsewhere, but recursion is a function that I'm unable to find
another method for elsewhere in the language.

Note: I have far more SQL Server experience than I do with any other DBMS.

------
AznHisoka
"In most cases I have seen performance differences smaller than a 2X
difference, this tradeoff for readability is a nobrainer as far as I’m
concerned. "

I wouldn't call a 2X difference in performance a "no brainer". Far from it. It
could be the difference between a visitor trying your product and him leaving
b/c it takes too long to load.

~~~
risratorn
I'd agree but it really depends to be honest.

A 100% increase in query time for a query that originally takes 50ms is
neglectable, for a query that takes 500ms it will indeed make a vital
difference.

------
socialist_coder
All SQL databases have this feature, it's nothing specific to PostgreSQL.

I don't like the "with" syntax though, I just do it all inline.

    
    
      select * from
      (
        select * from blah 1
      ) a,
      (
        select * from blah 2
      ) b
      where a.x = b.x;

~~~
adwf
The advantage of CTEs is they have the explicit "WITH RECURSIVE" option.
Whereas an aliased SELECT statement can't refer to itself as far as I know.

~~~
twic
Recursive CTEs are super amazing. One of the great underused features of SQL.
Give me tall PostgreSQL and a WITH RECURSIVE to steer by, and i will happily
take on any graph database.

Not promising i'll win, mind.

------
poolpool
I don't know about postgres but joining CTEs in Sql Server is a recipe for
blowing up your tempdb or at the very least waiting for ages for a trivial
query to finish.

~~~
endianswap
You must be doing something wrong then. MSSQL resolves the entire statement
(CTEs plus final query) into a single query that it optimizes as a whole.

Can you share an example of two queries that are equivalent where one is
written using CTEs and it doesn't construct the same plan?

