Hacker News new | past | comments | ask | show | jobs | submit login
Modern SQL: With – Organize Complex Queries (modern-sql.com)
165 points by MarkusWinand on Nov 24, 2015 | hide | past | favorite | 61 comments



Markus Winand is a great SQL expert. I'm really grateful for everything that he shares with us. He writes well, explains us the differences between the different SQL engines. And his books are fantastic references.

I highly recommend his "No to offset" tutorial http://use-the-index-luke.com/no-offset Should be common knowledge, but I still see offset being used way to often, even in core ORM frameworks :-(


A summary: "don't use offset unless you need to be able to navigate to specific pages, in which case you have to use offset". That "navigate to specific pages" use case is pretty common.


If page numbers are important to you, then you have to find a different method anyway because of the drifting issue.


Or just accept that the page won't be exactly "right", but close. The solution proposed in the article doesn't have a "good enough" option for pages, as far as I can tell.


If you can use cursor-based pagination, then definitely do so—but as you point out, you can't always do that. Further, unless you're using bigserials (i.e. incrementing ints), you can't check $last_seen_ID either. With UUIDs (whether v4 or a custom impl.) there's no way to check `WHERE id > $last_seen_id`.


Yes, but people need to understand how expensive it is to implement.


Yeah, sorry if I seemed to suggest otherwise. I thought it was a good article, and it should be more widely understood that offset isn't a free lunch.


If your query is complex enough to warrant a (non-recursive) CTE, it's probably complex enough to warrant breaking that CTE out into a separate view (or function, if it performs a write). Now you can independently test and reuse that view, which you can't do with a CTE.

(Bonus if you're in Postgres: Postgres optimizes across views, but not across CTEs. Go figure.)


Allow me to disagree. Before CTEs I would have agreed, but CTEs gain you the same in clarity (decomposition really helps with comprehensibility of complex queries) without the loss of having to chase down 5 different views in order to understand the one you're looking at.

My rule is to refactor into a view when and only when it's going to be reused.

> Postgres optimizes across views, but not across CTEs

Sadly true, and one of the few big gripes I have with Postgres. I half recall reading that they are reconsidering this decision, so there's hope.


> Sadly true, and one of the few big gripes I have with Postgres. I half recall reading that they are reconsidering this decision, so there's hope.

I've read quite a few people on the email lists claiming that this is a benefit of CTEs. How they justify that, I don't know. Sure, you get predictability of performance, but I have never once written a query worried about how predictably it performed, instead of how fast it performed.


It's a bizarre side-effect of lacking query hints. This is my interpretation of the sequence of events:

1. Postgres will never implement query hints, because query hints are evil

2. In the real world people need them anyway, so they (ab)use implementation details that offer some control over the query plan (WITH is one example, see also OFFSET 0 and other hacks)

3. Everyone starts relying on these implementation details, e.g. you'll find people actually recommending this as a way to optimize queries in the mailing lists

4. Now when someone asks to fix the planner, someone else will point out this will break all the queries that have been "optimized" by depending on the old behavior

Postgres is generally a fairly sane project though, so I have some confidence they will come to their senses. Eventually.


For the record, I have also read a quite a few people on the mailing lists strongly disagreeing with those claims.


Probably 30% of my queries are complex enough to have a CTE but none of them should be views. There is no reuse and views are opaque and exist in a single namespace. Views have their purpose but this isn't it.


> There is no reuse and views are opaque and exist in a single namespace.

I'm not sure what you mean by "opaque"; they encapsulate logic, which is a plus.

On the "single namespace" thing, that sort of depends on the RDBMS; e.g., in postgres, within a single DB, one can have multiple schemas, and there is nothing stopping a view in one schema from referencing objects in another schema, so views for one database may exist in separate schemas, which act as namespaces. Other DBs offer equivalent functionality, though the names of structures may be different.


For many of my use cases, this is a handy workflow. CTEs are just perfect for iteratively building up a query to what I need. Once it's done, I can either package the whole shebang or break out the really useful bits into views.

Notably, MSSQL does optimize across CTEs, so there's not as much pressure to break out to views on that platform. And if you're feeling especially perverse, you can make a CTE a view, which sorta... sporks the problem.


Disagree completely, CTEs and local temp/variable tables can serve an important purpose. I do think that a fair amount of CTEs can just be inner queries (not joins, well, joined queries...).


True in some cases, but I find CTEs invaluable for simplifying my ad-hoc/personal queries.


Updatable CTEs are great, for example if you want to set a column to 1 where the date is the latest day for each value in a different column, you would do something like this

  ;WITH cte AS(SELECT   
        ROW_NUMBER() OVER(PARTITION BY SomeVal   
                  ORDER BY SomeDate DESC) AS row,  
                * FROM SomeTable)

  UPDATE cte  
  SET id =  CASE WHEN row = 1 THEN 1 ELSE 0 END
Here is the DDL and DML in case you want to play around with this

  CREATE TABLE SomeTable(id int,SomeVal char(1),SomeDate date)
  INSERT SomeTable 
  SELECT null,'A','20110101'
  UNION ALL
  SELECT null,'A','20100101'
  UNION ALL
  SELECT null,'A','20090101'
  UNION ALL
  SELECT null,'B','20110101'
  UNION ALL
  SELECT null,'B','20100101'
  UNION ALL
  SELECT null,'C','20110101'
  UNION ALL
  SELECT null,'C','20100101'
  UNION ALL
  SELECT null,'C','20090101'


So I've never come across INSERT table SELECT values UNION ALL SELECT values.

Any reason to prefer that over INSERT INTO table (col, col) VALUES (v1, v2), (v1, v2), ... ?


No reason, I do use Row Value Constructor/Table Value Constructor, these were introduced in SQL Server 2008, while CTEs were introduced in SQL Server 2005, I think when I created this example initially SQL Server 2008 was just released so this would not have worked for a lot of people on 2005


As I very vaguely recall, not all DB systems have/had support for multiple VALUES statements.


You'll find INSERT SELECT UNION ALL interesting when you want to insert rows from multiple table sources. For example, when refactoring a two tables into a single table or loading data into a temporary table (BCP) before copying it into a target table.


SQL Server has a limitation of 1000 rows maximum allowed in a VALUES clause, so sometimes a SELECT ... UNION ALL is required instead (this limitation probably doesn't exist in Postgres, though).


One advantage of this syntax is that you can just run the select part, look at the data to make sure it is correct and then finally run the whole statement


Wow, what I liked on this website is that you can turn off social plugins, why not more websites do that...


It's actually "opt-in" (=default off) and common in Europe.

A German IT media website (heise.de) pioniered that approach in 2011: https://web.archive.org/web/20150318081341/http://www.heise.... ; the newer code: https://web.archive.org/web/20150424102213/http://www.heise....


The optimization fence in PostgreSQL is usually bad for SELECTs but it has it's own benefits. See this thread http://www.postgresql.org/message-id/201209191305.44674.db@k...


What I really want from SQL is the ability to do a single select with one or more parts of each returned record being an array (defined by a one-to-many join) and others simple scalars.

It looks like Postgres can do it [1] with its JSON capabilities, and to be honest its being a little while since I've dug into it, so things might have changed (I'm actually not even sure I'm phrasing the question correctly), but the old way of doing a single select and then looping over it in whatever code is calling the SQL and doing N more selects, one for each row to get the sub-array is horrible.

[1] http://bender.io/2013/09/22/returning-hierarchical-data-in-a...


Recursive CTE can be very fast in Postgres. I can create a closure table of 2 Million rows for indirect and direct relations i.e. All paths from 15000 source direct relations in about 12 seconds on my MBPro 1Tb ssd 16 Ram. The makes it practical to just recreate all paths as a materialzed view refresh instead of worring about complicated incremental graph logic such as moving branches. With such a closure view - any graph traversal query is lightning fast - no need for neo4j !


Can you post the code for this somewhere? Thanks!


You mean like

    SELECT key, value1,
      ARRAY(SELECT value2 FROM bar WHERE bar.key = foo.key)
        AS value2_array
    FROM foo;
?

(In Postgres; no "JSON capabilities" needed.)


In the old days, SQL operated on rectangular tables and produced rectangular tables ... which doesn't work so well for data that's better modelled as a tree or graph. But there now lots of ways to produced multi-valued output and to operate on graph-structured data. In addition to PG's JSON, you might checkout recursive unions - which supports a subset of graph navigation. But while SQL may have grown just a tad more expressive, I'm not sure that performance is up to snuff. And even if it's now possible to express some things in SQL, it doesn't mean it's easy or readable.


It works pretty well and fairly quickly. We use the array_agg and other json features in a Postgres to Elasticsearch ETL. Postgres is able to spit out the exact format that we need, so the application code isn't really doing much more than coordinating the reads and writes.


What you need is good old array_agg(expression) with group by. Json capabilities may be needed for more complex hierarchical structures.


> It looks like Postgres can do it [1] with its JSON capabilities

Postgres could do it long before its JSON capabilities, array_agg was added in Postgres 8.4 (2009) and hstore in 8.3 (2008)


Also, ARRAY(), since 7.4 in 2003.


Very interesting site, but does anyone know why Oracle and mysql have a so poor support of the standard ? It looks as if postgres was the only one trying to comply.


In oracle's case "poor support" seems to imply being more lenient than required (not needing recursive, allowing qualified names for CTE), whereas Postgres is just more strict.

Which seems to be a general with those two DBMS'...


Have you read the SQL99 or even 2003 spec? No database supports everything. It's a matter which features one cherrypicks. If you want to use lots of XML and ORMs inside your database or use Java as stored procedures than SQL 2003 spec is great. The enterprise grade database software Oracle, DB2, MSSQL and Postgres are all somewhat comparable.

MySQL with e.g. its InnoDB storage engine is very fast and web scale. Whould a materialized view or other advanced features be useful in MySQL? Sure. But it's a tradeoff. And many devs are willing to go with MySQL for websites.

And SQLite is different than the rest too. And then there are a bunch of NoSQL databases for various usecases.


I think you're glossing over some important differences there. Some databases really are more standards-compliant than others, and that Postgres would be more standards-compliant than Oracle or MySQL is 100% consistent with my experience.

For some use cases that's fine, sure. But I've also seen many developers come to regret choosing MySQL when, months or years down the line, they ran into issues that hadn't been apparent at first. And honestly I don't see how it's a "tradeoff" when Postgres can offer performance that's every bit the equal of MySQL.


Ans DB2, typically DB2 is ahead of the game with newer standard SQL.


But that's because IBM has a strong presence in the SQL standard committee and can nudge it to standarise (post-factum) things they already implemented in DB2.


How do you feel Oracle's support for "with" is lacking? I use it on a daily basis, and it seems to work fine for me.


I don't know for MySQL but I'd guess in the case of Oracle is that they have equivalents in their own dialect of SQL.


with has completely changed my usage of PostgreSQL. I am a newbie and being able to split queries into logical small steps with this has made SQL much more fun.


But don't you still get big long queries. Or can you make separate statements?


A long query written with CTEs is eminently more readable than the logically equivalent query written with nested sub-queries.

Just like a pipe operator makes long strings of function calls more readable than the semantically identical nested function calls.


Not sure what you mean?


Sometimes SQL looks like a good spot to break out the old CWEB literate programming tools...



A common pattern I still haven't solved is this: I have table A, and now I want to add table B and give it one row for each row in table A. Then I want to add a foreign key column to table A pointing at the row I just added.

Of course this is easy to do with a loop, but can you do it all in pure SQL? If you have a solution I would love to see it.

Here is my history of attempts:

    DROP TABLE IF EXISTS dogs;
    DROP TABLE IF EXISTS doghouses;
    DROP SEQUENCE IF EXISTS dogs_id_seq;
    DROP SEQUENCE IF EXISTS doghouses_id_seq;


    BEGIN;

    CREATE SEQUENCE dogs_id_seq;
    CREATE SEQUENCE doghouses_id_seq;

    CREATE TABLE dogs (
      id INTEGER PRIMARY KEY DEFAULT nextval('dogs_id_seq'),
      name TEXT NOT NULL
    );

    INSERT INTO dogs
    (name)
    VALUES
    ('Sparky'),
    ('Spot')
    ;

    -- Now we want to give each dog a doghouse:
    CREATE TABLE doghouses (
      id INTEGER PRIMARY KEY DEFAULT nextval('doghouses_id_seq'),
      name TEXT NOT NULL
    );
    ALTER TABLE dogs ADD COLUMN doghouse_id INTEGER REFERENCES doghouses (id);

    /*
    -- ERROR:  syntax error at or near "INTO"
    UPDATE  dogs AS d
    SET     doghouse_id = (
      INSERT INTO doghouses
      (name) VALUES (d.name)
      RETURNING id
    )
    ;
    */

    /*
    -- ERROR:  WITH clause containing a data-modifying statement must be at the top level
    UPDATE  dogs AS d
    SET     doghouse_id = (
      WITH x AS (
        INSERT INTO doghouses
        (name) VALUES (d.name)
        RETURNING id
      ) SELECT * FROM x
    )
    ;
    */

    /*
    -- ERROR:  missing FROM-clause entry for table "dogs"
    WITH homes AS (
      INSERT INTO doghouses
      (name)
      SELECT  name
      FROM    dogs
      RETURNING doghouses.id AS doghouse_id, dogs.id AS dog_id
    )
    UPDATE  dogs
    SET     doghouse_id = homes.doghouse_id
    FROM    homes
    WHERE   dogs.id = homes.dog_id
    ;
    */

    -- ERROR:  syntax error at or near "INTO"
    UPDATE  dogs AS d1
    SET     doghouse_id = h.id
    FROM    dogs d2
    INNER JOIN LATERAL (
      INSERT INTO doghouses
      (name) VALUES (d2.name)
      RETURNING id
    )
    ON true
    WHERE d1.id = d2.id
    ;


    COMMIT;


This script works in SQL Server - I would think you could do the same in Postgres, but I'm not as familiar as I'd like to be with Postgres.

Note that it does require 2 separate steps rather than 1 as you appear to desire, so may not work for you:

    IF OBJECT_ID('dbo.dogs', 'U') IS NOT NULL
        DROP TABLE dbo.dogs
    GO
    
    IF OBJECT_ID('dbo.doghouses', 'U') IS NOT NULL
        DROP TABLE dbo.doghouses
    GO
    
    CREATE TABLE dogs (
        id INTEGER IDENTITY (1, 1) PRIMARY KEY
       ,name VARCHAR(MAX) NOT NULL
    );
    
    -- Now we want to give each dog a doghouse:
    CREATE TABLE doghouses (
        id INTEGER IDENTITY (1, 1) PRIMARY KEY
       ,name VARCHAR(MAX) NOT NULL
    );
    GO
    
    ALTER TABLE dogs ADD doghouse_id INTEGER REFERENCES doghouses (id);
    GO
    
    INSERT INTO dogs (name)
    VALUES
        ('Sparky'),
        ('Spot')
    ;
    
    DECLARE @DogsAndHouses TABLE (
        doghouse_id INT
       ,name VARCHAR(MAX)
    );
    
    INSERT INTO dbo.doghouses (name)
    OUTPUT INSERTED.id, INSERTED.name INTO @DogsAndHouses
    SELECT
        d.name
    FROM
        dbo.dogs d;
    
    UPDATE d
    SET
        doghouse_id = dah.doghouse_id
    FROM
        dogs d
        JOIN @DogsAndHouses dah ON d.name = dah.name;
    GO


Yup, the OUTPUT keyword is the key to making this work. Before this was added to TSQL, the only way I knew of to make this work was by using a cursor and handling each record independently.


You could add a temporary column to Table B that stores the primary key of column A during the insert, then a 2nd query can insert the primary key of Table B into the corresponding rows of Table A by matching against the temporary column in B.

For example:

    INSERT INTO tableB (whatever, tempIdFromTableA) SELECT whatever, id FROM tableA;
    UPDATE tableA SET yourForeignKeyColumn = (SELECT id FROM tableB WHERE tempIdFromTableA=tableA.id)
... or instead of that update you could do it via a JOIN:

    UPDATE tableA INNER JOIN tableB ON tableA.id=tableB.tempIdFromTableA SET tableA.yourForeignKeyColumn=tableB.id;
Or you could use a temp table to achieve similar results (if you don't want to have to add and then remove the temp column from table B).


Your examples are confusing and I'm still not sure what you want to achieve.

From the description it sounds like you want to have two tables with 1:1 relationship (every dog has a single doghouse). Although this is different than typical 1:1, because both entries supposed to always match. In that scenario you should actually have a single table with two columns (dog name and doghouse name). It'll also be more performant (no need to use joins to fetch the data). There's no benefit to have two tables, especially when you're duplicating the data (doghouse.name = dog.name).

If every dog has to have dog house, and no dogs can share one, then this information is useless and there's no point for storing it.

If for example there are dogs without a doghouse, you can still use one table and have a column with boolean field to store that information.

If multiple dogs can share same doghouse, then you have a foreign key for dog, kind of the way you set it up, but you can't blindly insert data to two tables, because you still need to know which doghouse is shared by which dogs.

If you have multiple dogs and multiple dog houses and need to match one house with one dog, you use 3 tables (one for dogs, one for doghouses, and the third one containing primary keys for both of them that performs the matching. In that scenario you first need to populate dogs and doghouses tables and then you do the matching, so once again you shouldn't insert all data at the same time. This pattern also allows you to easily change which dogs reside in which doghouses.

If you absolutely need to insert to two tables at the same time (your example doesn't show justification for that) then you can do two inserts within a single transaction. You can wrap it in stored procedure or use triggers.


Predictably you're criticizing the question rather than answering it. That's boring. Dogs and doghouses is supposed to be sufficiently contrived that you don't get hung up on whether it's the correct model.

Although initially I'm creating one doghouse for each dog, the model will let dogs start to share doghouses. It's a one-to-many relationship, but to migrate the old data I need to do something, so I'm starting out by giving each dog its own doghouse.

Of course I can use stored procedures, but I'm asking if there is any way to avoid that.

"you can't blindly insert data to two tables, because you still need to know which doghouse is shared by which dogs." That is indeed the crux of the question. :-)


Wow, never knew this. I've been fine with just doing nested subqueries but this is definitely easier to read. Great to learn a helpful piece of everyday syntax before I've had my coffee.


Typically I advise people to use CTE's for ad hoc queries and to avoid CTE's for production code. CTE's are easily abused.


Easily abused and can throw the optimizer for a loop, especially if using them to reference within the CTE (not a recursive CTE but joining back).


One of the most useful things about CTEs is PostgreSQL's support of having DELETE ... RETURNING and UPDATE ... RETURNING in the the WITH clauses. This way you can easily implement moving rows from one table to another.


Nice reference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: