Hacker News new | comments | show | ask | jobs | submit login
Modern SQL in PostgreSQL (slideshare.net)
446 points by lelf on Feb 8, 2015 | hide | past | web | favorite | 130 comments



One of my favorite idioms lately for doing bulk updates without incurring lock contention is to chain CTEs, like so:

  with candidate_rows as (
     select id
       from table
      where conditions
      limit 1000
        for update nowait
  ), update_rows as (
     update table
        set column = value
       from candidate_rows
      where candidate_rows.id = table.id
  returning table.id
  )
  select count(1) from update_rows;
...and loop on issuing that query until the "count(1)" returns zero some for number of iterations (three works pretty well).

Want to add a column to your "orders" table and populate it without blocking concurrent writes for as long as it will take to rewrite a multi-million row table? Want to re-hash user passwords using something stronger than MD5, but not prevent users from ... you know, logging in for the duration?

CTEs are all that and the bag of chips.


Do you have any resources for learning cool advanced SQL like this? I've used CTEs in the past, just from reading about them in the documentation, but I feel like I'm only scratching the surface of what's possible out there.


http://www.postgresql.org/docs/

Just read it. Seriously. Literally cover to cover, or nearly so. (Maybe simply scan the C-related sections if you're not into that, but do take note they exist. Similarly for some pl/ sections.) It might take you a few evenings, but you won't regret it. The Postgres docs is one of the best out there...


Other than The Fine Manual, linked by a sibling comment, not really. Everything I've learned, I've just picked up along the way.

EDIT: When I wanted to learn about CTEs, for example, I conveniently had a large, ugly materialized view that needed refactoring, and they happened to fit the need perfectly. The previous version was, in places, five or six layers of subqueries deep, many of them repeated several times as they were reused. It hurt to read. Rewriting it to use CTEs made it about eight times faster, and the SQL script was a third the size of the original.


The author of the slides, Markus Winand, wrote the book "SQL Performance Explained".


The book appears to me to be the contents of use-the-index-luke.com. My friends impulsed bought it when I told him I was planning on buying it after confirming that it had more content than the website but after skimming the table of contents it doesn't appear to have anything more than the website.

Still, its good to support Markus' effort and if you prefer having a PDF rather than going to the website, just buy it. It's like $15.


Quoting from http://Use-The-Index-Luke.com/ (the main page):

Use The Index, Luke is the free web-edition of SQL Performance Explained. If you like this site, consider getting the book. Also have a look at the shop for other cool stuff that supports this site.

http://use-the-index-luke.com/shop


I was in his Modern SQL talk at FOSDEM on Saturday, he clearly knows his stuff. His book is definitely on my list.


X-comment so you get it too:

Quoting from http://Use-The-Index-Luke.com/ (the main page):

Use The Index, Luke is the free web-edition of SQL Performance Explained. If you like this site, consider getting the book. Also have a look at the shop for other cool stuff that supports this site.

http://use-the-index-luke.com/shop


Second that request. What can I read to go beyond my SELECT * from t order by c-level knowledge?


Apologies for the self-plug, but you might find some stuff in the later-stage Aggregates and Recursive exercises on http://pgexercises.com . There's a bunch of stuff on CTEs and window functions in there.


I like Date's SQL and Relational Theory (http://shop.oreilly.com/product/0636920022879.do). It won't tell you about the latest changes to the standard, but it will help you get a handle on the relational model, which is the key to using SQL intelligently.


"Want to add a column to your "orders" table and populate it without blocking concurrent writes for as long as it will take to rewrite a multi-million row table?"

PostgreSQL ALTER TABLE ... ADD COLUMN is an O(1) operation, and requires no data rewrite (as long as you are OK with NULLs).

You can even add an attribute to a composite type, and existing tables using that type will see the extra attribute. Again, O(1), no data rewrite.


Yeah, it's not about adding the column; I know that's instantaneous. I'm talking about the UPDATE you'll have to issue to populate the column once it's been added — unless you're okay with it staying NULL (or doing some sort of "lazy loading" is viable in your specific scenario). If you try to do that for the entire table in one go, you're contending with concurrent writes, incurring a lot of IO at once, and as much as doubling the disk size of your table.

This way, you're hitting a limited number of rows per iteration, significantly softening the IO impact (granted, at the expense of wall-clock time); the NOWAIT fails fast on rows that have write locks out against them when you try to grab them; and (if the column you're populating is indexed, obviating HOT updates) leaving a radically smaller number of dead tuples — particularly if you up-tune autovacuum on the table in question while doing this (though that can mitigate some of the reduction in disk IO).


Could you elaborate on why there'd be a smaller number of dead tuples? Are you just taking into account the autovacuuming that can take place as you continue through the iterations?


Primarily, yes. Once autovacuum is marking dead tuple space as reusable (dead tuple space that was largely created by earlier iterations of the loop issuing your query), subsequent iterations can use that space for their new tuples, and the table should reach steady-state size, or at the very least, grow much more slowly. This is even more likely when you make autovacuum more aggressive on the table in question while you're doing this.

Additionally, however, if you're bulk rewriting the table in one query, autovacuum has no (or very limited) opportunity to mark dead tuples as truly dead — especially if there's any lock contention going on, so other, older xids are waiting for your bulk update to complete — so in the worst case your table is half dead tuples, and twice the size it needs to be.


Really? I had the impression that if you added a column to a composite-type, the rows that have that composite-type as a column will be rewriten (tested in 9.3 couple of months ago)?


> re-hash user passwords using something stronger than MD5

Isn't this only possible if you do it on user login - unless you are also cracking user passwords... Actually has anybody cracked their own MD5 user passwords to upgrade them?


You can sort of upgrade hashes in place at the cost of extra book keeping.

Take the old hash, and use it as the input to a new stronger hash. Mark some column to indicate you've done this. Then next time the user logs in, you calculate the old hash and then the new hash from that. Once you validate the password use it to calculate a new new hash, put that in the field and clear the update column.


You don't even really need the book keeping - you can update the hash in place as you say, then just try both eg bcrypt . password and bcrypt . md5 . password. It would be even simpler to style all future passwords as bcrypt . md5 . password but I'm not sure of the security implications of that as a permanent measure and so wouldn't do it personally.


Security wise bcrypt(md5(password)) is at least as strong as bcrypt(password). However broken md5 is it is certainly no easier to find a preimage than where you already have the preimage. (Edit: now that I think about it, it could be worse if there is a side channel in your md5 implementation.)

If you do it that way you'll pay a performance cost on every password check forever. I suppose that trade-off might worth it in some cases.


So much of password hashing is a deliberate performance cost anyways, it almost hardly matters if your password passed through md5 and sha{1,2}. bcrypt will eclipse them all in relative cost, as it should be.


This is exactly what facebook do. They've done it several times over, such that a plaintext password goes through about 8 steps before making it into a database.


You might be right. It was a while ago that we did that, and I may have conflated it with another bulk update of our "users" table.

Either way, I've used this idiom at least half a dozen times in production, all without any downtime or user-visible effect, and we have millions of orders, SKUs and users.


Is this really different from using a transaction with separate read (using an update lock) and update statements, and looping over that in the same way from the application until the count is 0?


Yes. That incurs many of the consequences of doing them all in one query (particularly lock contention and dead tuple bloat), and takes longer — making those consequences worse than doing it in a single query.


I was interested in NoSQL. Went to a Mongo pres. Both examples would have been easier and faster in SQL. Even the SQL I used 20 years ago. I asked for an example that would show performance advantage. I got a tired vague statement about the vague performance advantage. Seems like snake oil to me.


Unfortunately you were downvoted, but it wasn't so long ago on HN that every second story was NoSQL this, NoSQL that. There were even "SQL is dead"/"relational DBs are dead" posts, just ridiculous. So it's nice to see stories like this.

I've been writing a lot of recursive queries for Postgresql lately using CTEs. Quite cool though a little mindbending at times.


I remember those days. I was never sold on NoSQL, but if anything can be said, it lit a fire under SQL and competition in the space was ultimately a good thing.

Glad I didn't go all in on NoSQL, though.


HN is very fad-oriented.


Except some fads turn out to be long-term trends, so it's worth keeping one eye open.


Absolutely, but it pays to be very skeptical and to pay closer attention to the negative "I have used this technology and it sucks because [links to the bug tracker]" articles than the ones that gush about how great they are.


There's probably a trendy Hacker News technology lifestyle cycle chart to be drawn, step 3 or 4 of which is "developer is bitten by deficiency in the technology, writes blog post saying it sucks and not to use it, gets 200 points and front page".


The main reason I read HN is to keep up with whatever manic fancy will catch the developers' eyes this week, leading to me supporting it for a couple of years.


"NoSQL" is generally a misnomer. It's not SQL that is/was the problem, but that there are a lot of cases where specific common properties of RDBMS' are limiting. The NoSQL moniker is a result of the fact that most of these RDBMS's uses SQL as the query language, and most of the "new" database engines does/did not.

Since then, a lot of the RDBMS's have adopted features that have reduced the gap. E.g. Postgres' rapidly improving support for indexed JSON data means that for cases where you have genuine reasons to have data you don't want a schema for, you can just operate on JSON (and best of all, you get to mix and match).

For some of the NoSQL databases that puts them in a pickle because they're not distinguishing themselves enough to have a clear value proposition any more.

But it is not the lack of SQL that has been the real value proposition.


I love Postgres but its support for sharding, multi-master, and most forms of scaling that aren't just "buy a bigger box" is still way behind most of the NoSQL solutions.

Lots of use cases don't need that kinds of scalability but if you do then Postgres can be more difficult to work with.


Skype used to run entirely on a Postgres cluster before Microsoft bought them. There are lots of examples of large Postgres clusters in the wild. Have you considered hiring an experienced Postgres admin? These types of setups are not impossible.


Our current usecase for NoSQL is that CouchDB can be replicated to mobile devices - I'm not aware of Postgres being able to do this. So I think there still are cases where it can be useful.


Do you have an example of how to use this? I'm interested in analogs of the way Firebase works.


All you need is Couchbase Mobile. I'll make a blog post about its use when I get around to it.

http://developer.couchbase.com/mobile/index.html


http://pouchdb.com/ is one option.


(accidentally posted this to the root of the story)

Postgresql 9.4 with jsonb sends mongo to the dustbin, IMHO. If you have to write it in js close to the data or if plpgsql is too steep of a learning curve, you can play with the experimental plv8. But you should really pick up plpgsql, it's "python" powerful, with the an awesome db (and has a python 2.x consistent API, sadly, but the doc is very good) There is a great sublime 2.0 package that makes the writing and debugging of functions in one file just awesome. Write an uncalled dumb function that has a lot of the API in it at the top of your file, and you'll get autocomplete on this part of the API. Specifically no not miss getting acquainted with json and hstore, specifically using json as a variable size argument passing and returning mechanism, it's just hilariously effective. cheers, and keep making this place(not only HN, our blue dot) better, F


It sounds like you are talking about plpython rather than plpgsql.


sorry for the confusion, no I'm really referring to plpgsql, it's quite a powerful language; the APIs for the essential extensions such as hstore, and integrated types such as json, array, strings etc have evolved over several years, so they lack a little consistency in naming conventions, something python2.0 also has IMHO. plpython is quite cool too, but using that or js with plv8 IMHO obscures some of the power of the underlying db server.


> Both examples would have been easier and faster in SQL.

That's easy, so long as we mean the whole entire project when we say "faster". When I worked at Timeout.com they were importing information about hotels from a large number of sources. For some insane reason, they were storing the data in MySql. Processing was 2 step:

1.) the initial import was done with PHP

2.) a later phase normalized all the data to the schema that we wanted, and this was written in Scala

The crazy thing was that, during the first phase, we simply pulled in the data and stored it in the form that the 3rd party was using. That meant that we had a separate schema for every 3rd party that we imported data from. I think we pulled data from 8 sources, so we had 8 different schemas. When they 3rd party changed their schema, we had to change ours. If we added a 9th source of information, then we would have to create a 9th schema in MySQL. We also checked the 3rd party schema at this phase, which struck me as silly because this did not mean that step 2 could be innocent of the schema, rather, both step 1 and step 2 would have to know the structure of those foreign schemas, but it was necessary because we were writing to a database that had a schema.

The system struck me as verbose and too complicated.

It's important to note that most of the work involved with step 1 could be skipped entirely if we used MongoDB. Simply import documents, and don't care about their schema. Dump all the data we get in MongoDB. Then we can move straight to step 2, which is taking all those foreign schemas and normalizing them to the schema that we wanted to use.

For ETL situations like, NoSQL document stores offer a huge convenience. Just grab data and dump it somewhere. Simplify the process. Your transformation phase is the only phase that should have to know about schemas, the import phase should be allowed to focus on the details of getting data and saving it.


You can do that in SQL databases too: store the XML / JSON / whatever in a blob. There is no need to have a normalized import schema, especially since you are doing the transformation using an external application (Scala program).


I'm not about to go to bat for Mongo in an SQL thread (there's plenty of problems with that platform that are real), but I rather enjoy their query syntax, it's very AST-like, even through the aggregation pipeline.

I don't believe it's probably much faster even in the best case (and I'm sure an experienced SQL expert wouldn't find it any "easier"), but on a grammatical level I do find it a fresh take on query structure, and writing queries and map-reduce jobs in coffeescript was extremely satisfying because of how terse, elegant, and pseudocode-like it turned out.


MySQL (and thus MariaDB I presume) and SQLite seem to be pretty poor in supporting these "new" features.

SQLite seems logical because it needs to be kept lean for embedding purposes, but do people know why MySQL is lagging behind so much?


It seems to be a philosophical difference. A lot of people seem to think that the database should only support primitive operations, and the rest should be left to application code.

I am not one of those people -- I think a good database system (like postgres) can make many things dramatically simpler.


In the 90s the MySQL guys believed that transactions and referential integrity, should be done in the application...


Maybe you need just key-value access and the rest to be in the app, maybe you need just file access and the rest in app, maybe you need just hd-block-access and the rest in app etc


My experience with PostgreSQL and MySQL is that PostgreSQL supports lots of features, but performance needs a very careful eye - often a neat feature isn't as useful as it seems because optimization isn't good enough. Whereas MySQL tends to be very good at what it can do, and makes it easy to do silly things, like turn off referential integrity in a slow query to boost performance.

It almost feels like a worse-is-better story. As a programmer, PostgreSQL is much better to work with; more tools, better EXPLAIN, more features, more types, more of almost everything. But to use in the heat of battle, it's less clear-cut. PostgreSQL's replication story is complicated. MySQL master-master replication is fairly easy to set up, and if you use a master as a hot failover, it all mostly just works; when the primary site comes back up, it resyncs with the failover. PostgreSQL has a lot of different replication stories - without a strong central narrative, it's hard to gain confidence.


it all mostly just works

Yes, that sums up MySQL pretty well.

And then one day, one of your masters segfaults for no discernible reason. And when you restart it, the replication process (or even the InnoDB recovery) fails to resume with a generic error message that you can't find any useful information about in the mess that MySQL calls "documentation".

That's when you realise that "it all mostly just works" is really not what you want from your database.


I did some tests with Postgres recently to test query optimization. Was pretty impressed, especially where it applied the correct index on an inline view (2nd example): * http://mxblog.eu/link/post/sql-is-about-what-data-you-want-n... * http://mxblog.eu/link/post/impact-of-inline-views-on-query-p...


I've never been with a company that didn't have serious problems with MySQL replication. At least once a month master-master replication breaks horribly and must be manually fixed; often by having to delete and remove data. MySQL replication half-works at best.


I look forward to the day that happens at my work then :)

To be clear, our master/master replication is strictly read-only at one site and read-write at the other site, and never read-write simultaneously. We have yet to see issues in production under fairly hefty write load, and we've failed over numerous times, and back again. But it's only been 10 months or so.


This doesn't match my experience at all. MySQL performance is highly variable, and it's incredibly easy to write simple queries that have truly abysmal performance.

Postgres on the other hand, has always tended towards making things work well (a step that mysql often skips), and then work quickly thereafter.

A lot of MySQL acolytes say Postgres is slow because, unlike MySQL, it doesn't ship with unsafe defaults. MySQL doesn't just allow you to do dumb things, it starts off with many of those settings as the defaults.

To me, the real problem is that people likebarrkel exist. He doesn't know what he's on about, but he likes MySQL. Most of wheat he wrote is flatly false, but he said it confidently. And he's employed someplace that probably uses MySQL.

MySQL got adoption for two reasons:

1) it used to be easier to install; and

2) it has unsafe defaults that mean if an idiot runs a benchmark, it wins.

That's it. That's how it won market. After that, it was network effects, and nothing else. MySQL is a turd. It requires substantial expertise to use MySQL because it is such an awful and dangerous tool. It slows you down as you get better. But most of the people who use it don't know any better, or (like barrkel) they spew nonsense that is the opposite of reality. So it wins.

Network effects suck.


He doesn't know what he's on about, but he likes MySQL.

You're putting words in my mouth that I didn't say. I don't like MySQL. I prefer PostgreSQL. And I have had a few rough times optimizing some queries in PostgreSQL, whereas I've had fewer such bad times with MySQL, despite using it more often. It's anecdata. Take it for what it's worth.

Time sinks in MySQL have come more from its crappy defaults, from its bizarre error handling (or lack thereof) in bulk imports, and most recently, a regression caused by a null pointer in the warning routine.

If I were working on my own project, I'd probably go with PostgreSQL and figure out the replication story. But I'm not. I do use PostrgeSQL on my personal projects.

(If there was one feature I'd add to PostgreSQL, it would be some means of temporarily and selectively disabling referential integrity. Not deferring it, not removing and readding foreign keys, just disabling. The app I work on does regular 10k-1M+ row bulk inserts, usually into a new table every time (10s of thousands of tables), but sometimes appending to an already 100M+ row table. It would be nice to have referential integrity outside of the bulk inserts, but not pay the cost on bulk insert.)


You can disable all triggers in a session.

   set session_replication_role='replica';
If memory serves me correctly. Foreign keys are maintained by triggers.


http://stackoverflow.com/a/2681413

SET CONSTRAINTS ALL DEFERRED

Then do your inserts, followed by whatever work needs to be done without referential integrity, in the same transaction.


[flagged]


>> performance needs a very careful eye

> Clearly implying that performance doesn't require a careful eye on MySQL

This is highly ironic, since you later say:

> you suck at logic

Funny chap!

(My point was that all the extra features in PostgreSQL give you more solution space, but not all of the solution space is suited to your problem. You have to watch out you're using the specific features in a way that pays off. Whereas MySQL is a bit like a hammer; you need to shape your problem like a nail, otherwise you won't be able to hit it at all. But when you do hit it, it mostly works. It's not a precision instrument though.)

>> makes it easy to do silly things, like turn off referential integrity in a slow query to boost performance.

> Which is really misleading, because it ships with most of that stuff off.

MySQL ships with referential integrity turned on - InnoDB is the default back end. Care to explain more?

(Personally, I wouldn't use MyISAM for my phone book, never mind in production.)


" InnoDB is the default back end"

This comment reveals that you're utterly and completely ignorant. If you think InnoDB treats your data with respect by default, you've clearly never actually paid close attention to anything.


As a MySQL DBA, I can tell you that neither MySQL nor Postgres are slow. MySQL has less features than Postgres and that was intentional. MySQL was supposed to be a basic feature set that serves 90%-95% of websites and has easy to set up replication. MySQL has always had more the mentality of light weight and fast. Postgres is a fine database. It has a lot of features I would like to see in MySQL. However, Postgres is still fighting issues with regular replication while MySQL is now progressing to sync/group-replication.

The features that Postgres already has can be solved in MySQL, but take some work. Sometimes they require me to use temporary tables and pre-calculated summary tables.

MySQL's speed is only variable by the queries that are run against it. MySQL is "simple enough" for developers to write queries for it and in 10%-20% of those cases, those queries could be a bit more optimal or the data model could use some more tweaking. In terms of getting things done, you can do a whole lot before needing someone like me to come along and tweak things.

A performance audit from someone like me every 6-9 months after your company's website has been in use for >3 years can be perfectly fine.

Also the main reason MySQL won out back in the 90s was because they had better documentation and answered questions on their forums faster.


> To me, the real problem is that people likebarrkel exist.

Seriously? you're getting worked up about a database and you conclude that it would be better if certain people didn't exist?

I'd like to remind you that this is HN, not the Linux kernel dev list. For all its flaws, HN still is about civility. You just wished someone out of existence over a database, can you please stop that? Grow up!


I have noticed that comments here are getting abusive. I got replied to with a post full of profanity for making an observation on GNOME, not taking sides on either cases for the argument.

It makes for an uncomfortable community.


He got worse. https://news.ycombinator.com/item?id=9022358

He created that account to "go after" me with a lot of vitriol. It's a bit mysterious though. Why would someone work themselves up into such rage over a database? Normally, you'd explain this as teenage frustration or something. But he's clearly very unhappy, lashing out.

More mystifying than offensive, since it's impossible to take seriously. How can you deal with these people.


It can be quite upsetting though, particularly with the vitriolic posts from complete strangers. If it was from someone I'd known for years I would cope with it better because you can just shrug and agree to disagree, no need for real offensiveness in any situation though.

But from complete strangers who are hammering profanity and abuse into their keyboards as hard as they can, that is quite unnecessary and doesn't make for a nice community. If someone was speaking the things some people on here post to my face, I'd be incredibly offended and they wouldn't be the sort of person you'd want to work with, hang around with or even live next door to. But they don't seem to mind typing it????

These posts rear their head in C++ articles and anything to do with OSX it seems. Really disappointing.

I suppose the best way to deal with it is just detach from it for a while, use another forum, go outside, look at the birds or stroke a cat or something.

There's something very therapeutic about picking up a fluffy cat (I have 4 British Shorthairs, great for fussing) or simply watching sparrows and small birds go about their business in the dust or seed feeders. They continue working without worries, but work hard to survive still and seem happy about it (as far as a bird can be happy). I find it a contrast to us sat in yellow-lit offices with deadlines, stresses, possibly incompetent managers/colleagues and concerns about our existence/paying bills etc.


Speaking as someone who made the switch from MySQL to Postgres... I'm just happy that I can now trust subqueries and schema updates to not blow up my database. And I don't know how I got by without lateral joins and hstore.


I would also say MySQL was adopted by lots of people because for a long time it had a better replication story then Postgres. Easy to setup replication was a pretty killer feature, even if it didn't always quite work as expected.


I think "easier replication" and "unsafe defaults" kind of go hand in hand...


MySQL also had much cheaper connection setup costs which was a big deal for the CGI and PHP crowds who setup the world at the beginning of every request.


This is very true. If they used something like pgpool-ii it'd be fine. But that would require more effort for the server administrators.


Why are you coming down so hard on barrkel, it seems like you essentially agree!?


A simpler and faster product won ? I am truly shocked.

And the issues around "safety" stopped being a concern for most developers a decade ago when the ORM was invented. So this idea that you need "substantial expertise" to use it is simply ridiculous.


Oracle influence? Postgres is the leader now in mindshare and features for the opensource offerings. If you like MySQL you could check out mariadb


> but do people know why MySQL is lagging behind so much

Because for 99% of the MySQL user base there isn't a need for these features.

Those using MySQL at scale are using them as dumb key-value stores with horizontally sharding. Those who aren't typically are using them with an ORM and so they aren't dealing with the database at the SQL layer.

Everyone else who is manually writing SQL generally was on or moved to Oracle, Teradata, SQL Server, PostgreSQL etc anyway.


This whole statement is 99% wrong


One thing that would be nice is if SQL provided first class support for sub records.

So instead of "SELECT name, (SELECT GROUP_CONCAT(CONCAT_WS(',', post_id, post) SEPARATOR ';') FROM posts p WHERE p.user_id = u.user_id) AS 'posts' FROM users u WHERE u.user_id = 1",

you could do "SELECT name, (SELECT post_id, post FROM posts p WHERE p.user_id = u.user_id) AS 'posts' FROM users u WHERE u.user_id = 1".

and the query result would be { name : 'Todd', posts : [ { post_id : 1, post : 'My Comment' } ] }.

Obviously this is a simple example and could have been rewritten as a query on the posts table, inner joined on the user table, and duplicating the user's name in the result. But it becomes much nicer to have as queries get more complex.

A query that supports sub records would gives you flexibility to structure data like a JSON object and simplify the server end of REST apis.


With Postgres:

    SELECT u.name, json_agg(p) AS posts
        FROM users u, posts p
        WHERE p.user_id = u.user_id
        GROUP BY u.user_id;


What if you only want the id and markup of each post?

Postgres 9.4 gave us json_build_object:

    SELECT 
      u.name, 
      array_agg(
        json_build_object(
          'id', p.id,
          'markup', p.markup
        )
      ) posts
    FROM users u, posts p
    WHERE p.user_id = u.user_id
    GROUP BY u.user_id;


Thank you so much for this! I spent a good chunk of today trying to work out how to do this, and had given up concluding that it wasn't possible.


I was playing around with postgres and was able to get a query that puts everything in JSON:

select json_agg(sub) from (select u.username, (select array_agg(p) from posts p where u.id = p.user_id) posts from users u) sub;


Using json_build_object (Postgres 9.4) to map each username to an array of posts:

    SELECT 
      json_build_object(
        u.username,
        (SELECT json_agg(p) FROM posts p WHERE u.id = p.user_id)
      )
    FROM users u
Output:

    [
      {"chuck": null},
      {"blair": [
        {"id": 1, "markup": "hello"},
        {"id": 4, "markup": "world"}
      ]},
      {"serena": [{"id": 5, "markup": "testing"}]}
    ]
At least I think you were trying to do that.


EDIT: Nevermind my last post. Your query makes sense now and would indeed work well.


Glad it helped! :) It honestly took me a while to boil it down to that – the docs for those JSON functions are poorly explained and illustrated, IMO.


If you're just trying to get JSON out (for a simple query to REST api, or in a node.js environment), have you considered converting the record/recordset to json in the subquery?

The json functions in 9.3+ are pretty handy for that sort of thing. Andrew (core developer who wrote most of that functionality) and I decided to keep the api pretty lightweight, as its easy to also add your own functions to suit your needs.


JSON functions do look pretty helpful and go a long way, even if not first class. As you can tell from syntax I'm stuck in the MySQL/MariaDB world, so not up with everything in PostgreSQL. This and the features in the slideshare are an eye opener to me. Thanks!


Postgres/Oracle supports such constructs.

And Revenj has been using it for years: https://github.com/ngs-doo/revenj

And yes, Revenj now comes with an offline compiler ;)


how can you do that with postgres ? a simple example / documentation link ?


https://github.com/ngs-doo/revenj/blob/master/tutorials/reve...

SELECT p, array_agg(SELECT c FROM "MasterDetail"."Child_entity" c WHERE c."parentID" = p."ID") as c FROM "MasterDetail"."Parent_entity" p


Note, just replace array_agg with json_agg to get JSON.


And doing this the query iterates only once over the MasterDetail table ?


This particular query does a nested loop on the detail table, but basically yes. Only one iteration.


Postgresql 9.4 with jsonb sends mongo to the dustbin, IMHO. If you have to write it in js close to the data or if plpgsql is too steep of a learning curve, you can play with the experimental plv8.

But you should really pick up plpgsql, it's "python" powerful, with the an awesome db (and has a python 2.x consistent API, sadly, but the doc is very good) There is a great sublime 2.0 package that makes the writing and debugging of functions in one file just awesome. Write an uncalled dumb function that has a lot of the API in it at the top of your file, and you'll get autocomplete on this part of the API.

Specifically no not miss getting acquainted with json and hstore, specifically using json as a variable size argument passing and returning mechanism, it's just hilariously effective.

cheers, and keep making this place(not only HN, our blue dot) better, F


Anyone know where I can get the full video lecture for this? I'd really like to see it.


I've given this talk on Saturday in Moscow again and it was video taped there. Organizers told me they'll publish the English original as well as the Russian voice over version. I'm keen for the second one ;)

I keep you posted.


It was given at FOSDEM, the videos should be available "soon".


Interesting. I don't think django implements a lot of this in their ORM.


I work on several Django apps, and also write advanced SQL in Postgres (CTEs, plv8 functions, materialized views, etc.). One thing the ORM allows though is writing a custom sql query and its pretty easy to 'cast' the sql results into a Django model.

With that said, its still sometimes a struggle to justify if almost all the queries are highly custom, and the app doesn't require Django Admin. Certainly there are advantages, but the minimalism of Flask and focusing on the REST interface for integration and building micro services is more appealing in some cases.


If one wants to use these Postgres features, Flask without an ORM is the way to go. Django is too constraining, especially with its primary key limiting.


If you use Django without the ORM, why would it be any more constraining than Flask?

I usually try to keep my queries within the ORM, as it means things like sorting and pagination become very easy, but its lacking in certain areas.


It's not necessarily more constraining, but from my perspective there are two reasons to choose Django over other python frameworks: the ORM and the out-of-the-box Admin (which depends on the ORM). If you take those away, there is no good reason, in my opinion, to choose Django over, say, Tornado.

I wouldn't choose Flask for anything; it's too slow, even by the relatively poor standards set by Tornado and Django.


The admin has stopped m going to another framework before.

My additional reasons for choosing Django over a minimalist framework:

There is a large selection of third party add ons for Django. I had a look around I couldn't find anything like reversion for any other framework (maybe it exists, but I didn't find it).

There are a standard set of components, so you get a default choice. These will work together and likely have some consistency in the way they get used. Choosing your own components for Flask (or whatever framework), they may well work together, or you may get problems.


> wouldn't choose Flask for anything; it's too slow, even by the relatively poor standards set by Tornado and Django.

What are your picks for good and fast web frameworks which are also light weight?


You can work around a lot of django's ORM limitations and end up with something that's less laborious and error prone than writing SQL for everything.


Same goes with Doctrine.

You can write crazy complex hydrators, or use simple result set mapping for populating custom query results into objects.

Nearly all of my complex queries for some analytics software I wrote lives in sql statements that aren't inlined in the EntityRepository of a specific entity. ORM's are nice, but doing complex stuff in DQL (doctrine) or the other query builders is more hassle than its worth.


SQLAlchemy supports many of these advanced concepts.


Django's ORM is very limited, it doesn't even expose GROUP BY. I don't think there will ever be support for more powerful constructs like these.


You are right in that the Django ORM will always be relativly simple, but it does have some GROUP BY support: https://docs.djangoproject.com/en/1.7/topics/db/aggregation/...


I'm aware, I was already years deep into Django when they released it. But since it doesn't expose GROUP BY (the aggregation API only covers a subset of the uses), when you need it you have to drop to raw SQL, in which case your query will not return a QuerySet anymore, breaking compatibility with the rest of the codebase, defeating the point of an ORM in the first place. An uphill battle.


We're getting closer to full support for custom/advanced queries with full queryset support. See the release notes for 1.8 expressions: https://docs.djangoproject.com/en/1.8/ref/models/expressions...

The GROUP BY still isn't exposed directly, but neither are joins or subqueries. The rules that make up what should be included in the group by are fairly elaborate. However, writing custom expressions will now allow you to craft some pretty cool SQL, and allow the sub-expression to specify whether it should contribute to GROUP BY or not.

Disclaimer: I did a lot of the work for query expressions in 1.8. The ORM team is expanding, and people are actively working on bringing more powerful features to the ORM.


You can use a Meta.managed=False model and a Postgresql view and map your model to that view for custom SQL.

And just query from that model as normal.


Group by is kind of implicit depending on the model you start with.

Model_a.objects.all().select_related(extra_tables).aggregate(arg_params)

will group by the Model_a table.


I often find myself wishing that SQL had a more uniform syntax. So much of it resembles COBOL to me, with all the capital letters and "english-like" arbitrary word order. At the same time, relational algebra exists and is very elegant, in an APL-like way. Would be very interested if there was a project to expose such an api to SQL databases.


SQLAlchemy (for python) tries to sanitize the syntax away from SQL with some level of success.


SQL is case insensitive. It is a bit quirky though.


Indeed, something people forget sometimes leading to collation problems in the (odd) instance when CS is set.


True, but very few people write it as anything but.


is it even possible ?


According to my understanding ,Teradata has already supported these so-called "modern" features for several years.


And yet, despite all these cool features, Postgres still doesn't support upsert or merge, which even MySQL manages to get right.


What's really funny about this comment is that, while all of the features discussed in TFA are fully standards-compliant SQL, MySQL implements none of them, and its "upsert" isn't.

EDIT: And before one slags PostgreSQL too hard for not currently supporting upsert, one might peruse the relevant pg wiki page [1] to better understand where development stands and why we don't yet have it. (Hint: it's actually kinda complicated, assuming you want it done, you know, "right".)

[1] https://wiki.postgresql.org/wiki/UPSERT


Everyone who needs 'upsert' is forced to go and read those discussions (and probably a few more articles as well), and then implement their own version of the same thing many times over.

The fact that it's complicated is precisely the reason this ought to be solved for the general case.

Otherwise, Postgresql is still an awesome product.


Nah, you do not need to read those discussions unless you want to help out with the current patch which if enough people help out might land in PostgreSQL 9.5.


Both MySQL and SQLite's implementations work satisfactorily, so I tend to think that PostgreSQL's reluctance is the perfect being the enemy of good.

It's a pattern that shows up often when you're mirroring data from a third-party, so it's a shame that the programmer has to do conflict handling for an operation that the database could easily do atomically.


> Both MySQL and SQLite's implementations work satisfactorily

Satisfactory for you perhaps, but it may not be in the general case.


"Get right" probably isn't the right term. Mostly I'm just annoyed that I can't choose an open source DB that has upsert (or merge if you want to do the standards version) and CTEs. I generally have uses for both of these in every project I work on.


You can use writable CTEs as a kludge for an upsert depending on your use case.


> which even MySQL manages to get right.

hahahaha.

If you think MySQL gets this right, you haven't thought about it much.

http://www.pgcon.org/2014/schedule/events/661.en.html https://wiki.postgresql.org/wiki/UPSERT


None of those links say that MySQL's ON DUPLICATE KEY UPDATE doesn't work as intended.

There's a single caveat: "If more than one unique index is matched, only the first is updated. It is not recommended to use this statement on tables with more than one unique index."

That isn't a problem for most users since the use cases for upsert tends to be simple by nature, with a primary key that you want to set to the latest values whether or not it exists.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: