Hacker News new | comments | show | ask | jobs | submit login
Pivot – Rows to Columns (modern-sql.com)
145 points by mooreds 6 months ago | hide | past | web | favorite | 40 comments

Ahh, love the old pivot. At my company we had a horribly designed schema for our MVP and our postgres DB was getting hammered running our daily analytics. We had like 6 replicas going at one point and each of them getting maxed out... Then we hired an older gentlemen with plenty of DB experience and as soon as he got onboard he wouldnt stop talking about 'pivoting' our data. Nobody really listened to him for a month or two but after we did a few prototypes it was like holy shit this is revolutionary! Fully implemented the redesign of our DB with a pivoted schema and that sh has been powering our service for 3 years without a blip.

What a concept... the benefits of experience.

I have a vague understanding of pivot, never really needed to know more than that. Sort of glanced through the article with a "meh" attitude, but after reading your anecdote, decided I should read it properly and added it to my reading list (which actually does get read!). Thanks.

Could you elaborate please, how switching from rows to columns made such a difference?

A lot depends on what he changed, of course, but in general pivoting like this reduces the overall amount of data that’s both transmitted down the wire and processed through the CPU. For instance, a lot more work can be done on a single pass through the table than before, it takes less memory to keep the calculation, and the end result is smaller - sometimes by an order of magnitude.

Those things all stack up.

Maybe I haven't understood the article fully but I don't understand the use case here. Showing data in columns vs rows is purely about representation of the data as opposed to how it is stored in a database. The first query with one value per row seems like a perfect way of representing the data in the database.

In addition, the author claims that very poor performance comes from using an EAV model due to joins. In reality with the right indexes, joins can be very high performance and an EAV model can give you better performance than de-normalised tables. Performance degradation often comes from attempting to pivot the data into a de-normalised structure.

Take the example the author uses where the attributes are name, email and website. The proper EAV normalised structure would have three tables, one for name, one for email and one for website. With the three tables you can enforce the constraint of one value per entity so you don't have to use the MAX hack and the joins between the tables will be more performant than joins on a single table. With this database structure, lots of operations are more performant than using the de-normalised table. For a given submission ID we can get the name, email, website (if they exist) faster by using three queries in parallel. Each individual query will operate faster because the index for each table will be smaller than the index for the de-normalised table.

>The proper EAV normalised structure would have three tables

But then it's not EAV. In this case he is creating a single table to hold all of the attributes. The pros of this model are that once the business decides they also want to record which browser was used, there are no database changes required. In your case, you'd have to spin up a new table (and then another one once they want to store the time, and another where they want to store whether this is a returning customer, etc.)

I'm not suggesting EAV is the correct solution, but the model has its merits... rarely :).

I see what you're saying, I think my definition of EAV is probably not the correct one. In a real EAV where you have just one table and you only change data not the table structure you can only really query for a list of attributes. In that case there is no performant way of querying for a specific attribute.

The value of this style of database seems limited because you have to change the application in order to make good use of the new attribute you've added but you refuse to change the database at the same time. The best systems are where application and database match; either both contain knowledge of an attribute or neither contain knowledge of an attribute. The case of neither having knowledge is only useful where an outside party (such as the end user) alone can interpret the attribute. For example listing device specs where the user alone can compare them and not the application like: backlight = LED, subpixel = PenTile, power-supply = switch-mode.

Your model is perhaps EV - you don't need the attribute if the data is in separate tables? It seems like normalization on steroids - I was going to joke about 6NF, but turns out that's a thing: https://en.wikipedia.org/wiki/Sixth_normal_form. I'm one of today's lucky 10000!

One thing EAV is really good for is allowing for user-defined attributes - some privileged users are allowed to add a new user defined field, and there's a UI that allows regular users to view and maintain them. A pain for reporting, but you can either update a view as UDF's are added or drop and recreate a materialized view at set intervals. I've worked with applications where there's a user-defined screen for each entity in the system, it can be a great get-out-of-jail card for a lot of late/ad-hoc requests.

I agree. This is all about display of the data. However, the point of pivots is getting a two-dimensional display of the data. Using the example from the guide, a simple aggregate query would give you:

    Year   Month   Revenue
    2016    01     1
    2016    02     23
    2016    03     345
    ...     ...    ...
    2017    01     567
    2017    02     1234
    2017    03     4567
    ...     ...    ...
That's fine, but for some readers, it's hard to see year-to-year changes. With a pivot, you get:

    Revenue / Month
    Year    01    02    03   ... 
    2016     1    23   345   ...
    2017   567  1234  4567   ...
     ...   ...   ...   ...   ...
Also, for what it's worth, I gotta say that Excel does a great job at pivoting. Better than MySQL, anyway, running on the same hardware. I've created pivots with thousands of columns.

In case anyone wants to use crosstab for postgres, here's what I learned so you don't have to: https://github.com/makmanalp/sqlalchemy-crosstab-postgresql

The R package tidyr - a tidyverse package - makes this incredibly easy http://tidyr.tidyverse.org/. Just wanted to note that if you're considering trying out R and work with SQL, you will probably love the tidyverse. The SQL-like task replacements in tidyr, dplyr, and others, are how I came to love and appreciate R.

For pandas people: the pivot / pivot_table / melt / stack / unstack series of methods help with this stuff.

For anybody finding pivot complex (I know I did), I've written a long post about pivot and friends: https://spapas.github.io/2016/09/21/pandas-pivot-table-prime...

i was looking into this last week. spread is what i found but i need to pivot on 2 keys, which I dont think is supported

I'm not sure exactly what you mean by 2 keys. Is it two key/value pairs or hierarchal 2 keys per value?

If it is 2 keys per value, you can simply `unite()` the key columns.

If it is two sets of key/values you simply spread twice.

`df %>% spread(key1, val1) %>% spread(key2, val2)`

tidy og dplyr are very much about composing things, so you do have to think a bit outside of the box of "which function" to use and rather "which functions" to use.

You want the data.table [1] implementation of dcast and melt. They support multiple pivot values, inline aggregation, are super fast, and are easy to use compared to trying to wrap your brain around gather/spread. Frankly I think Hadley went a little too far with those tidyr functions, the dcast/melt interfaces he created in reshape2 are much, much easier to intuit and the DT versions are just improvements on that.

[1] https://cran.r-project.org/web/packages/data.table/vignettes...

Huh, I feel the exact opposite here. melt/gather feel very similar, but spread seems like a massive improvement in UI over dcast.

Used to be a big data.table fan, until I realized I could read my dplyr code as easily a week later as when I wrote it.

I just had to deal with this. In my case, I needed two columns pivoted and aggregated per "slot." (So each "slot" would have an aggregated A and another aggregated B column.) I couldn't find a syntax that allowed me to do that, though I'd still love to find it, for instructive purposes.

I highly recommend you search about that on stackexchange and ask question if not already present. Seem easily doable but HN comment are not a goo place to get specific feedback.

Sounds like you could do that with a join. If you post a simple example I'm sure there would be a race to figure out an answer!

Is doing something like the author suggests

SELECT submission_id , MAX(CASE WHEN attribute='name' THEN value END) name , MAX(CASE WHEN attribute='email' THEN value END) email , MAX(CASE WHEN attribute='website' THEN value END) website FROM form_submissions GROUP BY submission_id

actually faster than doing the equivalent using joins? I would think the opposite, since you are going to have a unique key on entity and name for sure, and you can't really use the name part of the key in the group by syntax. Or can you?

Of course you can, GROUP BY take as many columns as you want according to standard.

And it’s way faster too on my typical use cases.

It's not the group by that needs to take the key into account, it's the case. I would think you get the result from the where/ group by clause first, but then for each row in that result check if it matches the case criteria. I don't think the case check is going to be able to take advantage of te fact that the name is the second column of a composite key.

I’m not sure to understand (and again stackexchange is a better place to seek advices) but I still think it doable. If I get you right you just have to put the aggregate function inside a CASE instead of the opposite.

One limitation of an aggregate function is that you can’t nest aggregate function in the same SELECT clause. But you can perfectly nest it inside a non-aggregate expression.

If anyone wants to play around with an implementation of PIVOT without installing anything, check out data.world (supports PIVOT and UNPIVOT). Check out https://docs.data.world/documentation/sql/concepts/advanced/... (Disclosure: I'm an employee at data.world)

Is there anyway to support N months? This one has them hardcoded in, but what if I don't know how many months are going to be in the dataset?

The example still groups by YEAR. So you won't need more than 12 months.

Regardless: SQL is a statically typed language. Also the type of the result—i.e. the names and types of the columns—must be known in advance. Some databases offer extensions for "dynamic pivot".

Markus is modest; he is the author of the top post, and imho is up with Joe Celko in offering great techniques to use SQL effectively (both at http://modern-sql.com/ as above and http://use-the-index-luke.com/). I'm a fan.

It's possible to do it with a prepared statement. I just did this a couple of weeks ago and it worked nicely. Inspired by this Stack Overflow answer:


I wouldn't recommend this unless it really fits your use case. For me I needed to report whether a set of users had completed various tasks. The tasks wouldn't be known ahead of time, but the report is always limited to a set of users who share the same set of tasks. This approach allowed me to build a report that works for different sets of users and the tasks don't need to be known ahead of time.

Big downside is that my query creates a bunch of dependent subqueries. Because the number of rows (users) and tasks (columns) in a given report is always within a reasonable number, it runs quickly. Your mileage may vary.

Yes, if you take a detour via Python. As Markus pointed out, you would still need to know data types and filter values in advance to configure the pivot - I would assume this is normally the case.

Using Python, you can build conditional rules for building a SQL query based on parametrized inputs with Jinja2. Example here: https://multithreaded.stitchfix.com/blog/2017/07/06/one-weir...

You can then execute said query with a given DB module.

I mentioned this in another comment, but with postgres you can use jsonb_object_agg and get out a variable number of columns, you just have to pull the data out of the json blob instead of them being columns.

In postgres you can use the crosstab function with two SQL queries, the first to get the data to pivot, the second to dynamically query the categories from the data (e.g. some kind of SELECT DISTINCT).

Actually, I just realized I'm wrong. You can do my trick to automatically filter values, but you still have to pre-declare the schema of the table being returned. This means you can't have a dynamic pivot where maybe you end up with 6 columns instead of 5.

dynamic sql is the ugly solution - using a loop to generate the sql string to be executed.

This is what I use at work. :) It can make the code difficult to read and update later though.

Transposing and cross tabulation are common tasks in maths and stats, and languages like R, Python, SAS, or even tools like Tableau are much better suited to it.

Using SQL for transposing data is using the proverbial hammer because you lack better tools.

when you have a hammer everything looks like a nail

in postgres there are also alternatives you can use with json fields, which will allow you to return an arbitrary and variable number of columns - the only trick is that the "columns" are in a json object.

Here is an example if anyone is interested: http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=5dbbf7eadf0ed9... - jsonb_object_agg is the secret sauce.

Good explanation of the options. I needed this half a year ago and found a solution and stackoverflow. Using MAX and CASE still feels like an ugly hack, though.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact