
Pivot – Rows to Columns - mooreds
http://modern-sql.com/use-case/pivot
======
brootstrap
Ahh, love the old pivot. At my company we had a horribly designed schema for
our MVP and our postgres DB was getting hammered running our daily analytics.
We had like 6 replicas going at one point and each of them getting maxed
out... Then we hired an older gentlemen with plenty of DB experience and as
soon as he got onboard he wouldnt stop talking about 'pivoting' our data.
Nobody really listened to him for a month or two but after we did a few
prototypes it was like holy shit this is revolutionary! Fully implemented the
redesign of our DB with a pivoted schema and that sh __has been powering our
service for 3 years without a blip.

~~~
thijsvandien
Could you elaborate please, how switching from rows to columns made such a
difference?

~~~
jmelloy
A lot depends on what he changed, of course, but in general pivoting like this
reduces the overall amount of data that’s both transmitted down the wire and
processed through the CPU. For instance, a lot more work can be done on a
single pass through the table than before, it takes less memory to keep the
calculation, and the end result is smaller - sometimes by an order of
magnitude.

Those things all stack up.

------
barbegal
Maybe I haven't understood the article fully but I don't understand the use
case here. Showing data in columns vs rows is purely about representation of
the data as opposed to how it is stored in a database. The first query with
one value per row seems like a perfect way of representing the data in the
database.

In addition, the author claims that very poor performance comes from using an
EAV model due to joins. In reality with the right indexes, joins can be very
high performance and an EAV model can give you better performance than de-
normalised tables. Performance degradation often comes from attempting to
pivot the data into a de-normalised structure.

Take the example the author uses where the attributes are name, email and
website. The proper EAV normalised structure would have three tables, one for
name, one for email and one for website. With the three tables you can enforce
the constraint of one value per entity so you don't have to use the MAX hack
and the joins between the tables will be more performant than joins on a
single table. With this database structure, lots of operations are more
performant than using the de-normalised table. For a given submission ID we
can get the name, email, website (if they exist) faster by using three queries
in parallel. Each individual query will operate faster because the index for
each table will be smaller than the index for the de-normalised table.

~~~
MSM
>The proper EAV normalised structure would have three tables

But then it's not EAV. In this case he is creating a single table to hold all
of the attributes. The pros of this model are that once the business decides
they also want to record which browser was used, there are no database changes
required. In your case, you'd have to spin up a new table (and then another
one once they want to store the time, and another where they want to store
whether this is a returning customer, etc.)

I'm not suggesting EAV is the _correct_ solution, but the model has its
merits... rarely :).

~~~
barbegal
I see what you're saying, I think my definition of EAV is probably not the
correct one. In a real EAV where you have just one table and you only change
data not the table structure you can only really query for a list of
attributes. In that case there is no performant way of querying for a specific
attribute.

The value of this style of database seems limited because you have to change
the application in order to make good use of the new attribute you've added
but you refuse to change the database at the same time. The best systems are
where application and database match; either both contain knowledge of an
attribute or neither contain knowledge of an attribute. The case of neither
having knowledge is only useful where an outside party (such as the end user)
alone can interpret the attribute. For example listing device specs where the
user alone can compare them and not the application like: backlight = LED,
subpixel = PenTile, power-supply = switch-mode.

~~~
ianmcgowan
Your model is perhaps EV - you don't need the attribute if the data is in
separate tables? It seems like normalization on steroids - I was going to joke
about 6NF, but turns out that's a thing:
[https://en.wikipedia.org/wiki/Sixth_normal_form](https://en.wikipedia.org/wiki/Sixth_normal_form).
I'm one of today's lucky 10000!

One thing EAV is really good for is allowing for user-defined attributes -
some privileged users are allowed to add a new user defined field, and there's
a UI that allows regular users to view and maintain them. A pain for
reporting, but you can either update a view as UDF's are added or drop and
recreate a materialized view at set intervals. I've worked with applications
where there's a user-defined screen for each entity in the system, it can be a
great get-out-of-jail card for a lot of late/ad-hoc requests.

------
makmanalp
In case anyone wants to use crosstab for postgres, here's what I learned so
you don't have to: [https://github.com/makmanalp/sqlalchemy-crosstab-
postgresql](https://github.com/makmanalp/sqlalchemy-crosstab-postgresql)

------
alexilliamson
The R package tidyr - a tidyverse package - makes this incredibly easy
[http://tidyr.tidyverse.org/](http://tidyr.tidyverse.org/). Just wanted to
note that if you're considering trying out R and work with SQL, you will
probably love the tidyverse. The SQL-like task replacements in tidyr, dplyr,
and others, are how I came to love and appreciate R.

~~~
autokad
i was looking into this last week. spread is what i found but i need to pivot
on 2 keys, which I dont think is supported

~~~
extr
You want the data.table [1] implementation of dcast and melt. They support
multiple pivot values, inline aggregation, are super fast, and are easy to use
compared to trying to wrap your brain around gather/spread. Frankly I think
Hadley went a little too far with those tidyr functions, the dcast/melt
interfaces he created in reshape2 are much, much easier to intuit and the DT
versions are just improvements on that.

[1]
[https://cran.r-project.org/web/packages/data.table/vignettes...](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-
reshape.html)

~~~
claytonjy
Huh, I feel the exact opposite here. melt/gather feel very similar, but spread
seems like a massive improvement in UI over dcast.

Used to be a big data.table fan, until I realized I could read my dplyr code
as easily a week later as when I wrote it.

------
TheRealDunkirk
I just had to deal with this. In my case, I needed two columns pivoted and
aggregated per "slot." (So each "slot" would have an aggregated A and another
aggregated B column.) I couldn't find a syntax that allowed me to do that,
though I'd still love to find it, for instructive purposes.

~~~
Twisell
I highly recommend you search about that on stackexchange and ask question if
not already present. Seem easily doable but HN comment are not a goo place to
get specific feedback.

------
gweinberg
Is doing something like the author suggests

SELECT submission_id , MAX(CASE WHEN attribute='name' THEN value END) name ,
MAX(CASE WHEN attribute='email' THEN value END) email , MAX(CASE WHEN
attribute='website' THEN value END) website FROM form_submissions GROUP BY
submission_id

actually faster than doing the equivalent using joins? I would think the
opposite, since you are going to have a unique key on entity and name for
sure, and you can't really use the name part of the key in the group by
syntax. Or can you?

~~~
Twisell
Of course you can, GROUP BY take as many columns as you want according to
standard.

And it’s way faster too on my typical use cases.

~~~
gweinberg
It's not the group by that needs to take the key into account, it's the case.
I would think you get the result from the where/ group by clause first, but
then for each row in that result check if it matches the case criteria. I
don't think the case check is going to be able to take advantage of te fact
that the name is the second column of a composite key.

~~~
Twisell
I’m not sure to understand (and again stackexchange is a better place to seek
advices) but I still think it doable. If I get you right you just have to put
the aggregate function inside a CASE instead of the opposite.

One limitation of an aggregate function is that you can’t nest aggregate
function in the same SELECT clause. But you can perfectly nest it inside a
non-aggregate expression.

------
calosa
If anyone wants to play around with an implementation of PIVOT without
installing anything, check out data.world (supports PIVOT and UNPIVOT). Check
out
[https://docs.data.world/documentation/sql/concepts/advanced/...](https://docs.data.world/documentation/sql/concepts/advanced/PIVOT.html)
(Disclosure: I'm an employee at data.world)

------
sgarman
Is there anyway to support N months? This one has them hardcoded in, but what
if I don't know how many months are going to be in the dataset?

~~~
MarkusWinand
The example still groups by YEAR. So you won't need more than 12 months.

Regardless: SQL is a statically typed language. Also the type of the
result—i.e. the names and types of the columns—must be known in advance. Some
databases offer extensions for "dynamic pivot".

~~~
mwexler
Markus is modest; he is the author of the top post, and imho is up with Joe
Celko in offering great techniques to use SQL effectively (both at
[http://modern-sql.com/](http://modern-sql.com/) as above and [http://use-the-
index-luke.com/](http://use-the-index-luke.com/)). I'm a fan.

------
mr_toad
Transposing and cross tabulation are common tasks in maths and stats, and
languages like R, Python, SAS, or even tools like Tableau are much better
suited to it.

Using SQL for transposing data is using the proverbial hammer because you lack
better tools.

~~~
hackits
when you have a hammer everything looks like a nail

------
netcraft
in postgres there are also alternatives you can use with json fields, which
will allow you to return an arbitrary and variable number of columns - the
only trick is that the "columns" are in a json object.

Here is an example if anyone is interested:
[http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=5dbbf7eadf0ed9...](http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=5dbbf7eadf0ed92f8d6a49fc5be8f3f2)
\- jsonb_object_agg is the secret sauce.

------
rawfan
Good explanation of the options. I needed this half a year ago and found a
solution and stackoverflow. Using MAX and CASE still feels like an ugly hack,
though.

