Those things all stack up.
In addition, the author claims that very poor performance comes from using an EAV model due to joins. In reality with the right indexes, joins can be very high performance and an EAV model can give you better performance than de-normalised tables. Performance degradation often comes from attempting to pivot the data into a de-normalised structure.
Take the example the author uses where the attributes are name, email and website. The proper EAV normalised structure would have three tables, one for name, one for email and one for website. With the three tables you can enforce the constraint of one value per entity so you don't have to use the MAX hack and the joins between the tables will be more performant than joins on a single table. With this database structure, lots of operations are more performant than using the de-normalised table. For a given submission ID we can get the name, email, website (if they exist) faster by using three queries in parallel. Each individual query will operate faster because the index for each table will be smaller than the index for the de-normalised table.
But then it's not EAV. In this case he is creating a single table to hold all of the attributes. The pros of this model are that once the business decides they also want to record which browser was used, there are no database changes required. In your case, you'd have to spin up a new table (and then another one once they want to store the time, and another where they want to store whether this is a returning customer, etc.)
I'm not suggesting EAV is the correct solution, but the model has its merits... rarely :).
The value of this style of database seems limited because you have to change the application in order to make good use of the new attribute you've added but you refuse to change the database at the same time. The best systems are where application and database match; either both contain knowledge of an attribute or neither contain knowledge of an attribute. The case of neither having knowledge is only useful where an outside party (such as the end user) alone can interpret the attribute. For example listing device specs where the user alone can compare them and not the application like: backlight = LED, subpixel = PenTile, power-supply = switch-mode.
One thing EAV is really good for is allowing for user-defined attributes - some privileged users are allowed to add a new user defined field, and there's a UI that allows regular users to view and maintain them. A pain for reporting, but you can either update a view as UDF's are added or drop and recreate a materialized view at set intervals. I've worked with applications where there's a user-defined screen for each entity in the system, it can be a great get-out-of-jail card for a lot of late/ad-hoc requests.
Year Month Revenue
2016 01 1
2016 02 23
2016 03 345
... ... ...
2017 01 567
2017 02 1234
2017 03 4567
... ... ...
Revenue / Month
Year 01 02 03 ...
2016 1 23 345 ...
2017 567 1234 4567 ...
... ... ... ... ...
If it is 2 keys per value, you can simply `unite()` the key columns.
If it is two sets of key/values you simply spread twice.
`df %>% spread(key1, val1) %>% spread(key2, val2)`
tidy og dplyr are very much about composing things, so you do have to think a bit outside of the box of "which function" to use and rather "which functions" to use.
Used to be a big data.table fan, until I realized I could read my dplyr code as easily a week later as when I wrote it.
, MAX(CASE WHEN attribute='name' THEN value END) name
, MAX(CASE WHEN attribute='email' THEN value END) email
, MAX(CASE WHEN attribute='website' THEN value END) website
GROUP BY submission_id
actually faster than doing the equivalent using joins? I would think the opposite, since you are going to have a unique key on entity and name for sure, and you can't really use the name part of the key in the group by syntax. Or can you?
And it’s way faster too on my typical use cases.
One limitation of an aggregate function is that you can’t nest aggregate function in the same SELECT clause. But you can perfectly nest it inside a non-aggregate expression.
Regardless: SQL is a statically typed language. Also the type of the result—i.e. the names and types of the columns—must be known in advance. Some databases offer extensions for "dynamic pivot".
I wouldn't recommend this unless it really fits your use case. For me I needed to report whether a set of users had completed various tasks. The tasks wouldn't be known ahead of time, but the report is always limited to a set of users who share the same set of tasks. This approach allowed me to build a report that works for different sets of users and the tasks don't need to be known ahead of time.
Big downside is that my query creates a bunch of dependent subqueries. Because the number of rows (users) and tasks (columns) in a given report is always within a reasonable number, it runs quickly. Your mileage may vary.
Using Python, you can build conditional rules for building a SQL query based on parametrized inputs with Jinja2. Example here: https://multithreaded.stitchfix.com/blog/2017/07/06/one-weir...
You can then execute said query with a given DB module.
Using SQL for transposing data is using the proverbial hammer because you lack better tools.
Here is an example if anyone is interested: http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=5dbbf7eadf0ed9... - jsonb_object_agg is the secret sauce.