Hacker News new | past | comments | ask | show | jobs | submit login

It would be useful if the data was loaded into different RDBMS' and their execution plans compared. My understanding is that MySQL has the worst query planner and Oracle the best. MSSQL sits somewhere in between, near PgSQL but better in a few areas.

I think MySQL basically gives up optimizing the moment you ask it to do a correlated subquery.

I use MSSQL at work and given it some really dumb schemas and really dumb queries and it still manages to pick good plans. It wouldn't surprise me if MSSQL generated the "optimal" plan from the first query.

It doesn't choose good plans if you break 3NF by a long shot or avoid the most obvious indexes. Otherwise the only thing to watch out for is stale stats and parameter sniffing.

Oracle 12c has adaptive plans which would be a great addition to MSSQL in the fight against parameter sniffing. Basically it figures out that it's probably got the wrong plan ("this nested loop join stopped making sense about 10,000 rows ago") and adjusts it accordingly.

Again, wouldn't be surprised if Oracle got it "right" from the first query.

For all its faults, this is one thing I like about Oracle. No matter how many levels of nested correlated subqueries or other convoluted junk I throw at it, it (almost) always untangles the query into a nicely optimized plan. Of course, this makes it less likely I'll spend time actually thinking about how the query could be cleaned up and streamlined. Which means if I tried to move over to MSSQL or MySQL my queries would probably take forever or fail completely. Ergo, I'll stick to Oracle... ?

Vendor lock in through dumbing down your clients ... I like it.

no, vendor lock in through having a better product

I agree with everything you've said, however I want to add that the general technique of compressing your data through pre-aggregation is a sound one, whether it is performed auto-magically by the optimizer, or manually through whatever hooks your DMBS provides.

While it's hard to deny the results, the human readability of the final solution leaves a lot to be desired, especially for SQL casuals.

At least with Postgres (and MySQL to some degree, but Postgres is probably a better launching point), you could hack the query optimizer to handle situations like this if you really wanted to. I like to think this is how Postgres would like you to approach the problem, instead of creating a less comprehensible query.

OP here again. This was such a good idea, we decided to do it. :)

Here are the same queries tested on MySQL, MSSQL and Oracle as well: https://periscope.io/blog/count-distinct-in-mysql-postgres-s...

tl;dr: Oracle and MSSQL do very well, and even better on the more specific queries. Postgres's naive query is much worse than all the others.

Thanks for the extra testing.

My biggest problem with these benchmarks and your queries I cover in this [1] post.

If databases.name is unique then you'd have a unique index/constraint on it. Doing this in the environment I set up causes SQL Server to generate identical execution plans for all three queries.

If databases.name is not unique then your query's group by name doesn't make sense. I'll just quote what I wrote verbatim:

> No online retailer is looking at their sales reports and thinking that John Smith is their best customer and Amorette McFredson is their worst.

> ... I don’t know whether the query that was chosen for optimization because it’s good for demonstrating the technique on PgSQL or because it represents a query that is actually in use. I don’t like group by name without knowing whether dashboard names are unique.

> With that said, if it’s not unique then the query is arguably “wrong” because it aggregates results for distinct dashboard names rather than distinct dashboards.

> Removing the unique index on dashboards.name does change the execution plan of the first query – it’s slightly worse than the other two (which remain identical). “Fixing” the first query to group by dashboards.id and dashboards.name causes the better plan to be generated again.

[1] http://ta.speot.is/2014/01/25/use-subqueries-to-count-distin...

Historically the MySQL query planner has been pretty bad, although it's improved significantly in the past couple years. It was particularly bad dealing with subqueries as you mentioned, although several of the performance issues have been solved in MySQL 5.6.


Yup, this. Of course, it hurts that there are probably less than 10 people in the world who fully understand the optimizer (and the majority are employed by Monty); though I'm sure the MySQL folks at Oracle are getting there.

OP here. Great points -- especially around MySQL and subqueries, which has been a frustration for us.

Comparing across different RDBMS's is a great idea. Maybe we can squeeze it into a future blog post. :)

I tried to do a fair job in recreating your queries and techniques with MSSQL. Turns out MSSQL picks the best plan from the start [1].

[1] https://news.ycombinator.com/item?id=7119976 or http://ta.speot.is/2014/01/25/use-subqueries-to-count-distin...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact