Hacker News new | past | comments | ask | show | jobs | submit login
How good are query optimizers, really? [pdf] (2015) (vldb.org)
70 points by guodong 25 days ago | hide | past | favorite | 56 comments



One of the most important things I learned with databases was to run each of my queries using EXPLAIN (EXPLAIN QUERY PLAN in sqlite) and seeing which indexes are being used, if any.

One of the reasons I don't like ORMs is that I'm not able to see the underlining query and truly optimize a service. That may be fine for a new service where performance isn't crucial, but once it needs to scale, you need to put on your engineering hat, get your hands dirty, and optimize queries.

You'll find you need to re-write queries so that there isn't complex nesting in the WHERE statement and flatten your logic so that the SQL optimizer can use your indexes. You may need to put SELECT statements within SELECT statements, where the innermost SELECT uses indexes and the outer queries are using the result of the inner query, which is smaller than the whole table.


I wrote a thread on this on Twitter: https://twitter.com/PredragGruevski/status/12639165990625402...

I feel that SQL aimed to be Python and became x86 assembly instead. It's no longer a simple "just works" query language the moment you have to worry about predicate flattening, join decomposition, CTEs that introduce optimization barriers, and "IN()" being faster than equivalent "JOINs".

As a result, I started a project that allows you to write read-only database-agnostic queries called GraphQL compiler: https://graphql-compiler.readthedocs.io/ https://github.com/kensho-technologies/graphql-compiler

The core idea of the project is to get us the convenience of specifying the "what question I want answered," but without the inconvenience of "how is the answer computed / with which specific set of queries / where did the data come from?" -- unless you want to peek under the hood, of course. All the visibility into the nitty-gritty details available on demand, but without the tedium of having to hand-optimize queries and know all the "magic" ways in which queries get faster or slower for each individual kind of database.


> The core idea of the project is to get us the convenience of specifying the "what question I want answered," but without the inconvenience of "how is the answer computed / with which specific set of queries / where did the data come from?"

So...exactly like SQL, then?


The post you're replying to directly addresses that. When you write SQL:

> It's no longer a simple "just works" query language the moment you have to worry about predicate flattening, join decomposition, CTEs that introduce optimization barriers, and "IN()" being faster than equivalent "JOINs".

Though it doesn't seem to address how to optimize things using the GraphQL compiler, when there's a need, without massaging the queries, as with SQL.


Just like how GCC and Clang/LLVM know all the quirks of various CPUs and can optimize accordingly, GraphQL compiler aims to know the quirks of various databases (down to individual database versions: e.g., in Postgres 12 certain kinds of CTEs are no longer an optimization barrier) and optimize accordingly.

This is clearly a massive challenge, but one made easier by the fact that GraphQL compiler queries (unlike SQL queries) operate at a much higher level of abstraction. In SQL, you write "here's a CTE, now recursively JOIN X.foo to Y.bar" where X and Y could be just about anything, even something where a recursive JOIN is nonsensical. If you want to put a WHERE clause, you have to decide whether it goes in the recursive CTE itself, in a separate CTE that is ordered before the recursive CTE, or if you want to wrap the recursive CTE into another SELECT and put the WHERE there. The correct answer varies from database to database, and as a function of the size, layout, and index coverage of your data.

In GraphQL compiler, your queries are much more declarative in comparison: your query would say "find all subgraphs where vertex A's field 'foo' has value 123, and where A has a recursively-expanded edge (i.e. 0+ hops along that edge) to a vertex with field 'bar' with value 456". It's then the compiler's job to figure out which of the many equivalent SQL statements (or other db language queries, if you aren't using SQL) is going to be the best way to compute the result you asked for.

Here's an example from our test suite: input query: https://github.com/kensho-technologies/graphql-compiler/blob...

Microsoft SQL Server-flavored compiled SQL output: https://github.com/kensho-technologies/graphql-compiler/blob...

I'm writing a blog post about this with more detail, follow me on Twitter if you'd like to see it when it comes out.


If you know all of the quirks of the various databases, why aren't you hacking on their optimizers? Why write a compiler that knows what's slow and what's not when you can just fix what's slow?


It's an "and" rather than an "either-or" :)

When the database is open-source, and I spot something that's broken that I know how to fix, I try to fix it. Here's a fix for a severe database query planner correctness bug I contributed to an open-source database called OrientDB: https://github.com/orientechnologies/orientdb/pull/7015

Unfortunately, Microsoft SQL Server, Oracle, and many other databases are not open-source, and I can't hack on their query planners. And even if they were, SQL is an absolutely massive language (the complete spec is 100,000+ pages). The GraphQL compiler query language is tiny in comparison, the spec is maybe 10 pages: https://graphql-compiler.readthedocs.io/en/latest/language_s...

It's a lot easier to intelligently map a small language to a big one than it is to optimize the big language outright.

In a sense, SQL is just not designed to be easy to optimize — it's too broad, and there are too many equivalent-ish ways of doing the same thing. This is why even after incredibly smart people cumulatively spent engineer-millennia on the query execution and optimization systems in SQL databases, we still keep having issues and there are still plenty of areas for improvement.

More info and more concrete examples of "why not just write SQL" in my upcoming blog post!


> In a sense, SQL is just not designed to be easy to optimize — it's too broad, and there are too many equivalent-ish ways of doing the same thing. This is why even after incredibly smart people cumulatively spent engineer-millennia on the query execution and optimization systems in SQL databases, we still keep having issues and there are still plenty of areas for improvement.

The main reason is the inherent difficulty of cardinality estimation, as the paper says.

Not every optimization is worth having. There is typically a distributed cost, paid in extra planner cycles for queries that don't benefit from the optimization. This is one of the main reasons why it's hard to contribute new optimizations to the Postgres planner. Naturally, it's possible that a marginal optimization will be incredibly important to one particular query or user. It's a value judgement in each case.

Frankly, I find the suggestion that SQL is not designed to be easy to optimize baffling.


I think we agree more than may seem apparent at first glance. In a sense, you are also making the same point I was trying and probably failed to make. Please bear with me as I give it another shot.

The difficulty of cardinality estimation is a function of the expressiveness of the language. Imagine a new query language, SQL2, that only has the SELECT and FROM keywords -- no WHERE, no JOIN, nothing else. Cardinality estimation in SQL2 is trivial: just store the counts for each table, and you can estimate everything trivially. Optimal query plans are trivial by extension as well.

Now let's add the WHERE keyword and a few operators to this SQL2. Cardinality estimation and good query planning got much harder now! For example, if the WHERE predicate touches two columns, we need to know about correlations between the two columns, or we might make incorrect estimates and therefore get worse plans. And since the plan space got bigger, we spend more cycles on planning. If we continue to add more features to SQL2 to bring it to parity with SQL proper, all these problems get harder as we go.

The language semantics behind GraphQL compiler aim to get sufficient expressiveness for many (hopefully, most) use cases, while limiting the scope so that the problems of cardinality estimation and good planning don't become too hard to solve effectively. In comparison, SQL is significantly more expressive, and as a result also much more difficult to execute and optimize.


I think the parent's point might have been: SQL tried with mixed success. What makes the GraphQL compiler different?


The user you're replying to presumably understands a bit about the cited inconveniences, as well as SQL, in general

> user: petergeoghegan > about: PostgreSQL major contributor, committer.


It’s a leaky abstraction, yes. In the ideal world, SQL query planners always would find the optimal plan in zero seconds and DDL would allow you to express non-functional requirements for various kinds of queries (“retrieving this record by ID should take at most 1ms”, “looking up related records in that table by order number should…”), but the world isn’t ideal. We have to manually create indices, split tables, move different tables to different disks, etc. for performance.

“The GraphQL compiler turns read-only queries written in GraphQL syntax to different query languages” is useful for GraphQL ‘fans’, but I don’t see how that is going to solve that.

Designing a user friendly query language for relational data isn’t the hard part. Executing such queries efficiently is.

For SQL, there’s half a century of research on that. This paper is part of it, and indicates that, at this moment in time, effort is better spent on methods for keeping statistics on data up to date than on making cost models more fine grained.


> Designing a user friendly query language for relational data isn’t the hard part. Executing such queries efficiently is.

I couldn't agree more with this part :)

To be clear, GraphQL compiler isn't trying to "solve" SQL itself, just merely the fact that by the time all the table splitting, sharding across disks and machines, and similar required maintenance operations are done, your SQL query has grown impractically complex and entirely unreasonable to write. At sufficient scale, and with sufficient additional non-SQL databases in play, writing adequate queries becomes wildly impractical.

In the GraphQL compiler world, all your databases (SQL and non-SQL) are represented in one unified schema against which you write database-agnostic queries, and GraphQL compiler handles the nitty-gritty details of "which query runs where." GraphQL compiler is not a toy project I build for fun — it's a core piece of data infrastructure that Kensho (company where I work) has been happily using in production for over 3 years now.

I'm writing a blog post about exactly this, and I hope to publish it very soon! Follow me on Twitter if you'd like to see it when it comes out.


>One of the reasons I don't like ORMs is that I'm not able to see the underlining query and truly optimize a service.

I thought most of them had some feature where you could dump the query before it gets sent to the DB.

Stuff like this: https://stackoverflow.com/questions/1412863/how-do-i-view-th...


yeah but if the generated SQL doesn't look how you want it, now you gotta optimize it using the ORM's language and not SQL. Tweak the ORM, see the generated SQL, tweak again... etc.

If you're looking at the generated SQL I would rather just use the SQL directly in my code. There's probably features in ORMs where you can write raw SQL and tell it how to map the result to an object but I haven't used an ORM in a while.


Generally it is the table design, and not the ORM that is the constraint on how you can write your queries. A good ORM lets you customize any part of the query, or even just write plain sql. Most performance problems that pop up as a product matures aren't because the ORM generates "slow" queries, it is because the table design didn't scale. That can't be fixed by writing plain sql.


This doesn’t really make sense to me. You don’t just design a good table schema independently and start querying it. The design isn’t a step that comes before the queries. The queries and the design are created in tandem. You design the schemas with the indices and queries in mind, write queries that use your indices. A good design doesn’t magically scale. It’s based on how you set up your indices and write SQL.

A good table design is only good BECAUSE it enables efficient predicate use on the SQL queries.

You can’t just query any column willy nilly, you have to plan it. That’s why I like thinking in SQL with the table definition on-hand.

Example, if I write “SELECT * WHERE x OR y” and “y” isn’t indexed, then this will do a full table scan. Not ok. I need to plan my queries so it does something like “WHERE x OR (y AND z)” where “z” is indexed so it filters by “z” and then “y”. I don’t want to have to try and figure out how to get the ORM to produce that.


And's and or's are trivial in any good ORM. There are valid reasons to not want to use an ORM, but they are more around the the object/relational impedance mismatch, coupling table design to the domain model, etc.

But the alternative to an ORM is not opaque blobs if SQL hard coded into the app all over. How do you handle SQL injection attacks for example? What if you add/rename/drop a column? Do you just grep you code and edit every blob of sql in the app?


You still separate your views from your controllers and sql injection is not an issue when you bind params, but even if you are using an ORM you can't change and add columns without some impact on the code (you need to update forms to add the new value to the created objects, show it, and, presumably, do something useful with it).

And that also assumes that you are using databases as an object store. Databases are also useful to answer questions like: show me the number of users who have signed up each day for the last month.


Why would we have table references all over the app? We still use centralized models, just not ORMs.

Have a class representing a table and methods where you hit the database and map the response to an instance of the class.

It’s nice in a typed language when I map what the query will return and the compiler enforces it.

But not all my queries map to a class, but it’s not a big mess since we only use statically typed languages on the server so I still need to map the result to a tuple or dictionary of not a class.


In the end, if you're writing in a non-relational language, using a relational database, and processing/using the DB responses in any way other than just returning them to the user, then you will have an ORM layer/library somewhere. Whether you should be using an off-the-shelf one or not is a different question, but you can't really escape the need to Map between your Object and Relational models (substitute Object for Datatype or Struct if you prefer).


Ah ok, so you wrote your own mapping of object relations.


yes. mapping is the trivial part. The query optimization is the more important part IMO so I like doing it manually.

I guess my problem isn't with ORM's, its with ORM SQL generation.


ORMs are just a tool, and they really don't preclude query optimisation.

All the ones I've used or written allow bypassing selects, joins or an entire query and manually translating results, so they don't have to get in the way when optimisation really matters, but the vast majority of the time IME in most apps that just isn't necessary so I'll take the reduced friction of a query builder that automates the basics as long as it allows bypassing it when required.


> You don’t just design a good table schema independently and start querying it

This is only true for trivial problems, in the real world you are going to have tradeoffs. You can't design a schema for transactional and analytical workload at the same time, yet every type of business needs some sort of analytics on their data.

The great thing about SQL is that you don't exactly need to know what your future queries are going to look like. Or your dataset.


Yes, you don’t exactly need to know what queries will look like, but I don’t think the GP claimed that.

You need to have a fairly detailed knowledge, though. How are you going to make educated trade-offs if you don’t know what kind of queries will be made, how often, and how important it is they run fast?

That’s why, in evolving programs, the database evolves, too, even if the table content stays 100% the same. Moving some tables to faster storage, splitting them horizontally or vertically, adding or removing indices, compressing or no longer compressing fields, updating statistics more frequently, etc.


That depends on the ORM -- ActiveRecord has a ".explain" method on the proxy objects for its query builder which displays the SQL, and the query plan (displayed in db-specific format).


Note that AR's `explain` actually runs the query (effectively an explain analyze). This might or might not matter depends on what you are trying to achieve.


> One of the reasons I don't like ORMs is that I'm not able to see the underlining query and truly optimize a service.

The vast majority of your queries will not fall into the category of "bottlenecks that need to be optimized" though, and you (and your probably more inexperienced team) will benefit massively from the less error-prone & more extensible nature of ORMs (I never again want to have to deal with an attempt at SQL code reuse that has grown into a string-formatted, quadruple-manifestation, triple-escaped nightmare)

A good ORM will also ease the transition into more manual SQL too, so that you can still retain the benefits of e.g. uniform abstract objects app-side.


This kind of thing is the reasons why I do like ORMs. Even if you can see the SQL, that doesn't mean you can tell what it's doing: you don't know what indices will be used or not used, you don't know why the query planner will do things one way or another, you're not actually "close to the metal" at all. So might as well have the convenience of an ORM; sure, you'll have to do some fiddly profiling as and when you have performance problems, but even if you were using hand-written SQL you'd still have to do that.


> Even if you can see the SQL, that doesn't mean you can tell what it's doing

You can. I mentioned EXPLAIN in my comment.

And the query planner isn’t a black box. Once you read the documentation on how the order of operations is determined by the engine, you can start to be thinking on the same plane as the query engine. You can infer how a query will use indices and the way the WHERE clause will be used.

Admittedly it’s not as easy as using an ORM, but if you’re a SQL expert then you can make queries much more optimized. You’ll never internalize how a SQL interpreter reads your queries unless you do it.

I’m talking about huge tables that are hit many times a second, where you need to start thinking like a Formula One team, being creative with queries to shave off hundreds of milliseconds.


> And the query planner isn’t a black box. Once you read the documentation on how the order of operations is determined by the engine, you can start to be thinking on the same plane as the query engine. You can infer how a query will use indices and the way the WHERE clause will be used.

If you're willing to put that kind of time and effort in, you should have no trouble understanding how your ORM generates queries - IME they tend to be far clearer, better documented, and more introspectable than database query planners.


It's not about whether one can or cannot, it's about whether one does or does not.

Also, there's a feedback loop to consider. You can choose to get to know your DBMS, or you can choose to hold it at arm's length. Both can be reasonable options, but only one is going to foster expertise.

I've worked in enough places to see a clear pattern: Companies that use ORM for anything beyond really simple CRUD tend to have creeping performance problems with their database. (And typically also a healthy contingent of team members pushing to solve those problems by doing something drastic like migrating to NoSQL.) Companies that don't use ORM generally don't have the same problems.

For my part, I find micro-ORMs such as Dapper to be the sweet spot. They get the vast majority of the convenience benefit, without encouraging poor housekeeping practices.


It’s not that I’m willing to put in the effort, it’s that the scale I operate at requires it. I would rather become highly proficient at SQL and use it universally with any server language than become an expert at a specific ORM.

But I don’t see how one could become an expert at writing ORM queries without knowing the underlining SQL, which means you’re putting in the time to master two languages.

I may be just be an outlier, and that’s fine. I like geeking out over SQL optimization.


> But I don’t see how one could become an expert at writing ORM queries without knowing the underlining SQL, which means you’re putting in the time to master two languages.

It's like working in a transpiled language: you need to understand the intermediate language a little, but you don't need to master it. Indeed I'd argue that the ORM often corresponds more clearly to what's actually going on at the query planner level than the SQL does: pulling out an entity via an indexed link to another entity is very different from scanning through a table for cases where one value matches another, and they look different in the ORM and in the query planner, but in SQL they're both just "JOIN".


I have really tried to let the optimizer do its thing and generally it does and everything's ok.

Until its not and then I want hints to save my ass, and they are not hints, I want want to TELL the f'ing computer what to do because I know better than the optimizer period.

So surprised to find out PG doesn't support hints don't think I will ever be able to move anything serious until it does, just not going to take that kind of risk.

I have played the whole rewrite query to try and convince the optimizer what to do with barrier tricks, no thanks, give me some hints and I will tell it exactly what to do when thanks.


Is the query actually slower, or is it just not using an index you want it to use?

Often times PG won’t bother with an index for a variety of reasons (sequential scans can be legitimately faster in some scenarios), especially when the number of rows is small.


The cool thing about hints is you can quickly drop in a hint to confirm your suspicions and narrow down the problem, rather than trying to do this sort of diagnosis in a vacuum. But because some people use hints for evil, nobody is allowed to use them.


You can also disable seq_scan and force pg to consider indexes, usually that's enough.


Access method is just part of the story. Same index may be accessed in different ways, you might also want to combine them. Sometimes you may change table join order to see how it estimates (or executes). Usually there are two parts 1) cost and reasoning for some estimation 2) how it executes (timings, resource usage, locks/contention)


The query is slower otherwise I would not even notice, I don't care which index it uses if its fast.

It's not just about index usage, its also which type of join (loop, hash or merge) and join order.


One big reason I appreciate Couchbase is it's USE INDEX functionality


This paper is from 2015 it appears.

Can anyone comment on how relevant this is with the enhanced statistics types in Postgres 10, 11, 12?


Here's a neat extension that tries to use genetic algorithms to learn better planning for one's queries, includes slides which cite this & have TPC numbers

https://www.pgcon.org/2017/schedule/events/1086.en.html


I can't comment on your question, but thank you for finding the year this paper was written. Having read many older papers, it's sometimes like solving a murder mystery figuring out what year a paper was written. The year a paper is written is vital for understanding social context of the research being presented in addition to any context cited in the paper.


Agreed, especially in fast-changing fields or after recent breakthroughs. One trick I use is look at the References and find the approx. max year cited. Generally the same or pretty close to the year of the paper.


With CS conference papers, it's quite easy to see the date on the bottom-left of the first page. You see the copyright year, conference name, etc


Isn’t it usually a matter of just googling the title?


This has long been a pet peeve of mine.


Most of that stuff depends on explicit CREATE STATISTICS commands being run in order to work around column correlations and stuff like that. The general assumption of independence among columns/attributes is pretty universal (as the paper actually says).

One of the most useful areas for future improvement is making plans more robust against misestimations during execution, for example by using techniques like role-reversal during hash joins, or Hellerstein's "Eddies".


> The general assumption of independence among columns/attributes is pretty universal (as the paper actually says).

So, the paper definitely talks about how independent column statistics are a problem with big tables in the default stats configuration.

...But the option of creating correlated, non-independent column statistics did not exist in PG until after this paper. Which was my point.

In my experience, flat out increasing statistics sample rates fixes 80%+ of the problems in this paper, with basically no downsides. (You can push that computation to downtime when no-one cares.)


I like the methodology of the Join Order Benchmark (JOB). The key takeaway is PostgreSQL specific:

> ...the most important statistic for join estimation in PostgreSQL is the number of distinct values. These statistics are estimated from a fixed-sized sample, and we have observed severe underestimates for large tables.

Live statistics, incrementally updated on DML execution, is a key feature for a good query optimizer. As a zero-administration RDBMS, SQL Anywhere had gained a reputation as a best-of-breed query optimizer [1] a decade ago; I'm curious if this still holds true.

In the last decade, the importance of OLAP queries in row stores has diminished due to the superiority of column stores. I'd be interested in a comparison of the Citus query optimizer vs. say Presto.

[1] https://www.student.cs.uwaterloo.ca/~cs448/W11/cs448_Paulley...


I'm surprised there is no mention of Postgres' Genetic Query Optimizer (GEQO) here. It kicks in when there is a large number of joins and reduces query planning time at the cost of query execution time.

Another post in this thread mentions adaptive query planning and mistakenly imply that the GEQO is a module for this. My hands have been itching to look into experimenting on some improvements on the GEQO, specifically by improving the genetic algorithms used. When there are many similar queries, so in the adaptive query planning setting, one could also use reinforcement learning to improve query planning over time.


The simplified cost model they use was really surprising. 34% better than the complex pg one not only sounds great (incoming "simple is better" replies below) but is really nice to hear. Hopefully postgres has or will consider changing the default cost model to a simpler, more modern function that takes the current landscape into account.


That may be true, but that doesn't seem like the important takeaway to me. The important takeaway is "In contrast to cardinality estimation, the contribution of the cost model to the overall query performance is limited". Actually, the paper itself says "This improvement [the 34% one you mention] is not insignificant, but on the other hand, it is dwarfed by improvement in query runtime observed when we replace estimated cardinalities with the real ones".

Optimizers are weird.


Oh definitely agree. Mainly I find the cost model interesting because it’s so simple and contained. Cardinality estimation is a hard problem and requires real expertise. But the easy wins you get by just throwing out something based on old assumptions like the cost model is fun!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: