This may help you mentally map in brain what happens. But this is all table scanning and doesn't include indexes, which organize data in a tree. When you do SQL, you should consider whether table will contain thousands or millions of rows. If millions, you have to think how that JOIN or WHERE will take indexes into account.
In a way, it is even deceptive, because naive programmers might be lead to think that repeated(!) table scanning is what real database engines do.
It's actually very rare that this happens. For example, SQL Server will create a temporary index on the fly if one is missing. It can create B-Tree, Hash, and Bitmap indexes which might be unexpected for some people because only B-Tree indexes can be created "permanently".
So in some ways database engines do even more than just use statically defined indexes.
So I have a little consultancy gig for a few decades now where I spend a few days a month optimising bad software for performance (it is what I like; I don’t do anything else but ‘make shit faster’). I can tell you that the the past 10 years 99% of optimisations I did are fixing MySQL queries and indexes that table scan. I had projects that literally have table scanning queries over 50% of the queries ran. The result, as you know but apparently is not very common knowledge, is that these sites and apps run to a grinding halt (after incurring bizarre bills on aws rds; I moved many app from $100k/month bills to $10/month) when even a little traffic comes in.
Or; table scans should be rare but are not.
Edit; removed ‘time’ as that was not a good way of expressing this
How do you build your intuition about creating queries in a way to avoid this?
Is it a matter of having a conceptual model of relational algebra and the way the different db engines work, or is it more an accumulation of heuristics over time for what probably will cause problems, and an iterative process of using EXPLAIN, adjusting the query and seeing what happens?
I am old :) But experience is a big thing; I can sniff by just skimming the table definitions where probably something is very wrong. In uni in the 90s I studied both relational theory and formal methods and I had to spend a lot of time figuring out and fixing complexity; if you take a university level book on big O complexity and work through it, you will have a good feeling what software can and cannot do and in what way. That has not really changed; we have more efficient and more cache, we have improved algorithms, but things that cannot be looked up in O(1) are still dangerous and possibly can incur enormous IO even with only a few million records. Naive developers see that things are blazingly fast locally on their laptops and that’s it. I have met, especially in the last few years (In my bubble this is getting worse, quite fast), quite a lot of lead devs that actually do not know what an index is for and so I see entire dbs without any or only on the id field. I know people (for some reasons) do not like ORMs that create tables and indexes, but it would prevent many rookie mistakes if they did.
I can’t remember the last time I came across a non-CotS database schema that has secondary indexes in a significant number. Like more than half a dozen for a hundred tables or more.
I’ve never seen a database use “advanced” features like clustered columnstore or even just page compression.
I just have an email in my inbox from this morning from a small vendor that “doesn’t recommend” columnstore for a database containing 10 TB of numeric metrics in one table.
That would compress to a few gigabytes and query times would go from minutes to milliseconds.
But they “don’t support it”.
Which I now translate to: “we haven’t even flipped through the manual and when we googled it in a panic we didn’t understand it.”
This is how your data is being managed at huge enterprises and government agencies around the world.
Exactly. And what I find far far worse is that this is bleeding into startups. I did not express myself too clear above; I don't consult for one client (I used gig, not gigs which I cannot edit anymore); it's different clients every time; I make things fast(er) and then go to the next one. I do this really fast (aka cheap) because I don't look or care about the code (if I would care, this would be a fulltime job). I find stuff by going over the code with grep and checking the storages and then making taking a sledgehammer to beat things into performance.
I had a startup with around $1m seed invest who asked for help (actually one of the board members who is a friend) because they were burning through the 1m too fast and very big cost was the AWS. I wasn't allowed to make changes, but recommended actually adding indexes to the database and adding cache in some places in the code. I also found some strange O(2^n) 'algorithms' in the code but they weren't used much; I recommended not being clever and using libraries or the database (they all had to do with geo pathfinding stuff; do people know how to use google?). I estimate that their costs on AWS would dramatically drop doing that. Instead of doing this, their investors are upping their investment so the company can keep iterating fast.
I kind of understand this to some extent, however these things won't cost too much time and when you are building things the first time they don't cost extra time at all, you just need to know them (yes, i'm trying to be polite and nice about people who create software and do not know about db indexing). Some of these companies will grow to be the next something-you-use-every-day and this is how the data is handled.
Maybe I should write a book about anonymised client misery stories. I have too many and one day I will die and some people will never encounter this; because I usually work in gigs I got via c-level execs, I see many layers of absolute garbage at the same time inside a company, especially inside big ones. People here and on reddit who have never seen these things and think large enterprises are these smooth ran places really should be exposed to the absolute chaos that goes on there.
The knee-jerk reaction I get to any proposed database tuning is: "That sounds expensive, let's just throw more cores at it to solve the scalability issues."
Of course, tuning is often a one-time activity and cores cost money monthly, but they ignore that.
They also ignore that if one user gets a poor experience, then it is by definition not caused by a lack of scale. Conversely, it will cause scalability issues, but that's a side effect and not the root cause.
I'm starting to suspect that 90-98% of all "web-scale" architectures are compensating for errors like this. Nobody has tried to use the "release" build, add an index, or just use a binary data format.
First realize that SQL is not a procedural language, you are only describing the result set. The data store will then create an execution plan which is the actual code that gets run. Learn to read the explain in your data store of choice (very few swe do this). If you have access to a database administrator in your company, befriend and learn from them. Read about how databases store and retrieve data: from sql to data pages. Measure measure measure. Learn about different types of indexes and their trade offs. Remember that “it depends” is the answer to almost every db question and that you should be thinking through all you codes interactions. That is the path to mastery of dbs.
Yeah, that works. My process is very different, but your advice is better as it hangs more on actionable tactics than learning intuition.
> Once you understand the underlying data structures, all the magic goes away. As it should.
Once you actually understand them, I feel you don’t need explain in most cases; you will simply ‘see’ why certain queries or definitions or structures are bad.
> Once you actually understand them, I feel you don’t need explain in most cases; you will simply ‘see’ why certain queries or definitions or structures are bad.
Ummmm.... my experience pushes me strongly in the other direction. The same query's behaviour can be utterly different depending on the distribution of the data it runs on, and the parameters, and the currency of the statistics, and maybe memory and cpu pressure, and more. One of the heads of the team that wrote SQL server said "don't try to outguess the optimiser". That's very good advice that's served me well, otherwise it's exactly like optimising a program without profiling it; just don't (learnt that hte hard way).
All the internal details of how the server works that you learn are 95% useless until you need them, then they're essential, but just profiling is the best bang for the buck by a long way as a first step, and is always the first step (for me anyway).
> The same query's behaviour can be utterly different depending on the distribution of the data it runs on, and the parameters, and the currency of the statistics, and maybe memory and cpu pressure, and more
Yeah, but it is tough and time consuming to test all those parts. Ok, you can specifically try to benchmark by inserting skewed data, trying one and other parameter, but the extra time it takes from you... could be done for critical workloads probably. Maybe a better way would be just to force some particular index you see is suitable for that query, maybe use other hints that will hopefully prevent query optimizer biting down the road.
> > The same query's behaviour can be utterly different depending on the distribution of the data it runs on, and the parameters, and the currency of the statistics, and maybe memory and cpu pressure, and more
> Yeah, but it is tough and time consuming to test all those parts.
GP was saying that because it is so tough and time consuming to test this, the best thing to do is just to look at the slowest queries and optimize them, instead of trying to identify problems by reasoning from first principles.
To clarify an earlier one where I said "The same query's behaviour can be utterly different depending on..." what I was saying was that the same query might be executed completely, totally differently depending on what the optimiser deems best. It might use a merge join with an index, or it might (depending on the data distribution) use a loop join and ignore the index(es) entirely. Same query, different execution.
This is why you should (almost) never use hints unless you know a lot more about the data than the DB server, which is rare (or unless the optimiser is being consistently stupid and very sub-optimal, that happens). If you add an index hint the DB will be forced to use it even if it reduces performance, and it can reduce performance a lot. If you force join orders explicitly you can catastrophically fuck performance - I think I've only ever done that once in production.
So profile, add indexes accordingly, but typically don't force them to be used; let the DB use them, or not, as it feels. Save hints for specially problematic situations.
I see you deal with these issues too. In my case, SQL Server is providing storage for 3rd party application (Microsoft Dynamixs AX). Although it is customizable, but up to a point. I want to share some experience.
What we have is a huge covering index, consisting of 6+ key columns and many INCLUDE columns. The query executes millions of times per day, actually per hour (that implies query plan MUST be cached and query parametrized), the table itself is close to billion rows. SQL 2008 R2. And what SQL optimizer may sometimes think about - "hmm, that index takes up many bytes, let's use clustered index - not so effective on lookup, but anyways.". And then performance tanks, because, yeah, for that particular parameter value the reasoning was true, but not for 99% other cases. And in these cases, forcing particular index is a lifesaver. Along with a plan guides, which actually enable forcing that index on that query without touching the application.
So what I wanted to share? Sadly, even covering index in some cases may not be chosen by query engine.
If the covering ix is as almost fat as the table there may be little benefit in it. If the original table can just fit in the ram but the orig table + covering index together can't, they may be fighting for space in ram which means hitting disk which means slooow. But hard to diagnose from a distance.
> The query executes millions of times per day, actually per hour (that implies query plan MUST be cached and query parametrized) ...
Not at all! See below (Edit: I see what you're saying. Still, see below. Opt. 2 might be best here, but it depends)
> And then performance tanks, because, yeah, for that particular parameter value the reasoning was true, but not for 99% other cases
We had billion row searches and the cost of a recompile might be a second or two but the cost of a bad query (from previous, cached plan) can be vastly larger. Esp. here as you talk about it not suiting the other 99% of queries.
2 poss solutions:
1. Force a recompile every time. <https://docs.microsoft.com/en-us/sql/relational-databases/st...> (... WITH RECOMPILE clause). This really worked for us. Note that cpu is proportionately cheaper if you have multiple cores, and who doesn't these days. Recompiles are cheap on such machines. Recompile works very well IME!
2. More advanced. If you know the type of data the query then write identical queries in stored procs with different names, and consistently use a given proc for given expected parameter(s). Each proc compiles and stores its own query plan, so you have multiple query plans ready to go.
Further, make sure your stats are a) present and b) up-to-date. Bad stats = train-wreck query plans.
Further redux, don't assume that a covering index is all good news. I'd have to look at the query plan (edit: what I'm saying is multiple 'thinner' covering indexes may be better than one fat one).
1 & 2 may be complementary rather than totally exclusive but nevver used both together.
Nah. That may take multiple milliseconds. When fine tuned, this query runs in <1ms. It would be an option if it would execute thousands of times per hour, but not millions.
> write identical queries in stored procs with different names
Changing a 3rd party vendor application is not an option. At least for a DBA. Well, maybe it could be an option, when there are really no other options and business starts cashing out to vendor tons of money. But that's a hack - better hack SQL directly and don't bother implementing these ugly hacks in a vendor app that may be applicable only to particular case, particular data distribution with hardcoded stuff.
> Further, make sure your stats are a) present and b) up-to-date
Yep, that's valid - for some tables, there is even a agent job which do this fairly regularly.
> Further redux, don't assume that a covering index is all good news. I'd have to look at the query plan.
Query is very simple: SELECT SUM(COL1),SUM(COL2),...snip..., SUM(COL20)) FROM X WHERE OTHERCOL1=@P1 AND OTHERCOL2=@P2 AND OTHERCOL3=@P3 AND OTHERCOL4=@P4 AND OTHERCOL5=@P5 AND OTHERCOL6=@P6 AND (OTHERCOL7=@P7 OR (OTHERCOL8=@P8 AND OTHERCOL9=@P9)) AND OTHERCOL10>@P10 AND OTHERCOL11>@P11 AND OTHERCOL12<=@P12 AND (OTHERCOL13<=@P13 OR OTHERCOL14<>@P14)
I don't know exact purpose of query, but I think its part of a process that rollups some transactional data into less rows (recalculates item quantities).
Covering index provides a simple seek from a single index. Having clustered index with that much key columns would be bad for every other index, which would have to carry all those columns to other indexes and use them for lookup. But this is a business critical table and this is what it takes for SQL Server and application keeps ticking. Of course, there are probably solutions if you own the application that you can rearchitect etc. But here, operations must be transactionally correct, locking must be minimized etc. If execution time grows to 20ms (I just checked how much time it takes for this query to generate query plan), that means we get more than 20x more total duration and MUCH longer process to execute. For a particular day, i see ~15,2mil executions between 08:00 and 20:12. That is 3h2m total duration with average 0.7ms execution time (Exellent). Generating query plan for every execution would be disaster and I'd get a call for an incident. For other queries 20ms execution time is not a big deal.
I'll add that for this particular query, it is rare it would cache a bad plan. But in those cases, we get an incident from customer and thus we had to prevent it from happening from time to time.
I don't want to argue, because we have different experiences, situations, applications, possibilities, designs and data. What you write are valid points, applicable to most cases. But I keep going with replies, because I'v learned a ton from HN and I think our discussion may help someone learn a thing or to about SQL Server. And I like to finally talk to someone that experiences this stuff.
Gotcha. It's a very different scenario from what I imagined and my suggestions don't apply. Agreed also your limited control on rewriting is limited and with the comment on clustered indexes.
I do wonder how is it possible to run that query on a billion rows and typically get sub-millisecond response, but however you did it, well done!
I mostly work the way you do, but still find explain helpful when my intuition fails. Most notably this usually happens when an index exists but the query manages to miss the index. Most often I experience this when it comes to dates and ORM-built queries that type cast.
> For example, SQL Server will create a temporary index on the fly if one is missing
err, I may be misunderstanding but can you explain why you feel this? I have never (IME) seen MSSQL do this and it wouldn't make sense because constructing an index needs a table scan plus a lot of work on top. Just doing a hash join is simply the better option.
I mean it would be nice at times but there are traps to this which is why (again AFAICS) it's not done and would be unsafe to do without a lot of info about resources and future expected queries which the query planner just doesn't have.
It won't create an index you can see in the GUI or query via "sys.indexes" or anything like that.
It's a temporary object, much like a temporary table that exists only in the scope of the query.
As another comment mentioned, this is what a hash-join does internally: it builds a temporary "hash index" of one input, and then uses it to look up rows while scanning through the other input.
If you looks at the query plans in SSMS, you'll occasionally see bitmap indexes as well.
The equivalent of a standard B-Tree index that you would create permanently is the "Index Spool" operator. You'll also see "Table Spool", which is basically a temporary heap.
The example in the original article was the equivalent of this loop:
foreach( a in table_a ) {
foreach( b in table_b ) {
if ( a.id == b.aid ) ...
}
}
That's hideously inefficient. Most databases will automatically do something like:
var a_hash = new Hashtable( a.row_count )
foreach( a in table_a ) {
a_hash.add( a.id, a )
}
foreach( b in table_b ) {
if ( a_hash.lookup( b.aid )) ...
}
The clever part in all of this is that you can do this two ways: build a hashtable of "a" and lookup "b" rows in it, OR build a table of "b" and lookup "a" rows in it. They're equivalent, but the performance can be wildly different.
RDMBS query planners have the job of figuring out which to pick. Even if you think you can outperform the database by writing code like the above in Java or C# or whatever, you won't write out every combination and have the statistics available to choose. The database engine can and does.
SQL Server can do both steps in parallel across all CPU cores which is a topic of several PHD-level research papers. For example, hash tables can have performance issues if the same key occurs too often (e.g. many NULL columns). Balancing this across multiple cores is... complicated.