Ignore all the advice about "rules". Create a realistic dataset and run performance tests before you release anything into production. Check your query-plans to find the pain-points. The database is frequently treated black box surrounded by superstition and cargo-cult "rules" that are either out of date or just educated guesses.
Wrap anything weird in a view so you can rebuild it as you see fit.
Avoid getting fancy as much as possible until the simple, normalized, naive implementation fails.
This IS the answer for all databases. Start simple, do straight forward queries, and then measure, measure, measure.
Unfortunately query plan builders can do odd things. Is IN faster than EXISTS? Sometimes yes, sometimes no. The only way to know is to test, and this is true across the board.
Not only is it sometimes faster, and sometimes slower, but the answer can change, for the same schema, for the same query, depending on the volume and distribution of data.
Yes. Every day. The developers here write SQL that runs acceptably for a given schema at a given point in time. Unfortunately, the code pattern isn't scalable (which is generally why people say that RDBMS' can't "scale" anyway, but that's a different rant) and performance changes, sometimes dramatically. Where a LEFT OUT JOIN ... WHERE outer.value is NULL worked great for an anti-semijoin earlier, NOT EXISTS may be better. What jpitz is talking about are statistics that are tracked by the engine. They are a key part of how SQL Server's CBO determines the best plan to access the data. As data grows, chances are you're going to get skew on the keys that you've defined. If your data is such that the counts and distinct counts of your keys are guaranteed to remain consistent, then you have a lower chance of plan changes as data scales.
It's been while since I have spent time with MSSQL, but I remember seeing this with sprocs and cardinality. The sproc would compile with 1 plan that dealt with tables with little data. The plan would not always get updated as the tables fill with data.
The underlying reason is that as data changes the plan changes. For example, on a small table, it is faster/more efficient to do a table scan than go to the index and then to the table for data. At some threshold this changes. If table size can change the plan, then the query itself could also need changing as the table grows, indexes grow, and differing index cardinality emerges.
Offhand, I don't remember the specifics - I observed this a decade ago. Sorry. I don't have a SQL Server around to try to craft it either.
The gist of it is, the query planner for some database engines ( pg and mssql ) often takes cardinality into account when selecting plans. Therefore those plans change with the size of the dataset.
EDIT: It occurs to me to wonder if you're asking if the resultset changed. No, I've not seen that answer change.
The very fact that IN can produce different query plans than an equivalent EXISTS is a very nasty smell. As a developer, I shouldn't have to care. The database is supposed to figure that out. That's why we're paying no small sum of money for this.
Agreed, fancy and clever in optimization/design is super dangerous especially since you don't know what will/won't work until you measure. Also if you do "Wrap anything weird in a view" (which is an acceptable approach) realize that any and all of that part of what you've built could be the problem.
The best query plan can look good even on a representative data set but your customers don't always have representative data sets and they don't really care if it works on the bench with 10 million records if their 10 million record data set does not work.
I have found that Use The Index, Luke is a good reference when looking at optimizations. It's also nice that it covers most of the major SQL databases, so you can see differing optimizations and how to handle them.
What intrigues me about Joe Celko's book is that he is completely anti all forms of auto indexing (and it makes sense) but is the assumption for all the web frameworks that I come across. Most recently I hit this when looking at yesod.
Good list here. A few things I would add about execution plans specifically.
1. SET STATISTICS IO ON. This is much more useful than setting time on imo. This will give you physical / logical reads per table as well as how specifically they were fetched. Usually I don't care if the query took 11 seconds, I want to know what took those 11 seconds.
2. The % numbers that everyone relies on when skimming execution plans are total BS. No one seems to realize this, but the percentage values are based off of the estimated query plan even when you're running the actual execution plan. Use the plan to determine which operators were chosen, disregard the % values. To get the true amount of work done, look at stats IO above. If your underlying issues is missing or stale stats for example, that incorrect data will pass through to your execution plan and that plan will lie to you.
3. Try to determine why a plan is getting generated. Don't just keep trying wacky code changes until you can get the correct plan (once..), find out what the optimizer is seeing and why it's doing what it's doing. You may know what operation is best in the current moment ("this was much faster as a nested loop") but instead of using a join hint, force the optimizer's hand by correcting whatever underlying issue is making it think that the merge/hash is a better route. This will save future-you hours of head-bashing when, inevitably, that join hint that was added three years ago causes a big production issue.
A lesser known hint I remember having heard. If you use an indexed column in your query, use the same character encoding in the query than in the column, otherwise you end up with a full scan (like varchar vs nvarchar).
I'm a junior DBA so take what follows with a grain of "he probably has no idea what he's talking about":
One of the most frustrating things coming from a world of say python or C is a function that implements an algorithm of say O(n) it will run in O(n) all-day-everyday but if a table's stats are out of date or there's a missing index or anything related goes awry a typically fine running query could perform terribly. And that's infuriating but it's the nature of the beast I guess.
Yup. As developers we're always told to trust the database and to use proper, normalized relational theory and how it's so much better that in the database we just say what we want instead of specifying exactly how to calculate it like we do in the client programming languages...
But that beautiful kernel of relational theory is wrapped up in a six-foot-ball of duck-tape, bubble-gum, and hot glue in the form of a zillion hacks and leaky abstractions.
At a certain point it really feels like databases are failing to live up to their promises. I expect things like sharding to be hard. I'm okay with that. But if I have one database server, I expect to be able to create tables, maybe indexes, and for it to Just Work given that information. That's what we're paying for.
Until there's an automated way for a database to tell a developer that their particular query is running terribly because of some defacto reason there'll always be a need for a DBA to interpret and investigate, though, I wish it weren't so. I wish we could just treat databases as "dumb" stores of information that could be queried in n number of ways without issue. Some or most DB-as-a-service offerings will handle the mundane tasks of backing up, upgrades, etc. but I know of none that offer this DBA-like level of service where an automated tool can tell you exactly what's going wrong, basically take the context of your query and match it with the current execution plan and the underlying structure of the tables and your data (data types included) and tell you what is wrong. That'd be a huge innovation if it exists. And I'd be out of a job.
Did you compare the execution plans of both queries? Without doing that, it's hard to know why the planner did what it did. Maybe statistics were off and it was choosing a worse plan for the left join because it thought it'd be faster, where the sub-select it was forced into what should have been an inferior plan (if statistics were correct), which in reality was faster.
Technically the sub select is a semi join which could be optimized differently than left join (and could give different results). If bar.id had a unique constraint though the queries should generate identical plans.
Well, I simplified the example a bit, for the real query bar.id is not unique and has to be constrained by other means.
Think an application/case with a table with credit scoring variables where each variable has a case id and a variable type id, so i had to join the same table multiple times with different criteria.
But replacing all of the left joins here with subselects did reduce the query time by an insane amount.
However I have seen this behaviour with unique indexed left joins too just not so extreme.
And for comparison (i have the same schema in my analytics postgresql database), the exact same query here takes about 1.5 minutes with left joins.
The two statements, as mentioned by gfody, are not equivalent. For the first statement, there can be any number of output rows based on the join condition. For the second statement, by definition, there can only be output row per input row in foo. That's a lot more information that the query planner can use to make optimizations.
It's possible, as the other commenters mentioned, it would come down to reviewing the query plan. I was just trying to address the two parts of your original comment, which were "is very baffeling" and "left join performance in sql server is very bad in some situations.". Hopefully my answer helped to un-baffel a little.
You've got this backwards. Assuming the join brings back at most one row (i.e. The join key is a unique key) the left join is better because it can be optimized differently. The sub select must be done as a nested loop but the join can be done via a hash join. For a full data set (not just one of selected) it'd be more efficient. Simply count the net logical IOs.
That is why I think this is baffeling behaviour by sql server since in certain queries you can get an order of magnitude faster execution by dropping left join and using subselect.
It's counter intuitive but try it in a real dataset.
It may no longer be current, however, the query parser has an easier time if you use semicolons. After making a habit out of this for a few years, I find that the brain has an easier time too.
* DON'T USE USER DEFINED FUNCTIONS IN ANY PREDICATE OF ANY TYPE EVER, THIS IS DEATH.
* If you want things to go fast, consider that your query needs to be able to be converted into a "search argument" to use any of those precious indexes you made, so:
LIKE 'hdfhafsdh%' -- This works!
LIKE '%fafads%' -- Scanning all the things, so sad.
In the same vein, any function you use on your values in the tables mean that there are often no indexes you can use to help your operation (with a few technical exceptions around specific optimizations SQL Server has around datetime <-> date calculations )
Download SQL Sentry Plan Explorer - it just became free and with it you can shred apart your execution plans with ease and keep histories of the work you have done on a query, or watch a plan change over time as your tune it with ease.
Not only that, but I have been working with some queries that contradrict this. We had some UDFs that are hyper-simple fetchers of config values from a KVP config table.
I tried optimizing them out. Nothing. Looked at the query plans - 0%. SQL server had reasonably noticed what they were and done a bang-up job handling them.
That's the infuriating thing about SQL server "advice" - it's such a black box that so many of the rules are cargo-cult stuff.
The biggest problem with scalar UDFs is not the IO performance. It is that they force the query to serial (non-parallel). On queries that return a non-trivial quantity of data, that hurts.
It's not a "black box" but what ends up happening is that the recommendation is handed down/out to others without full explanation. Additionally, data volumes truly matter, for the reason(s) outlined above vis a vis table scanning due to (non)determinism.
I see it as a sort of trade-off. Give a lengthy explanation and people often ignore the advice. Make the advice short and frightening and people listen, but might not be so sure why!
Without judgement, here is an explanation of those items mentioned above:
1.) Don't use user defined functions in any predicate.
In order to evaluate a predicate (WHERE clause, HAVING clause, JOIN condition aka boolean comparison), the query engine needs to know what both sides of the comparison evaluate to. So for each row, it has to pull out a value out of the database, then also execute the UDF (User Defined Function) in order to answer the question "does this predicate match?". UDFs are generally going to be slower than their native equivalents, and the best suggestion is to always try to do predicate matching on pre-computed values. So if you find that you are doing a lot of UDF based predicate matching, then it may be useful to see if you can pre-calculate the value of the UDF for that row, then do the predicate matching based on the pre-computed value. In MSSQL, these columns are called "Computed Columns", and computed columns in that engine can be persistent or not. You would generally want to use the persistent type for best performance.
2.) Consider that your query needs to be able to be converted into a "search argument" to use any indexes.
At a logical level, an index is a way of sorting rows. Imagine if you take all of the books in your local library, if you have one, and sorting them in alphabetical order. If someone asks you to find all of the titles that begin with "Database", then that's pretty easy. You walk down the aisles until you find the first book that starts with the letters D, A, T, A, B, A, S, and E. Then you continue reading off the titles until you find a book that doesn't use those letters anymore. Since the books are already alphabetized, you know that you've found them all. This is the equivalent of like 'Database%'.
Now imagine that someone instead asks you to find all of the books that have the word Database anywhere in their title. That's harder. Even though the books are alphabetized, the word Database could show up anywhere in the title, not just at the beginning. So you have to read the title of every single book in order to know if it contains the word Database. This takes a lot more time. This is the equivalent of like '%Database%'.
3.) Any function you use on your values in the tables mean that there are often no indexes you can use
The scenario presented here is something like Where left(title, 4) = 'asdf' or where firstName + ' ' + lastName = '_quick_q'. Internally, the query planner is going to index use what it determines is the best based upon the left side of the predicate, and even though it might look like "left(title, 4)" would be able to use an index, and maybe in some engines it does, there is generally no index that is going to match that definition. Many times people fall into this trap with simple stuff like where UPPER(LastName) = 'Q'. It seems like the query engine could make this work, but it won't because the left hand of the predicate doesn't match the index definition.
Programmers avoid working with DBAs because programmers think of SQL and databases as an annoyance that gets in the way of just shoving data into a hole somewhere. This is why MongoDB et al are so popular; they've provided cover for programmers to drop any pretense that organized data is useful. The end result is a continuous stream of data catastrophe after data catastrophe, and thousands of wasted man-hours trying to figure out how to do something with Mongo/Hadoop/whatever that is simple with a SQL server (and often reaching incorrect conclusions anyway).
I'm not saying there's never a good reason to use a doc DB like Mongo or a "Big Data" implementation like Hadoop, but programmers don't choose to use these things for good reasons 95% of the time. They choose to use them a) because they're new, and they want the resume points; and b) because they allow them to shed the pesky DBA and the pesky database engine, always in their hair about the "Right Way" to do things so that the data can be recalled and analyzed quickly, easily, and repeatably.
There is absolutely a flip side to this too, where DBAs behave in a massively unjustified manner toward non-DBAs.
IMO the heart of the issue is the bifurcation in roles. Both DBAs and traditional coders are programmers/developers. Both should be trained in the others' concerns and expected to help out in both parts of the system, and the role distinction of 100% DBA or 100% non-DBA should not exist. This keeps balance in the incentives on both sides.
My problem, as a developer, is how much database behaviour seems actively user-hostile.
The horrifying implications of tri-value logic and the lack of a modern type-system to allow us to have a sensible alternative to NULL is infuriating.
The bipolar attitude about how normalizing data is so important but how so much falls apart when you actually have normalized data.
How infuriatingly hard basic, sane code-reuse is. "Don't use scalar UDFs" basically screams out to me "this is a bad programming language".
Is it any wonder that developers go running to NoSQL solutions? It's not "relational algebra" that they're running from - schemaless behavior is just gravy on top of escaping dealing with the muddy, deceptive black-box world of databases.
The database takes care of everythign until it doesn't.
Yeah, this is actually a great example of why it's important that if you're developing an application, you need to do some of its database work, and if you're developing a database, you need to get into the application and write code that accesses the database.
If you're a hobby programmer and do full-stack stuff on your own to make your dreams real, you organically stabilize on a pretty-reasonable balance between database-side accommodation and application-side accommodation. Since you have to feel the full impact of the tradeoffs made, you select the most efficient, reasonable tradeoff available in the large-scale sense.
When you introduce a hard separation between data-side and app-side responsibilities, it's just human nature to get that complexity off of your plate and onto theirs. This competition to put the burden on the other group burns bright and causes massive resentment between DBA and dev.
Unify the roles and watch this disappear. Psycho DBAs who insist on fifth normal form everywhere will gain an appreciation for data locality and consider trivial bloat in table size an acceptable tradeoff, because it will make their application-side code tasks much more tolerable. Self-important developers who treat the DBA like their personal data lackeys will gain an appreciation for good data organization and reporting facilities, because it will make their database-side management and reporting tasks much easier.
Database and application are too tightly coupled to ever successfully divorce the areas of responsibility. You just end up with people at each others' throats all the time. Merging the responsibilities (and thus the incentives) is the only way you'll get these two groups to adopt a unified perspective.
Ignore all the advice about "rules". Create a realistic dataset and run performance tests before you release anything into production. Check your query-plans to find the pain-points. The database is frequently treated black box surrounded by superstition and cargo-cult "rules" that are either out of date or just educated guesses.
Wrap anything weird in a view so you can rebuild it as you see fit.
Avoid getting fancy as much as possible until the simple, normalized, naive implementation fails.