
Techniques for Microsoft SQL database design and optimization - DenTurpitka
https://www.apriorit.com/dev-blog/421-microsoft-sql-optimization
======
Pxtl
Here's my SQL Database Optimization Guide:

Ignore all the advice about "rules". Create a realistic dataset and run
performance tests before you release _anything_ into production. Check your
query-plans to find the pain-points. The database is frequently treated black
box surrounded by superstition and cargo-cult "rules" that are either out of
date or just educated guesses.

Wrap anything weird in a view so you can rebuild it as you see fit.

Avoid getting fancy as much as possible until the simple, normalized, naive
implementation fails.

~~~
matwood
This IS the answer for all databases. Start simple, do straight forward
queries, and then measure, measure, measure.

Unfortunately query plan builders can do odd things. Is IN faster than EXISTS?
Sometimes yes, sometimes no. The only way to know is to test, and this is true
across the board.

~~~
jpitz
Not only is it sometimes faster, and sometimes slower, but the answer can
change, for the same schema, for the same query, depending on the volume and
distribution of data.

~~~
nine_k
Can you please provide an example of answer changing?

~~~
swasheck
Yes. Every day. The developers here write SQL that runs acceptably for a given
schema at a given point in time. Unfortunately, the code pattern isn't
scalable (which is generally why people say that RDBMS' can't "scale" anyway,
but that's a different rant) and performance changes, sometimes dramatically.
Where a LEFT OUT JOIN ... WHERE outer.value is NULL worked great for an anti-
semijoin earlier, NOT EXISTS may be better. What jpitz is talking about are
statistics that are tracked by the engine. They are a key part of how SQL
Server's CBO determines the best plan to access the data. As data grows,
chances are you're going to get skew on the keys that you've defined. If your
data is such that the counts and distinct counts of your keys are guaranteed
to remain consistent, then you have a lower chance of plan changes as data
scales.

------
dsp1234
I have found that _Use The Index, Luke_ is a good reference when looking at
optimizations. It's also nice that it covers most of the major SQL databases,
so you can see differing optimizations and how to handle them.

[http://use-the-index-luke.com/sql/preface](http://use-the-index-
luke.com/sql/preface)

~~~
jpitz
That is a great resource.

Here are some I've used:

Inside Microsoft SQL Server 2008 T-SQL Querying: T-SQL Querying
[https://books.google.com/books?id=FZlCAwAAQBAJ&printsec=fron...](https://books.google.com/books?id=FZlCAwAAQBAJ&printsec=frontcover#v=onepage&q&f=false)

Good resource for designing high performing queries and investigating
performance problems.

PostgreSQL 9.0: High Performance
[https://books.google.com/books?id=OWOAu0GcsqoC&printsec=fron...](https://books.google.com/books?id=OWOAu0GcsqoC&printsec=frontcover#v=onepage&q&f=false)

Greg lays out a great foundation for doing performance analysis that is
largely applicable to any rdbms platform.

Optimizing Oracle Performance
[https://books.google.com/books?id=D7ElCwAAQBAJ&printsec=fron...](https://books.google.com/books?id=D7ElCwAAQBAJ&printsec=frontcover#v=onepage&q&f=false)

Another great book that explains the process of approaching and solving
performance problems.

Joe Celko's SQL For Smarties
[https://books.google.com/books?id=i-19BAAAQBAJ&printsec=fron...](https://books.google.com/books?id=i-19BAAAQBAJ&printsec=frontcover#v=onepage&q&f=false)

Everyone writing SQL should read Joe.

~~~
icc97
What intrigues me about Joe Celko's book is that he is completely anti all
forms of auto indexing (and it makes sense) but is the assumption for all the
web frameworks that I come across. Most recently I hit this when looking at
yesod.

------
MSM
Good list here. A few things I would add about execution plans specifically.

1\. SET STATISTICS IO ON. This is _much_ more useful than setting time on imo.
This will give you physical / logical reads per table as well as how
specifically they were fetched. Usually I don't care if the query took 11
seconds, I want to know what took those 11 seconds.

2\. The % numbers that everyone relies on when skimming execution plans are
total BS. No one seems to realize this, but the percentage values are based
off of the estimated query plan even when you're running the actual execution
plan. Use the plan to determine which operators were chosen, disregard the %
values. To get the true amount of work done, look at stats IO above. If your
underlying issues is missing or stale stats for example, that incorrect data
will pass through to your execution plan and that plan will lie to you.

3\. Try to determine _why_ a plan is getting generated. Don't just keep trying
wacky code changes until you can get the correct plan (once..), find out what
the optimizer is seeing and why it's doing what it's doing. You may know what
operation is best in the current moment ("this was much faster as a nested
loop") but instead of using a join hint, force the optimizer's hand by
correcting whatever underlying issue is making it think that the merge/hash is
a better route. This will save future-you hours of head-bashing when,
inevitably, that join hint that was added three years ago causes a big
production issue.

------
cm2187
A lesser known hint I remember having heard. If you use an indexed column in
your query, use the same character encoding in the query than in the column,
otherwise you end up with a full scan (like varchar vs nvarchar).

~~~
olavgg
This is also true for date types. Be careful when you use date trunc
functions.

------
gigatexal
I'm a junior DBA so take what follows with a grain of "he probably has no idea
what he's talking about":

One of the most frustrating things coming from a world of say python or C is a
function that implements an algorithm of say O(n) it will run in O(n) all-day-
everyday but if a table's stats are out of date or there's a missing index or
anything related goes awry a typically fine running query could perform
terribly. And that's infuriating but it's the nature of the beast I guess.

~~~
Pxtl
Yup. As developers we're always told to trust the database and to use proper,
normalized relational theory and how it's so much better that in the database
we just say what we want instead of specifying exactly how to calculate it
like we do in the client programming languages...

But that beautiful kernel of relational theory is wrapped up in a six-foot-
ball of duck-tape, bubble-gum, and hot glue in the form of a zillion hacks and
leaky abstractions.

At a certain point it really feels like databases are failing to live up to
their promises. I _expect_ things like sharding to be hard. I'm okay with
that. But if I have _one_ database server, I expect to be able to create
tables, _maybe_ indexes, and for it to Just Work given that information.
That's what we're paying for.

~~~
gigatexal
Until there's an automated way for a database to tell a developer that their
particular query is running terribly because of some defacto reason there'll
always be a need for a DBA to interpret and investigate, though, I wish it
weren't so. I wish we could just treat databases as "dumb" stores of
information that could be queried in n number of ways without issue. Some or
most DB-as-a-service offerings will handle the mundane tasks of backing up,
upgrades, etc. but I know of none that offer this DBA-like level of service
where an automated tool can tell you exactly what's going wrong, basically
take the context of your query and match it with the current execution plan
and the underlying structure of the tables and your data (data types included)
and tell you what is wrong. That'd be a huge innovation if it exists. And I'd
be out of a job.

------
tyfon
Another issue with sql server which is very baffeling. It is MUCH faster to
use sub selects than left join if you data allows it.

That is do not:

select a.bla, b.blabla from foo a left join bar b on (a.id = b.id)

Do:

select a.bla, (select blabla from bar where id = a.id) as blabla from foo a

I have reduced a big etl query with about 20 left joins from 2 hours running
time down to 7 minutes by this.

And before you ask, yes the tables have been properly indexed, it is just that
left join performance in sql server is very bad in some situations.

~~~
Tostino
Did you compare the execution plans of both queries? Without doing that, it's
hard to know why the planner did what it did. Maybe statistics were off and it
was choosing a worse plan for the left join because it thought it'd be faster,
where the sub-select it was forced into what should have been an inferior plan
(if statistics were correct), which in reality was faster.

~~~
gigatexal
yes I, too, would like to see more on this. I am skeptical that doing sub
queries is faster than joins

------
rdiddly
Is it just Firefox or is this page a pioneer of "staunchly unresponsive
design?"

Edit: nope, Chrome too. I literally can't fit all the text on a small screen,
no matter how feverishly I pinch, rotate or pan!

~~~
iask
Same issue on my iPhone 5s. I changed to text mode and was able to view.

------
zamalek
It may no longer be current, however, the query parser has an easier time if
you use semicolons. After making a habit out of this for a few years, I find
that the brain has an easier time too.

------
hobs
Some tips from a SQL Server DBA:

* DON'T USE USER DEFINED FUNCTIONS IN ANY PREDICATE OF ANY TYPE EVER, THIS IS DEATH.

* If you want things to go fast, consider that your query needs to be able to be converted into a "search argument" to use any of those precious indexes you made, so:

LIKE 'hdfhafsdh%' \-- This works!

LIKE '%fafads%' \-- Scanning all the things, so sad.

In the same vein, any function you use on your values in the tables mean that
there are often no indexes you can use to help your operation (with a few
technical exceptions around specific optimizations SQL Server has around
datetime <-> date calculations )

Download SQL Sentry Plan Explorer - it just became free and with it you can
shred apart your execution plans with ease and keep histories of the work you
have done on a query, or watch a plan change over time as your tune it with
ease.

[https://sentryone.com/plan-explorer](https://sentryone.com/plan-explorer)

~~~
dang
Please don't use all-caps for emphasis. This is in the HN guidelines:

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

