
Results of the SQL Performance Quiz - randomdrake
http://use-the-index-luke.com/blog/2014-02/results-three-minutes-sql-performance-quiz
======
pdubs
Regarding #5: >That caught me by surprise. Both options “roughly the same” and
“depends on the data” got about 25% — the guessing probably.

I don't think it was guessing so much as reasoning that fetching 100 rows (and
filtering by value) instead of 10 rows doesn't have significant real-world
impact unless the row data is particularly large. I'll admit I didn't think of
the out-of-index lookup, but my main thought was 100/1000000 vs 10/1000000
isn't a big deal unless the DB is running on an abacus.

~~~
brianberns
I had the same thought. Seems like the optimizer could still perform an index-
only scan to get to 100 rows, then go to the table to filter them down to 10
rows. Yes, the second step is extra, but should still be fast. What am I
missing?

~~~
eterm
That was my reasoning when I answered, but I had missed the fact it was a
GROUP BY, which means you can't just filter after the fact.

Edit: In other words it was 100 or 10 aggregated rows. A extra WHERE clause
will change the values of each of the rows rather than just filter the rows
from 100 to 10. (Which a HAVING clause would do.)

~~~
buckbova
It's much simpler than that. The first query only has to reference the index
because the data is IN the index. The second query has to access the table.
That's it.

It's called a covering index.

~~~
eterm
But if it wasn't for the GROUP BY, filtering 100 results of a million down to
10 results wouldn't change performance much even if you read every column of
every row of those 100.

The trick is the fact that the GROUP BY means that "It used to return 100 it
now returns 10" is a red herring, it still has to read every row to make up
those 10.

~~~
buckbova
I don't understand what you mean "trick".

    
    
      SELECT date_column, count(*)
      FROM tbl
      WHERE a = @a
        AND b = @b
      GROUP BY date_column;
    

The "AND b = @b" causes the sql engine to access data in the table instead of
solely relying on the index. GROUP BY has 0 to do with it. If you changed the
query to

    
    
      SELECT a, date_column
      FROM tbl
      WHERE a = @a
    

and

    
    
      SELECT a, date_column
      FROM tbl
      WHERE a = @a
        AND b = @b
    

The answer would be the same.

~~~
eterm
No, it wouldn't.

If we're told that:

SELECT a, date_column FROM tbl WHERE a = @a

Returns 100 rows.

Then:

SELECT a, date_column FROM tbl WHERE a = @a AND b = @b

Will only have to scan column b over 100 rows.

Even without an index that will always be neglible, not compared to using the
index to grab 100 rows from 10million but just compared to running a query and
returning results at all.

The reason that the original can be a lot slower is that the 100 and 10 rows
of results are comprised of a lot more rows of actual information, because of
the grouping.

You're right that:

SELECT a, date_column FROM tbl WHERE a = @a AND b = @b

would be a lot slower, given the same data, but that isn't the scenario, the
group by has implications about what "returns 100 rows, returns 10 rows"
actually means in terms of data read.

~~~
buckbova
Query 1 is an index seek only. It does not access the table data.

Query 2 will perform the same index seek but will need to do a key lookup on
each row and filter.

It's not negligible. The 100 results are not comprised of a lot more
information in this case, regardless of the grouping, because the 1st query
does not access the table.

Edit:

I happen to have a table laying around with a little over a million rows and
set up a similar set of queries.

The query optimizer suggested the index seek taking 6% of total operation time
while the key lookup taking up the other 94%. The rest was negligible.

------
tom_b
I consider the critical path to SQL performance to be understanding what the
data looks like and the type of queries to be executed against it rather than
general guidelines.

To be frank, knowing the difference in how to write an inner and outer join
when given a three table schema and a desired output is a frightening filter
of candidates. Having an instinct that some type of index could help a query
probably makes you a db wizard.

~~~
tbrownaw
_knowing the difference in how to write an inner and outer join when given a
three table schema and a desired output is a frightening filter of candidates_

Even just asking for a basic understanding at the "use this to look up that"
level -- since SQL syntax is easy to learn if you know what you want, and
whoever does the recruiting has taken to mostly finding us new college
graduates -- filters out an absurd number of candidates.

I guess it's the same thing as makes FizzBuzz a useful question.

~~~
RogerL
I honestly don't grasp why this is a good question.

Sure, I write a SQL query now and again, but if I have to join I look it up. I
don't do it a lot. If I had to do it a lot for you, I would learn it inside
out.

If I told you I was a DB wizard, then sure, I better already know how to join.
But I know tons of people that don't really do SQL. They are eminently hire-
able and valuable.

source: a guy (me) who was bounced out of an interview because he didn't
recall some specific detail about boost::shared_pointer().

~~~
cwyers
If you don't know how to do a JOIN you really don't know how to use SQL, and
that's not a tricky use of JOINs at all. If you're hiring for a position that
mostly uses an ORM or something and raw SQL is only important in cases where
the ORM is giving bad performance, maybe not knowing how to do a three-table
join is fine. If actually writing SQL is part of the job regularly, though,
the sort of thing grandparent is talking about is absolutely a bare-minimum of
knowledge needed, maybe even below the bare minimum.

------
bradleyjg
The fact that something like this is necessary just goes to show you how much
of a failure SQL is compared to what it was supposed to be.

Remember the promise of a "declarative language" where you just had to tell it
what you want, and it took care of the details?

~~~
jcampbell1
I reach the opposite conclusion. These are premature optimization tweaks which
means tons of people are using SQL successfully without knowing them.

You could argue CSS is a failure, because designers don't know that

    
    
        #sidebar .widget
    

is slower than

    
    
        .widget
    

I'd argue that it is evidence CSS is a success because it makes no real
difference, and people opt for the more readable solution.

~~~
taeric
Do you have a good link for numbers on this css point? I would expect speed
differences, but my naive view would be that they wouldn't be much.

~~~
jcampbell1
The difference is probably .1us. You may find it interesting why browsers
match CSS selectors from right to left (though this is a useless bit of
trivia). see:

[http://stackoverflow.com/questions/5797014/why-do-
browsers-m...](http://stackoverflow.com/questions/5797014/why-do-browsers-
match-css-selectors-from-right-to-left)

------
nwatson
Hmmm, I'm no SQL jockey, rarely deal with it directly, and currently work with
MongoDB. I got 5-of-5 on the PostgreSQL flavor of the test but am sure I'd
fail a whiteboard interview on SQL particulars. Being a generalist I can
usually figure things out and think I have an intuitive grasp of issues
involved in many systems -- but unfortunately only 30% of workplaces
understand that intuition, adaptability, and general experience usually trump
specific knowledge.

Fortunately I've landed pretty good jobs in spite of employer myopia, and work
for a SV salary and a SV company though I live in a non-tech town in North
Carolina. Just venting a frustration.

~~~
mattfenwick
Just curious -- where did the "30%" figure come from?

~~~
herge
Probably his intuition.

------
henrikschroder
Interesting, every question in this quiz is about the concept of covering
indexes, and figuring out if the query in question is covered by the suggested
index or not.

I'm surprised people didn't score better on this, it's a very simple concept.
:-/

~~~
cwyers
The author has a website and sells a book devoted to teaching people about SQL
indexing, so that's where his focus is. I wonder what kind of a pool he got;
given that he seems to have given it mostly to people reading his site, you'd
think that it'd be people who know more about SQL indexing than the average.

Most troubling to me was how people who chose the MySQL option did so much
worse than pretty much every other database. (I took the MySQL option, even
though I work on MS SQL these days, and got 4 out of 5.) I suppose for people
who consider MySQL to be a toy RDBMS, that's less "troubling" than "confirming
current perception."

------
einhverfr
He wonders about PostgreSQL users doing poorly on question 4. This is a fairly
tricky question. Since the index is a btree index, it is pretty clear what the
right answer is, but there are ways to address the wildcard issue with better
indexes. PostgreSQL folks who have less familiarity with the db may miss the
fact that a GIN or GIST index is required along with the pg_trgm addon, to
make that query fast.

~~~
cbsmith
Actually, since ~9.2 you don't need pg_trgm.

But yeah, I was kind of surprised that he was surprised about why PostgreSQL
folks failed: they just didn't think about whether the index was optimal.

------
tbrownaw
I have two nearly-identical tables, both shaped like (foreign_key
some_data_type, name varchar2, value varchar2). (Before you say "that means
you should use No-SQL", this is a staging environment for loading data into a
claims handling system. Which is built in a language that comes with a
relational db built in.)

They both have about 50M rows, with statistically identical data. I was
running near-identical large queries on both, with the same execution plan
(nested-loop join with index lookups), and getting vastly different timings.

This being a staging environment, these tables are re-populated by truncating
them and running an ETL tool. What turned out to be happening, is that in one
case the source query on the ETL tool was sorted by the foreign key, and in
the other it wasn't. So in one case all those fetch-by-index-lookup operations
added to to essentially a partial table scan, and in the other they added up
to what you'd expect where blocks would be fetched in random order and
probably re-fetched after falling out of cache.

------
deepsun
Interestingly, if you worked a lot with NoSQL systems, you could get 5/5 on
that score.

Although I worked a lot with SQL in past, I think good knowledge of only a
NoSQL system (e.g. GAE Datastore) would give you enough understanding of how
simple indexes are, and how to deal with them to pass this pure SQL test.

Would like to hear a feedback from someone with only NoSQL experience after
taking that test.

------
troels
Maybe I'm just a bit slow, but what's going on with the 50/50 "guessing score"
reasoning? Surely, you should half the wrong answers as well, rather than just
subtracting it from the right answers?!?

~~~
Robin_Message
If proportion X (between 0 and 1) of the population knows the right answer
(we'll call A, and the wrong answer B), and the other half guesses evenly, you
would expect:

    
    
        A = X + (1 - X) / 2
    

So to calculate X, given A,

    
    
        2A = 2X + 1 - X = X + 1
        X = 2A - 1
    

The author implied the formula

    
    
        X = A - 0.5
    

which is not correct. In particular, assume everyone got the answer right
(X=A=1). Then the assertion that half of the answers are correct guesses is
absurd.

The correct lucky guess fraction (amount of the right answers to discard) is:

    
    
        A - X = A - (2A - 1) = 1 - A
    

If A is near to 0.5 (which we would expect if our model is accurate and few
people know the answer, the 50% approximation the author used is about right.)

~~~
troels
Thanks - Not just me then.

------
Shivetya
I like some of the examples they provide, however this is very platform
dependent. Some advanced platforms have highly optimized query engines, to
where even bad queries can be handled if they are run many times. Expressions
can be reordered without the users knowledge and the results will be the same.

------
conradfr
I'm guilty of using multiple single indexes instead of multi-column, so I'll
try to correct that.

~~~
Hannan
There's valid uses for both approaches; it all depends how you're querying the
data.

------
bowyakka
I wonder if postgres higher scores come from the excellent visual explain that
is inside pgadminIII ?

------
ugk
Not enough info to make good choices on all, at least for MSSQL. Besides, the
correct answer to most database issues is "it depends".

------
veddev
Very interesting thank you. Can anyone recommend any books on the matter? (on
top of the one being recommended in the article itself).

------
eie
I missed #2. I wonder which case assigning sort order(ASC or DESC) to columns
in index gives performance advantage.

------
sergiotapia
Wow this just highlights why I absolutely hate working with raw SQL be it
Postgres or MSSQL.

~~~
bhaak
How do you think an ORM will help you avoid those performance issues? Unless
the ORM isn't very feature-rich but then you've got other problems anyway.

SQL's syntax is ugly because it was designed in the 70s where some people had
quite different ideas what a DSL should look like (hey, COBOL, you are guilty,
too!)

~~~
dreamdu5t
Ugly compared to what query language of equal power and expressiveness?

~~~
saosebastiao
Datalog.

I honestly love SQL...I consider it a noble language:) It is the only
declarative language out there that is still used today, and there are some
solid reasons for that.

However, I believe the biggest mistake that was made in the creation of SQL
was to try to emulate natural language flow patterns. This makes it hard to
format and edit, and when people optimize towards writability, they kill
readability (such as the use of leading commas. The lack of any real standard
indentation practice makes it cooperative work frustrating. And the natural
language flow actually confuses the writer/reader into thinking that order of
operations is linear, as opposed to the FROM->WHERE->GROUP
BY->HAVING->SELECT->ORDER BY->LIMIT order. This makes for really annoying
logical bugs, such as how filter conditions in the where clause on a left-
joined table can effectively turn your left join into an inner join.

~~~
einhverfr
The biggest mistake in SQL was in using NULLs to mean three different things
(no record found in an outer join, vs not applicable, vs not yet known).
Fortunately with some types (varchar) on sane db's you can use dedicated empty
values ('' for varchar) for not applicable, but you are still stuck with the
other two as possibly conflicting.

This is actually a big issue because your ability to have complex declarative
constraints goes up with table width, so breaking off potentially nullable
columns also removes them from being available for cross-column check
constraints.

------
fooyc
5/5, as a MySQL user.

I'm a bit ashamed by how bad MySQL users performed at this test.

------
uptown
Scored 4/5 ... Missed #3 - the multi-column index question.

------
nousernamesleft
38.2% of people got 4/5 or 5/5 questions right. That is amazing. I never would
have guessed anywhere even close to that. I would have figured it would be
around 15% or so given that guessing randomly would put it at 12.5%. I guess I
get a skewed perspective from looking at applicants since presumably most of
those 38% are happy with their current jobs.

~~~
club7g
I took the test twice so I'm guilty of skewing the stats. My reasoning - first
time I got 4/5 and it told me 'you know a little bit about SQL performance
tuning.' I was curious about what the 5/5 message would be. It's interesting
on many levels that the message was the same.

