Hacker News new | past | comments | ask | show | jobs | submit login

> (If you build an index on surname, firstname and dob, but execute select * from users where firstname="Bob" and surname="Jones", the index won't be used).

I don't believe that's true, at least not for all RDBMS. I just checked the docs for MySQL and PostgreSQL and they seem to agree with me.

http://dev.mysql.com/doc/refman/5.0/en/multiple-column-index... | "The name index is an index over the last_name and first_name columns. The index can be used for queries that specify values in a known range for last_name, or for both last_name and first_name."

http://www.postgresql.org/docs/8.2/static/indexes-multicolum... | "A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns"

If we want to 100% correct, indexes don't just hurt writes, and they don't hurt all writes either. Two examples (there are more):

- If I do an UPDATE query, and I don't modify indexed columns, the index doesn't need to be modified.

- If I have indexes on (last_name, first_name) and (last_name) the (last_name) index is redundant. Some queries may use the last_name index when I don't specify the first_name - but they could have used the other index. Now I have bloated what contents need to be memory resident (affecting reads as well).

I'd say that suggesting there's a "penalty" at all implies a naive hardware setup[1].

Total system I/O is a finite resource, but this can be hard to reach if indexes are on separate spindles[2] from the data and each other. This can cost substantially more initially, but disappears at scale.

Perhaps, then, as for the OP's original topic, one mistake is not hiring someone who knows these performance strategies. Another is using an RDBMS that doesn't support them.

[1] which, granted, is far and away the norm

[2] or sets of spindles, or SSDs

It's not totally redundant. The index on last_name alone will be more efficient than (last_name, first_name) because the index will be able to contain more keys per index page. The difference can be significant if you're doing something that does an index scan on this column.

This is called a "covering" index, e.g. the index on lastname+firstname also "covers" queries on with a predicate on just the last name. Most relational databases will do this, but it's always good to be sure in your particular case.

It's a composite index. A covering index would be this:

ALTER TABLE staff ADD INDEX idx (last_name, first_name); SELECT first_name FROM staff WHERE last_name = 'n';

A covering index is something very special - all results are returned from the index. Not all databases do this - Postgres does not, because it does not want to maintain all the versions (see MVCC in wikipedia) in the index.

Indeed you are correct.

[Edit: removed defensive response. I've updated my original comment to remove its absoluteness]

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact