How b-tree database indexes work and how to tell if they are efficient

mustpax · on July 15, 2009

This might sound dumb, but I always thought database tables used some sort of hash table like implementation with O(1) retrieve times on indexed columns. Don't b-tree indexes imply that we're looking at O(log(n)) retrieve times instead?

If so, why would database implementations make this tradeoff?

Two advantages I can think of: a b-tree offers very quick sorting (because data is already sorted). It is probably more efficient to grow a b-tree in place on disk without having to move too much data around too often.

While these two things matter, O(log(n)) vs. O(1) read performance is a big difference. And if I'm not mistaken, most databases ought to be read optimized to begin with.

Can someone enlighten me on the implementation considerations here?

alienasia · on July 15, 2009

Another point is that with a b-tree you have your rows in sorted order, which means you can do "select * from blah where id > 100 and id < 200" and you're still using the index. If you were using a hash table, you'd be doing 100 random disk seeks instead of a few sequential ones.

silentbicycle · on July 15, 2009

For starters, algorithms with O(1) performance can have large constant factors, and resizing the hash now and then can make the occasional operation disproportionately expensive.

Mainly, though, each b+-tree node is the same size as a disk I/O block. The disk is often the slowest part of the database, so making as few reads as possible will go a long way. If you can't work entirely in memory, that's the next best thing.

neilc · on July 15, 2009

Mainly, though, each b+-tree node is the same size as a disk I/O block. The disk is often the slowest part of the database, so making as few reads as possible will go a long way. If you can't work entirely in memory, that's the next best thing.

That makes no sense. An external hash index is also designed to do as few I/Os per read as well (typically just a single I/O to read the page that holds the target bucket). Just like b-trees are designed differently than in-memory trees, external hash indexes are designed differently than in-memory hash tables.

rimantas · on July 15, 2009

BTW, HASH is default index type for MEMORY tables in MySQL.

nbhat · on July 15, 2009

Hash algorithms are prone to collision depending on data/content you are storing. If the data is similar then it can lead to collision in hash keys, which can deny O(1) retrieval. Retrieval depends on content stored and O(1) can not be guaranteed every time. One of the reasons for going for binary tree based traversal is guaranteed O(log n) irrespective of the kind of data that is stored.

leif · on July 15, 2009

Any database engine worth its salt will use some kind of balanced tree structure (canonically, a B-tree or variant thereof, but since I'm on a project that isn't using a B-tree, I'll be a little less restrictive). This is primarily, as said above, because trees speed up range queries, not just point queries (hash tables only improve point query speed), but also because you don't need to rehash everything whenever your data set grows beyond some size. Balanced trees get the nice property that, though their point queries are O(log n), everything else is O(log n) as well (updates, inserts, range queries, deletes), while hash tables suffer the penalty of being amortized (fast for almost everyone means slow for some). Besides, log is a very powerful function. Even if your database is extremely large (a million elements, say), the log factor is only going to give you about twenty seeks.

Another problem with hash tables is that if your keys are large (varchars), you have to hash the entire string, or deal with more collisions. A reasonable balanced tree index on varchars just needs to call strncmp(3), which can terminate much sooner than a hash function. Of course, it's unlikely that your index, if it's large enough for you to care, is going to be CPU-bound, but it's important to remember that the constant inside the O(1) (amortized) hash table is probably going to be pretty large in this case.

neilc · on July 15, 2009

you don't need to rehash everything whenever your data set grows beyond some size

Not necessarily true; linear hashing avoids this requirement, for example. http://en.wikipedia.org/wiki/Linear_hashing

andrewf · on July 15, 2009

A hash table isn't much use for queries that return multiple rows, or use aggregate functions. (SELECT avg(price) FROM sales WHERE date_trunc('month', timestamp) = '2009-06-01').

Not only can a b-tree locate the edges of such a range, but if you're accessing k rows from an n-row table, your index access time is O(k + log(n)) with k quickly becoming the dominant factor.

Even when you're accessing one row per query, these queries will often be favouring a particular cluster of row, rather than being perfectly evenly distributed over the table. In such cases a b-tree will provide better locality of reference; you'll hit cached data in memory rather than disk far more often.

neilc · on July 15, 2009

A hash table isn't much use for queries that return multiple rows, or use aggregate functions.

Neither of those claims is true. A hash index is perfectly useful for queries that return multiple rows: the index just stores all the duplicate values for the search predicate in the matching hash bucket. And hash indexes can certainly be used with aggregate functions: you first use the hash index to evaluate the query predicate (WHERE clause), and then you evaluate the aggregation or grouping clause over the results of the hash index.

andrewf · on July 15, 2009

Sorry, I should have been clearer here. I was referring to queries with range predicates.

Your counterexamples are true where all the returned/aggregated rows have the same indexed value; it may make sense to hash index under such circumstances, but b-trees are a more sensible default.

I guess you could also have a situation where a range predicate maps nicely to a proportionate number of hash buckets. I don't know if any databases are clever enough to do this.

fendale · on July 15, 2009

If you have a table that you basically always query by primary key, then Oracle offers a 'Hash Cluster table' which uses a hash instead of an index locate the data. It generally makes PK looks up very fast, but it has downsides.

1. You need to know roughly how many rows the table will have, and the row size at table creation time. 2. Oracle will allocate all that space at creation time, and so a full table scan will take a long time.

So long as you know the deal with them, hash cluster tables can prove very efficient for the correct problem.

neilc · on July 15, 2009

10% selectivity is the minimum selectivity necessary for a b-tree index to be helpful.

That's probably on the high side. Suppose that records are stored in 8KB pages (which is probably too small for a modern system, but nevermind), the selectivity of the scan is 10%, the records that satisfy the scan are randomly distributed over the pages, each page holds, say, 40 records (which is conservative, even for small 8KB pages), and we're considering scanning a secondary index[1].

That means that a typical page in the heap has 4 matching records, which means we are very likely to have to read every page in the heap anyway. Worst still, the order in which the index scan returns results is unlikely to correlate with the physical storage order of the heap pages, which means that the 4 IOs per page are going to be random, not sequential. Add to that, in a scan of a secondary index, you need to read both the index page and the pointed-to heap page, which is an additional I/O (compared to a sequential scan of the heap). So it's easy to construct a realistic scenario when you wouldn't want to use a secondary index when the scan selectivity is "only" 10%.

[1] And that an index-only plan can't be used: i.e. we need to fetch the pointed-to heap page.

tt · on July 15, 2009

A couple other things to consider:

1. Create a multi-column index that satisfies the query entirely on its own without the need to go to the table to read additional data. This technique solves a large percentage of slow query problems.

2. A bitmap index in Oracle kills concurrency. Do not use bitmap indexes for table with multiple concurrent transactions.

fendale · on July 15, 2009

Also consider that if you have a 100M row table and the column has only 4 values, but those values are highly skewed (ie 3 values are 99% of the rows, 4th is 1%), then a function index on the skewed column can be very useful:

    create fbi on table(case
                          when skewed_col == 'small value'
                            skewed_col
                          else
                            null
                        end)

The null values are never indexed, so now you have a nice small index that is good only for finding you 'needles in the haystack', so long as you put that case statement in your queries.