How database indexing works internally

eatonphil · on Aug 20, 2021

Awesome guide! If you want to see a very minimal walkthrough of actually implementing this in code, the third post of my "write a SQL database in Go" series does just this [0].

Once you have the index set up and are able to use it the next challenging part is pattern matching (plus algebraic rewrites) to figure out if the index is actually applicable.

For example, in my linked blog post, an index is only used if the WHERE expression has only AND-ed operations and one of those expressions is a filter using <, >, =, <> (and a few other operators) and one of the columns is indexed. This is simpler than what a real database might attempt.

Another interesting part of seeing it in code (and messing with the code) is observing cases like where even though you're using an index it doesn't speed anything up if you filter still matches every item in the index.

Also if you just want to learn more about indexes as a user, I highly recommend the site "Use the index, Luke" that has many useful guides/explanations.

[0] https://notes.eatonphil.com/database-basics-indexes.html

[1] https://use-the-index-luke.com/

masklinn · on Aug 20, 2021

The article claims to talk about database indexing but as soon as it goes into any sort of details the information only applies to mysql.

> So here, indexing comes into picture. What indexing does is, sets up the column your search conditions are on, in a sorted order to assist in optimising query performance.

That's mysql specific, and not just that but it's innodb specific: innodb uses the primary key as a clustering index, or the first UNIQUE index if there's no PK.

Other engines don't usually do that, some (e.g. postgres) don't even have clustered indexes (aka indexes which affect the table's own layout), in postgres clustering is an explicit discrete table-rewriting operation.

And of course a table can have multiple indices, so it's not possible for an index to always determine the table order. In databases which support clustered indices, only one index per table can be clustered.

> The major advantage of B-tree is that the data in it is sortable

That's… not really true. A b-tree is sorted, period. You can't have an unordered btree.

> What are other types of indexes?

There are tons of index types not mentioned by the article e.g. BRIN, GIN, GiST

> For example, suppose you want to find out all of the companies which has company_id less than 40. How could you do that with a hash table index? Well, it’s not possible because a hash table is only good for looking up key value pairs – which means queries that check for equality (like “WHERE company_id = 18”).

And because btrees naturally handle "range" queries, they also handle prefix queries well. After all, "foo*" is just looking for every value between "foo" included and "fop" excluded.

cogman10 · on Aug 20, 2021

> innodb uses the primary key as a clustering index, or the first UNIQUE index if there's no PK.

In MSSQL, the primary key or the even a "Clustered index" is ultimately the ordering of a table. If a neither are provided then MSSQL uses an invisible row id column. If the clustered index is not made with the "UNIQUE" constraint, MSSQL will also create the row id column.

That is to say, yes, this article is very MySQL oriented.

tomnipotent · on Aug 20, 2021

> is ultimately the ordering of a table

Clustered index leaf nodes contain the actual row data of the table itself, which is why there can only be one.

da_chicken · on Aug 20, 2021

> > So here, indexing comes into picture. What indexing does is, sets up the column your search conditions are on, in a sorted order to assist in optimising query performance.

> That's mysql specific, and not just that but it's innodb specific: innodb uses the primary key as a clustering index, or the first UNIQUE index if there's no PK.

> Other engines don't usually do that, some (e.g. postgres) don't even have clustered indexes (aka indexes which affect the table's own layout), in postgres clustering is an explicit discrete table-rewriting operation.

It kind of depends on exactly what they mean. The physical index itself is essentially always going to be stored in a "sorted" order, clustered or non-clustered. It's [by default almost always] a B-tree of some flavor, but every index is going to be stored in a way that makes searching it faster, and in that sense it can always be described as "sorted" even if the table itself has no clustering index. It's in a well-defined order that aids in identifying the records in the table itself. Even if the query engine identifies that the 100 records that satisfy the WHERE clause are in 100 different, discontinuous pages, an index can help with that even if the I/O won't be optimal.

> > The major advantage of B-tree is that the data in it is sortable

> That's… not really true. A b-tree is sorted, period. You can't have an unordered btree.

Oh, I've just noticed you know this already. What's your confusion then?

masklinn · on Aug 20, 2021

> It kind of depends on exactly what they mean. The physical index itself is essentially always going to be stored in a "sorted" order, clustered or non-clustered.

They're not "meaning" anything, they're literally saying that adding an index sorts the table by the indexed column, with a little picture showing this change to the table itself.

> Oh, I've just noticed you know this already. What's your confusion then?

I'm not confused, I'm pointing out that the article is factually incorrect. Not just "this only applies to one database" but literally saying something which is not true, or at best hopelessly confused.

da_chicken · on Aug 20, 2021

> They're not "meaning" anything, they're literally saying that adding an index sorts the table by the indexed column, with a little picture showing this change to the table itself.

No, I don't think so. That's one interpretation of what they said. Since that's clearly not correct, why should we assume that's what it means? That's dishonest reading.

If I'm explaining to a beginner how an index works -- which is literally what this article is doing -- then I don't see why I wouldn't refer to an index as a specially sorted version of the records in the table. That's not a huge difference from what the author actually wrote. Yeah, you might prefer that the author had said "organized" rather than "sorted", but those two words are synonyms in plain English.

Look, here's the statement in context:

> What indexing does is, sets up the column your search conditions are on, in a sorted order to assist in optimising query performance.

> With index on company_id, table would look like this <image>

> Now, due to sorted order, the database knows which records to examine and when to stop.

> The main point of having a index is to cut down the number of records/rows in a table which needs to be examined by the database, to speed up the searching queries.

If you look at an index physically, it does literally store the values being indexed. That's why indexes take up disk space and require I/O. They're actually duplicating information. It does actually take the column values from the table and "sort" or "organize" the columns covered by the index. They're organized into a B-tree [usually], but the values from the table for the indexed columns are there and they are organized in a way you can describe as "sorted." The big difference is that the pages of the index, rather than having leaf records like a table, instead reference the records directly (often using an identifier tied to the clustered unique index).

srcreigh · on Aug 20, 2021

An important detail which this guide leaves out: The "actual location of the row" is usually the leaf node of another B-Tree. That is the primary B-Tree and it's indexed by the primary key.

The major consequence is every non-primary index query involves dereferencing N pointers, which could mean loading N pages from disk/SSD to get leaf nodes spread out across the primary tree. Whereas if you query a range in the primary B-Tree directly, the rows are consecutive so you would only load N/M consecutive pages based on the ratio of rowsize to pagesize.

That's why some people use composite keys for primary key, to get better data locality in the primary B-Tree index.

See "How We've Scaled Dropbox"[1] to hear the same thing.

At 48:36 they explain why they changed from PRIMARY KEY (id) to PRIMARY KEY (ns_id, latest, id). ns_id is kinda like user_id. So that change groups journal entries for users together on disk

Specifically. PRIMARY KEY (id) orders things by creation date whereas PRIMARY KEY (ns_id, latest, id) orders things by ns_id primarily.

[1]: https://youtu.be/PE4gwstWhmc?t=2914

iaabtpbtpnn · on Aug 20, 2021

This is true of MySQL (which the guide uses), but not necessarily of other databases such as Postgres.

srcreigh · on Aug 20, 2021

You're right. Postgres doesn't give any control over primary data locality. That might cause querying 1 row a bit faster in Postgres (no log(N) traversal of the primary B-tree index) but picking out N rows could be a lot slower.

https://www.postgresql.org/docs/13/indexes-index-only-scans....

thinkharderdev · on Aug 20, 2021

I don't understand why ``` SELECT total(amount) FROM orders WHERE createdAt BETWEEN '2020-01-01 00:00:00' AND '2020-12-31 23:59:59'; ```

would do a full table scan. Wouldn't the engine be able to use the index to find just the correct rows and then only do the total on those?

tomnipotent · on Aug 20, 2021

Statistics.

The query planner has decided that it's going to have to visit a lot of pages regardless, so rather than reading the index AND most of the table it can get the job done with less by just scanning the table.

WJW · on Aug 20, 2021

To elaborate on this with a concrete example: if ALL rows were created between 2020-01-01 00:00:00 and 2020-12-31 23:59:59, going to the index is pure overhead.

da_chicken · on Aug 20, 2021

How do you know that's when the rows were created? Maybe it means when the orders were created, but the records are from a foreign system and they were inserted in a completely different order.

More relevantly, how does the query engine know that the `createdAt` field has anything to do with the order of records stored on disk?

tomnipotent · on Aug 20, 2021

> how does the query engine know

It doesn't, really.

When a query planner is deciding the best path for answering a query, it will generate many different plans with different costs. Estimations are sometimes wrong, so planners may try several different plans for the same query over time and eventually settle on the one that actually did the best.

Databases are also constantly collecting statistics on the values in each column, such as min/max, distribution, cardinality, rows per page/block, values at different percentiles. It's possible for these statistics to change over time, either suddenly or slowly, and this can mean the cached plan is sub-optimal. Or maybe you're adding/removing indexes. Plans become stale, and new plans will need to be tested.

So let's say you create an index on `createdAt`. When you ask for rows `WHERE createdAt >= '2020-01-01' AND createdAt < '2020-04-01'`, it's going to generate plans considering that index and compare it against a plan without that index.

Because the index is sorted and we have a range query (min/max), the planner knows it can very quickly find the inner b-tree nodes that contain all rows that match the query and how many pages may need to be read from disk and how many rows that will include. It will then know exactly what data pages we need to scan to find the rows we're interested in.

Without that index, it has no idea about the distribution of values of `createdAt`. It's very possible that 99% of all rows are sorted, but for whatever reason 1 row is completely out of place (should be row #100, but is row #10,000). Every row will need to be scanned to be sure.

Even with the index, the database statistics include min/max values. Let's say we change our query to `WHERE createdAt >= '1900-01-01' AND createdAt < '2100-01-01'`, and it includes EVERY row in the table. The query planner will be able to figure this out, and will generate a less costly plan that just does a full table scan instead and skips the index.

tomnipotent · on Aug 20, 2021

Or even if it's most of the rows. The optimizer will determine a cost which includes data distribution, and the cumulative cost of an index scan + table seeks could still exceed just the cost of a table scan.

da_chicken · on Aug 20, 2021

It depends on a lot of factors. It's almost always down to cardinality and organization.

How many records are in the table, and what proportion of records are estimated to be necessary to read? If it's more than a given threshold, then the query engine may decide to just read everything than waste time sorting out which pages it needs and which it doesn't. I/O is slow, but CPU is not free.

Is the table clustered on the `createdAt` field? If not, the system will not assume that the table records are going to be stored in any particular order that benefits the query execution. After all, the field may not represent when the record was created in this particular system. It will then estimate how many pages it thinks it will need to read. Again, at a certain threshold it will decide to just read every page in the table. Sometimes it's more expensive to figure out what to read and still end up reading 50% of the table instead of just reading everything and scanning through it quickly.

Remember, databases almost always read from disk in pages that contain multiple records. They can't read just one record. They read the whole page and then read the records out of that. That's the smallest unit of I/O. If records are evenly distributed, then the system will need to read almost every page anyways. That's why clustering indexes help so much, but you can only cluster on one set of fields.

As an aside for datetime thresholds it's a better idea to specify `WHERE createdAt >= '2020-01-01 00:00:00' AND createdAt < '2021-01-01 00:00:00'`. Different systems have different time precision. You don't want to miss records because of fractional seconds. Yes, it probably won't matter, but you can also just write it this way and not have a hole in your logic. BETWEEN is just syntactic sugar, too.

CapriciousCptl · on Aug 20, 2021

I was curious so I tried it in postgres 13. Postgres, at least, uses the index to form a bitmap and scans that when aggregating in the first case (10% rows in the bitmap) and not in a second case WHERE "createdAt" BETWEEN '1990-01-01 00:00:00' AND '2020-12-31 23:59:59'; (100% rows in the bitmap, obviating the need for the intermediate step). I also tried ~20% rows (2019-2020) and the planner skipped the index. '''

CREATE TABLE temp (id SERIAL PRIMARY KEY, amount MONEY, "createdAt" TIMESTAMPTZ); CREATE INDEX ON temp ("createdAt");

INSERT INTO temp(id, "createdAt", amount) SELECT generate_series(1,1000000) AS id, NOW() + (random() * (interval '10 years')) - interval '10 years' AS createdAt, random() * 100::money AS amount.

EXPLAIN SELECT sum(amount) FROM temp WHERE "createdAt" BETWEEN '2020-01-01 00:00:00' AND '2020-12-31 23:59:59';

Aggregate (cost=10286.06..10286.07 rows=1 width=8) -> Bitmap Heap Scan on temp (cost=2148.00..10033.48 rows=101032 width=8) Recheck Cond: (("createdAt" >= '2020-01-01 00:00:00-05'::timestamp with time zone) AND ("createdAt" <= '2020-12-31 23:59:59-05'::timestamp with time zone)) -> Bitmap Index Scan on "temp_createdAt_idx" (cost=0.00..2122.75 rows=101032 width=0) Index Cond: (("createdAt" >= '2020-01-01 00:00:00-05'::timestamp with time zone) AND ("createdAt" <= '2020-12-31 23:59:59-05'::timestamp with time zone))

And when running a longer query: Finalize Aggregate (cost=14596.71..14596.72 rows=1 width=8) -> Gather (cost=14596.49..14596.70 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=13596.49..13596.50 rows=1 width=8) -> Parallel Seq Scan on temp (cost=0.00..12620.00 rows=390597 width=8) Filter: (("createdAt" >= '1990-01-01 00:00:00-05'::timestamp with time zone) AND ("createdAt" <= '2020-12-31 23:59:59-05'::timestamp with time zone))

tomnipotent · on Aug 20, 2021

Buffer pool plays a big part. Very possible all the data is already in-memory, and for certain data sizes it'll be faster to just follow leaf nodes start-to-finish than it is to determine what pages you can skip.

Postgres buffer pool is a ring, and relies on "clock sweep" to decide what pages it can evict on each iteration. It has a shared buffer, and per-query buffers to eliminate shared buffer evictions (for costly queries). When doing index scans, worst-case the same page is being accessed in random order multiple times and it's evicted between those accesses so we end up with redundant disk I/O.

Bitmap scans ensure each page is only scanned once and in-order, so it's a great solution when you need more than an index scan but less than a full table scan (worth of data), not to mention multiple indexes can be combined into one bitmap scan.

If every page is already in memory, the query planner may pick plans that look sub-optimal if you factor in disk I/O but are otherwise very efficient in-memory.

masklinn · on Aug 20, 2021

FWIW you can indent with 4 spaces for a preformatted block e.g.

    CREATE TABLE temp (id SERIAL PRIMARY KEY, amount MONEY, "createdAt" TIMESTAMPTZ); CREATE INDEX ON temp ("createdAt");
    INSERT INTO temp(id, "createdAt", amount) SELECT generate_series(1,1000000) AS id, NOW() + (random() * (interval '10 years')) - interval '10 years' AS createdAt, random() * 100::money AS amount;

masklinn · on Aug 20, 2021

> Wouldn't the engine be able to use the index to find just the correct rows and then only do the total on those?

Yeah, the explanation doesn't really make sense in general, though computing anything might be sufficient to throw off mysql's query optimiser.

dboreham · on Aug 20, 2021

Indexing isn't magic. You can define an ordered index on the createdAt field and your query could use that index if it feels by its notion of the best thing to do that it'd be worthwhile. You can do things to persuade it what to do. It's up to you.

masklinn · on Aug 20, 2021

The problem is that the article's explanation makes absolutely no sense: it claims that a full scan is performed because of the computation (function call) in the `select` clause. And worse, that indexing the column used in the function call fixes it:

> But, It will still do a full table scan because we are using "amount" column to calculate the total. So, we need to put an index on "amount" column too.

Seems to me like TFA tried it, saw that the index was not used, and invented a justification for it. And after adding the second index the query planner had collected the relevant information and started using the index, or something.

walski · on Aug 20, 2021

Great video from the SQLite founder that covers a lot of ground in terms of how indices work under the hood: https://m.youtube.com/watch?v=gpxnbly9bz4