If you want to see a dumb example of how you can implement indexing on top of a SQL database without it, I wrote a tutorial on implementing basic indexes [0] as part of a series on making a SQL database from scratch.
And of course, Use the Index Luke is a great reference for the real world.
[1] above is a fantastic resource. I highly recommend Markus Winand's book "SQL Performance Explained" if you're in need of better performing SQL queries. It's quite short and small, but it covers all the bases, and uses introductory language.
I loved your series on writing a SQL DB from scratch. I also enjoyed your recent SMTP post as well. I hope you continue to do these, they're always good reading.
Now picture someone is updating the file at the same time you are searching and someone updating the index while you are searching and return to step 1
My team has really enjoyed the concise and informative book [0] from Markus Winand. It comes up sometimes here on HN, and IMHO it's a great resource on the topic for anyone who touches a relational database in their line of work.
His website "Use the Index, Luke!" [1] is also great.
Also remember to create tables using efficient structure packing and reordering[1], taking into consideration how the storage serializer operates (see [2] for PostgreSQL). An example provided by Gitlab[3] demonstrates structure packing and reordering principles in PostgreSQL. It may also be beneficial depending on types of workload to limit the number of columns in a table to optimise for locality of reference[4] once the data is deserialized into memory.
>> How it works: Edgesearch builds a reverse index by mapping terms to a compressed bit set (using Roaring Bitmaps) of IDs of documents containing the term, and creates a custom worker script and data to upload to Cloudflare Workers
I had assumed btree in Postgres was referring to a binary tree but that is definitely not the case. I also learned when to use a hash index, rather than a btree. (If I understand correctly if you’re only using the = operator rather than something that needs sorting with >, <, <=, etc. then a hash will perform better.)
A btree is a kind of binary tree, one which is self-balancing to keep O(log n) lookups from turning into O(n) lookups which is possible without the self-balancing. (Think about a naïve binary tree where you insert elements in sorted order—then everything will be in the right (or left if you start with the maximum value) branch and you need to traverse the whole structure to find the node you're looking for.)
For sorting or adjacency queries, you want a btree index. A hash index can give you O(1) lookup with the drawback that you lose ordering completely (whereas getting a sorted selection of records with a tree is a O(1) operation). The trick here is having a hashing function which is both fast and avoids collisions (the worst case scenario is hash(x)=1 where everything hashes to the same value and you end up again with O(n) lookups).
It's not, because in a standard B-tree all the leaf nodes are at exactly the same depth, whereas in a binary tree they are not in general, unless the binary tree is balanced with exactly 2N leaves.
This difference occurs because a B-tree can have interior nodes that aren't completely full, whereas a binary tree's interior nodes have exactly 2 children.
Binary implies two. Both B-trees and BSTs exist in the universe of m-ary trees. A BST and a B-tree would be equivalent only if the branching factor of the B-tree was set to 2, but practically this is rarely the case with indexes, given that a higher branching factor is generally more favorable to lookup times (“block accesses”) since the hight of the B-tree is reduced as m increases.
> practically this is rarely the case with indexes
Seems like it would be extremely odd to have a 2-2 btree, you’d just get a significantly more complicated BST no?
I’d figure you’d want to fill a cacheline with something typical-ish, so probably at least 4 (this way if you have 4 children and 3 keys, the keys are 8 bytes and the child links are straight pointers your node is 56 bytes and you can add some metadata e.g. a bitmap).
Apparently Rust’s BTreeMap is a 6-11 btree but I don’t know how they picked the branching factor.
Unless your own benchmarks prove hash indexes to have major benefits for your queries, stick with the default B-tree index: you never know what queries you'll need in the future.
That's a bit too YAGNI for me. You have a `users` table with an `tenant_id uuid not null` column. You aren't going to need uniqueness on it and you aren't going to need range operations on it it.
Less important now, but they're unsafe to use prior to PG 10.
As of 14.1, they still don't support unique constraints:
select amname, pg_indexam_has_property(oid, 'can_unique') from pg_am
amname | pg_indexam_has_property
--------+-------------------------
heap | ¤
btree | t
hash | f
gist | f
gin | f
spgist | f
brin | f
Indeed, they perform much better, but caveat emptor: hash index operations are not currently WAL-logged. That means that: 1) if you crash during a hash index update, all bets are off and you'll have to rebuild, and 2) they will just be wrong in a replicated configuration.
theoretically a hash lookup is O(1) so should perform better if the key is unique, but you give up a lot of other features (like partial matches, searching, and range queries).
Binary trees are the ideal in a world of a pure von Neumann model. In practice data structures perform better when data is grouped together and this is what the b-tree gets you. It's both more disk and cache efficient in real world workflows
This. I was hopeful that the linked article would explain how b-trees are practically used for indexes in DB engines. I already know what a db index is (databases 101 software eng) and why are they used, but would love a clear explanation of how are they implemented
It doesn't actually match low level implementations for a number of reasons. Such as the need to fix the page size the block fits in rather than the count of things in the block, and locking for concurrent access. But the data structure itself still looks the same.
Can someone expand on this statement (from the top answer, last paragraph):
"Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided."
Specifically "indexing fields used only for output", what does this mean? I interpreted it as "indexes used only for `select` statements", but that would see completely counter to the main reason you'd want an index e.g. to speed up record retrieval.
And suppose you have an index on x. You can locate the x=1 INDEX records using the index quickly (probably no more than 1-3 disk accesses). But then the qualifying TABLE records have to be retrieved. That index lookup could turn up thousands of qualifying records, and now you have to retrieve each record to get the a, b values.
Now suppose that you replace your index on (x) with an index keyed by (x, a, b). You can still search for x=1 using this index, but now, the (a, b) values that you are SELECTing are present in the index itself. You don't have get the table records to get those values. That can save a lot of random accesses, and there are secondary effects, since the pages containing those records aren't brought into the disk cache.
Yes, this wider index has space and update costs, but the benefit is often worth it.
Right, Postgres supports this use case explicitly with INCLUDE [1], which can be more efficient than just indexing all the columns you need. Pretty sure SQL Server has something similar.
Yes, Postgres has been playing various tricks in the last few years, to make this optimization actually optimize things, but the Postgres MVCC approach makes it difficult. The problem is that index information can be stale, and that you often have to go to the indexed page to determine current state. They've made progress to lessen that "often" to "sometimes", but unless version information is added to indexes, they won't be able to fix this completely.
I was teaching a database course a couple of years ago, using Postgres, and in exercises on query optimization, I found it surprisingly difficult to get columns added to indexes to produce a convincing improvement.
That part is wrong if your database can perform index-only queries. In that case if all columns you query are part of the index the query can be answered from the index alone, so you don't have to read any parts of the actual table.
So how does a binary search work when the disk blocks are essentially a linked list as the top answer initially says? They just assume that it's possible to jump through a contiguous array of blocks for their later analysis but initially state blocks aren't necessary contiguous.
Honest answer: If the blocks aren't* contiguous, there's an index for that!
(* It depends on the database type. In an LSM-tree layer, the blocks are typically contiguous in a file, so direct binary search is possible, though it might not be the most efficient method. The filesystem handles locating blocks in this case.)
The database table is disk blocks containing data where the blocks are logically in order of keys, as if in a contiguous array. Really they are laid out differently on disk.
So there's a block index, which is just a smaller version of the same table data structure: a table mapping keys to values, implemented as blocks in logical order of keys, as if in a contiguous array. Except in this index, the map is from key-ranges of the first table to block locations on disk. This index lets you go from keys to block locations on disk, and it also lets you do a course-grained version of the binary search to narrow down to a single block of the larger table.
Ok, but then how do you look things up in the block index if it's using the same kind of data structure, made of multiple blocks? Same again: Each block index has its own smaller block index.
This neat recursive definition gives you a tower of progressively smaller block indexes until the size is just one block, which doesn't need an index.
Guess what you get when each table of logically sorted, contiguous blocks has a smaller index to say where the blocks are really located?
The data structure is called a B-tree (technically a B+tree), and the smallest index is the root block.
The tree structure arises from the recursive description, where each table has another table (until it stops).
This is a decidedly non-standard way of describing the B-tree data structure. There's no explicit tree. But it's a valid and useful alternative view. (Note, these block indexes are not what is generally meant by database indexes, and they are not visible at the SQL level. They are an implementation detail.)
One of the useful things to emerge from this view is that each level is just logically a flat table, made of blocks that are logically in key order but have arbitrary physical location. It's not necessary for every index to use the same data structure, or be on the same storage. Depending on how you think about algorithms, this description might be simpler to work with.
Their point is that in order to get log(n) access speed you need a sorted random access list, not that if you sort the items in a linked block list you can use binary search on it. They go on to explain how a database index constructs such a thing over the linked block list.
I've seen it happen a bunch. Since StackOverflow is a Q&A format, if someone knows that many people will ask a question they know the answer to, they will write both. Preemptive questioning perhaps!
He also wrote the other question that he links to, How to index a database column, with the request to get answers for each major type of database. So he's asking not so much to learn the answer but to provide a place for others to provide a catalog of answers.
This is different of course from the case where someone asks a genuine question then comes back and writes an answer to themselves when they have learned it. This happens a lot too.
In school and public libraries it was a physical piece of thick paper, similar to a Rolodex card, which had multiple copies, always one per author and one per title, but sometimes backups by subject manager. And when I was a kid, there were often multiple, redundant copies that one could take to the librarian to check out the physical book if it were particularly important.
Wait, you had card catalogs where you took the cards out of the drawer? I grew up with card catalogs where there was a metal rod at the bottom of each drawer that ran through holes in the bottom of the cards. There were cards for author, title and subject with multiple subject card possible for each book. The CIP blocks on the copyright pages of books list all the possible subject headings.
And I've been thinking that I don't use the subject indexes on the online catalog anywhere near enough. They can surface things that the standard keyword search might miss or bury among irrelevant results.
There's kind of three things you want to show people to get them to understand indexes. Let's say you sell an internet police service and have a table of URLs you are crawling for a table of clients, each client gets some client_id, each URL references the client_id of the client who wants it policed. Here are the things you want to say:
1. Binary search. It's be really handy to have the URLs stored in a different order, maybe grouped by client_id first and then all the URLs for the same client are alphabetical. The index conceptually just contains the same rows in a different order.
2. The read vs write trade-off. Your data fits in memory if there's less than 10 TB of it (possibly up to 100 TB) so the big thing is latency, what an index does is, it requires you to write the same information twice in order to read it exponentially faster, so depending on how often do you write versus how often you read, indexes make less or more sense.
3. High fan-out trees. Once you have explained that the index is just an the same data in a different order, you have inserted a sort of ticking time bomb of misunderstanding. The problem is, a beginner programmer’s notion of a freely indexable list, such that binary search works on it, is an array. How do we insert into an array? With all the work of array copies! So you get people who think that random ID columns should never be indexed, because every write into the middle of the array has to shove half the data one cell to the right—unacceptable! So it really helps to say, “now we can store this as a binary tree, make one comparison per level of the tree.” Pause for understanding that the “list,” can be stored with such a tree. “but, modern computers have a nice property that right after accessing some memory the memory near that location is loaded into a cache... this makes arrays real fast. Suppose we don't just have the binary tree thing of 2 child nodes each representing 50% of our data, but an array with 100 pointers each representing 1% of the data, you get something like a 6x memory speedup from the cache, because your tree is one sixth the depth of the binary tree. And that's basically what a B-tree is, you use a higher fan-out of your sort tree to exploit the cache.”
This feels a bit reductionary. Database indexes definitely have more to consider than just simple lookup tables. I’d say that mental model breaks down in practice.
It is, but with all respect, it’s not a useful answer (or even accurate; see below). If one takes, for example and contrast, the top-voted answer, well, I could probably implement a database index scheme of some sort based on that explanation. Might not be a good one, but it would work. Parent comment’s answer? I wouldn’t know where to start.
Additionally, parent’s answer doesn’t actually answer the question, by saying what an index is, but the question was “how does it work?” The top SO answer does a great job with this one, I think. Another one a few answer down also goes into some of the downsides (it slows down writes, for one). The whole page is worth a skim if you’re not a DBA but have to fiddle with DBs on occasion.
And of course, Use the Index Luke is a great reference for the real world.
[0] https://notes.eatonphil.com/database-basics-indexes.html
[1] https://use-the-index-luke.com/