I feel like FAT and FAT32 back in the DOS days were so simple and easy to explain, that my generation of programmers have the "sweet point" in terms of learning technology.
FAT / FAT32 wasn't about B-trees or balancing or anything. It was a simple linked list, that got severely tangled up during use and required an hour of "defragmenting" to make your hard drive go back at high speed.
As a computer user of the time, I was too young / ignorant to know why it happened like that, but that experience stuck with me until college. In college, I learned about how FAT32 filesystem worked fundamentally (though obsolete, its so simple its a great introduction).
From there, I learned why we have moved onto more advanced data-structures like B-Trees. Why things like B-trees are less prone to fragmentation. Etc. etc.
----------
A big part of my learning process was the hours of waiting I'd put in to defrag my hard drive back in the day. Those memories stuck with me, I know what the "downsides" are if I don't use a self-balancing B-Tree to store data on a hard drive.
Perhaps today, we're no longer at a point where we can just convince legions of new computer-users (or programmers) to use inferior file systems like FAT32 in their daily lives. But we can at least tell stories about those times, so that they can understand why we put so much complexity into our modern filesystem data-structures.
Wasn’t defragmentation more important in general on spinning media to reduce random seeks? And now most performance oriented things are happening on SSD?
I'm very confused by your question. I don't think FAT or FAT32 was ever a performance-oriented filesystem.
Back then, we were just happy having a personal computer. I'm sure there was better tech being used somewhere. But MS-DOS / Windows was just your cheap $99 OS in an age when other compilers and OSes cost you something like $1000 to $10,000.
You formatted your hard-drive in FAT because MS-DOS used FAT. You used MS-DOS because that's what PCs and PC-clones came with. There was no alternative, and PCs were already considered a pretty luxury / nerd item. (I know people brag about their DEC Alphas, Unix copies or OS/2... but you got enough "nerd cred" back then if you simply had a PC at all)
Its just how computers worked back then. Every PC was using FAT, and every few months of regular use the disk reads/writes would get so slow that performance was unbearable.
At that point, you spent an hour defragmenting the hard drive, and everything worked great again.
-------
Eventually 4GB hard drives were made, and you upgraded your computer to FAT32. Soon after that, Windows promised an end to defragmenting with its next-generation NTFS filesystem (based off of the B-trees discussed in this article) and we never looked back.
Well, one purpose of defragging a drive was to put data from the same file in one place in order to reduce random seeks while reading sequential data. One would defrag in order to increase performance in this regard.
On an SSD, the random seek time is much lower. There may be some gains from sequential reads, but the gains from having data physically defragmented are not as stated as with spinning media.
Defragmentation was indeed more important on spinning disks; I think your "to reduce random seeks" is largely correct: another way to think about it is that it was ensuring sequential reads ("read in this 10MB file") did not become random seeks ("Sure, but that 10MB file is in these 5,381 chunks all over the place")
You are correct that SSDs do not require defragmentation, and to a certain degree defragmentation is somewhat harmful in the sense that it forces unnecessary read/write cycles.
If I understand correctly, NTFS describes a file as a sequence of cluster ranges, e.g. [7,12), [0,4), [20,25). So the more fragments a file or folder has, the more space is required to describe how it's stored. This applies to the file system and is independent of HDD/SSD technology.
As for SSDs, they need some degree of garbage collection - it's considered bad for an erasable block (e.g. 1 MiB) to contain a mix of live pages and deleted pages (4 KiB each).
How would btrees on their own solve fragmentation problem? The source of slowdown was random seek speed, not data structure used. You can optimize FAT32 driver to always fit files into sequential gaps, early Windows implementations didnt.
Files grow and shrink. Think about even the simplest "log" file (ex: Apache logs) appending constantly to the end of the file. First you have an access.log (+1 line to the end). Then Apache opens the error.log and adds some info to the end. Then PHP has something logged somewhere, again opening and closing a new set of files.
Even if these files were all perfectly allocated in sequential order at the beginning, a few operations like this quickly fragments the files.
--------------
I don't recall all the files MS-DOS or Win95 opened / closed on a regular basis, but play a video game (save file changed), open a word document, edit a photo, and come back to the video game... and the save file will now start getting fragmented as the size of the save files change and new allocated blocks need to be figured out.
> - can you name couple of games saving state by appending/modifying files in place on disk?
Hmm, maybe the video game wasn't the example I should have chosen. But swap the order around: open a word doc (which creates an "auto-save" file which is almost certainly modified in place), and then do the other stuff and I think it applies.
------
> - How would btrees help with that?
At a minimum, your indices are all together and in order and are read together all at once.
From there, you can build the file's representation in RAM with much fewer random-traversals than the FAT32 linked-list approach.
Perhaps "fixing fragmentation" is the wrong description. B-Trees mitigate the random-reads associated with fragmentation. There are much fewer seeks from a b-tree based representation than a FAT/linked-list style representation.
-----------------
Consider this scenario. You have a FAT/Linked list of the following: 0 -> 10 -> 5 -> 7 -> 2 -> 1
So you need to move-head to 0, read, move-head to 10, read, move-head to 5, read, (etc. etc.). That's 6 times you move the head randomly.
Now lets say you have a b-tree of [0, 10, 5, 7, 2, 1]. Exactly the same order of nodes as the linked list, but you can optimize much better.
You read initially the b-tree, and then you read 0, 1, 2, 5, 7, 10 in order. You probably only move the head twice: first you move the head to 0, to read [0, 1, 2] as a sequential scan.
Then, you move the head to 5, and read [5, 6, 7, 8, 9, 10] in order. Except you throw away the results for 6, 8, and 9.
So long as you have sufficient RAM to re-order the data after you've read them from disk, you're gold.
----------
There's no way to optimize the FAT32's reads, because the indicies are scattered across the linked list. You only know that 10 is the "next" FAT32 node after you've read the page at 0. You only know that 5 follows 10 after you've read 10.
Office autosave does not modify files in place, it creates new copy with some additional backup/history/unwritten mechanisms thrown in. Modifying files in place is very rare (pre allocation in case of many downloaders like bittorrent/aria2 etc, or straight up disk editing), appending is more common (logging).
>From there, you can build the file's representation in RAM with much fewer random-traversals than the FAT32 linked-list approach.
You can do same thing in FAT by reading whole cluster map at the start. Wiki: "Other high-level mechanisms may read in and process larger parts or the complete FAT on startup or on demand when needed and dynamically build up in-memory tree representations of the volume's file structures different from the on-disk structures." This passage references OS/2 documentation, but its unclear to me if OS/2 employed such optimizations. Later NT builds might have this, as Iv seen large files allocated into empty space in the middle of a disk instead of forcing writes into first available gap.
>B-Trees mitigate the random-reads associated with fragmentation. There are much fewer seeks from a b-tree based representation than a FAT/linked-list style representation
after reading whole cluster map you have same representation in RAM, there is no information disparity. Its less elegant, slower to initialize and wasteful - 2TB FAT32 partition with 32KB clusters means >250MB cluster map on disk. But at the end of the day disk will have to perform same random seeks to read already fragmented file. The magic bullet for fragmentation is a map of free sectors letting OS pre plan where to put files in a smart way, and even NTFS doesnt use fancy data structures for that. NTFS $BITMAP is a https://en.wikipedia.org/wiki/Free_space_bitmap. exFAT has a nice feature of marking unfragmented files in the Directory entry removing the need for FAT traversal altogether.
>Consider this scenario. You have a FAT/Linked list of the following: 0 -> 10 -> 5 -> 7 -> 2 -> 1 So you need to move-head to 0, read, move-head to 10, read, move-head to 5, read, (etc. etc.). That's 6 times you move the head randomly.
or whole FAT table is cached in ram and there is no difference :)
I had to implement a simple B+ tree for a databases class in college (since they're often used when implementing databases).
It was one of those assignments few people fully completed. Most completed parts of it to varying degrees. I got more than halfway through but wasn't able to complete it.
It was an extremely difficult assignment and stands out in my mind.
They're very interesting and very complex. Nice article!
Implementing a B+ tree was the final assignment in a grad-level Algorithms & Data Structures call that I took. I thought it was a reasonable assignment in that context. It felt like a good capstone to the class, building on what we had learned and finally crossing the line from toy examples to something that approached real-world usefulness.
It seems very out-of-place for a databases course. Yes, B-Trees are important for a database, but I don't see how implementing one teaches you about how a database uses one or why it's important.
As another data point, I implemented a B+ tree with rebalancing deletion functionality back in CS3090 at UVU circa 2008 from Dr. Curtis Wellborn.
It was an undergraduate course, but the professor emphasized that everywhere else it would be a graduate-level of course.
My B+ tree implementation served as the foundation for my own (shitty) DBMS ;D
As a funny sidenote, the professor claimed that even Stanford DB courses had a buggy / broken algorithm for balanced deletes. While preparing the course, he'd spent a chunk of the summer doing researching it, and had developed his alogrithm in Prolog. He shared the concepts and our implementations worked fine. However, it was ultimately mostly an academic exercise, as balanced deletions come with significant tradeoffs which make it not as useful as I initially thought.
Even full-fledged database systems don't typically implement B+ tree indexes with deletion+rebalancing support (the usual technique is marking nodes deleted and cleaning them up when the index is rebuilt).
> Even full-fledged database systems don't typically implement B+ tree indexes with deletion+rebalancing support (the usual technique is marking nodes deleted and cleaning them up when the index is rebuilt).
It's intuitively clear that worst-case performance [1] can degrade to a function of the number of insertions since a rebuild (as opposed to the number of currently live items) when you use relaxed deletion. If you can afford to globally rebuild the tree you get to play the usual amortization tricks, but global rebuilds can be problematic in a large-scale live database; one solution is to rebuild from a snapshot in the background (on another thread or machine) and then replay late arriving operations before switching over to the rebuilt tree. But if someone is okay with the amortization spikes from rehashing a hash table, they should be okay with non-incremental global rebuilds for search trees; a hash table where deletions leave behind tombstones and the table is rehashed when the ratio of tombstones gets too high is basically the same idea.
Perhaps it was just my professor, but what really helped was we first did red-black trees and after doing that our teacher explained to us that red black trees were just a type of B-Tree. Once you have that, it's pretty easy to go from one to the other.
That's the approach Sedeewick of the (in?)famous textbook takes. He also has one neat tweak that simplifies rebalancing: when traversing a B-tree for insertion, preemptively split any full nodes as you visit them. This ensures there will always been room in a parent node if child nodes force keys up.
I got interested in the topic a few years ago and wrote an implementation in Python[0] for myself. It's just a toy but the funny thing is that I regularly get waves of students staring the repository, probably when a teacher gives a similar assignment.
It doesn't balance between neighbours on the left and right like some btrees do but it handles splits in a really simple way and balances otherwise.
I would recommend everybody write the core algorithm in a simple language like Python first to get the core algorithm understood before trying to implement in a low level language like C or Rust.
I still remember my B-tree implementation in college. The professor wanted us to run it against his data. I quickly commented up my debug prints which introduced a bug with an "if" statement and failed for some of his test cases but I got most of the credit. Later I found the cause of the bug and got pissed off for such a silly mistake.
An in-memory b-tree or one stored on disk? The first one should not be a huge amount of work over implementing, say, a red-black tree. But sorting out the disk storage is hard (and in the end, you've just written a database!)
Fantastic. Also a great, related, listen is the recent Corecursive Podcast with Richard Hipp as guest. He’s the man who created SQLite, with good technical stories to share.
Just FYI, adding right-sibling pointers to a B-tree makes it impossible to implement a COW B-tree. Kinda like COW semantics makes it impossible to have doubly-linked lists.
But if you're not going to COW, then right-sibling pointers are probably a good idea.
This is really cool. I recently attempted to build a toy database, and subsequently implemented my own b-tree (
https://github.com/spandanb/learndb-py). I ended up running into a lot of these issues.
This incomplete tutorial is about how to write a sqlite clone, with particular emphasis on how to implement the underlying b-tree: https://cstack.github.io/db_tutorial/
for YEARS that tutorial has been stuck on "Alright. One more step toward a fully-operational btree implementation. The next step should be splitting internal nodes. Until then!". Must not be worth the time to finish writing the tutorial.
I just implemented one in go, working on the repl now. I plan to eventually write it in C++ for a real database, but I’m doing this for my team to teach them the difference between LSM trees and Btrees. I’m hoping to show that a specialized concurrent btree, though harder to write, performs better than LSM trees for the same data. Maybe I’ll make a blog post about it.
EDIT: For more context, this is for time series metrics, so super specialized data.
IMO unlike the opening drawing, in a disk-based B-Tree index, Leaf nodes should not have "next" pointers, one should instead go back up the Tree node(s) (iterating upward when necessary) to find the next Leaf node. Next pointers are a duplication of information violating referential integrity within the index, and will be invalidated when an index update causes a split or merge rebalancing operation, creating concurrency issues.
When you split a node, you're probably [1] already modifying the node you're splitting, so there's no problem updating unidirectional sibling pointers. If duplication of information is a problem, well, you have a bug in your storage engine somewhere.
[1] It occurs to me that you don't necessarily need to modify the node you're splitting, if the key you're writing is on the new node's half of the split.
However, sibling pointers would be the dumbest way to traverse the data, even on a read-only tree, because it has serial round-trip latency to disk for every block. You would never use them for traversal.
They can be useful, for a fancy concurrency hack, if you use them to split the node without updating the parent, and then update the parent in a separate operation after the fact. This lets write operations release their lock on the parent before accessing the child. In that case any code traversing to a child makes note of the child's right sibling stored in the parent and uses the child's sibling pointer only if it's different from the right sibling as stored in the parent (which means the parent hasn't been updated yet).
Concurrency issues are a challenge, but a bigger problem is when designing a copy-on-write B-Tree, which is what I did. With such a design, any change to a node causes a replacement to be allocated, and then the pointer into it must be changed too. This means that the referencing node must be copied, and so on.
If only the parent pointer needs to be changed (and up to the root), this isn't a big deal considering that B-Trees aren't very deep, and a good implementation will cache modifications anyhow.
When sibling pointers are introduced, a copy-on-write B-Tree will on average need to read and rewrite half of the B-Tree on every modification. For this reason, I don't use sibling pointers.
PostgreSQL and MySQL, for example, use B-tree variants that have pointers between leaf nodes. Postgres is based on Lehman & Yao (itself a B+(*?) Tree variant), and InnoDB uses a B+ Tree with a next pointer.
> one should instead go back up the [parent] node(s) (iterating upward when necessary) to find the next Leaf node.
Nitpick: rather, you should not go anywhere, but refer back to the parent nodes that you already have in memory, because how else did you find the current leaf node in the first place? (This also allows you to avoid round-trip latency by fetching sibling-after-next before you actually get the data in the sibling node.)
> In my college Data Structures and Algorithms course, we covered B-Trees, but I didn’t grok why I’d choose to use one. As presented, B-Trees were essentially “better” Binary Search Trees, with some hand-waving done that they had improved performance when used in database applications.
With modern memory hierarchies that also tends to be the case in-memory (with lower node densities more suited to caches and cache lines).
DDR4 has a burst-length of 8, meaning the smallest you can write is a block of 64 bytes (corresponding to a modern cache-line).
It also should be noted that DDR4 has "rows" roughly in the 1024 byte size. A RAS command opens up a "row" (aka 1024 bytes), while a CAS command reads from a column (aka: 64-bytes).
DDR4 cells can only be read once (!!!), and afterwards, the data gets mangled. To prevent the loss of data, all "reads" are first into sense-amplifiers... and these sense-amplifiers can be read infinitely.
This "read into sense-amplifiers" is called a RAS command, and it transfers the entire 1024-byte row into sense amplifiers. A CAS can then read 64-byte chunks from the page at high speeds and randomly.
--------
Before a new RAS command can be issued, there needs to be a few steps.
1. The "current row" of data needs to be written back to DRAM (remember, the data in the DRAM was lost when you read the data in the first place).
2. After writing the data, the sense-amplifiers must be emptied (aka: precharged), so that they're ready to receive the new data.
3. After #1 and #2, it is finally legal to issue a RAS
--------
So in fact, DDR4 RAM is also a block device. Sure, RAS is much faster than a hard-drive seek, but given how much slower RAS (row address strobe) is than a CAS (column address strobe)... a lot of these disk-based discussions (such as B-trees) end up applying to DDR4 in practice.
--------
The only thing that really works like classic RAM is L3 / L2 / L1 caches. Everything else in the modern world is basically a block device.
B+Trees minimize the maximum depth of the lookup. All the values have exactly the same cost. In all honesty, I don't care about the lookup cost of the values I never need.
Sounds like you are looking for a Splay Tree implementation. Their dynamic optimality properties ensure your most popular data lives right at the root.
For example. In essence any tree that allows you to "rotate" based on some statistics about subtree popularity. This becomes more and more interesting in combination with copy-on-write strategies because there you have to rewrite the top of the tree anyway. So why not aggregate some statistics while you're at it.
The "hot items" optimisation you're thinking of have less of an effect on trees with wide fanout, such as B-trees with a substantial block size.
This is because the ratio of number of "hot" items you can store at a reduced depth to total number of item is a much smaller ratio with a high fanout tree than a low fanout tree; and at the same time, the depths are lower with a high fanout tree, reducing the benefit of reduced depth anyway.
In other words, you can only give the benefit to a smaller number of hot items, and the benefit you can give is smaller.
On top of that, the added complications of rotations and statistics mean fewer items can be stored in interior nodes (which I'd call index nodes) for a given node size. This reduces the fanout and increases the full depth.
And the rotations and access statistics cause more writing.
There are tradeoffs and perhaps it is worth it for some tree parameters. Perhaps for exceptionally large trees with particular access patterns. It would be interesting to see how it's implemented.
But generally for a high-fanout B-tree, you might as well just store frequently accessed items in a separate small tree as a cache, and keep that cached in memory.
You're absolutely right. However a cow strategy forces you to rewrite parts of the tree _anyway_ (typically the path from the node you're updating to the root). So you might as well rotate an out of balance node in that path. The rewriting also causes you to shy away from high fanouts because of the overhead.
Well, I would definitely be interested to see the paper where that B-tree rotation is described along with measurements! There are many effective B-tree modifications; maybe that's one of them. At least with storage B-trees, I haven't seen a paper describing anything like that yet.
One of the "rules of thumb" of optimising high-fanout B-trees larger than memory is to generally put values in the leaves, because the benefit of being able to keep more pointers cached in memory in compact internal nodes to help locate values in fewer I/O steps outweighs the benefit of having an arbitrary and small subset of larger values in the internal nodes. But if you have particularly well-chosen larger values in the internal nodes, and a low-entropy access pattern, perhaps the rule of thumb doesn't apply then.
There are quite a few B-tree modifications where separate small trees or logs or both are maintained to improve performance, but these are generally to optimise writes, which can be sorted and distributed in batches to reduce random access block I/O, rather than to optimise frequently accessed reads by acting as an on-storage persistent cache.
On fanout:
High fanout is generally used with storage devices, because low fanout always increases the overhead.
For example on NVMe SSDs I've used, reading or writing any block smaller than 4096 bytes takes the same time as 4096 bytes. On HDDs the equivalent number is much larger due to seek time.
So lowering fanout below 4096 bytes per block would just increase the number of block I/Os per tree operation. without reducing the time per I/O.
On COW:
You're right about the rewriting, although it's worth knowing that times have changed: On NVMe you have durable write atomicity guarantees so you don't need to COW as much to have the same storage guarantees. Inside the NVMe of course it may be doing COW on hidden page remappings to provide this atomicity, or it may use other techniques (e.g. a supercapacitor to ensure writes always complete) but from the outside you can simply overwrite blocks or parts of blocks in place.
This adds quite a saving to safe tree updates, and it also reduces padding required in write-ahead logs at durable commit points, including the in-block logs used in several B-tree optimisations (e.g. see bcache on Linux).
Even with COW, it isn't always necessary to update the path to the root by rewriting parent blocks in general. To avoid a parent block write, sometimes it's faster to log an update to the parent's pointer entry, either in an actual logical log somewhere, or using a separate logical to physical page mapping table, or by using pointer-to-log entries that are resolved at read time.
If it's an in-memory B-tree, a COW strategy doesn't need to rewrite parent blocks at all. It can update the pointers in place directly with atomic pointer updates, although some implementations prefer a logical-to-physical page mapping table instead and to atomically update the table.
NVMe devices can be split into regions with a different block size per region. If you look at the spec and are interested in building a storage system on top of them (as I was in a previous life) you will see they don't resemble a hdd at all.
It's a shame to put a file system on top of them and throw away all the iops. Also, there's no such thing as an atomic write. HDDs, SSDs, NVMes all lie to you in different ways. I've seen SSD power failures where in a whole bank every byte had it's 2nd bit set. So it's an illusion to think that you don't break what you don't touch.
I tried to split an NVMe into different blocks sizes per region recently. It was to get the small performance improvement (~2%) of 4096 byte sectors, while needing to boot on a motherboard that needs 512 byte sectors (I didn't want to deal with switching a remote server to UEFI and figure out every issue remotely; it's a shame there isn't some trivial BIOS extension to support 4096 byte sectors).
So I decided to format a boot region and main region.
What I found with the particular models I had was:
- Multiple regions were fine, and different block sizes (512 or 4096 bytes) were fine too, but they wouln't accept different block sizes in different regions at the same time. Format commands failed when you tried that.
- The atomicity size and alignment were identical regardless of block size. Durable (power fail safe) atomicity was reported as 4096 byte aligned blocks. Write command overlapping atomicity was reported as 512kiB aligned blocks.
I'm not sure why you may think I think NVMe resembles a HDD regarding atomicity. HDDs don't claim to do atomic replacement of sector contents, and neither do non-NVMe SSDs.
However, NVMe specifies the atomicity, and the device reports atomicity parameters. Are you saying the NVMe devices lie about their atomicity as well? If so, that's a real shame, after going to the effort of reporting atomicity parameters in the first place. Or are you inferring it from experience with HDDs and non-NVMe SSDs?
With SSDs (at least low-quality non-NVMe ones) I'm not convinced you can make use of "you don't break what you don't touch". Blocks are remapped all over the place, so you don't know which banks you are touching, so if there are faults with the SSD you don't know which random blocks may become corrupt as a side effect of the blocks you are writing to, so you can't work around it.
In other words, such devices are not usable for anything if you need power fail safety. Atomicity is the least of your worries.
However, to the extent you decide to depend on a device not corrupting random unrelated data on power failure, it's hard to see why you should not include the NVMe-reported block atomicity in that assumption and modify your storage algorithm to take advantage of it.
Of course any device can fail catastrophically to meet its own basic specifications, such as your bit 2 in every byte example, and in general random corruption due to design or manufacturing issues. It doesn't make sense to design around those kinds of "unknown unknowns" at the B-tree and logging format design level though. Those need to be handled at a higher level, where we are willing to write off a database and perhaps switch to a backup.
Edit: ps. Thanks for the Usenix paper link. It's a good paper, and I'd hoped to see someone do those measurements. I hope to see a similar paper about NVMe someday.
FAT / FAT32 wasn't about B-trees or balancing or anything. It was a simple linked list, that got severely tangled up during use and required an hour of "defragmenting" to make your hard drive go back at high speed.
As a computer user of the time, I was too young / ignorant to know why it happened like that, but that experience stuck with me until college. In college, I learned about how FAT32 filesystem worked fundamentally (though obsolete, its so simple its a great introduction).
From there, I learned why we have moved onto more advanced data-structures like B-Trees. Why things like B-trees are less prone to fragmentation. Etc. etc.
----------
A big part of my learning process was the hours of waiting I'd put in to defrag my hard drive back in the day. Those memories stuck with me, I know what the "downsides" are if I don't use a self-balancing B-Tree to store data on a hard drive.
Perhaps today, we're no longer at a point where we can just convince legions of new computer-users (or programmers) to use inferior file systems like FAT32 in their daily lives. But we can at least tell stories about those times, so that they can understand why we put so much complexity into our modern filesystem data-structures.