Then I implemented a little FUSE driver in Python to read the disk image from the host system and it was wonderful to mount it the first time and see the files! https://github.com/vinc/moros-fuse
I have read down to the implementation section, but for my money, this is the best way to describe the high level function and behavior of a filesystem that I have ever seen.
I'm not asking about the structure or how it's organized. I mean... is the filesystem in a file or... how?
Background: I mostly do embedded stuff so at a glance I would have expected low level primitives (like, HW interactions, registers and stuff) but I see none. So maybe, my expectation, when tacking a problem, of interacting with the HW directly, does not stand in modern environments.
Even better, but unrelated question... how the heck does a x86 OS request data from the HDD?
You'd presumably have some "block device" abstraction between your filesystem and your device driver. Don't want to re-implement a FS for each type of hardware. On a Linux system, you can read, eg, /dev/sda1 from userspace, which is what it looks like this filesystem probably does.
As for how you actually request data from the hard drive:
There's older ATA interfaces, and BIOS routines from them, which I suspect is what most hobbyist OSes would use.
A more modern interface is AHCI. The OSDev wiki has an overview, where you can see how the registers work:
https://wiki.osdev.org/AHCI
as an aside, for our embedded system we use https://github.com/ARMmbed/littlefs for our flash file system, it has a bit of a description on its design and its copy on write system so that it can handle random power loss. Be nice to see some of these kinds of libraries done in Nim or Rust.
> how the heck does a x86 OS request data from the HDD?
Entirely too short summary: Use PCI to discover the various devices attached to the CPU. One or more of them are AHCI or NVMe devices. The AHCI and NVMe standards each describe sets of memory-mapped configuration registers and DMA engines. Eventually, you get to a point where you can describe linked lists of transactions to be executed that are semantically similar to preadv, pwritev, and so on.
Always fun to see this type of work. I notice the usage of OsString, and it made me wonder: does the way an OS encodes it’s strings potentially make this FS non-portable between OSes? If I want to mount a drive formatted with this FS, would the OsString be potentially non-portable?
There was a lot of discussion in the past around TFS https://github.com/redox-os/tfs, my understanding is that effort has kinda lost steam.
It would be nice if the intro had a brief explanation of why a disk needs to be divided into blocks. Otherwise, I really enjoyed this read from the perspective of a lay person.
> It would be nice if the intro had a brief explanation of why a disk needs to be divided into blocks.
One reason is that HDDs simply don't have a byte-wise resolution, so there's little point talking to HDDs in sub-sector units. Sectors are usually 512 bytes to 4k.
A second reason is being able to simply address the drive. Using 32b indices, if you index bytewise you're limited to 4GB which was available in the early 90s. With 512 bytes blocks, you get an addressing capacity of 2TB, and with 4k blocks (the AF format), you get 16TB. In fact I remember 'round the late 90s / early aught we'd reformat our drives using higher-size blocks because the base couldn't see the entire thing.
> Now why is it that disks lack byte-wise resolution?
Overhead.
Each disk sector is not 512 bytes (or 4096 bytes in modern disks), it's a bit more. It needs at least the sector number (to detect when the head is flying over the correct sector), a CRC or error-correcting code, a gap of several hundred bits before the next sector (so writing to one sector won't overwrite the next one), and a preamble with an alternating pattern (to synchronize the analog to digital conversion). All this overhead would be a huge waste if the addressable unit were a single byte.
Cost, more resolution means more transistors, address lines, and protocol overhead (your memory address take up
space too).
You could build a HDD with byte-wise resolution, which did a whole load of clever things internally to handle error correction (modern HDDs have far more complex error correction that just block wise checksums, they’re a piece of technological magic and analogue electronics).
But sure a device would cost far more, and offer no real benefits. Especially when you consider that reading a single byte in isolation is a very rare task, and it takes so long (compared to accessing a CPU cache) to find the data and transfer it that you might as well grab a whole load of bytes at the same time.
Byte-wise resolution in a HDD is like ordering rice by the grain in individual envelopes. Sure you could do it, but why the hell
would you?
I believe for performance and error correction features for spinning hard drives.
Check out the tables and graphic in the overview section here[1]. Any write to a sector requires updating the checksum, which means the whole sector has to be read before the drive can complete the write. Since the sector is probably in the kernel disk cache already, the whole sector can be flushed to disk at once and the drive can compute the checksum on the fly (instead of flushing 1 byte, making the drive do a read of the sector, recompute the checksum, and then do a write).
Because of the sector based error correction. I remember some earlier 8-bit disks has 128 byte sectors. Some newer flash SSDs have 1k-4k physical sector size for ecc coding efficiency so in those cases they will internally do a read-modify-write to handle a 512 byte logical sector size.
Try designing the format keeing the required features in mind and you might seen why they have.
The hardware has to be able to read and write data from an arbitrary location and do some kind of checksum for error detection and they have to be able to efficiently transfer this data to the cpu.
Yes it would, but you would be working in cache lines rather than disk blocks.
When working with RAM your PC will transfer a cache line of memory from RAM to your CPU caches. Again this is a feature of hardware limitations and trading off granularity against speed.
Your CPU operates many time faster than RAM so it makes sense to request multiple bytes at once, then your RAM can access those bytes, and line them up on it’s output pins ready for your CPU to come back and read them into cache. This give you better bandwidth. (It’s a little more complicated than this because the bytes are transferred in serial, rather than in parallel, but digging into the details here is tad tricky).
On the granularity point, the more granular your memory access pattern is, the more bits you need to address that memory. In RAM that either means more physical pins and motherboard traces, or more time to communicate that address. It also means more transistors are needed to hold that address in memory while you’re looking it up. All of those things mean more money.
And before we start looking at memory map units, that let you do magical things like map a 64 bit address space into only 16GB of physical RAM. Again the greater the granularity, the more transistors your MMU needs to store the mapping, and thus more cost.
So really the reason we don’t have bit level addressing granularity is ultimately cost. There’s no reason you could build a machine that did that, it would just cost a fortune and wouldn’t provide any practical benefits. So engineers made trade off to build something more useful.
A drive (be it HDD or SSD) is a block device, all drives are divided into blocks/sectors since they are a rather lossy medium. Data written to a bit may not necessarily be read accurately and so the solution is to have an Error-Correcting-Code that will allow to recover from some number of badly read bits (above that and you get a medium error from the drive). Since an ECC for a byte is a bit excessive the drives use blocks of 512 bytes or 4096 bytes, one reason for the move from 512 to 4096 is that the ECC becomes more efficient.
HDDs also have small "gaps" that have headers to locate the sectors (in the distant past you could do a low-level format to correct these gaps as well).
SSDs do not have these gaps but they do need the ECC.
In the bad old days we stored data on spinning-rust harddrives. In order to find the data you wanted you had to position the reader head. This took several milliseconds. You could read a lot of data in that time. So basically if you took the trouble to seek somewhere you better read a bunch of data to make it worthwhile.
People would even run defragmenters, that would rearrange file blocks so they were stored continously and the whole file could be read without seeeking.
The disk / inodes need to know where to start looking for a file's contents, like the address in memory for RAM. Or like the mail: We subdivide by city, then ZIP, then street, then address.
So the inode says "The data for my file starts at block 72 and is 3 blocks long" (or something like that). The disk then goes there, and reads blocks 72,73,74.
Each block is 4KiB large often, so if you have a 10KiB file, you still take up ceiling(file size/block size) blocks.
That's why there is a difference between "File size" and "Size on disk" when you look at disk usage summaries.
This doesn't explain why blocks are valuable as you could use byte addressing. The reason why blocks are valuable is for similar reasons to why memory is divided into pages. (Which is all I am going to write, as I don't have the time to answer this well today. But "you need to know where things are on disk" isn't an answer.)
This isn’t the reason about block addressing at all since it existed before paging was a thing. Addressing storage content in blocks is closely aligned with how direct access storage is organized in “sectors”, usually 512-byte blocks. Disks don’t have byte access interfaces, so it doesn’t make sense to use one.
It also lets you to address much larger data structures with limited sized variables. With block addressing, you can access terabytes of storage with a 32-bit integer.
(I am awake now, so I can provide a more useful answer to this question.)
> This isn’t the reason about block addressing at all since it existed before paging was a thing.
I didn't say "blocks exist because pages exist", I said "blocks exist for similar reasons to why pages exist". How you thought that required temporal causality is beyond me. :/
> Addressing storage content in blocks is closely aligned with how direct access storage is organized in “sectors”, usually 512-byte blocks.
OK, great: now you have to answer why, as you are just punting the answer. The answer, FWIW, is similar to the answer for why memory is divided into pages.
> Disks don’t have byte access interfaces, so it doesn’t make sense to use one.
But, of course, they could have a byte-access interface; in fact, that was in some sense the entire question: why don't they? Note, BTW, that RAM also doesn't have a byte-access interface: you have to access it by page. (And hell: many modern disks are just giant flash memories!)
The super high-level API for working with memory provided by the CPU as part of instructions lets you access it by the byte, but again: that's also true of disks and blocks, where the super high-level API for working with files absolutely lets you work with it accessing specific bytes.
> It also lets you to address much larger data structures with limited sized variables. With block addressing, you can access terabytes of storage with a 32-bit integer.
This is true, and is a partial answer to the original question. Essentially, a filesystem is serving the same problem space as a page table, but instead of mapping virtual pages of processes to physical pages it is mapping block-sized regions of files to blocks on disk.
In both cases, it is thereby beneficial to save some bits in that map. However, as pages of memory are usually 4k and blocks on disk are usually 512 bytes--though you see larger for both of these (64k pages are becoming popular)--that really isn't that many bits you are saving.
The real reason is that there is a cost to random access: the more entries in your page-table / filesystem block list the more space it takes and the harder it is to find the one you want; and, while you can "compress" ranges, that also comes with its own cost on random access.
On disks made out of spinning rust, this random access cost is particularly ridiculous, as it might require rotating a heavy piece of metal and physically adjusting magnets to achieve the specific offset you are looking for. You thereby never want to load a single byte at a time: you want to pull a full block.
There is also another interesting cost to random access on writes, which applies to some kinds of memory and applies to most kinds of disks, which is that you just don't have the precision required to write a single byte at a time: think about trying to isolate one byte using a magnet!
So the way these things work is that they have to read large subsets of data at one time and then write it back. If you can align your filesystem to match the blocks of a file along these write boundaries, you thereby decrease the amount of work the disk has to do.
(And, again: this is very similar to the work that memory managers are doing with respect to attempting to align pages of large allocations to make the virtual memory areas efficient in the page table, which, again, is pretty much a filesystem for memory with processes as files.)
You thereby also often find yourself attempting to tune your block/page size to that of the underlying hardware, and if you build higher-level data structures (such as databases using B-trees or dictionaries using RB-trees) you will want to align those at the higher-level as well.
(And it is thereby then also fascinating that it took as long as it did for people to realize that if you want to implement an in-memory tree that you should always prefer a B-tree--a data structure designed around disk blocks--over an RB-tree, for all of the same reasons.)
Blocks (and pages) thereby exist to help all layers of the system most efficiently take advantage of linear access patterns to get efficient parallel loading and caching to solve a problem that is otherwise subject to high random access costs and potentially-ridiculous fragmentation issues.
Is there any advantage in writing a custom file system for a niche purpose? It seems like most file systems are just different variations of managing where/when files are written simultaneously. Could a file system written specifically for something like PostgreSQL cut out the middle-man and increase performance?
Yes. Oracle has done this (ASM) to eliminate overhead, implement fault tolerance and provide a storage management interface based on SQL, for example.
I once made a 'file system' to mount cpio archives (read-only) in an embedded system. Cpio is an extremely simple format to generate and edit (in code) and mounting it directly was very effective.
I suspect operating on block storage directly may both be easier and more reliable for databases, since about 75 % of the complication in writing transactional I/O software is working around the kernel's behavior.
Kernel's fsyncing behavior is one thing, but just relying on a massive amount of fragile C code running in kernel is a significant liability, especially if your software is a centralized database and crashes, panics will bring down everything.
Yes, and also the traditional answer was that the kernel handles weird and complicated hardware and can talk to RAID controllers properly, but nowadays hardware has much less variance, and RAID is rare (and arguably unnecessary for a direct-IO database).
I think it'd be viable for an enterprise-y database to do IO directly over NVMe. Imagine the efficiency and throughput gains you could get from a database that (1) has a unified view of memory allocation in the system (2) directly performs its page-level IO on the storage devices.
Wow this comment just made me fall down a rabbit hole. I've only just surfaced. The Kaitai project actually comes with some pre-defined bindings for cpio which meant I was up and running very quickly.
You may be interested in a paper written by the Ceph team: "File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution"
Yes, this is common in database engines. Doing so allows you to optimize the file system along a very different set of performance tradeoffs and assumptions than a typical generic file system. Beyond that, it also gives you direct control of file system behavior, the lack of which is a source of code complexity and edge cases. This is not transparent to the database, something like PostgreSQL would need to have its storage layer redesigned to explicitly take advantage of the guarantees.
It isn't just about performance gains, which are substantial, it also greatly simplifies the design and code by eliminating edge cases, undesirable behaviors, and variability in behavior across different deployment environments.
We've attempted this as well and it's not as simple as it seems. The issues we've run into have made us reconsider porting our FS handlers to Rust, although we are cautiously optimistic about later results.
I'd be curious to experiment with a file system where all of the file and path metadata is centrally stored in a sqlite blob. Is sqlite fast enough for dealing with file system metadata requests?
Then I implemented a little FUSE driver in Python to read the disk image from the host system and it was wonderful to mount it the first time and see the files! https://github.com/vinc/moros-fuse