
Writing a file system from scratch in Rust - carlosgaldino
https://blog.carlosgaldino.com/writing-a-file-system-from-scratch-in-rust.html
======
vinc
I recently wrote a very simple and naive filesystem in rust for a toy OS I'm
building and it was quite an interesting thing to do:
[https://github.com/vinc/moros/blob/master/doc/filesystem.md](https://github.com/vinc/moros/blob/master/doc/filesystem.md)

Then I implemented a little FUSE driver in Python to read the disk image from
the host system and it was wonderful to mount it the first time and see the
files! [https://github.com/vinc/moros-fuse](https://github.com/vinc/moros-
fuse)

------
unethical_ban
I have read down to the implementation section, but for my money, this is the
best way to describe the high level function and behavior of a filesystem that
I have ever seen.

~~~
ridiculous_fish
A very accessible (though dated) intro to filesystems is Practical File System
Design, by Dominic Giampaolo.

PDF link: [http://www.nobius.org/practical-file-system-
design.pdf](http://www.nobius.org/practical-file-system-design.pdf)

~~~
vondur
Is he the guy who did the BeOS filesystem?

~~~
peterkelly
Yes. Later went to Apple and did Spotlight.

[https://en.wikipedia.org/wiki/Dominic_Giampaolo](https://en.wikipedia.org/wiki/Dominic_Giampaolo)

~~~
saagarjha
And APFS:
[https://developer.apple.com/videos/play/wwdc2016/701/](https://developer.apple.com/videos/play/wwdc2016/701/)

------
azhenley
There’s also this file system chapter from a series on writing an OS in Rust:
[http://osblog.stephenmarz.com/ch10.html](http://osblog.stephenmarz.com/ch10.html)

~~~
est31
And for code there is TFS [https://github.com/redox-
os/tfs](https://github.com/redox-os/tfs)

------
Immortal333
Shameless plug. I did similar in my OS course. But, in C. Github:
[https://github.com/immortal3/EbFS](https://github.com/immortal3/EbFS)

Warning: Terribly written. many hacks.

~~~
RealityVoid
Soooo... how does it work?

I'm not asking about the structure or how it's organized. I mean... is the
filesystem in a file or... how?

Background: I mostly do embedded stuff so at a glance I would have expected
low level primitives (like, HW interactions, registers and stuff) but I see
none. So maybe, my expectation, when tacking a problem, of interacting with
the HW directly, does not stand in modern environments.

Even better, but unrelated question... how the heck does a x86 OS request data
from the HDD?

~~~
mcpherrinm
You'd presumably have some "block device" abstraction between your filesystem
and your device driver. Don't want to re-implement a FS for each type of
hardware. On a Linux system, you can read, eg, /dev/sda1 from userspace, which
is what it looks like this filesystem probably does.

As for how you actually request data from the hard drive: There's older ATA
interfaces, and BIOS routines from them, which I suspect is what most hobbyist
OSes would use.

A more modern interface is AHCI. The OSDev wiki has an overview, where you can
see how the registers work:
[https://wiki.osdev.org/AHCI](https://wiki.osdev.org/AHCI)

------
bluejekyll
Always fun to see this type of work. I notice the usage of OsString, and it
made me wonder: does the way an OS encodes it’s strings potentially make this
FS non-portable between OSes? If I want to mount a drive formatted with this
FS, would the OsString be potentially non-portable?

There was a lot of discussion in the past around TFS
[https://github.com/redox-os/tfs](https://github.com/redox-os/tfs), my
understanding is that effort has kinda lost steam.

~~~
fiddlerwoaroof
This is really cool, I wish someone would fund it.

~~~
still_grokking
That dead[1] project?

Actually everything around "Redox" looks like:

[https://gitlab.redox-os.org/redox-os/tfs/issues/66](https://gitlab.redox-
os.org/redox-os/tfs/issues/66)

[1] [https://gitlab.redox-os.org/redox-os/tfs/issues/80](https://gitlab.redox-
os.org/redox-os/tfs/issues/80)

~~~
qchris
Redox is still very much active...

[1] [https://gitlab.redox-os.org/groups/redox-
os/-/activity](https://gitlab.redox-os.org/groups/redox-os/-/activity)

[2] [https://www.redox-os.org/news/](https://www.redox-os.org/news/)

------
dm319
It would be nice if the intro had a brief explanation of why a disk needs to
be divided into blocks. Otherwise, I really enjoyed this read from the
perspective of a lay person.

~~~
masklinn
> It would be nice if the intro had a brief explanation of why a disk needs to
> be divided into blocks.

One reason is that HDDs simply don't have a byte-wise resolution, so there's
little point talking to HDDs in sub-sector units. Sectors are usually 512
bytes to 4k.

A second reason is being able to simply address the drive. Using 32b indices,
if you index bytewise you're limited to 4GB which was available in the early
90s. With 512 bytes blocks, you get an addressing capacity of 2TB, and with 4k
blocks (the AF format), you get 16TB. In fact I remember 'round the late 90s /
early aught we'd reformat our drives using higher-size blocks because the base
couldn't see the entire thing.

~~~
jagged-chisel
> HDDs simply don't have a byte-wise resolution

Sufficient explanation for code. Now why is it that disks lack byte-wise
resolution?

~~~
e12e
.. And does it hold true for ram disks?

~~~
avianlyric
Yes it would, but you would be working in cache lines rather than disk blocks.

When working with RAM your PC will transfer a cache line of memory from RAM to
your CPU caches. Again this is a feature of hardware limitations and trading
off granularity against speed.

Your CPU operates many time faster than RAM so it makes sense to request
multiple bytes at once, then your RAM can access those bytes, and line them up
on it’s output pins ready for your CPU to come back and read them into cache.
This give you better bandwidth. (It’s a little more complicated than this
because the bytes are transferred in serial, rather than in parallel, but
digging into the details here is tad tricky).

On the granularity point, the more granular your memory access pattern is, the
more bits you need to address that memory. In RAM that either means more
physical pins and motherboard traces, or more time to communicate that
address. It also means more transistors are needed to hold that address in
memory while you’re looking it up. All of those things mean more money.

And before we start looking at memory map units, that let you do magical
things like map a 64 bit address space into only 16GB of physical RAM. Again
the greater the granularity, the more transistors your MMU needs to store the
mapping, and thus more cost.

So really the reason we don’t have bit level addressing granularity is
ultimately cost. There’s no reason you could build a machine that did that, it
would just cost a fortune and wouldn’t provide any practical benefits. So
engineers made trade off to build something more useful.

------
ravenstine
Is there any advantage in writing a custom file system for a niche purpose? It
seems like most file systems are just different variations of managing
where/when files are written simultaneously. Could a file system written
specifically for something like PostgreSQL cut out the middle-man and increase
performance?

~~~
topspin
Yes. Oracle has done this (ASM) to eliminate overhead, implement fault
tolerance and provide a storage management interface based on SQL, for
example.

I once made a 'file system' to mount cpio archives (read-only) in an embedded
system. Cpio is an extremely simple format to generate and edit (in code) and
mounting it directly was very effective.

~~~
formerly_proven
I suspect operating on block storage directly may both be easier and more
reliable for databases, since about 75 % of the complication in writing
transactional I/O software is working around the kernel's behavior.

~~~
zzz61831
Kernel's fsyncing behavior is one thing, but just relying on a massive amount
of fragile C code running in kernel is a significant liability, especially if
your software is a centralized database and crashes, panics will bring down
everything.

~~~
formerly_proven
Yes, and also the traditional answer was that the kernel handles weird and
complicated hardware and can talk to RAID controllers properly, but nowadays
hardware has much less variance, and RAID is rare (and arguably unnecessary
for a direct-IO database).

I think it'd be viable for an enterprise-y database to do IO directly over
NVMe. Imagine the efficiency and throughput gains you could get from a
database that (1) has a unified view of memory allocation in the system (2)
directly performs its page-level IO on the storage devices.

------
phjesusthatguy3
We've attempted this as well and it's not as simple as it seems. The issues
we've run into have made us reconsider porting our FS handlers to Rust,
although we are cautiously optimistic about later results.

~~~
ianlevesque
Any more specifics?

------
Ericson2314
My dream is to add enough type parameters so in-memory collections can also
work as (not horribly tuned!) on-disk datastructures.

It's a nice ambitious goal which can really drive language and library design.

------
sjwright
I'd be curious to experiment with a file system where all of the file and path
metadata is centrally stored in a sqlite blob. Is sqlite fast enough for
dealing with file system metadata requests?

------
shmerl
Something like bcachefs could have been written in Rust.

------
blackrock
Once you have the file system, and a scheduler, don’t you have a basic
rudimentary operating system?

How soon until someone builds an Operating System developed in Rust? Maybe
make it microkernel-based this time.

~~~
smt88
> _How soon until someone builds an Operating System developed in Rust?_

Redox[1] has been around for almost as long as Rust has. I first heard about
it 4-5 years ago.

They had an interesting competition a while back challenging people to figure
out how to crash it.

1\. [https://www.redox-os.org/](https://www.redox-os.org/)

~~~
blackrock
Yeah, I heard about this project. But there’s a graveyard of dead OS projects
out there.

What’s the progress and potential of Redox?

~~~
mlindner
Redox is being worked on continuously. They just added support for gdb.

