Hacker News new | past | comments | ask | show | jobs | submit login
50 years in filesystems: towards 2004 – LFS (koehntopp.info)
148 points by todsacerdoti on May 23, 2023 | hide | past | favorite | 42 comments



Also, we’re getting things that can seek a lot faster than disks: Flash Storage.

NAND FTLs are by necessity log-structured because of the nature of the medium: pages can only be programmed in sequential order in each block, only entire blocks can be erased at once, and ideally you want to evenly use all blocks even when you're just updating a single (logical) sector repeatedly.


NILFS2 is upstream in Linux - the log structure means snapshots happen for free on every disk operation, so you can get point-in-time rollback without needing to manually create disk snapshots.

It does require a userspace daemon running to compact the log, though.


Current file systems are impressive - flexible, robust, close the hardware performance. But I'm disappointed that we are still using such low level models for our day to day computing. Files = everything is an array of bytes and every program/library has to interpret and manage those bytes, "manually", individually, and slightly differently to other programs!

It's understandable to use "files" when running retro apps, but it's way past time that a high level model rendered the concept of files obsolete.

(I can be hopeful but I hold not outlook for such better models. Too many backwards compatible apps and too much depends on our existing code.)


I think the simplicity and flexibility and lack of overall framework is the benefit. Dead simple bytes that may or may not be arranged in a way that works with the program you’re trying to open them with. Then build the relational model on top of it.

Git’s now out of style and we’re onto ____ but my storage is identical. I used to use flickr but now I dump directly to s3 and my jpgs are indistinguishable.

Especially so some consortium of tech companies don’t come up with the next-gen db/fs with bolt on features that no one’s asking for and telemetry to improve your file recall experience. Or logging into my fs because I need customization. For instance any modern web app is built with overkill tech that adds complexity because in certain scale uses that is necessary.

Give me trees of utf-8 encoded flat files any day. Not nested object relational models of stuff that ages faster than milk.


I don't think many people are arguing against having arbitrary byte arrays for storage and using application specific serialization formats. The real problem with file systems, imo, is that they present a leaky abstraction over something that's internally very complex. Any single operation might look simple, but as soon as you start combining operations you're going to have a bad time with edge cases.

For example, let's say you need to ensure that your writes actually ends up on disk: https://stackoverflow.com/questions/12990180/what-does-it-ta...

The typical file abstraction introduces buffering on top of this, which adds additional edge cases: https://stackoverflow.com/questions/42434872/writing-program...

If you want ordered writes, then you have to handle this through some kind of journalling or at the application level with something like this: https://pages.cs.wisc.edu/~vijayc/papers/UWCS-TR-2012-1709.p...

And that's only something that's visible at the application level. There are plenty of similar edge cases below the surface.

Even if all of this works correctly, you still have to remember that very few systems give any guarantees about what data ends up on disk. So you often end up using checksums and/or error correcting codes in your files.

Finally, all this is really only talking about single files and write operations. As soon as you need to coordinate operations between multiple files (e.g., because you're using directories) things become more complicated. If you then want to abuse the file system to do something other than read and write arrays of bytes you will have to deal with operations that are even more broken, e.g., file locking: http://0pointer.de/blog/projects/locking.html

---

It's not an accident that you are using other services to store your files. For example, S3 handles a lot of the complexity around durable storage for you, but at a massive cost in latency compared to what the underlying hardware is capable of.

Similarly, application programmers often end up using embedded databases, for exactly the same reason and with exactly the same problem.

This is a shame, because your file system has to solve many of the same problems internally anyway. Metadata is usually guaranteed to be consistent and this is implemented through some sort of journaling system. It's just that the file system abstraction does not expose any of this and necessitates a lot of duplicate complexity at the application level.

---

Edit: After re-reading the grandparent comment, it sounds like they are arguing against the "array of bytes" model. I agree that this is usually not what you want at the application level, but it's less clear how to build a different abstraction that can be introduced incrementally. Without incremental adoption such a solution just won't work.


Those are all good points. I will read the rest of the links!

My question is can those uncertainties be fixed with a less performant, ordered, and safe file system for typical application use. Then bleeding-edge with plenty of sharp edge cases for high performance compute work that application programmers can handle at app level? Because it is nuts how fast hardware and inexpensive RAM are and I think if you add +30% time on file write IO that will not greatly impact the user experience vs all the other causes of lag that burden us like network and bloat.

Then in the HPC word if a new byte cloud where all context is in some database with a magic index naturally comes to be we can move to that. I won't rule out needing to change the underlying file system because that's pretty over my head and there are good ideas I don't understand.

My point is to push against the proprietary format vendor lock-in file system abstractions like I get in nested objects in microsoft powerpoint or word or apple garage band where the app is merely wrapping files and hiding your actual data that you can pick up and move to another app. I don't want to need to adopt a way of thinking about pretty simple objects to use every different program.

I like wavs > flac, plain text > binary, constant bit rate > variable bit rate, sqlite > cloud company db (not really fair but just saying sqlite one-file db is amazing). Storage is inexpensive and adding in layers to decode the data runs a risk of breaking it and I like interoperability. Once you lose the file data and just have content clouds there might be compression running on the data changing the quality, e.g. youtube as a video store with successive compression algorithms aging old videos.

It drives me nuts when needing to attach things I'm faced with a huge context list where I'd rather go find a directory. Abstractions are just that, mental models to avoid the low level stuff. I'm still cool thinking of my information as file trees I think that's an OK level. But you're right complex operations with a file system has issues. I've messed up logging and file IO not thinking it through and it made me think about needing to fix my mistaken code.


There are roughly three ways you can look at files.

The first is the traditional way: a file is a bag of bytes. Operating systems could do a better job of handling bags of bytes (really, they should default to making sure that the bags of bytes are updated atomically--you either see only the old bag of bytes or the new bag of bytes, never a weird mixture of both), but this is the fundamental view that most APIs tend to expose.

The second is a file is a collection of fixed-sized blocks, stored at not-necessarily-contiguous offsets. This is where something like mmap comes into play, or sparse storage files. A lot of higher-level formats actually tend to built on this model, and this tends to be how underlying storage thinks of files.

The third is that a file is a collection of data structures. It's tempting to think that the OS should expose this view of files natively in its API, but this turns out to be a really bad idea. If you limit it to well-supportable primitives, it's too simple for many applications, so they need to build their own serialization logic anyways. Cast too wide a net, and now applications have to worry about representing things they can't support. Or you take a third option and have a full serialization/deserialization framework that allows custom pluggable things, which is a ticking time bomb for security.


The "stream of bytes" model is what lead to easy data interchange and interoperability. There were plenty of proprietary "structured file" schemes invented in the past, but (fortunately) none of them seem to have become widespread.


I agree that where are now is bad, but I also think files could be an answer too.

What we saw in 9p was a file orientation as well, but files were much smaller grained structures. We can see various kernel interfaces like /proc and /sys where we have file structures representing bigger objects too.

Rather than use the file system structure, apps have been creating their own structures within files. This obstructs homogenous user access to the data!

If we could start to access finer grained data, start to have objects as file-system-trees, I think a lot of progress could be made in computing, especially vis-a-vie the rifts of human-computer-interaction. It would give us leverage to see & work the data, broadly. Rather than facing endless different opaque streams of bytes.


I think the closest thing to what you are looking for is SQLite.

It is basically designed to be an fopen replacement. It is designed to be robust. The relational model is very flexible. It provides great interoperability and backwards compatibility.


Are you proposing something similar to the Apple Newton Soup? https://en.wikipedia.org/wiki/Soup_(Apple)


I'm thinking something where the system maintains "objects" of arbitrary types, presumably they could include multiple other objects. You just access objects and the system makes them available - the object could be a document, a game, a purchase order, etc. The system would also handle moving them around - what we know as "networking".

(A little like Smalltalk.)

In that case, I see no need for lower level data. No need to read text into memory, no need to serialise anything, no need to open/read/write/close. So all the work we do to handle low level stuff becomes obsolete :) (Or at least provided once by OS programmers.)


Reminds me of KeyKOS and presumably other capability-oriented OSes (none of which I've ever gotten familiar with, I'm afraid).


Not familiar with that one. Thanks for the pointer.



Most technology is able to do useful things by building layers of simpler things.

Files are not sequences of bytes in day to day computing. They are videos, or databases, or applications. Actually a lot of the time you'll be doing your day to day computing, thousands of files are being accessed and you wouldn't even know it.


This has attracted a lot of flack, but you can see from actual usage that "S3 blob" is a not-quite-filesystem API that people actually use. Given all the latency and mutability tradeoffs, it might be useful to have something that sits on the PCIe bus and speaks Blob.


In some mainframes and micros, the filesystem is based on database model, there are no files.


I'm familiar with OS400 and Pick. Pick provided the DB table as its lowest level of storage.


this is the same thinking that gave us the 'advanced intelligent network'

current ip networks are impressive - flexible, robust, close to line speed. but i'm disappointed that we are still using such low level models for our day to day computing. tcp/ip = everything is a sequence of packets and every computer has to interpret and manage those packets, 'manually', individually, and slightly differently than other computers do!

it's understandable to use 'packets' when running retro apps, but it's way past time that a high-level model rendered the concept of packets obsolete

that's not a quote from a pre-stupid-network bellhead 25 years ago but it could have been

or the intel iapx432

current cpu architectures are impressive - flexible, robust, with impressive performance. but i'm disappointed that we are still using such low level models for our day to day computing. 8086 = everything is a sequence of computations on 16-bit integers and every program/library has to interpret and manage those 16-bit integers, 'manually', individually, and slightly differently than other programs do!

it's understandable to use '16-bit words' when running retro apps, but it's way past time that a high-level model rendered the concept of untyped words obsolete

in fact file storage forms the same sort of nexus as the rest of the posix system call interface, the 8086 instruction set, ip packets, bytes, and dollars: many things can store files fairly efficiently, and many things can use them for many different purposes, and the nexus permits those things to evolve independently with minimal coupling to one another

(there are many ways the posix concept of files could be improved, which is also true of 8086)

if we want to replace files with a better storage interface, it should probably be something dumber rather than something smarter

'it's done in the os so it's simple' is the same kind of cognitive error as 'it's done in the hardware so it's cheap' https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht... (though see https://blog.cr.yp.to/20190430-vectorize.html for some 02019 updates on the relative costs of things like dispatching and floating point)

actual good systems design amounts to more than 'move the problem somewhere where i don't understand what's involved in solving it anymore'


https://www.oilshell.org/blog/2022/02/diagrams.html is maybe a better explanation of this idea


I'm still hoping that the OS will make our lives (programmers and users) easier, rather than us having to know about the low-level stuff, even if (these days) we usually handle it by invoking yet another library.


despite the needlessly hostile tone of my initial message, i'm curious what you see as the pros and cons of "yet another library" knowing about the low-level stuff rather than the kernel knowing about it


The cons that come to mind:

Size: The more externals you drag in, the bigger and less efficient the apps will get. An extreme version of this are the deployment packages like flatpak and snap - you drag it a substantial copy of an OS! Bigger needs more storage, slower startup, etc.

Difference: Every library does things a little differently - uses different init schemes, different serialisation, different naming conventions, etc. Makes for steeper learning curves (for users and programmers), incompatabilties, etc.

Knowledge: over a few projects, you'll need fairly detailed understanding of many libraries. Makes sense to consolidate this into one place (the OS). It will be less than perfect for some apps, but realistically, so are the libraries.

Bugs: Really a by-product of the above - but bigger systems (more libraries and/or more hand-cut low level code) invariably have more bugs.

I'm afraid the only pros that comes to mind:

Potential performance: low level stuff gives you the option to fine tune. But: #1 for almost all apps, a better design gives performance more easily than low-level coding, #2 the industry default is to accept poor performance and buy a faster machine ;)

Familiarity: using a new paradigm involves changing thinking and skills - uncomfortable and time consuming.


these all seem like arguments for standardizing on a particular implementation of any given functionality, rather than having many different implementations

they don't seem to bear on the question of whether to put that functionality in userspace or in the kernel

you are presumably aware that at times these advantages of standardization on a single implementation are outweighed by its disadvantages; see http://akkartik.name/freewheeling for some interesting exploration of the opposite extreme from the kind of high modernism that seems to be your ideal


added...

> (I can be hopeful but I hold not outlook for such better models. Too many backwards compatible apps and too much depends on our existing code.)

I see it that we (you, me, almost all programmers) are so practiced at the "file" way of thinking, that we genuinely struggle to look far beyond that paradigm. We see the advantages of "files" but have no experience with much else, so we struggle to make comparisons.


The article claims garbage collection was invented in the JVM! I wonder what that old DEC-20 was doing when it reported to all terminals that garbage collection was ongoing...

Or what the mark/release garden of eden model of Smalltalk was...


The article claims JVMs were invented around the same time, and that they happened to include garbage collectors.


Speaking as a guy who's done enterprise storage for close to 30 years, the main issue here is IO stack integration. There's almost none. There are people like Oracle that try to bypass at least some of these disconnected layers that don't work well together, but why don't the drive vendors do this? Intel makes a compiler for their CPUs. Why isn't there a WDFS that has built-in LVM?

Here's the main issue. You have your application that sits on a filesystem. The filesystem tries to predict what the application is doing. That sits on a volume manager. That's just a dumb table of pointers. That sits on top of a disk drive, talking to the RAM on the drive. Then you have the backend of the disk controller trying to predict what to put in RAM.

Oracle knows best what it's going to need next from the disk, based on some query it's running, if it expects a drop in IO soon where the disk can do background cleanup, if and when it's about to do a lot of reads once it's done with a lot of writes, in 3 minutes. The filesystem has no idea. The disk controller has no idea. Wouldn't it be great, more performant, and less wasteful, if the application could tell the disk drive about its behavior using some sort of standard API, and the disk controller could translate that to what the backend disk should do - whether it's the various types of spinning rust or different flash types?

TRIM is a very basic example of that. What we need is more things like TRIM for the application IO libraries to tell its intent to the backend controller, and that API is appropriate to be put in the filesystem, and just blindly pass it on all the way to the backend.


> why don't the drive vendors do this? Intel makes a compiler for their CPUs. Why isn't there a WDFS that has built-in LVM?

Given the quality of firmware in RAID controllers and disk drives and... er, everything, actually, I would really rather that they do as little as possible, unless they're going to make the firmware open source so we can fix the bugs.


it doesn't even need to be open source, just reprogrammable.


Why would that be enough? Being able to replace the firmware doesn't help if you don't have something to replace it with, and part of the problem is vendors not bothering to ship bugfixes.


It wouldn't help in the short term, but in the same way that 3rd party open source community have made OpenWRT for routers, over time, people other than the OEMs could develop 3rd party firmware for SSDs. It would obviously be better if the OEMs would open source, but with enough time, open source re-implementations could be made.


This is a very general issue in computing. You could make many of the same arguments about a web app running on a computer and all the involved modules (graphics, networking, JS VM, app code itself, etc). We have abstractions and interfaces that enforce separations of concerns, which give us many desireable properties, but at the same time there's an attraction, especially for performance in exchange for modularity, to do some "layering violations" to take advantage of knowledge of unexposed internals of other modules.

I think one way around it and to have the cake but eat it too would be to enable some whole-system program transformations, a bit like what unikernels have started nibbling at the edges of.


I did some searching & am a bit shocked: I couldn't find any way to adjust io priority other than by altering the entire processes io level. I would have though this would be a semi-commonly used routine to make high/regulae/low priority QoS for io, but indeed, per your claims, I can't seem to find anything.

Hypothetically one could maybe spawn a bunch of child processes and give them each their own io priority? Maybe io priority is sticky, and one can change io priority just before doing io work, and the io priority for that work would stay when one changes the io priority before the next operation?

I feel like we have a bunch of possible things we could to better qos with what controls we have.

Therw are also a variety of madvise hints we can provide, telling the kernel what we will need, what to drop, what we won't need, what will be random access (not benefit from lookahead) Vs sequential. These already are some pretty useful knobs. Which I'd guess are quite broadly underused.


On Linux, you can set I/O priority on a per-thread basis:

  ioprio_set(IOPRIO_WHO_PROCESS, 0, val)


NVMe devices can support multiple namespaces and each namespace is assigned a specific command set upon creation, normally NVM with LBA. But there's also a key-value command set. I'd expect NVMe KV-enabled devices to directly use their FTL for the mapping.

Zoned namespaces provide a "trimless" future, as zones are allocated explicitly, written sequentially and must be released explicitly by the host.

edit: I've worked on ACID stuff before and another thing that's kinda annoying is how poorly FS APIs line up with both what you want for ACID databases and how the hardware works. FS APIs are "flush/sync" oriented, somewhere between device and byte-range/sector granularity. Log-structured databases, which is most of 'em (page-oriented RDBMS with WAL are effectively log-structured), don't need or care about that, it's just an additional complication. They really only need barriers. Hardware also has barriers, at least on paper. FTLs in SSDs provide barriers for free almost by definition; writes go to fresh NAND, but they're only visible once the log entry in the FTL is persisted. Writes between FTL flushes can be reordered any way, doesn't matter, if power fails all of them are either gone or visible.


I think that’s for the same reason most OS schedulers don’t have functionality for applications to tell them such things as “this program needs m MB RAM, s seconds of a standard CPU, doesn’t use vector instructions, will do r I/O reads and w writes to disk d and has to finish before 8 PM”: on general-purpose systems, it’s effectively an intractable system.

Also, even if the OS could compute an optimal schedule, that may not be so good that it makes up for time spent computing that schedule.


Isn't this also the same reason 'prosumer' storage hardware / use of off the shelf stuff mostly doesn't exist? If the storage manufacturers dared provide a low level interface option to the real hardware without the easy to for Windows traditional abstractions then they'd both get their lunch eaten (by everyone that moves their current excuse for market segmentation into the OS / Database daemons) and take a loss at still providing for the majority market share of dumb as bricks Windows that lacks a mature VFS API other than NTFS (it's defacto VFS API that MS should just declare all new filesystems implement due to the crushing weight of legacy).


Smart idea. There are a bunch of different basic strategies and policies that could be implemented in a series of weekend projects. Ranged / extent reads/writes, upcoming allocations, locality-sensitive data, Short-range vs. Long-range data structures, access frequency estimates, historical file size estimates, etc.

This pairs well with microkernel architectures too. A separate FS policy manager service that is pluggable. You could write a dozen simple policies in a month and also shore up in terms of open-source defensive patents.

Or, if you're a commercial house and not worrying about day-to-day operations you could fill your patent portfolio.

Smart idea.


Storage manufacturers now give the application more control over SSD FTL operations through ZNS. https://zonedstorage.io/docs/introduction/zns. Curious to see how it will be used


Your writing made me think of the fact that purestorage is designing its own (flash based) drives for its storage appliances… I wonder they’re doing what you’re saying, in their own stack at least.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: