Ask HN: Why don't file systems and OS's provide file system transactions?

chrsig · on July 23, 2022

i don't think it's enough to just say "provide transactions" -- that's way too general.

what sort of semantics would you want out of the transaction? atomicity? isolation? durability? how should concurrency behave? should there be a mvcc implementation?

linux provides atomic writes up to 4k. moves on the same fs are also atomic.

fsync ensures writes are durable and written to disk (allegedly[0])

file advisory locks can be used to ensure mutual exclusion. or memory mapping in the file to shared memory and allocating a mutex in it (libapr provides a few options for interprocess mutual exclusion)

...but in reality, if you need transactional semantics, you're really just better off using a database. because the database developers will have a much better idea of the nuances that applications need from transactional semantics than the kernel devs will.

and if you want your program that requires transactional semantics to be portable, major database vendors have already dealt with inconsistencies across multiple major operating systems. because of that the database gives one system to handle transactions, versus pushing the portability concerns onto each individual application.

[0] https://news.ycombinator.com/item?id=19119991

chrsig · on July 23, 2022

I'll also add that if it's a case that doesn't require more than one process writing simultaneously, sqlite really shines. I know that's probably not satisfying in a thread that starts as "i want fs transactions" - but it's a solid consolation prize.

they've put a lot of effort into making it work well even in likely-to-crash situations (e.g., running on a phone with a user who doesn't tend to the battery)

https://www.sqlite.org/fasterthanfs.html

https://www.sqlite.org/atomiccommit.html

https://www.sqlite.org/hirely.html

andreareina · on July 23, 2022

On the contrary, "sqlite competes with fopen()" and so is a perfectly crimeless alternative/suggestion

andreareina · on July 23, 2022

s/crimeless/cromulent

dataflow · on July 23, 2022

> I have read that Windows has a transactional API, but they've actually deprecated it! [2]

IMHO you can probably ignore their deprecation and keep using it. The set of things MS deprecates and the set of things they actually remove from the OS are quite different. IIRC their own components depend on FS transactions and I don't see them rewriting their own components anytime soon. However, note that even without deprecation, transactions can fail for a variety of reasons (not just conflicts), so you'll need fallbacks anyway.

> why does Windows hide its equivalent of `openat()` in the NT API?

I don't know for certain but I've always imagined it's because (a) Win32 programmers (or for that matter, most programmers) are used to the path-based API, and (b) it would be much slower to perform manual traversal level-by-level, and (c) I think in practice there aren't that many common scenarios where the race condition can realistically turn into a security vulnerability.

codeflo · on July 23, 2022

As a general principle, I think low-level APIs shouldn’t provide abstractions that are both expensive and have no clear “best” design: in those cases, you want your applications to have the ability to make different trade-offs, rather than being locked in to a design that might be suboptimal for your usecase. I think sibling comments explain nicely how that’s the case for transactions; I just wanted to point out why that matters in the big picture. Similar arguments can be made for other higher-level features that someone might wish their OS provided, like (tracing) GC.

klodolph · on July 23, 2022

Our existing file APIs already provide abstractions that are both expensive and have no clear best design.

There’s a lot that these APIs hide from you in order to make it so you can pretend that this is the 1970s and your file is essentially a piece of tape. We’ve then built filesystems and directory structures on top of this piecemeal. The abstraction covers all these fancy journaling structures and a whole page cache.

For something that dives into the consequences, the classic paper “Don’t stack your log on my log”

https://www.usenix.org/system/files/conference/inflow14/infl...

deadmutex · on July 23, 2022

> Our existing file APIs already provide abstractions that are both expensive and have no clear best design.

Even if that is the case, lets not make it worse :).

i_have_to_speak · on July 23, 2022

> Quite frankly, I don't know why filesystems don't provide these things.

They do. F2FS [1] does. There were attempts to add them to xfs/ext4 too, but they petered out, probably because of lack of interest.

[1] https://www.kernel.org/doc/html/latest/filesystems/f2fs.html

vivegi · on July 23, 2022

That is because almost all applications use filesystem calls (directly or indirectly), but only some apps may need a transactional API.

Consider a process P1 using a fictional transactional API in an OS and is accessing the path `/a/b/c` and is creating some files under directory `c`.

Consider another process P2 executing a `mv /a/b /x`.

P1 uses the transaction API, but P2 is not. So, under the covers the system calls will all have to use the new transactional API to ensure global correctness. That is asking a lot of the kernel and possibly makes a lot of legacy programs slower.

The other question to answer is: do we want ACID properties or eventual consistency to be guaranteed by the transaction? What to do when some processes want ACID guarantees and some processes are okay with eventual consistency? How does the kernel handle concurrent running of these processes in contention with the same resources under two different transaction semantics.

These are some of the reasons why the transaction management is better handled in userspace.

rvdginste · on July 23, 2022

I have wondered about that too. But when I think about a file system transaction, I immediately think about a file system transaction enlisting in an ambient transaction, together with a database transaction. This would make code much more simple/clean for cases where you must create a file and store metadata on the file inside the database. And on the other hand, some database systems do provide features for storing large file-like blob objects, which give you these transactional features.

So I think it depends on the context and what you wanna use it for. I don't see the transactional features of a file system as useful for actual users that are directly interact with files on their file system. It seems more useful in the context of applications that maintain files and where you likely do not want the user to directly interact with those files.

marcell · on July 23, 2022

Move in Unix is atomic, which handles a lot of the common use cases for a transactional file system.

bhawks · on July 23, 2022

Move _on_ the same filesystem/partition is atomic ;).

beagle3 · on July 23, 2022

and only on _files_. If you want to atomically replace a whole directory. "move" alone can't do it - though you can compensate with a symbolic link.

mook · on July 23, 2022

Isn't that only specified (for the same filesystem, as mentioned) for Linux? As far as I can tell the only thing POSIX specifies came from ISO C:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/r...

ISO C 9899:2018 (actually the 2017 draft) says in 7.21.4.2.3 that on a failure return the file is available by its original name, but that's a much weaker guarantee as that doesn't say anything about failing to return (e.g. hard system crash)…

IgorPartola · on July 23, 2022

creat is as well if you do the magic correctly. Well except if the file system is network mounted. Maybe. And if another process respects what you are doing.

zamalek · on July 23, 2022

It's trivial to implement with CAS (e.g. Git is a transactional file system). That's a lot of code/time/money/attack surface to spend on the kernel when it is so easy to do in userspace.

hansvm · on July 23, 2022

It's only easy to do in userspace for consenting adults though. The whole point of an OS is to manage contended resources, and this particular contended resource requires every process to implement their own way of safely interacting with every other process

Joker_vD · on July 23, 2022

And now I have flashbacks of trying to use flock(2)/lockf(3) for their intended purposes... such miserably misdesgined (and underdocumented) pieces of file API, bloody hell.

bhawks · on July 23, 2022

The filesystem abstraction can only work when it's used by cooperating, well mannered consenting adults. At it's core it is just a hierarchical key-value store where the core os plays a middle manager role connecting applications looking up keys and filesystem plugins serving values and persisting them onto the kernels block device apis.

It certainly can be a convenient and productive abstraction but the model itself has very clear natural limits.

bhawks · on July 23, 2022

Transactions are for databases, databases are for sets of 1 or more processes that are coupled to, permissioned for and willing to cooperate with each other.

Transactional operations bring the chance of deadlocks. Deadlocks cause performance and denial of service implications. Deadlocks are far more easier to detect vs prevent vs avoid. Detected deadlocks are resolved by killing one of the requests, which must be handled by the cooperating and highly coupled processes.

The filesystem is an abstraction of convenience and very loose rules. Instead of all the structure and rigor a database brings a program just gives a string to the os and gets bytes back. The cost of this ease of use is that you must keep your program's demands and expectations low.

chrismorgan · on July 23, 2022

It feels to me rather like the difference between cooperative versus preemptive scheduling. Some mainstream operating systems used to use cooperative scheduling. All have long shifted to preemptive scheduling. File system transactions would be global in the same way kernel-space cooperative scheduling is—and one malicious or even buggy process can now bring down the whole system.

In a fully controlled environment you could do it, just like you can do cooperative scheduling, but outside of such an environment you just can’t do it in any way sanely.

charcircuit · on July 23, 2022

A filesystem is a database.

wmf · on July 23, 2022

I read that the Tandem NonStop OS from the 1980s had a built-in transaction manager (because the OS was designed to run databases) and they built their filesystem on top of the transaction manager which gave them filesystem transactions for "free".

JPLeRouzic · on July 23, 2022

I think the Pick OS, at least in its native form, had no file system but used a database.

It was very easy to program. I saw colleagues with no computer science education writing complex queries on "Pick Basic", what would have been impossible for them on classical computers of the time (or of today as well).

https://en.wikipedia.org/wiki/Pick_operating_system

TheAceOfHearts · on July 23, 2022

I think macOS at some point supported file system transactions as well. If you look through the AppleScript docs there used to be references to file system transactions, but it appeared incomplete and undocumented.

Would love to hear more details if anyone is knowledgeable of this arcane history.

jbverschoor · on July 23, 2022

I guess the number of comments explain

nobozo · on July 23, 2022

Mike Stonebraker and others are working on an OS that is based on a database.

Take a look at https://vldb.org/pvldb/vol15/p21-skiadopoulos.pdf

jasfi · on July 23, 2022

Concepts like transactions are often about trade-offs between performance and features (which introduce complexity). It's likely that the OS architects realized that if devs wanted transactions they'd use a database.

dmpk2k · on July 23, 2022

ZFS does, although to fully exploit it you'll need to make DMU calls.

gavinhoward · on July 23, 2022

And this is one of the many reasons I use ZFS personally.