NixOS on Btrfs+tmpfs

viraptor · on May 7, 2022

> Most subvolumes can be mounted with noatime, except for /home where I frequently need to sort files by modification time.

That doesn't sound right. Noatime turns off recording of the last access time, not modification.

traceroute66 · on May 7, 2022

> Most subvolumes can be mounted with noatime

This noatime thing is an old-wive's tale that needs to die.

AFAIK, most "modern" filesystems (XFS,BTRFS etc.) all default to relatime

relatime maintains atime but without the overhead

EDIT TO ADD:

Actually,I've just done a bit of searching .... relatime has been the kernel mount default since >= 2.6.30 ! [1]

[1] https://kernelnewbies.org/Linux_2_6_30 (scroll to 1.11. Filesystems performance improvements)

kilburn · on May 7, 2022

> but without the overhead

The cost of atime is an extra write every time you read something.

Relatime changes this to one atime update per day (by default), low enough that it usually doesn't matter.

However, that update per day may have significant impact when you are using Copy-on-Write filesystems (btrfs, zfs). Each time the atime field is updated you are creating a new metadata block for that file. Old blocks can be reclaimed by the garbage collector (at an extra cost), but not if they exist in some snapshot.

All of this means that if you use btrfs/zfs and have lots of small files and take snapshots at least once per day, there's a noticeable performance difference between relative and noatime.

I've been using noatime everywhere for several years and I've never noticed any downside. This is definitely my recommended solution.

xorcist · on May 7, 2022

If you don't use atime, there's no need to keep it around. Daily or otherwise. Unless you have reason to believe an application uses atime, it probably doesn't.

I've used noatime by default, except for few cases where I know it is used, in professional settings for probably two decades. Hopefully you know what kind of appliciation you are running. There are many parameters in a system and this is just one of them.

The only times I've seen atime used has been for a two queues, and only in the case of "has this file changes since last it was read". And that is precisely what relatime is for, the daily update is just an optional extra.

jdc · on May 7, 2022

I prefer lazyatime.

nsajko · on May 7, 2022

AFAIK those are independent features, I use both noatime and lazytime.

pdenton · on May 7, 2022

IIRC, noatime is useful everywhere except /var/spool or /var/mail where certain daemons may depend on access time correctness.

londons_explore · on May 7, 2022

You can't depend on access time correctness... Because someone else can come along and grep through all your files and now they're all accessed right now.

viraptor · on May 7, 2022

That's the theory. In practice for example courier IMAP relied (still relies?) on at least relatime for some notification aspects.

amelius · on May 7, 2022

Perhaps it just triggers some wasted CPU time, as opposed to incorrect behavior.

cesarb · on May 7, 2022

> After freeing the new SATA SSD, I also filled it with butter. Yes, all the way, no GPT, no MBR, just Btrfs, whose subvolumes were used in place of partitions

I would not recommend doing that. It might work for now, but there's a high risk of the disk being seen as "empty" (since it has no partition table) by some tool (or even parts of the motherboard firmware), which could lead to data loss. Having an MBR, either the traditional MBR or the "protective MBR" used by GPT, prevents that, since tools which do not understand that particular partition scheme or filesystem would then treat the disk as containing data of an unknown type, instead of being completely empty; and the cost is just a couple of megabytes of wasted disk space, which is a trivial amount at current disk sizes (and btrfs itself probably "wastes" more than that in space reserved for its data structures). Nowadays, I always use GPT, both because of its extra resilience (GPT has a backup copy at the end of the disk) and the MBR limits (both on partition size and the number of possible partition types).

LanternLight83 · on May 7, 2022

Neat! I went looking for ways to create a protective MBR, learned a lot from the Gentoo wiki, some interesting info about how Windows does things in the link below, but the way to achieve this seem to be to just format the disk as GPT and then truncate it to the MBR (or just use one big GPT partition).

https://thestarman.pcministry.com/asm/mbr/GPT.htm

cesarb · on May 7, 2022

I would also not recommend formatting the disk as GPT and "truncating it to the protective MBR". Not only there's a good chance of the GPT re-appearing on its own, either because some software noticed it was corrupted and copied it from the backup copy at the end of the disk, or because some software noticed it was missing and created a new one, but also there's a chance of it once again being treated as if the whole disk was empty (since the "protective MBR" says it's a GPT disk, and the GPT has no entries). If you want to have just the single MBR sector, then create a traditional MBR with a single partition spanning the whole disk instead of a GPT "protective MBR". But that will not gain much, since you should align your partitions (IIRC, usually to multiples of 1 megabyte) for performance and reliability reasons (not as important on an HDD, where you can align to just 4096 bytes or even 512 bytes depending on the HDD model, but very important on an SSD), and the space "wasted" by that alignment is more than enough to fit the GPT.

LanternLight83 · on May 8, 2022

Ah, I missed there was a unique label indication GPT was involved in the protective MBR; I I'd read it, it essentially is just an MBR with a single max-size entry, and I didn't consider there might be anything else; and ofc, it is a bit of a pointless thought experiment. Thanks for sharing!

genghizkhan · on May 7, 2022

I would prefer to do this on zfs, for which there is a lovely installation guide on the openzfs docs site.

https://openzfs.github.io/openzfs-docs/Getting%20Started/Nix...

kaba0 · on May 7, 2022

I can vouch for how good it works. Using it on my personal laptop.

grumpyprole · on May 7, 2022

I tried ZFS-on-linux with Ubuntu 21.10 and it ate my data (ZFS panics when accessing certain files). Sure, Ubuntu does have a habit of using unstable kernels, but I was still disappointed. It should be stable at this point.

mkj · on May 7, 2022

That corruption was caused by a patch that Ubuntu created themselves, it was never in upstream. ZFS on any other platform would be OK.

https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/190...

genghizkhan · on May 7, 2022

I've never tried Ubuntu, but zfs has been rock solid on Fedora, Arch and Debian for me. Whatever the issue that hit you was, I hope recent versions of Ubuntu/zfs have fixed it.

yjftsjthsd-h · on May 7, 2022

Neat:) I would never use btrfs myself[0], but very happy to see people exploring all variations of these ideas. The one thing that's starting to bug me though, as I read blog posts about installing nixos: why is the install process so imperative/non-declarative? Once the system is up, the whole thing fits in configuration.nix, but to get there we still have to use masses of shell commands. Is anyone working on bridging that last gap and supporting partitions, filesystems, and mounts (I think that's all that's left?) from nix itself?

[0] I lost 2 root filesystems to btrfs, probably because it couldn't handle space exhaustion. I'm paranoid now.

rrix2 · on May 7, 2022

I have a fork of `justdoit.nix' https://github.com/cleverca22/nix-tests/blob/master/kexec/ju... which generates installer images with an auto-partitioning script and stub "configuration.nix" embedded in it with basically just enough of a system to get nixops or morph to deploy to it. it's kind of a pain to get that working the first few times since you have to wait for an image to bake and then test it in QEMU, and then make changes for nvme, etc, but it's brought up three systems now.

yjftsjthsd-h · on May 7, 2022

Thanks! I'll have to give it a spin:)

nix23 · on May 7, 2022

I loose every two years (when i test it again) a volume to btrfs, last time 2 month ago, with that simple "trick":

-Fill your rootpartionion as root with "dd if=/dev/urandom of=./blabla bs=3m"

-rm blabla && sync (we don't want to be unfair to such a fragile system)

-Reboot and end up with unbootable /

It's a mess, for a filesystem i would declare it as alpha stage.

cmurf · on May 7, 2022

I can't reproduce this on a 5.17.5 kernel and loop device. So if you're able to trivially reproduce it, I'm guessing it's configuration specific. It's still a bug, but to find it means providing more detail about that configuration, including the kernel version.

Ideally reproduce it with mainline kernel or most recent stable. And post the details on the linux-btrfs list. They are responsive to bug reports.

If you depend on LTS kernels then it's, OK to also include in the report the problem happens with X kernel but not Y kernel. Upstream only backports a subset of fixes and features to stable. Maybe the fix was missed and should have gone to stable or maybe it was too hard to backports.

These are general rules for kernel development, it's not btrfs specific. You'll find XFS and i915 devs asking for testing with more recent kernels too.

But in any case, problems won't get fixed without a report on an appropriate list.

nix23 · on May 7, 2022

>So if you're able to trivially reproduce it,

Make a VM OpenSUSE or Fedora (tested just these two) fill it and see it not boot anymore...it is trivial.

cmurf · on May 7, 2022

OK this is a very different scenario than what I thought you were talking about (a broken file system). If a file system is full such that writes aren't possible, boot can fail no matter the file system. This isn't a file system problem. It's a failure to manage space.

Most distros require a read-write /sysroot, and expect the ability to write to disk. If they can't write, various services will fail and that can prevent startup from proceeding further. But without any logs, we have no idea what you're actually experiencing.

You are saying it won't boot but that's not at all a case of a broken file system. It's an expected consequence of the file system being full. Since the examples were clear intent of making the file system completely full, it's a setup to prevent the file system from further writes.

Overwriting file systems are expected to run into this problem less, but aren't immune to it. If the data write requirement is new writes, rather than overwriting, it'll fail whether ext4, XFS or Btrfs. If the requirement involves overwriting, it's expected overwriting file systems will succeed where COW file systems simply can't. It might be a valid argument in favor of non-root users being disallowed to use the last 1-5% of free space on any file system.

nix23 · on May 9, 2022

>You are saying it won't boot but that's not at all a case of a broken file system

Please read my comment in full especially that point:

- rm blabla && sync

- reboot

londons_explore · on May 7, 2022

All these "clever" filesystems can never guarantee not to run out of space for their own metadata. That's because even to delete a file they might need more space in the journal, or to un-copy-on-write some metadata.

The mistake however is that even though it isn't practical to make theoretical guarantees that the filesystem won't end up full and broken, it is very possible to make such a thing only happen in exceeding unlikely cases. One runaway dd isn't that...

nix23 · on May 7, 2022

>it is very possible to make such a thing only happen in exceeding unlikely cases. One runaway dd isn't that...

It's not dd, it's one process run by root who fills the filesystem with one big file. That's like the first thing i would test if it can destroy my filesystem.

It's really the filesystem responsibility, if it needs to reserve 30% so be it, if it need's more because i wrote billions of files so be it, (even if it says "sorry i told you i have 50GB free but because you wrote so many small files it's now just 45GB" after all they just can make a estimation) so be it. But it's the filesystem job to tell me how much ~free space that i have, and stop writing before it really/internally cant take anymore. And NOT to kill itself because i alloc 100% of it, there's is just no excuse. That's the filesystem's responsibility.

PS: The clever ZFS survives that "unlikely" test easily.

KingMachiavelli · on May 7, 2022

Why can't they? For example, Btrfs reserves some storage for it's internal use which should be more than enough to update the journal to fix a full filesystem.

londons_explore · on May 7, 2022

Calculating exactly how much you need to reserve for the worst case is a near-impossible task.

For example, say you try to delete a file, which is part of one of multiple identical snapshots, so deleting the file doesn't free up any space, but does require extra metadata to be written (since a new directory entry will be needed that shows the file is deleted in this snapshot only).

The same operation could be done for millions of files, eating up all the reserved space. End result: full disk and unusable filesystem, even for deletes.

The alternative is not to allow file deletes to use reserved space. But now when you have a full disk, some things become 'undeletable', since the only way to free space is to delete all copies of the file, but it isn't permitted to delete any one copy of the file since the intermediate state would use more disk space.

cmurf · on May 7, 2022

What is supposed to happen is the metadata commit fails due to enospc before the super block update. Thus the current super points to the current value working tree roots, not the partial/failed tree roots.

Btrfs won't issue the writes for super block update until the device says the current metadata transaction is successfully on stable media.

It is possible the filesystem is completely consistent (can be mounted, btrfsck finds no error), and yet not bootable due to the interruption of updates. Software updates are one transaction in user space but not atomic unless expressly designed for it. From the fs point of view, a software update might be broken up into dozens of fs transactions.

It's also possible the device lies about writes being on stable media. If the fs writes some metadata, does flush/FUA, then super block write, and flash/FUA, the device should only write the super block after the prior write is on stable media. If it says the first flush succeeded but that write is still happening, and the super block write goes to stable media before all the metadata writes get to stable media and there's a crash or power failure, then you can in fact have a broken filesystem. The super points to tree roots that don't exist. This is definitely a device flaw not an fs flaw.

Btrfs super blocks contain 3 backup roots. So it's possible to revert to an older and hopefully correct metadata generation (seconds to a couple minutes ago). But this has limited recovery potential. It's also completely thwarted right now if you use any discard mount option on an SSD because discard will ask the device to garbage collect recently freed metadata blocks. So the backup root trees pointed to by the super may already be zeros when they're needed.

But any need for backup roots already means some kind of device (firmware) flaw.

Fnoord · on May 7, 2022

A lot of consumer grade SSDs and flash (microSD, eMMC) don't like it when they are near full. That's why you should set a reserved space and quotas. Notice your dd trick requires root which ignores the reserved space. At some point, its PEBCAK.

vladvasiliu · on May 7, 2022

> A lot of consumer grade SSDs and flash (microSD, eMMC) don't like it when they are near full

I've heard about this, but my understanding was that when this happens, performance becomes extremely poor. While that may be quite bad, it's still worlds apart from losing data.

There's also the fact that the user may have partitioned the drive in a such a way to prevent it from ever filling up. Even root can't fill the partition beyond its size. Here, you have to go out of your way to make sure the partition doesn't fill out, or else you have a bad time. Shit happens, so this does look like a FS bug to me, much more than PEBCAK.

Fnoord · on May 7, 2022

> While that may be quite bad, it's still worlds apart from losing data.

Except if you keep overwriting a flash-based storage system, at some point that flash storage gets destroyed (wear level). You can absolutely achieve such by having a near full filesystem on flash. Mechanical harddrives or partitions don't suffer from this issue.

Perhaps the issue occurs more quickly on btrfs, that I don't know, but it could happen on any filesystem. On the other hand, you should have backups. Personally, I use ZFS on two of my machines, with snapshot feature.

vladvasiliu · on May 7, 2022

> Except if you keep overwriting a flash-based storage system, at some point that flash storage gets destroyed (wear level). You can absolutely achieve such by having a near full filesystem on flash. Mechanical harddrives or partitions don't suffer from this issue.

Wearing out is yet a different thing. I've had this happen on an SD card. It would refuse to write anything new, although it reported being mostly free. But the stuff that was already on it was readable.

I've had SD cards that got full. They didn't lose any data, and once I'd moved the things off them, they became usable again.

Granted, this was with a digital camera, so using fat32 at the time, so no fancy FS.

nix23 · on May 7, 2022

>Shit happens, so this does look like a FS bug to me, much more than PEBCAK.

Exactly, a swimming-pool should never explode if you overfill it, however it's the users responsibility to turn the water off to prevent "data/water-loss"

That's why we made filesystems, preserve and organize data and tell the user/system when it cant take anymore.

cmurf · on May 7, 2022

Without more information on the exact mount error and btrfsck --readonly output, it's a guessing game.

In the ordinary cases, btrfs full behavior is the same as other filesystems. It gets full, you can delete files. Keep in mind deleting files on any cow filesystem is a write that consumes free space before space is freed from the delete operation. There is reserve space for this. If you hit an edge case (which would be a bug) there's currently no known data eating bugs. But it's not always obvious to the user their data hasn't been eaten if the filesystem won't mount. As in, this is indistinguishable from data loss. Nevertheless it's a serious bug so if you have a reproducer it needs to be reported.

nix23 · on May 7, 2022

>In the ordinary cases, btrfs full behavior is the same as other filesystems.

Exactly that's NOT the case:

>>-Reboot and end up with unbootable /

You can test it for yourself, it "works" reliable since over more then 8 years.

cmurf · on May 7, 2022

I just tried it like I mentioned in another comment in this thread and cannot reproduce it.

As much effort as you've spent complaining about this problem in this thread you could have provided a proper bug report in the proper venue. It's not going to get fixed by complaining on HN.

So I have to ask if you just want to complain about it or if you want it fixed? Both can be true and legit. But it still requires a report in the proper venue with enough detail for someone else to reproduce.

nix23 · on May 9, 2022

>As much effort as you've spent complaining about this problem in this thread you could have provided a proper bug report in the proper venue. It's not going to get fixed by complaining on HN.

You don't even tryed....a loop device really? And hey i use zfs, i don't trust btrfs who has problems (data-loss) since 10 year. No thanks....hell even SLES recommends for data partitions XFS.

>But it still requires a report in the proper venue with enough detail for someone else to reproduce.

Hey how about you? Since you seam to care..install a vm ~10minutes? fill it ~10minutes? there you go.

nix23 · on May 7, 2022

Ah, so it's a special btrfs feature, as root you can kill your filesystem by simply writing a file to it. I wonder what function that serves.

But hey how about a quota behind the scenes?....you know like ZFS? AFS? ReFS?...you know so the filesystem tells the user "sorry cant take anymore" before it really cant take anymore? That would be some crazy enterprise level stuff....

You know, a Filesystem that immediately stops writing and instead cares more about the data that's already on the platter?

BTW: It was a DC-Harddisk

Fnoord · on May 7, 2022

This can happen on any filesystem under the circumstances of it becoming full as root. You are misinformed if you believe such isn't the case.

"Can't boot" is also vague. I've had data loss with XFS in 2002 or so (didn't have backups), couldn't mount filesystem anymore. Thanks to help on IRC from devs I got almost all data back. I've been able to get recover data from a dying Deathstar, too. And then there's the RAID5 write hole (can be mitigated), and RAID5 issues on btrfs (which are well known). For all we know you were using RAID5 shrug. Anyway, did you file a bug report, did you contact the devs?

nix23 · on May 9, 2022

> This can happen on any filesystem under the circumstances of it becoming full as root. You are misinformed if you believe such isn't the case.

No true with ZFS and XFS, you are trying to defend a ill designed filesystem....in typical linux *fashion ;)

It's S#it but at least we "invented" it.

mekster · on May 7, 2022

Why isn't btrfs declared abandoned and just let people move on?

Everytime I read about it, someone is losing data.

Thank god Ubuntu makes zfs very easy to use. No reason to even consider touching btrfs.

curt15 · on May 7, 2022

How does zfs handle that test?

SkyMarshal · on May 7, 2022

In both ZFS and btrfs, at initial setup you can create an extra dataset/subvolume of 1-2GB or whatever and leave it unused. If you ever fill up root and run into problems freeing up space on root, you can unallocate that extra volume and add it to root to fix the problem.

vladvasiliu · on May 7, 2022

I haven't tried that particular test, but I've had a ZFS drive filled to the brim. I just deleted a bunch of files and was back to normal. This was on a computer with a single pool, also used for booting. The system didn't even crash or anything.

nix23 · on May 7, 2022

dd says "no space left" and kills itself, "df" says 100% full, "zpool list" says 98% 80GB free...it's like black magic that ZFS knows to keep some space for itself ;)

And to other comments:

It was a DC-Harddisk, and NO, not even root should be capable destroying the filesystem by simply write to it, it's not 1970 anymore.

Calculating the "to reserve" Metadata-block should be rather trivial since it's ONE big file. And it's not dd that is the problem, it's btrfs that cannot handle a process that writes ONE BIG file.

GrumpySloth · on May 7, 2022

It would probably make sense to display <subdomain>.srht.site for submissions which match this domain pattern, similar to github.io sites.

chocolatesnake · on May 7, 2022

+1. This style of domain shortening should use https://publicsuffix.org/ to determine how to trim subdomains. And lo, https://publicsuffix.org/list/public_suffix_list.dat contains "srht.site".

anotherhue · on May 7, 2022

Perhaps we can call this a "Considered State" system. No more haphazardly rearranging bytes on your drive. Managed OS objects, linked at boot and your home dir / config dirs under VCS.

I use nixos with zfs on /home, /nix and /persist. Everything else is tmpfs, including /etc. Mostly you can configure applications to read config from /persist, but when not, a bind mount from /etc/whatever to /persist/whatever works pretty well.

I will never use a computer any other way again.

kilburn · on May 7, 2022

> To make use of snapshots, the backup drive gotta be Btrfs as well. The compression level was turned up to 14 this time (default was 3):

Isn't this useless? My understanding is that compression is only done at file write time. When you "btrfs send" a snapshot, the data is streamed over without recompression, so there's no point in setting up a higher compression level in the backup disk.