Hacker News new | past | comments | ask | show | jobs | submit login
Linux: Ext4 data corruption in 6.1.64-1 (debian.org)
226 points by zdw 5 months ago | hide | past | favorite | 131 comments



Here's my understanding so far:

In the upstream Linux kernel there were two fixes posted months from each other, one for direct io [0] and the other one for ext4 [1]. The ext4 one was marked for backport to stable (CC: stable@vger.kernel.org), the other was not. The problem is that these commits depend on each other for things to work properly. If you have both, you're fine. If you have only the backported one, you have a problem.

Which versions are affected? We know for sure that 6.1.64 is affected, 6.1.55 is not (because it doesn't have the commit). As of right now, 6.1.64 is still marked as "stable" in Debian [2] but if you actually try to install it from the official mirrors (deb.debian.org), you will get error 403. The fix is included in version 6.1.66 which will soon be available.

The issue seems to be only highlighted in the context of Debian but it is not specific to it. The issue is/was in the official upstream release.

[0] https://github.com/torvalds/linux/commit/936e114a245b6e38e0d...

[1] https://github.com/torvalds/linux/commit/91562895f8030cb9a04...

[2] https://packages.debian.org/source/stable/linux


Did you update to 6.1.66? Sucker! Your data might be safe(r) but your Wi-Fi is not working! The exact same thing happened again ([0] needs [1] but that one was not backported). Fixed in 6.1.67 [2] released this morning.

[0] https://github.com/torvalds/linux/commit/7e7efdda6adb385fbdf...

[1] https://github.com/torvalds/linux/commit/076fc8775dafe995e94...

[2] https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.1.6...


Were any Ubuntu versions affected?


I think not unless you've explicitly installed the 6.1 series (none of the official supported releases ship with it by default as far as I can tell).


6.1.65 is affected as well.


"Causes non-serious data loss". What does that mean. Only affects cat videos?


I believe the Debian policy for reports says that for corruption you can recover from. Serious data loss would be losing your metadata and not being able to figure out where any of the files are, or corrupting all copies of the disk encryption key.


I think "corruption you can recover from" seems far too vague; you can recover from any corruption with backups.


If you're trying to be padantic and missing forest for the trees... Sure.


The way I'd interpret it is that it's non-serious if fsck can repair it. Not sure if that's accurate though, just the way it sounds to me.


Sure, and you can recover from any hardware failure with a separate computer. In fact, you can recover from death simply by cloning yourself. What's the issue?

The issue is that some people don't have the time, money, and/or willpower to invest in backups. Some people only have one volume, or one copy of each volume at least, and need to recover their data using only that one volume if something gets corrupted. That's what "corruption you can recover from" means. Using a backup is not recovering from corruption; it's replacing the corruption with the pre-corrupted data. Recovering from corruption involves actually transforming the corrupted data back into non-corrupted data.


I think you're agreeing with userbinator here.


Not really. They're arguing that backups count as recovering from corruption, where my argument is that backups don't count because the only thing you're doing is discarding the corruption, not recovering from it. You're basically throwing the corruption out and getting your data back from somewhere else.

So you recover your data, but not from the corruption. You recover it from a backup. Similar, but not the same thing.


In this case, I would say that backups are not relevant protection. In order to restore something from backup, you would first need to know that a restore from backup is needed, and secondly you need to know what the last-known good copy is.

This is a silent data corruption bug (data is written at the wrong offset, IIUC) with no known method to detect the error condition except to compare the resulting data with a known-good copy. The next backup run will happily accept the corrupted file contents into the backup vault, and unless someone happens to stumble upon the corrupted data, the last-known good copy will eventually be rotated out.


I think they meant non-catastrophic (no pun intended). From what I can tell, individual files can get corrupted when opened with O_DIRECT, but the filesystem structure will never fail.

From the link in https://news.ycombinator.com/item?id=38591444 :

> file position is not updated after direct IO write and thus [we] direct IO writes are ending in wrong locations effectively corrupting data.

This may be a major problem for hypervisors using ext4-backed disk images. It wouldn't surprise me if qemu opens image files with O_DIRECT by default, but it would surprise me if professional shops were using ext4 as the backing store rather than a provisioning abstraction layer like dm_thin. Still, anyone using kvm on top of ext4 must not be having a good day.


Reminds me of this exchange from Coupling (UK):

“How is it not serious if she died from it?”

“She made an unsuccessful recovery”


Funny, I also thought of Coupling, except I was thinking of

- how badly can you phrase yes?

[Cuts to other scene]

- No

Such a good show.


You lose your data but, like, it's just a prank man.


yeah, practical joke for using Linux...have a laugh and deal with it


Most data is recoverable albeit slow to do. It takes some very bad conditions or intentional actions to make data recovery impossible. Current high level standards for military/diplomatic data sanatisation are complete physical destruction of hard disks.

Even linux tools like shred have given up saying they can actually delete data from disks due to how SSD's work these days.


I don't agree - modern NVME drives have secure erasure mechanisms. All data is by default encrypted on-the-fly within the flash memory, and when requesting a secure erase, the controller throws away the key and generates a new one.

https://man.archlinux.org/man/nvme-format.1


> I don't agree - modern NVME drives have secure erasure mechanisms.

Assuming the firmware doesn't lie about it does:

* https://www.zdnet.com/article/flaws-in-self-encrypting-ssds-...


In "reduce, reuse, recycle", this is the #2 R

Please don't trash physically good drives with a hammer. It's not good for the environment (or the drives!) when you have such a simple technology at your disposal!


What percentage of people on Earth use this in your expert opinion?

Also have you done any work in digital forensics?

Eagerly awaiting your response.


> Even linux tools like shred have given up saying they can actually delete data from disks due to how SSD's work these days.

Which emphasizes the importance of enabling full disk encryption immediately whenever you start using a new device--BitLocker if you're on Windows, FileVault on macOS, LUKS on Linux, etc. Trying to decrypt data is much harder than reconstructing deleted data on a stolen drive.


True, properly zeroing out the headers on an encrypted drive will make recovery impossible.

How to do that reliably is another question.


You cannot with normal tools as writing a 0 to the SSD does not guarantee it overwrites the 0 you want. At best it does, at worst it writes the 0 somewhere else and remaps the bit (or whatever its physical storage thing is).


This is especially a problem on macOS:

https://support.apple.com/guide/disk-utility/erase-and-refor...

> Note: With a solid-state drive (SSD), secure erase options are not available in Disk Utility. For more security, consider turning on FileVault encryption when you start using your SSD drive.

So if you set up a Mac without FileVault you can never erase everything.

At least with my Lenovo I can do the secure erase.


[flagged]


Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.


I'm sorry what? You are calling me non-human here. A quick glance will tell that I'm very fallible for both opinions and assertions.


SSDs are not magic. For shredding to be impossible the drives would have to have hidden capacity (and a lot of it, more than 1%). People often say, write levelling makes overwriting useless, but what if you overwrite the entire disk? How would you recover after something like this?

Of course providing something like 10% of hidden extra capacity would extend the life of the drive significantly, but are manufacturers really doing that and are not mentioning it in their marketing materials? I never heard that they do that.


> Of course providing something like 10% of hidden extra capacity would extend the life of the drive significantly, but are manufacturers really doing that and are not mentioning it in their marketing materials? I never heard that they do that.

Every SSD has considerably more physical blocks than reported blocks. They have to, SSD bad blocks are common and number of writes is quite limited compared to hard disks.

https://download.semiconductor.samsung.com/resources/others/...


SSDs don't cycle blocks based on how much you've written. You can write over the entire reported capacity of a drive 100 times and there's still the chance, however small, that your encryption key and some encrypted data are still sitting around on some chip. You just can't guarantee that it's gone. For all you know, the controller has marked them as bad and isn't going to reuse them, even though there's still recoverable data on them. Since there's no way to reach them through the drive interface, there's also no way to erase them. But if someone opened up the drive and pulled the chips, there they would be.


Sure, but very few business tools are doing that, they are playing the old Gutmann game which he himself has walked away from as not reliable anymore. Filling up the drive is certainly getting you somewhere compared to some of the snake-oil out there. Drives are only getting bigger though and rarely filled up, done some of this stuff in my time and if cost is no barrier then you'd be surprised how much can be reclaimed. Degauss and destroy is the only real method we have now and that's probably not changing in future.

>but are manufacturers really doing that and are not mentioning it in their marketing materials

SSD manufacturers have been caught out repeatedly with their in-built deletion API claims. Recovery of a significant % of files is possible nearly always without tampering having occurred.


> I never heard that they do that.

Not sure how you missed it, as it's not a new thing. Straight from the horses mouth, if that helps: :)

https://www.kingston.com/en/blog/pc-performance/overprovisio...

Note that it's not just a Kingston thing, it's common across SSD manufacturers. Or at least with the major ones.


> what if you overwrite the entire disk

People who want secure deletion generally want their storage device to remain functional afterwards. Same reason they don't simply destroy the disk to get rid of its contents: those things are expensive. Overwriting the entire SSD every time you want a file gone just isn't a solution.


So you’re a dog person?


I thought we all were dog people. Woof.


Some of us are just dogs.


On the Internet, nobody need ever know.

P.S. Squirrel!


Some of us are different kinds of animals.


In Debian bookworm:

"sudo systemctl stop unattended-upgrades.service" ...wasn't able to prevent unattended-upgrades from going on ahead and just upgrading (to this problematic kernel) anyway.

Unintuitively, the "right" way to disable unattended-upgrades is:

"sudo dpkg-reconfigure unattended-upgrades" ...and choose "No" when asked.


Most likley because you stopped the wrong service. You should have stopped the relevant timer service, not the service that the timer starts.


> You should have stopped the relevant timer unit, not the service unit that the timer starts.

FTFY. Helps keeping them apart to not use the same word for both ;)

    systemctl disable --now unattended-upgrades.timer;
    systemctl disable --now unattended-upgrades.service
No need to uninstall or mask.


Can you give example code to stop that timer?


It's the same, but you disable the timer instead of the service. `sudo systemctl stop unattended-upgrades.timer`


It will still start when you next boot up. 'systemctl disable --now unattended-upgrades.timer' will stop it now _and_ remove the symlink that starts it at boot.


Probably better to mask it instead if you don't want it to execute at all


Usually, it's the same service name but with .timer at the end instead of .service.


dpkg-reconfigure is really important and maybe a bit outdated, but definitely something users should know? I think it's also responsible for turning 'apt upgrade' into a little text adventure, e.g. by showing lots of prompts like "Do you want to replace/skip your sshd config"?


It's better to use configuration / override directories such as sshd_config.d/ to specify customizations.


Tbf, you also would have to explicitly enable automatic reboots (or manually reboot after the new kernel).

That said, I certainly wouldn’t blame anyone for blindly trusting a kernel in Stable. This hurts.


systemctl stop is a oneshot operation, right? Perhaps sudo apt remove unattended-upgrades?


Severity: grave

I'm not sure if that's a standard value, but it made me chuckle a little.



It would have been really funny if they reported that one as:

Severity: `


This might start executing a subshell in some ancient script. Avoid!


The bug system has a list of predetermined severities, so that would just be rejected. :-)


Am I missing something? grave means very serious, which seems like an appropriate category.


Maybe the GP read it as a noun.


The irony of seeing this, after Linus remarks how serious and thorough the file system kernel developers are, versus the other device drivers developers.


https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuh...

It seems like a commit from the main branch was backported to stable branch, but it actually depended on another one which was not backported.

Linux kernel maintainers heavily backport commits as it seems the distinction between bug-fix and new features is hard to do. Also, bug-fix can depend on new features..

I'm always a bit scared when I look at the amount of commits in a single stable kernel point release, plus the fact that these point releases come out every week or so. The velocity of change in what should be a stable kernel is very high.

I'm wondering who exactly wants to run a stable kernel that receives all these updates so fast. If you want all the latest and greatest, just use -latest?


> I'm wondering who exactly wants to run a stable kernel that receives all these updates so fast. If you want all the latest and greatest, just use -latest?

I find the kernel is now the least stable bit of debian stable

I've had LTS kernels break the networking on my workstation (with a 10 year old onboard NIC) and they deliberately broke ZFS in the past

in LTS there shouldn't be ANYTHING other than security fixes


Still, where is the care that Linus mentioned on the keynote interview?


I wonder if Linus will publicly scold the developer responsible for this, like the famous "Mauro, STFU!" e-mail


Aren't this days over?


If they are, then that's why the quality is declining.


“Verbal abuse is a necessary component of effective software reliability engineering!”

“If my file system developer hasn’t been through the Full Metal Jacket experience, I don’t trust the code they write!”

“Linux is going down the toilet! What we really need is leadership from a guy who’s VERY angry all the time!”

Sure, giving good candid feedback is a gift, but degradation isn’t a necessary part of honesty.


Funny. You don't even know if that's the case, but somehow you know it's the cause?


Weird. I had a few upvotes from people with a sense of humor before it went back down to 1 and these comments appeared.


As someone who is heavily dependent on zfs, this is a little bit of reassurance. Sad that data corruption exists in any file system that is shipped too many people, but reassuring that it happens to the most stable and least "interesting" in the newness sense file systems.

ZFS's own recent file system corruption issue is in roughly the same category of edge case, but accessible to reasonable if niche workloads.


Note though that unlike zfs, ext4 has a battle-tested fsck.

Combine ext4's dumb but robust approach to journaling and robust metadata layout (for example inodes are statically allocated) with fsck.ext4 (which got refined for years) and you can recover from any situation.

To give you an example, fsck.ext4 will happily carve a working filesystem out of random data, as long as there is a valid superblock. Seriously - try it yourself:

  # create a working filesystem image
  dd if=/dev/zero of=test1.img bs=1M count=256
  /sbin/mkfs.ext4 test1.img
  # write file with random data
  dd if=/dev/urandom of=test2.img bs=1M count=256
  # copy superblock into random file
  dd if=test1.img of=test2.img bs=1024 count=4 seek=1 skip=1 conv=notrunc
  /sbin/fsck.ext4 -fy test2.img
  sudo mount test2.img /mnt/somewhere


I'm trying to understand what exactly this is testing/demonstrating.

Like: "Wow, but why should I care?" I'm not sure being able to fsck a random disk image shows resiliency. Doing this could do all sorts of nasty things to data you actually care about.


(not the author) The metadata is a contiguous range of disk blocks. I think the intuition is that such layouts are likely to require simpler filesystem code to manipulate. [versus (iirc) ZFS which may have metadata blocks scattered throughout the disk, probably requiring more intricate code to keep track of].


> [versus (iirc) ZFS which may have metadata blocks scattered throughout the disk, probably requiring more intricate code to keep track of].

So -- you think redundant metadata is a bad thing? Try wiping your metadata and then trying to fsck that random disk image.

Again, has this been the source of data corruption you're aware of? This seems like a "Maybe, it could be this way, but I don't know" kind of take.


> To give you an example, fsck.ext4 will happily carve a working filesystem out of random data

What application does this have for recovering user data? Would it not be more interesting to see how much of the user data can be recovered? Or are you implying that the random data can symbolize the user data here and much of it would be recovered?


You’re forgetting the btrfs folks.

— someone who lost data due to btrfs bug


Yeah, all the threads full of folks bashing on ZFS because their trusty ext won’t corrupt data should be eating crow right about now, but I’m sure they’ll hop in and explain why this bug is different.


It's not an extfs bug, it's a "stable kernel" process bug. Linux master branch/mainline was never affected by this bug. Someone cherry-picked a wrong patch into stable without realizing the consequences, and it was caught too late.

Stable process is cherry pincking thousands to tens of thousands of patches from the current master kernel branch into years old kernel branches, spraying tens of thousands of emails at original patch authors, hoping (With some limited testing on top) that the resulting frankenkernels will still work and all these patches will have satisfied dependencies applied, too.

I don't trust this process very much, and just run the latest stable branch for the latest kernel release, only. Staying with older stable release branch for too long seems too risky, unless you're some bigcorp that can afford the testing required, or you're running some highly mainstream setup that is probably covered by tests done by the stable team, and testing teams they cooperate with.


I understand your reasoning, but running a popular « stable » distro in a popular configuration will help detect bugs much faster. I’m not sure this would be the case for the very latest kernel.


A bunch of popular distros run the latest kernel.


Nobody ships vanilla kernels. Sure the reference implementation was unaffected but users never experience vanilla Linux kernels


Distros like Arch and NixOS use the vanilla kernel with very small amount (single digits) of patches on top. For the purposes of ext4, it's very likely to be as-is with vanilla. (I am counting the official linux-stable releases from kernel.org as 'vanilla' here.)


Even a single patch to something like default kernel config parameters means it's not vanilla. I'm not arguing that there are significant patches, only that nobody seems to ship a true vanilla kernel.


> Even a single patch to something like default kernel config parameters means it's not vanilla.

Anybody who cares that much is very likely already compiling their own kernels (I speak for myself here). It doesn't make sense for distros to do the extra work to support it.

Just expanding on this a bit: my Debian laptop has a Kconfig with modules disabled that only includes the exact set of drivers it needs. It takes ten minutes to rebuild on the laptop when I pull from git. Even if Debian did all the work to let me automatically install the latest vanilla kernel... I'd still build it myself.


Fedora, openSUSE tumbleweed, Manjaro,...


> Fedora

You know it's not that hard to debunk this:

> You'll then be left with a kernel-6.X.? directory, containing both an unpatched 'vanilla-6.X.?' dir, and a linux-6.X.?-noarch hardlinked dir which has the Fedora patches applied. [1]

WOW fedora patches! Sounds to me like they're not shipping vanilla.

[1] https://fedoraproject.org/wiki/Kernel


That's not what my original argument was about. So not sure what you're debunking.

I have problem with backaptching 10s of thousands of changes to years old kernels by people who don't really understand the changes or consequences, like 5.4, 5.15, 6.1 or whatever. Not patching up 6.6 or 6.7 kernel with a few out of tree patches, where they mostly understand what they're doing and can test the limited set of changes they're applying.


If they change anything it's no longer vanilla. Anything at all. No matter how insignificant.


This isn't a sports tournament, nobody is cheering or bashing "teams". If ZFS has bugs and caveats, they should be called out. Same for ext4.


[flagged]


Well said. Twisted partisanism.

I’m the middle ground. It all sucks in one way or another but we’re generally trying to our best.


> When a core rust developer is exposed as a psychotic pedophile, there's crickets - it's not news because it happens every month.

bro what on earth are you on about


The problem here was Debian's distribution process. Any distro compiling linux from the releases on kernel.org was not affected.


This is not accurate. As you can see in the changelogs, the problematic commit made it to 6.1.64 and the fix was merged in 6.1.66:

"properly sync file size update after O_SYNC direct IO": https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.1.6...

"update ki_pos a little later in iomap_dio_complete": https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.1.6...

This post explains the relationship between the two commits: https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuh...


This is a misreading of the bug. It is from upstream stable kernels before 6.5 that include commit 91562895f803 but not 936e114a245b6[1].

In this case Debian's current process is good - it's kernels track kernel.org stable releases. This debian bug is responsibly flagging "for visibility" that a serious bug has been discussed and fixed upstream.

[1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuh...


Do you have a link to just the 2 patches (in patch form) that could be before on a kernel tarball to revert these 2 changes?


Are you sure about that (genuine question)? The linked discussion involves a Suse engineer and a request to the kernel maintainer directly, not to a Debian packager-


No, its an upstream bug being discussed in the debian bug tracker


What if you build your own kernel from kernel.org, but you use Debian tools to make dpkgs? (and you use old config)?


So this is another bug introduced by Debian itself by patching things?

I remember that there was a fairly severe one which was caused by patching OpenSSL I think? But I remember the change they made being fairly weird and no one understood why but it was easy to see that it would introduce a vulnerability.


No, the issue was caused by backporting a patch A but not backporting a patch B. Sadly in this case the overall behavior after applying just A was broken when issuing direct IO writes.


I remember when they made 2 CVEs back to back in Apache by incorrect back porting of changes


No, it exists in the upstream release.


To get past apt errors in the immediate (copied from the bottom of the bug report):

> This should block just the buggy kernel. Which might help with unattended upgrades problem or just being forgetful. It might even uninstall the buggy kernel, though I didn't test that. And it shouldn't impact upgrading to 6.1.66 when it's available.

> create a file: /etc/apt/preferences.d/buggy-kernel

> with the contents:

    # avoid kernel with ext4 bug 
    # 1057843
    Package: linux-image-\*
    Pin: version 6.1.64-1
    Pin-Priority: -1
> (the comment isn't required but is helpful for remembering why this file is around in 6 months.)


Why wasn't automatic updating to this halted immediately? The "grave" bug report was yesterday, but today unattended-upgrades updated me to the affected kernel and I restarted before I heard about the problem.


The way the Debian update process works doesn't make it possible to halt updates immediately. The servers you're getting the updates from aren't controlled by the Debian project itself, but are operated by third parties. They sync their data from a master machine (which is operated by Debian) 4 times per day. If you're using a mirror that's not under the debian.org domain, there might even be another machine or two in between, each adding some delay. So even if Debian pulled the update immediately, which their archive software isn't really setup to do, it'd take a bit for that to propgate through to the mirrors.

If this all seems archaic, keep in mind that all this has been designed 20+ years ago, when bandwidth was quite a bit less abundant.


I found a few things surprising about this bug too, but this might be due to my own ignorance about how these processes work:

- I was expecting the package to either be revoked, or to have a new `6.1.64-2` version being published with the previously good known state.

Maybe it wasn't done because it was not possible (mirrors being write-only, and maybe other complications in publishing a 6.1.64-2).

- I was expecting some guidelines to be published for affected machines. I've seen questions being asked [1], but no answers to them yet (so users are not sure if it would be safe to rollback to the previous kernel version (6.1.55-1) or not).

Maybe it's because the problem is not well understood yet, or maybe there are not enough people available to answer such questions on a weekend?

If someone can provide some context about "how these things work", it would be interesting to learn more about this.

[^1]: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1057843#33


The `6.1.66-1` release with a fix was uploaded to Debian's infrastructure about 2 hours after the bug was filed, but it takes a bit for it to be compiled for all Debian's architectures (the mipsel builder e.g. needed 9 hours) and for it to propgate through the mirror network.



Have any guidelines been posted somewhere for systems already updated to linux-image-amd64 6.1.64-1?


I also manually upgraded yesterday (bad luck, just a few hours after the update) and just now reverted to the previous kernel. Not sure this was the right thing to do but didn't want to take any chances


My desktop and laptop are both root on ZFS so I dodged this bug. (Actually I saw the press on this before upgrading anyway so I'm holding off until I see a fix released.)


I saw the message yesterday late in the evening, I still had 8 hours before my automated updates ran and I thought the package would be revoked by then. Mistake :/


Seems like I dodged this bullet by using Debian unstable, how ironic.


HWE kernel (currently 6.2.0-37-generic #38~22.04.1-Ubuntu) on LTS Ubuntu 22.04 is not affected


Meanwhile my ntfs partitions are bugging pretty much all the time lately.



It seems they have frozen the roll out of 6.1.0-14 too? Did they just freeze all Kernel Updates using `HTTP 403 - broken package ` as a response?


That version is the one (or one of the ones) affected. I'm guessing that's why they froze it, though why they didn't replace it with a newer one, or even the old safe version I'm not sure.


Honestly the handling of this bug is horrendous to say the least, why can't they just retract the update? Why do people have to actively avoid updating to avoid this?

This doesn't help debians reputation.



The reason I'm using Debian stable was to avoid these kind of problems.

I should just move on to FreeBSD.


Curious why it doesn’t happen on GNU?

Context: This is seen in the following environments: > > > * dragonboard-845c > > > * juno-64k_page_size > > > * qemu-arm64 > > > * qemu-armv7 > > > * qemu-i386 > > > * qemu-x86_64 > > > * x86_64-clang


What do you mean by GNU here? The bug is in the Linux kernel, specifically in the code for the ext4 filesystem. There are not many systems out there who runs a GNU kernel.


I think they're referring to the `x86_64-clang` part (`clang` vs `gcc`).


Is there some widely used software that does O_DIRECT writes? MariaDB?


O_DIRECT is almost always the wrong choice. sync_file_range gives you much better control over scheduling of the write backs, and madvise gives you better control over caching policy.

There were some old UNIX variants where O_DIRECT actually bypassed the filesystem cache, but Linux's cache is coherent, so reading a file immediately after an O_DIRECT write completes is guaranteed to give the new value. That is more sane than the cache-incoherent approach (how do you know your write is going to a page that is clean in OS cache in the other UNIXes?), but also eliminates most of the code-path-length benefit of O_DIRECT.

Also, if I remember right, O_DIRECT doesn't bypass the I/O scheduler on Linux. That, and bypassing the cache are the two main benefits of the old API, and you get neither.

As a bonus, filesystems like tmpfs passive-aggressively return error if you try to use O_DIRECT.


> but Linux's cache is coherent

Linux achieves that coherency by making O_DIRECT invalidate the page in the page cache. I think it paints a more accurate picture to say the caching is disabled with O_DIRECT, not that it is coherent (although it certainly is).


Three months using kernel 6.6.5 from Zabbly (for hardware support) and having no issues:

https://github.com/zabbly/linux

Quote "As those kernels aren't signed by a trusted distribution key, you may need to turn off Secure Boot on your system in order to boot this kernel."


6.6.5 was released 2 days ago and 6.6 a month ago :)


6.5 is available too.

Anybody would be so kind to tell me why I have been downvoted?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: