There are actually several distinct issues being reported there. I replied respo...

ein0p · 2025-01-25T23:03:53 1737846233

Regarding your comment - seems unlikely that it "affects Ubuntu less". I don't see why that would be the case - it's not like Ubuntu runs a heavily customized kernel or anything. And thanks for taking a look - ZFS is just the way things should be in filesystems and logical volume management, I do wish I could stop doing hash compares after large, high throughput copies and just trust it to do what it was designed to do.

ryao · 2025-01-25T23:35:04 1737848104

Ubuntu kernels might have a different default IO elevator than proxmox kernels. If the issue is in the IO elevator (e.g. it is reordering in such a way that some IOs are delayed indefinitely before being sent to the underlying device) and the two use different IO elevators by default, then it would make sense why Ubuntu is not affected and proxmox is. There is some evidence for this in the comments as people suggest that the issue is lessened by switching to mq-deadline. That is why one of my questions asks what Linux IO elevator people’s disks are using.

The correct IO elevator to use for disks given to ZFS is none/noop as ZFS has its own IO elevator. ZFS will set the Linux IO elevator to that automatically on disks where it controls the partitioning. However, when the partitioning was done externally from ZFS, the default Linux elevator is used underneath ZFS, and that is never none/noop in practice since other Linux filesystems benefit from other elevators. If proxmox is doing partitioning itself, then it is almost certainly using the wrong IO elevator with ZFS, unless it sets the elevator to noop when ZFS is using the device. That ordinarily should not cause such severe problems, but it is within the realm of possibility that the Linux IO elevator being set by proxmox has a bug.

I suspect there are multiple disparate issues causing the txg_sync thread to hang for people, rather than just one issue. Historically, things that cause the txg_sync thread to hang are external to ZFS (with the notable exception of data deduplication), so it is quite likely that the issues are external here too. I will watch the thread and see what feedback I get from people who are having the txg_sync thread hang.

ein0p · 2025-01-26T01:38:25 1737855505

Thanks a lot for elaborating. I'm traveling at the moment, but I'm going to try reproducing this issue once I'm back in town. IIRC I did do partitioning myself, using GPT partition table and default partition settings in fdisk.

Upd mq-deadline for all drives seems to be `none` for me. OS is Ubuntu 22.04

ryao · 2025-01-26T02:58:35 1737860315

> Upd mq-deadline for all drives seems to be `none` for me.

I am not sure what you mean by that. One possibility is that the ones who reported mq-deadline did better were on either kyber or bfq, rather than none. The none elevator should be best for ZFS.

ein0p · 2025-01-26T18:28:52 1737916132

I mean that it's already "none" on the machine where I encountered this bug. "Upd" was merely to signal that I've made an edit to my post.

ryao · 2025-01-26T18:55:32 1737917732

The mq-deadline part is what confused me. That is a competing option for the Linux IO elevator that runs under ZFS. Anyway, I understand now and thanks for the data point. I added a list of questions to GitHub that could help narrow things down if you take time to answer them. I will be watching the GitHub thread regularly, so I will see it if you post the answers after you return from your travels.

ein0p · 2025-01-27T08:54:04 1737968044

I was confused, actually, not you. The output was:

cat /sys/dev/block/8:176/queue/scheduler

[mq-deadline] none

However, this output does not mean what I thought it did - it means that mq-deadline is in use.

If I do

echo "none" | sudo tee /sys/dev/block/8:176/queue/scheduler

This changes to

cat /sys/dev/block/8:176/queue/scheduler

[none] mq-deadline