Hacker News new | past | comments | ask | show | jobs | submit login
Debugging a FUSE deadlock in the Linux kernel (netflixtechblog.com)
166 points by andsoitis 9 months ago | hide | past | favorite | 33 comments



This article is mostly about rediscovering the distinction between interruptible and non-interruptible waits.

At the end there is some sort of deadlock between namespace teardown, signal delivery, and FUSE, but... it isn't articulated in a way that is super comprehensible to me. The kernel flushes open files on exit and also kills things in the namespace on exit. But that means the race condition was always hitable if you killed the FUSE daemon at the wrong time relative to the FUSE client shutdown? It's not totally obvious to me why this would impact other non-FUSE filesystems.

Signal delivery and multithreaded process teardown in the kernel is certainly tricky, and it's really easy to get these weird edge cases wrong.


> It's not totally obvious to me why this would impact other non-FUSE filesystems.

Anything filesystem that involves writing to another process, possibly a distant one (think NFS) will be susceptible to this.

Generally this wouldn't be a problem because init would normally be init, and it wouldn't have any open FDs for remote files (unless it was a diskless boot), and it wouldn't exit (because init never exits, and historically init exiting/dying would cause Unixes to panic), so this doesn't usually come up.

But containers can break all these assumptions since you can have applications be init. And if applications-as-init start helper daemons for FUSE or whatever, they need to clean up in order, and if they don't then bad things happen. In this case the application being init caused the kernel to kill all the other processes in the namespace when the application exited.

Apps-as-init can always fail to exit cleanly by crashing, and that shouldn't cause unkillable zombies. The fix described is correct: allow flushing during exit to fail, since that could always happen (e.g., ENOSPC). Better than waiting forever for a flush that can't complete.


> Generally this wouldn't be a problem because init would normally be init, and it wouldn't have any open FDs for remote files (unless it was a diskless boot), and it wouldn't exit (because init never exits, and historically init exiting/dying would cause Unixes to panic), so this doesn't usually come up.

I mean, the kernel cannot rely on any userspace process to do anything -- even init. Even ignoring containers. I don't think the root cause here is even init related -- that's just what caused flush to hang forever in this situation, but as you point out that could happen for any number of reasons.

> Apps-as-init can always fail to exit cleanly by crashing, and that shouldn't cause unkillable zombies.

Right.

> The fix described is correct: allow flushing during exit to fail, since that could always happen (e.g., ENOSPC). Better than waiting forever for a flush that can't complete.

Sure. You could also do an interruptible flush before blocking signals instead of after.


I don't understand why wants_signal returns false on PF_EXITING even if the signal is SIGKILL (and from the kernel). Shouldn't wake up the process still, so it can get out of the flush?

I am curious, if you were just to walk over every PID in the pid namespace after sending zap_pid_ns_processes, and perform wake ups, would it break out of the `wait_event` loop?

Btw, this class of weirdness with FUSE isn't that unusual.


Author here (hi Sargun), it's not really about rediscovering killable vs. unkillable waits, and any confusion is probably a result of my poor writing.

The crux of it is that once you've called exit_signals() from do_exit(), signals will not get delivered. So if you subsequently use the kernel's completions or other wait code, you will not get the signal from zap_pid_ns_processes(), so you don't know to wake up and exit.

There's a test case here if people want to play around: https://github.com/tych0/kernel-utils/tree/master/fuse2


Hi Tycho!

I'm glad you inherited this :).

Oh, I wasn't suggesting that it was about killable vs. unkillable.

Couple of things: 1. Should prepare_to_wait_event check if the task is in PF_EXITING, and if so, refuse to wait unless a specific flag is provided? I'd be curious if you just add a kprobe to prepare_to_wait_event that checks for PF_EXITING, how many cases are valid?

2. Following this:

  zap_pid_ns_processes ->
     __fatal_signal_pending(task)
     group_send_sig_info
       do_send_sig_info
         send_signal_locked
           __send_signal_locked -> (jump to out_set)
             sigaddset // It has the pending signal here
             ....
             complete_signal


Shouldn't it wake up, even if in its in PF_EXITING, that would trigger as reassessment of the condition, and then the `__fatal_signal_pending` check would make it return -ERESTARTSYS.

One note, in the post:

  # grep Pnd /proc/1544574/status
  SigPnd: 0000000000000000
  ShdPnd: 0000000000000100

> Viewing process status this way, you can see 0x100 (i.e. the 9th bit is set) under SigPnd, which is the signal number corresponding to SIGKILL.

Shouldn't it be "ShdPnd"?


> Couple of things: 1. Should prepare_to_wait_event check if the task is in PF_EXITING, and if so, refuse to wait unless a specific flag is provided? I'd be curious if you just add a kprobe to prepare_to_wait_event that checks for PF_EXITING, how many cases are valid?

I would argue they're all invalid if PF_EXITING is present. Maybe I should send a patch to WARN() and see how much I get yelled at.

> Shouldn't it wake up, even if in its in PF_EXITING, that would trigger as reassessment of the condition, and then the `__fatal_signal_pending` check would make it return -ERESTARTSYS.

No, because the signal doesn't get delivered by complete_signal(). wants_signal() returns false if PF_EXITING is set. (Another maybe-interesting thing would be to just delete that check.) Or am I misunderstanding you?

> Shouldn't it be "ShdPnd"

derp, fixed, thanks.


> Or am I misunderstanding you?

Oh, I see, you're suggesting exactly,

> (Another maybe-interesting thing would be to just delete that check.)

I agree.


Hi Tycho. I was The Guy at LSS who tested positive for COVID about 12 hours after we sat next to each other at that Japanese restaurant in Vancouver the week before last. I really hope you didn't catch it. So far, to my knowledge, my "blast radius" is just me.

As somebody who has written a non-trivial amount of upstream Linux filesystem code and who is leading the containers team at my current employer, I've found your writing more interesting than perhaps most people on the face the planet might. I'm also a bit surprised at how often companies write their own custom FUSE filesystems. A lot of them I only hear about as former employees from those companies join mine and then clue me in about their existence. It seems like every large-ish company these days has at least one now.

It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?

Also, my whole career I've been doing ps aux, but TIL about ps awwfux. Which I guess goes to show there's always some gap in one's basic knowledge of Linux foo!


> Hi Tycho. I was The Guy at LSS who tested positive for COVID about 12 hours after we sat next to each other at that Japanese restaurant in Vancouver the week before last. I really hope you didn't catch it. So far, to my knowledge, my "blast radius" is just me.

Hi Mike. So far so good for me.

> It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?

We do have these both enabled, and have alerts to log them in the fleet. I have found it very useful for saying "there's a bug", but not generally applicable in debugging it. However, we wouldn't catch these things without user reports if we didn't have those tools.

Something that might (?) be useful is something like lockdep when there's hung tasks. It wouldn't have helped in this case, since it was a bug in signals wakeup, but I e.g. in the xfs case I cited at the bottom maybe it would.


Should processes not be able to wait after exit_signals? That seems like a plausible invariant.


I think they definitely should not. I've considered sending a patch that adds a WARN() or some syzkaller test for it or something, especially now that I've seen it in other filesystems.


Makes sense to me.


I think that’s the point. Currently doing that will potentially result in a deadlock.


Well, only if the wait is for userspace or a remote resource, right? Regular disks are sometimes considered infallible (or at least, the IO will timeout eventually in the generic SCSI logic) and might be ok to wait on.

To generalize a bit, I think the problem is doing any sort of interruptible wait -- because we can no longer be interrupted. Uninterruptible waits aren't any different without signal delivery. I might be oversimplifying, though.


It sounds like exit_signals() is being called too early, and based on the test case linked this might be a library issue rather than a code or kernel issue?

Edit: Reading the article it's more clear this happens in kernel's:

  do_exit() {
    ...
    exit_signals(tsk); /* sets PF_EXITING */
    ...
    exit_files(tsk);
Would a better solution not be to exit_signals(tsk); later in do_exit() after all possible signal sources are exhausted?


It doesn't matter. Filesystem waits are historically non-interruptible. The correct fix is indeed to allow the flushes to fail fast rather than wait forever.


> It sounds like exit_signals() is being called too early

Or zap_pid_ns too late, yeah.


Later would be better, no? Since it'd allow the FUSE process to outlive the init process, thus allowing the flushes to complete.


> I don't understand why wants_signal returns false on PF_EXITING even if the signal is SIGKILL (and from the kernel). Shouldn't wake up the process still, so it can get out of the flush?

First, a signal wouldn't get the process out of blocking on flushing because filesystem waits are non-interruptible.

Second, if a process is exiting then a) no handler for a signal (other than SIGKILL or SIGSTOP) could run, b) any default-exit actions couldn't wouldn't change the state of the process.

Therefore the process can't want the signal if the process is exiting.


I'm reminded of Bedrock Linux hanging during sleep because Linux sleeps the FUSE daemon while a process is waiting on FUSE and unable to be suspended for sleep: https://news.ycombinator.com/item?id=34583495


To be clear, Bedrock Linux does not hang during sleep. Rather, it cannot enter sleep consistently; if it fails to, it simply continues on as it was operating before the request to suspend. The underlying issue is a long standing The Linux kernel bug [0] in which it cannot reliably suspend if FUSE is in use. When the kernel detects this scenario, it simply doesn't suspend, but rather logs its difficulty to dmesg. This is not specific to Bedrock, but hits all projects on Linux which leverage Linux's FUSE functionality [1]. It probably hits Netflix's ndrive as well.

In the few months since my comment you've linked, I've put some work into a possible FUSE-less Bedrock implementation. It will likely have some downsides compared to a FUSE-based solution, but the trade-off may be worthwhile for some users. While it's too early to commit to this, I'm hoping to eventually support switching between a FUSE-mode and a non-FUSE-mode with a reboot to allow users to pick the desired trade-off in the desired contexts.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=34932 https://bugzilla.kernel.org/show_bug.cgi?id=198879 https://lists.debian.org/debian-kernel/2011/10/msg00412.html

[1] https://github.com/keybase/client/issues/12458 https://github.com/libfuse/libfuse/issues/248 https://bugs.launchpad.net/ubuntu/+source/sshfs-fuse/+bug/17...


TIL about `ps awwfux`. Great way to remember it


I love `ps axufww` so much and ah fuck to having learned it that way. It's the first thing I run on a server when logging in. It tells you so much about the system. That, `w` and a `dmesg -T` will go 90% of the way to diagnosing most system issues.


Fun to see how admins remember those flags. My version is fauxww


See explainshell [0], a great resource, for the full explanation of these flags.

(Another one of my favorites is cat -vet)

[0] https://explainshell.com/explain?cmd=ps+awwfux


The explainshell output is actually largely wrong here. For historical reasons, GNU ps has two completely different sets of flags depending on whether you use a dash (ps -abcd) or not (ps abcd). The man page documents both sets, but explainshell is getting confused and showing a mix of dash versions and non-dash versions. For example, it shows the documentation for '-f' ("full-format listing"), but it should be showing the documentation for 'f' ("ASCII art process hierarchy (forest)").

(I considered filing a bug report, but there is one already: https://github.com/idank/explainshell/issues/214)


Another one I like is `netstat -tulpen` (tulpen is german for tulips)


I use `cat -A` on Linux, which is equivalent to `cat -vET`



I thought <bigcorp> used <xyz> is always a funny question/assertion.

Bigcorps are large and diverse. In this case - this seems to be user/desktop facing as it's a fuse module for studio assets.


They use FreeBSD for the CDN but I think their application servers use Linux.

https://netflixtechblog.com/linux-performance-analysis-in-60...


Netflix uses FreeBSD for their dataplane and AWS Linux for control plane, at least as of ~2020.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: