Amazon should IMO come up with a little ACPI spec by which it can set a flag saying "diagnostic interrupt requested" and then send an NMI. That way it would be a known NMI and would work reliably and without magic command line options.
I suspect that is why you are seeing unreliability with unknown NMI panics.
There are a few relevant things about interrupts that matter here:
1. Interrupts, in general, don't have useful payloads. This is because the CPU and its interrupt controller keep track of all interrupts pending so that they can make sure to invoke the interrupt handler when it's next appropriate to do so. To avoid having a potentially unbounded queue of pending interrupts, the pending interrupt state is just a mask of bits indicating which vectors are pending. On x86, NMI is vector 2. It is pending or it isn't. This is like like a regular device IRQ, which can be pending or not. In the case of regular device IRQs, the OS can query the APIC to get more information. If the IRQ vector is shared, the OS can query all possible sources. In the case of performance counter NMIs, the kernel can read the performance counters. In the case of magic AWS NMIs, there is nothing to query. This is why I think it's a poor design.
2. The x86 NMI design is problematic: NMIs are blocked while the CPU thinks an NMI is running, but the kernel has insufficient control of the "NMIs are blocked" bit, and the CPUs heuristic for "is an NMI is running?" is inaccurate. The result is extremely nasty asm code to fix things up after the CPU messes them up. This is nasty but manageable.
3. x86_64 has some design errors that mean that the kernel must occasionally run with an invalid stack pointer. The kernel developers have no choice in the matter. Regular interrupts are sensibly masked when this happens, so interrupt delivery won't explode when the stack pointer is garbage. But NMIs are non-maskable by design, which necessitates a series of unpleasant workarounds that, in turn, cause their own problems.
The kernel can definitely have bugs that cause NMI handling to fail. I've found and fixed quite a few over the years. Fortunately, no amount of memory pressure, infinite looping, or otherwise getting stuck outside the NMI code is likely to prevent NMI delivery. What will kill AWS's clever idea is if the kernel holds locks needed to create a stackdump at the time that the NMI is delivered. The crashkernel mechanism tries to work around this, but I doubt it's perfect.
Also, regarding point two, is the cumulative result of those factors you describe (other NMIs being blocked, lack of kernel control, the unreliable heuristic) that a diagnostic NMI may end up never being run?
Look up SYSCALL in the manual (AMD APM or Intel SDM). SYSCALL switches to kernel mode without changing RSP at all. This means that at least the first few instructions of the kernel’s SYSCALL entry point have a completely bogus RSP. An NMI, MCE, or #DB hitting there is fun. For the latter, see CVE-2018-8897. You can also read the actual NMI entry asm in Linux’s arch/x86/entry/entry_64.S for how the kernel handles an NMI while trying to return from an NMI :)
To some extent, the x86_64 architecture is a pile of kludges all on top of each other. Somehow it all mostly works.
> is the cumulative result of those factors you describe (other NMIs being blocked, lack of kernel control, the unreliable heuristic) that a diagnostic NMI may end up never being run?
No, that’s unrelated. A diagnostic NMI causes the kernel’s NMI vector to be invoked, and that’s all. Suppose that this happens concurrently with a perf NMI. There could be no indication whatsoever that the diagnostic NMI happened: the two NMIs can get coalesced, and, as far as the kernel can tell, only the perf NMI happened.
Once all the weird architectural junk is out of the way, the NMI handler boils down to:
for each possible NMI cause
did it happen? if so, handle it.
If no cause was found, complain.
I don't know if that's enough to make an NMI "unreliable", but the kernel being able to do nothing for NMIs to fail might be a bit strong.
A triple fault will reboot or do whatever else the hypervisor feels like doing.
Though arguably, a hypervisor can dump out some useful state on a triple fault, as they actually tend to do. But that's not an in-band kernel panic, and in VirtualBox at least you can get a dump of that state without sending an NMI or otherwise disturbing the VM.
It seems like a much more useful feature is simply being able to see the console on the VM itself. Often kernel panics will print information on the default virtual terminal.
It was extremely handy, and I miss having it. Debugging issues in AWS is a bit trickier.
It might be a better idea to trigger a machine check (#MC). That said, AWS probably knows more than I do.
source: I help keep Linux’s NMI and MCE entry asm working. NMI is also a turd, but it’s a turd that can work reliably. Not crashing due to MCE is strictly best-effort.
Historically it took the form of a button on the computer/server that when pressed would signal the kernel to bluescreen/kernel panic and a dump of memory to disk, then a reboot, aka a "machine check". It is now present also in the virtual hardware of a VM.
This article is talking about the latter, not the generic idea of hardware raising an NMI (which many types of hardware raise during error conditions, such as ECC or raid controllers or other devices), which is likely what you are seeing in your logs.
I tried to debug crashes on my laptop and without a real console access, it's pretty much impossible to tell why kdump fails.
There is also a 'forced shutdown' API, but for whatever reason in testing for me it has always been slower than a graceful shutdown. I have no idea what it does internally
I have sort of reverse engineered what the two options ("shutdown", "force shutdown") in AWS do.
A regular shutdown causes the hypervisor to issue the equivalent of "init 0" to the VM, over whatever communication channel is used between dom0 and domU. Semantically this is treated as a request. If the VM is completely catatonic, the domU kernel won't be able to process the request and will just fail.
A forced shutdown causes the hypervisor to issue a similar request, but also puts the VM on a watchlist. If the domU kernel has not terminated in a given time window, probably 10 or 20 minutes[ß], the hypervisor will nuke the VM.
ß: I'm not entirely sure about the time window, 10 and 20 minutes are what I have witnessed most often. I have on one occasion seen a VM take 90 minutes get nuked. And of course it was a legacy snowflake system that we couldn't just respawn on demand. That was fun.
echo 1 > /proc/sys/kernel/sysrq && echo c > /proc/sysrq-trigger
to trigger a freeze. This could also easily be automated via SSH.
> Today, we are announcing a new Amazon Elastic Compute Cloud (EC2) API allowing you to remotely trigger the generation of a kernel panic on EC2 instances.
Which can be done with the commands I mentioned in my parent comment.
And again, the point of this is for VMs that are unresponsive, and that generally includes SSH. It is remarkably simple to get Linux into a state where SSH isn't going to allow you to log in, or log ins might take 10+ minutes to actually happen.
Someone else mentions console access for these things, which is sadly missing in AWS. I'm not sure how you'd take something like console access (very dynamic, keyboard input, screen output? text?) and put it into an AWS API, though. (Their APIs tend to follow certain patterns.)