
Trigger a Kernel Panic to Diagnose Unresponsive EC2 Instances - ingve
https://aws.amazon.com/blogs/aws/new-trigger-a-kernel-panic-to-diagnose-unresponsive-ec2-instances/
======
amluto
unknown_nmi_panic is not entirely reliable.

Amazon should IMO come up with a little ACPI spec by which it can set a flag
saying "diagnostic interrupt requested" and _then_ send an NMI. That way it
would be a _known_ NMI and would work reliably and without magic command line
options.

~~~
throwaway2048
It still wouldn't be reliable because if a kernel is fully wedged its not
going to respond to anything, including an NMI.

I suspect that is why you are seeing unreliability with unknown NMI panics.

~~~
derefr
You can't be so wedged that you can't respond to an NMI; an NMI is the CPU
stealing execution from the kernel and giving it to the configured interrupt
handler. The interrupt handler is _usually_ coded entirely independently of
any state in the kernel itself, so—for any reasonable production
kernel—there's nothing the kernel can do that would make the NMI interrupt
handler fail to function.

~~~
amluto
Not really. An NMI is delivered just like any other interrupt, and the NMI
handler absolutely depends on kernel state. This is needed for the handler to
do things like figuring out that it's a performance counter event and queueing
up the perf event record involved.

There are a few relevant things about interrupts that matter here:

1\. Interrupts, in general, don't have useful payloads. This is because the
CPU and its interrupt controller keep track of all interrupts pending so that
they can make sure to invoke the interrupt handler when it's next appropriate
to do so. To avoid having a potentially unbounded queue of pending interrupts,
the pending interrupt state is just a mask of bits indicating which vectors
are pending. On x86, NMI is vector 2. It is pending or it isn't. This is like
like a regular device IRQ, which can be pending or not. In the case of regular
device IRQs, the OS can query the APIC to get more information. If the IRQ
vector is shared, the OS can query all possible sources. In the case of
performance counter NMIs, the kernel can read the performance counters. In the
case of magic AWS NMIs, there _is nothing to query_. This is why I think it's
a poor design.

2\. The x86 NMI design is problematic: NMIs are blocked while the CPU thinks
an NMI is running, but the kernel has insufficient control of the "NMIs are
blocked" bit, and the CPUs heuristic for "is an NMI is running?" is
inaccurate. The result is extremely nasty asm code to fix things up after the
CPU messes them up. This is nasty but manageable.

3\. x86_64 has some design errors that mean that the kernel _must_
occasionally run with an invalid stack pointer. The kernel developers have no
choice in the matter. Regular interrupts are sensibly masked when this
happens, so interrupt delivery won't explode when the stack pointer is
garbage. But NMIs are non-maskable by design, which necessitates a series of
unpleasant workarounds that, in turn, cause their own problems.

The kernel can definitely have bugs that cause NMI handling to fail. I've
found and fixed quite a few over the years. Fortunately, no amount of memory
pressure, infinite looping, or otherwise getting stuck outside the NMI code is
likely to prevent NMI delivery. What _will_ kill AWS's clever idea is if the
kernel holds locks needed to create a stackdump at the time that the NMI is
delivered. The crashkernel mechanism tries to work around this, but I doubt
it's perfect.

~~~
mrbrowning
If you're able to, would you mind linking to some resources about the case you
describe in point three where the kernel need deal with an invalid stack
pointer? I'm very curious about the underlying causes.

Also, regarding point two, is the cumulative result of those factors you
describe (other NMIs being blocked, lack of kernel control, the unreliable
heuristic) that a diagnostic NMI may end up never being run?

~~~
amluto
> If you're able to, would you mind linking to some resources about the case
> you describe in point three where the kernel need deal with an invalid stack
> pointer?

Look up SYSCALL in the manual (AMD APM or Intel SDM). SYSCALL switches to
kernel mode without changing RSP at all. This means that at least the first
few instructions of the kernel’s SYSCALL entry point have a completely bogus
RSP. An NMI, MCE, or #DB hitting there is fun. For the latter, see
CVE-2018-8897. You can also read the actual NMI entry asm in Linux’s
arch/x86/entry/entry_64.S for how the kernel handles an NMI while trying to
return from an NMI :)

To some extent, the x86_64 architecture is a pile of kludges all on top of
each other. Somehow it all mostly works.

> is the cumulative result of those factors you describe (other NMIs being
> blocked, lack of kernel control, the unreliable heuristic) that a diagnostic
> NMI may end up never being run?

No, that’s unrelated. A diagnostic NMI causes the kernel’s NMI vector to be
invoked, and _that’s all_. Suppose that this happens concurrently with a perf
NMI. There could be no indication whatsoever that the diagnostic NMI happened:
the two NMIs can get coalesced, and, as far as the kernel can tell, only the
perf NMI happened.

Once all the weird architectural junk is out of the way, the NMI handler boils
down to:

    
    
        for each possible NMI cause
          did it happen?  if so, handle it.
        
        If no cause was found, complain.
    

Amazon’s thing is trying to hit the “complain” part. What they should do is
give some readable indication that it happened so it can be added to the list
of possible causes.

------
djsumdog
AWS doesn't give you console access, does it? Or at least I haven't been able
to find it. I know on Linode they have LISH (console over SSH) and Xen will
give you a VNC console (I think Vultr and Digital Ocean can too).

It seems like a much more useful feature is simply being able to see the
console on the VM itself. Often kernel panics will print information on the
default virtual terminal.

~~~
Human_USB
[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance...](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-
console.html)

~~~
technion
That's only an output image though. The ability to interact with the console
is always something I've missed from on-premises hypervisors.

~~~
andoma
FWIW, GCE have console. Another thing I miss in AWS is the ability to change
instance user-data while running (GCE also allows this, including the ability
to long-poll for changes)

~~~
kornholi
Yeah, I've used GCE's console many times with SysRq to debug frozen instances.
Surprised that AWS doesn't have one.

------
comex
Looks like a useful addition. What would be even better is if they provided a
GDB stub to allow for full remote kernel debugging. Most hypervisors have
support for a GDB stub, but for some reason it's not the norm for cloud
providers to expose it to customers. It would be especially useful for
unikernels. Most kernels of any type don't offer self-debugging over the
network, because it's hard and unreliable (you have to ensure that any
functions you might want to set a breakpoint on aren't needed by the debugger
itself, which means you basically need a whole separate network stack). But
they can at least allow debugging userland processes; that doesn't work with
unikernels, which bake the entire system into the kernel.

~~~
toast0
Does ec2 offer a serial console equivalent? I've got FreeBSD on bare metal
boxes with IPMI; you can get a serial over lan console and hit the NMI via
IPMI as well to drop into the kernel debugger; super helpful for confirming
where threads are stuck.

~~~
comex
As far as I know, there’s only a read-only serial console.

------
kccqzy
Not all NMIs cause kernel panics. In fact in a normally running system you are
likely to receive NMIs as part of the normal operation. I've seen a typical
Windows 10 system handle hundreds of NMIs in a short span, and they are all
handled gracefully (interrupt vector 2). (Perhaps certain drivers? Not sure.)

It might be a better idea to trigger a machine check (#MC). That said, AWS
probably knows more than I do.

~~~
throwaway2048
NMI (Non Maskable Interrupt) is both a general term and a term that
specifically refers to an IBM PC x86 compatible function built into
motherboards.

Historically it took the form of a button on the computer/server that when
pressed would signal the kernel to bluescreen/kernel panic and a dump of
memory to disk, then a reboot, aka a "machine check". It is now present also
in the virtual hardware of a VM.

This article is talking about the latter, not the generic idea of hardware
raising an NMI (which many types of hardware raise during error conditions,
such as ECC or raid controllers or other devices), which is likely what you
are seeing in your logs.

~~~
kccqzy
No I'm talking about the latter. To be specific I'm referring to the kind of
NMI interrupt defined by Intel in Volume 3 Section 6.7 of its manual, that
triggers interrupt vector 2 in protected mode. I write code for a hypervisor
and it's the hypervisor's job to pass on NMIs from the hardware to the guest,
which is why I see these NMIs.

------
viraptor
I'm glad they provide a working kdump config. Normally kdump is kind of magic
and either it works or doesn't. In the second case, trying to figure out why
may be insanely hard.

I tried to debug crashes on my laptop and without a real console access, it's
pretty much impossible to tell why kdump fails.

------
wmf
In this vein I think it would be interesting to have APIs that simulate a
server, AZ, or region failure/partition. Obviously these could be quite
dangerous and would need to have appropriate safeties (maybe AWS could ship
you a box where you turn two keys at once).

~~~
notalogo
This is why Netflix made "chaosmonkey" and other tools nearly a decade ago:
[https://github.com/Netflix/chaosmonkey](https://github.com/Netflix/chaosmonkey)

------
nodesocket
Would this help in the case when sending shutdown to an EC2 instance sometimes
takes 10+ minutes before the instance is stopped?

~~~
slovenlyrobot
If I understand the EC2 instance lifecycle correctly, shutdown causes ACPI
shutdown, and the machine does not enter stopped state until the OS in turn
requests ACPI to power it off. That means shutdown time is dependent on the
configuration of your software -- for example, if a single service is
configured to gracefully wait up to 10 minutes for, say a persistent database
client connection to disconnect, you might pay the full timeout during
shutdown.

There is also a 'forced shutdown' API, but for whatever reason in testing for
me it has always been slower than a graceful shutdown. I have no idea what it
does internally

~~~
bostik
> _There is also a 'forced shutdown' API, but for whatever reason in testing
> for me it has always been slower than a graceful shutdown. I have no idea
> what it does internally_

I have sort of reverse engineered what the two options ("shutdown", "force
shutdown") in AWS do.

A regular shutdown causes the hypervisor to issue the equivalent of "init 0"
to the VM, over whatever communication channel is used between dom0 and domU.
Semantically this is treated as a request. If the VM is completely catatonic,
the domU kernel won't be able to process the request and will just fail.

A forced shutdown causes the hypervisor to issue a similar request, but also
puts the VM on a watchlist. If the domU kernel has not terminated in a given
time window, probably 10 or 20 minutes[ß], the hypervisor will nuke the VM.

ß: I'm not entirely sure about the time window, 10 and 20 minutes are what I
have witnessed _most often_. I have on one occasion seen a VM take 90 minutes
get nuked. And of course it was a legacy snowflake system that we couldn't
just respawn on demand. That was fun.

------
herpderperator
On Linux, you've always also been able to just run:

echo 1 > /proc/sys/kernel/sysrq && echo c > /proc/sysrq-trigger

to trigger a freeze. This could also easily be automated via SSH.

~~~
tln
If the feature is for unresponsive servers, would SSH really work?

~~~
herpderperator
I suppose I misunderstood the diagnosis part. My understanding was that it was
an API to trigger a kernel panic.

> Today, we are announcing a new Amazon Elastic Compute Cloud (EC2) API
> allowing you to remotely trigger the generation of a kernel panic on EC2
> instances.

Which can be done with the commands I mentioned in my parent comment.

~~~
deathanatos
Not really. From the hypervisor's point of view, there isn't a guarantee that
the VM is running ssh, is running it on the typical port, that there's a valid
log-in, etc.

And again, the point of this is for VMs that are unresponsive, and that
generally includes SSH. It is remarkably simple to get Linux into a state
where SSH isn't going to allow you to log in, or log ins might take 10+
minutes to actually happen.

Someone else mentions console access for these things, which is sadly missing
in AWS. I'm not sure how you'd take something like console access (very
dynamic, keyboard input, screen output? text?) and put it into an AWS API,
though. (Their APIs tend to follow certain patterns.)

