Hacker News new | past | comments | ask | show | jobs | submit login
Trigger a Kernel Panic to Diagnose Unresponsive EC2 Instances (amazon.com)
141 points by ingve 59 days ago | hide | past | web | favorite | 35 comments

unknown_nmi_panic is not entirely reliable.

Amazon should IMO come up with a little ACPI spec by which it can set a flag saying "diagnostic interrupt requested" and then send an NMI. That way it would be a known NMI and would work reliably and without magic command line options.

It still wouldn't be reliable because if a kernel is fully wedged its not going to respond to anything, including an NMI.

I suspect that is why you are seeing unreliability with unknown NMI panics.

You can't be so wedged that you can't respond to an NMI; an NMI is the CPU stealing execution from the kernel and giving it to the configured interrupt handler. The interrupt handler is usually coded entirely independently of any state in the kernel itself, so—for any reasonable production kernel—there's nothing the kernel can do that would make the NMI interrupt handler fail to function.

Not really. An NMI is delivered just like any other interrupt, and the NMI handler absolutely depends on kernel state. This is needed for the handler to do things like figuring out that it's a performance counter event and queueing up the perf event record involved.

There are a few relevant things about interrupts that matter here:

1. Interrupts, in general, don't have useful payloads. This is because the CPU and its interrupt controller keep track of all interrupts pending so that they can make sure to invoke the interrupt handler when it's next appropriate to do so. To avoid having a potentially unbounded queue of pending interrupts, the pending interrupt state is just a mask of bits indicating which vectors are pending. On x86, NMI is vector 2. It is pending or it isn't. This is like like a regular device IRQ, which can be pending or not. In the case of regular device IRQs, the OS can query the APIC to get more information. If the IRQ vector is shared, the OS can query all possible sources. In the case of performance counter NMIs, the kernel can read the performance counters. In the case of magic AWS NMIs, there is nothing to query. This is why I think it's a poor design.

2. The x86 NMI design is problematic: NMIs are blocked while the CPU thinks an NMI is running, but the kernel has insufficient control of the "NMIs are blocked" bit, and the CPUs heuristic for "is an NMI is running?" is inaccurate. The result is extremely nasty asm code to fix things up after the CPU messes them up. This is nasty but manageable.

3. x86_64 has some design errors that mean that the kernel must occasionally run with an invalid stack pointer. The kernel developers have no choice in the matter. Regular interrupts are sensibly masked when this happens, so interrupt delivery won't explode when the stack pointer is garbage. But NMIs are non-maskable by design, which necessitates a series of unpleasant workarounds that, in turn, cause their own problems.

The kernel can definitely have bugs that cause NMI handling to fail. I've found and fixed quite a few over the years. Fortunately, no amount of memory pressure, infinite looping, or otherwise getting stuck outside the NMI code is likely to prevent NMI delivery. What will kill AWS's clever idea is if the kernel holds locks needed to create a stackdump at the time that the NMI is delivered. The crashkernel mechanism tries to work around this, but I doubt it's perfect.

If you're able to, would you mind linking to some resources about the case you describe in point three where the kernel need deal with an invalid stack pointer? I'm very curious about the underlying causes.

Also, regarding point two, is the cumulative result of those factors you describe (other NMIs being blocked, lack of kernel control, the unreliable heuristic) that a diagnostic NMI may end up never being run?

> If you're able to, would you mind linking to some resources about the case you describe in point three where the kernel need deal with an invalid stack pointer?

Look up SYSCALL in the manual (AMD APM or Intel SDM). SYSCALL switches to kernel mode without changing RSP at all. This means that at least the first few instructions of the kernel’s SYSCALL entry point have a completely bogus RSP. An NMI, MCE, or #DB hitting there is fun. For the latter, see CVE-2018-8897. You can also read the actual NMI entry asm in Linux’s arch/x86/entry/entry_64.S for how the kernel handles an NMI while trying to return from an NMI :)

To some extent, the x86_64 architecture is a pile of kludges all on top of each other. Somehow it all mostly works.

> is the cumulative result of those factors you describe (other NMIs being blocked, lack of kernel control, the unreliable heuristic) that a diagnostic NMI may end up never being run?

No, that’s unrelated. A diagnostic NMI causes the kernel’s NMI vector to be invoked, and that’s all. Suppose that this happens concurrently with a perf NMI. There could be no indication whatsoever that the diagnostic NMI happened: the two NMIs can get coalesced, and, as far as the kernel can tell, only the perf NMI happened.

Once all the weird architectural junk is out of the way, the NMI handler boils down to:

    for each possible NMI cause
      did it happen?  if so, handle it.
    If no cause was found, complain.
Amazon’s thing is trying to hit the “complain” part. What they should do is give some readable indication that it happened so it can be added to the list of possible causes.

So if the kernel loads a random value into CR3, will an NMI still do anything useful? (Or if the IDT is overwritten, or the NMI handler unmapped.)

I don't know if that's enough to make an NMI "unreliable", but the kernel being able to do nothing for NMIs to fail might be a bit strong.

Total garbage in CR3 will likely triple fault. Garbage in the IDT, etc will also double or triple fault. Corruption of the NMI “don’t recurse now” state will likely just ignore the NMI. Corruption of various other things may cause the NMI handler to crash before it makes a useful dump.

A triple fault will reboot or do whatever else the hypervisor feels like doing.

Yeah, that was my point. No kernel panic on a triple fault.

Though arguably, a hypervisor can dump out some useful state on a triple fault, as they actually tend to do. But that's not an in-band kernel panic, and in VirtualBox at least you can get a dump of that state without sending an NMI or otherwise disturbing the VM.

No it’s not. If you have a legit source of NMIs (e.g. if you’re running perf), it’s entirely possible for a magic diagnostic NMI to be ignored because it is I distinguishable from a perf NMI.

AWS doesn't give you console access, does it? Or at least I haven't been able to find it. I know on Linode they have LISH (console over SSH) and Xen will give you a VNC console (I think Vultr and Digital Ocean can too).

It seems like a much more useful feature is simply being able to see the console on the VM itself. Often kernel panics will print information on the default virtual terminal.

That's only an output image though. The ability to interact with the console is always something I've missed from on-premises hypervisors.

FWIW, GCE have console. Another thing I miss in AWS is the ability to change instance user-data while running (GCE also allows this, including the ability to long-poll for changes)

Yeah, I've used GCE's console many times with SysRq to debug frozen instances. Surprised that AWS doesn't have one.

Rackspace (at least, when I worked for a company that used them) has access to the console; their website has a link to a VNC page (running a Java applet) that connects to the console of the VM; you can even type into it, unlike AWS.

It was extremely handy, and I miss having it. Debugging issues in AWS is a bit trickier.

This. Console access is one of the handiest user-level features of VMware in my opinion.

Looks like a useful addition. What would be even better is if they provided a GDB stub to allow for full remote kernel debugging. Most hypervisors have support for a GDB stub, but for some reason it's not the norm for cloud providers to expose it to customers. It would be especially useful for unikernels. Most kernels of any type don't offer self-debugging over the network, because it's hard and unreliable (you have to ensure that any functions you might want to set a breakpoint on aren't needed by the debugger itself, which means you basically need a whole separate network stack). But they can at least allow debugging userland processes; that doesn't work with unikernels, which bake the entire system into the kernel.

Does ec2 offer a serial console equivalent? I've got FreeBSD on bare metal boxes with IPMI; you can get a serial over lan console and hit the NMI via IPMI as well to drop into the kernel debugger; super helpful for confirming where threads are stuck.

As far as I know, there’s only a read-only serial console.

Not all NMIs cause kernel panics. In fact in a normally running system you are likely to receive NMIs as part of the normal operation. I've seen a typical Windows 10 system handle hundreds of NMIs in a short span, and they are all handled gracefully (interrupt vector 2). (Perhaps certain drivers? Not sure.)

It might be a better idea to trigger a machine check (#MC). That said, AWS probably knows more than I do.

#MC is a colossal turd. Don’t go there.

source: I help keep Linux’s NMI and MCE entry asm working. NMI is also a turd, but it’s a turd that can work reliably. Not crashing due to MCE is strictly best-effort.

NMI (Non Maskable Interrupt) is both a general term and a term that specifically refers to an IBM PC x86 compatible function built into motherboards.

Historically it took the form of a button on the computer/server that when pressed would signal the kernel to bluescreen/kernel panic and a dump of memory to disk, then a reboot, aka a "machine check". It is now present also in the virtual hardware of a VM.

This article is talking about the latter, not the generic idea of hardware raising an NMI (which many types of hardware raise during error conditions, such as ECC or raid controllers or other devices), which is likely what you are seeing in your logs.

No I'm talking about the latter. To be specific I'm referring to the kind of NMI interrupt defined by Intel in Volume 3 Section 6.7 of its manual, that triggers interrupt vector 2 in protected mode. I write code for a hypervisor and it's the hypervisor's job to pass on NMIs from the hardware to the guest, which is why I see these NMIs.

I'm glad they provide a working kdump config. Normally kdump is kind of magic and either it works or doesn't. In the second case, trying to figure out why may be insanely hard.

I tried to debug crashes on my laptop and without a real console access, it's pretty much impossible to tell why kdump fails.

In this vein I think it would be interesting to have APIs that simulate a server, AZ, or region failure/partition. Obviously these could be quite dangerous and would need to have appropriate safeties (maybe AWS could ship you a box where you turn two keys at once).

This is why Netflix made "chaosmonkey" and other tools nearly a decade ago: https://github.com/Netflix/chaosmonkey

Why not just allow the configuration of your particular application against a set of alternate AWS endpoints that act like a failing instance/AZ/region?

Would this help in the case when sending shutdown to an EC2 instance sometimes takes 10+ minutes before the instance is stopped?

If I understand the EC2 instance lifecycle correctly, shutdown causes ACPI shutdown, and the machine does not enter stopped state until the OS in turn requests ACPI to power it off. That means shutdown time is dependent on the configuration of your software -- for example, if a single service is configured to gracefully wait up to 10 minutes for, say a persistent database client connection to disconnect, you might pay the full timeout during shutdown.

There is also a 'forced shutdown' API, but for whatever reason in testing for me it has always been slower than a graceful shutdown. I have no idea what it does internally

> There is also a 'forced shutdown' API, but for whatever reason in testing for me it has always been slower than a graceful shutdown. I have no idea what it does internally

I have sort of reverse engineered what the two options ("shutdown", "force shutdown") in AWS do.

A regular shutdown causes the hypervisor to issue the equivalent of "init 0" to the VM, over whatever communication channel is used between dom0 and domU. Semantically this is treated as a request. If the VM is completely catatonic, the domU kernel won't be able to process the request and will just fail.

A forced shutdown causes the hypervisor to issue a similar request, but also puts the VM on a watchlist. If the domU kernel has not terminated in a given time window, probably 10 or 20 minutes[ß], the hypervisor will nuke the VM.

ß: I'm not entirely sure about the time window, 10 and 20 minutes are what I have witnessed most often. I have on one occasion seen a VM take 90 minutes get nuked. And of course it was a legacy snowflake system that we couldn't just respawn on demand. That was fun.

On Linux, you've always also been able to just run:

echo 1 > /proc/sys/kernel/sysrq && echo c > /proc/sysrq-trigger

to trigger a freeze. This could also easily be automated via SSH.

If the feature is for unresponsive servers, would SSH really work?

I suppose I misunderstood the diagnosis part. My understanding was that it was an API to trigger a kernel panic.

> Today, we are announcing a new Amazon Elastic Compute Cloud (EC2) API allowing you to remotely trigger the generation of a kernel panic on EC2 instances.

Which can be done with the commands I mentioned in my parent comment.

Not really. From the hypervisor's point of view, there isn't a guarantee that the VM is running ssh, is running it on the typical port, that there's a valid log-in, etc.

And again, the point of this is for VMs that are unresponsive, and that generally includes SSH. It is remarkably simple to get Linux into a state where SSH isn't going to allow you to log in, or log ins might take 10+ minutes to actually happen.

Someone else mentions console access for these things, which is sadly missing in AWS. I'm not sure how you'd take something like console access (very dynamic, keyboard input, screen output? text?) and put it into an AWS API, though. (Their APIs tend to follow certain patterns.)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact