

Hunting deep-seeded kernel bugs - kezza
http://www.anchor.com.au/blog/2012/03/bugfixing-the-in-kernel-megaraid_sas-driver-from-crash-to-patch/

======
caf
It's "deep-seated" (although I don't see what was wrong with just using the
original article title).

~~~
rquantz
Ever since I learned the name for these kinds of mistakes, I can't help but
call it out every time I see them: eggcorn!

<http://eggcorns.lascribe.net/english/46/seed/>

~~~
topbanana
This term was coined by Adam and Joe. You can listen to their their podcast
here: <http://www.bbc.co.uk/podcasts/series/adamandjoe> (and I recommend that
you do)

~~~
davidwtbuxton
As great as Adam and Joe are, the term pre-dates their podcasts. I always
thought it was coined on the language log,
[http://itre.cis.upenn.edu/~myl/languagelog/archives/000018.h...](http://itre.cis.upenn.edu/~myl/languagelog/archives/000018.html)

------
munin
> MegaCli is still the ultimate cause

strongly disagree! no kernel component should introduce memory corruption, and
especially not in response to an ioctl from any user mode code. they say:

> Specifically, various offsets provided from userspace are used by the driver
> without any checks. If these offsets are maliciously chosen, the driver can
> be induced to write to arbitrary kernel memory

sure hope they patched that too!

~~~
rwmj
Bugs happen though.

And these sort are old. I wrote a super [root] exploit for OS-9 which worked
by tricking the GETSTAT call [equivalent of ioctl] for some driver to write
its result to a random, unchecked place in memory. That was in 1992 / OS-9
2.4.

~~~
munin
bugs DO happen.

the root cause is in the kernel component that deals with memory
inappropriately though. user space should be free to blast whatever into
kernel space via system calls and ioctl and rely on the operating system to
not crash, and also for the kernel to continue enforcing the rules of the
operating system.

this is like saying "the root cause is Dan, who typed in a `drop tables. our
solution is to take away Dans keyboard"

------
jwildeboer
They claim to use RHEL but instead of asking for support from Red Hat they
decide to try themselves? Or am I missing something? Wouldn't such a patch be
beneficial to other RHEL users too? (disclaimer: I work at Red Hat)

~~~
pmjordan
Well, they _have_ pushed the fix upstream and it's been merged into mainline.
I assume you guys have people monitoring the subsystems for backportable
bugfixes?

I have no doubt that you guys have great support, but since the MegaCLI
utility that triggers the issue is a closed source third-party piece of
software, I could imagine the process of getting Red Hat engineers to even
reproduce the problem would probably take longer than just going in and fixing
it themselves. I can imagine having hosts with dozens of VMs going down is a
high-priority issue - OK, less so once they discovered they could work around
it by disabling monitoring, and eventually by upgrading the monitoring
utility, but at that point they were already close to a fix. (also: they
didn't say they _didn't_ report it to RH...)

On a more technical note, if the MegaRAID driver only does DMAs to/from 32-bit
memory addresses, this could be an artificial performance limitation on
systems with lots of RAM (like the 128GiB in the article) as it potentially
means waiting for <4GiB physical memory buffers to become available, and
copying the data. It's not clear from the article if the 32-bit-only
restriction only applies to STP requests, or all I/O. As the STP stuff only
seems to be used for monitoring, that's not a problem, but if all disk I/O
uses bounce buffers, that's a bad thing.

I wonder if a hardware IOMMU could have caught this bug.

~~~
justincormack
Lots of poor hardware has limitations like this. It does hurt performance.

~~~
pmjordan
I _think_ it's a requirement of PCIe for devices to support DMA to memory with
64-bit physical addresses.[1] In any case, the hardware clearly supports it in
the case - the bug was that a 32-bit I/O vector was being passed off to the
hardware claiming to be a 64-bit one, and subsequently interpreted as such.

The situation was certainly bad with classic PCI - I have an nForce5-based
motherboard whose PCI bus silently truncates 64-bit DMA buffer addresses to 32
bits. The onboard ethernet chip does in principle support 64-bit DMA, so I was
getting all sorts of weird behaviour after upgrading my RAM from 2 to 6 GiB.
The southbridge chip in question is now blacklisted for 64-bit DMA in the
Linux kernel.

The 32-bit DMA cutoff isn't the only such constraint on buffer physical
addresses of course; historically, ISA devices could only DMA to 20-bit
addresses IIRC (or was it 24 bits?). I wrote an OS X driver for the _virtio_
paravirtual ethernet adapter a while back[2] and its I/O buffers must be 4K
page-aligned (low 12 bits 0), and its high bits are specified by a 32-bit
number. This means you can only use buffers with physical addresses which fit
in 44 bits. That's 16TiB, which are probably still a way off but could
conceivably be reached not too far in the future. Current x86_64 CPUs
certainly support 48 bit physical addresses in principle.

[1] That doesn't mean hardware bugs don't exist, but at least it's part of the
basic spec now, as opposed to being tagged on later as for PCI.

[2] <http://github.com/pmj/virtio-net-osx>

------
spydum
Enjoyed the read; hats off to you guys for digging in and finding/resolving
and pushing the bugfix upstream.

I do have to wonder: did this bug all of a sudden spring up on you? Seems like
the potential would have always been there on this platform. Could a change
management process have lead you to find that "adding megaCLI monitoring to
all KVM hosts" or "upgrading to megacli 8.01.06" was the cause of the problem?
Just curious if you perceive a better CM or system inventory could have
allowed you to deduce the problem area with a bit more efficiency?

------
jilebedev
>MegaCli specifies the address in one format, but sets a flag indicating that
it’s in another

Far as I can discern (I'm not a kernel hacker) that is the root cause. Can
anyone with static code analysis experience chime in on whether it's possible
to detect that kind of error by examining the source?

