

Nasty Lockup Issue Still Being Investigated for Linux 3.18 - harshreality
http://www.phoronix.com/scan.php?page=news_item&px=MTg1MDc

======
jacquesm
The more mature a system is the harder the bugs will be. It's a simple bit
math: easy bugs can be found by just reasoning about the code, harder bugs can
be found with some persistence and sometimes a debugger. Those are all found
relatively early on in the life of a codebase. But a _really_ hard bug that
shows up only under stressful situations on a system that has been running
stable for days, weeks or even months will have a cycle time roughly
equivalent to the number of incidences that you can generate on a single
machine. Hard to reproduce -> hard to analyze -> therefore hard to fix.

Anything you can do to speed up the occurrence of such a bug should be reached
for first because just speeding up the frequency with which the bug appears
guarantees that you'll be able to fix it sooner, and to be more certain that
you have fixed the bug once you think that you have identified the problem.

You can waste months on little nasty ones like this. I love it though, digging
in on a bug and not letting go until the sucker is nailed for good. Especially
hardware related bugs (interrupt handlers that are not 100% transparent for
instance) and subtle timing bugs can be super hard to fix. But that feeling
when you finally find it is absolutely priceless, it's like solving a very
complex puzzle and finding the solution. I wished it did not take such a toll
on my nights though :)

~~~
Someone
_" easy bugs can be found by just reasoning about the code"_

I probably live in a dream world, but I would say software bugs of the 'causes
deadlock in a OS that is used in millions of devices' type _should_ (maybe
even _must_ ) be _avoided_ by reasoning about the code. If you find that you
cannot reason about the whole system at the time, you either simplify it
(python's GIL is a nice extreme example; you never read about race conditions
inside Python's core) or you improve your logic.

In my world, you _should_ only need the debugger for hardware related bugs,
and then, it often is a combination of intuition and sheer perseverence that
solves the issue. Especially if timing is involved, looking at the system can
make the problem go away. But even there, having thought about the system will
help a lot. For example, your reasoning might have said "between two X-es, we
always get at least one Y, so…". You can turn that into a hypothesis that can
be tested easily (in theory. Practice can be quite different)

On the other hand, if you must use modern PC hardware, you get a lot of
unavoidable complexity (multiple interrupt levels, drivers that you cannot all
check do compliance with your model of how drivers work, etc), so you may end
up with the case where reasoning about the code isn't really possible.

~~~
nieve
You're not just living in a dream world, you're using the phrase to excuse a
snide strawman. The Linux kernel is 12-20 million lines of code depending on
how you count it (and how far up the top end has gone recently). It is not
possible for humans to know and comprehend the entire codebase, much less
avoid all lockup bugs by reasoning about the code. In fact I'm pretty sure all
extant methods of creating a correctness proof fail at that scale (even if you
can ignore drivers). Much of the kernel requires specialized skills and
experience the majority of kernel developers don't have.

Beyond that the kernel developers value both correctness and performance. Most
of these bugs involve multiprocessing and a huge amount of effort has gone
into getting rid of them global locks, but even they aren't easy to reason
about at that scale - I think you're missing the common thread in "deadlock"
and "Global Interpreter Lock". Pretending that a world could exist where we
can simply reason about something like the Linux kernel and then using it to
state that devs can simplify or improve the logic is taking a completely
unsupported cheap shot and the disclaimer about the real world doesn't change
that. I'm not even sure what your motivation for the equivalent of "if you
pretend that friction & drag don't exist it's really easy to figure out
aerodynamics, so plane designers need to do better" (hint, it's not) is, but
it certainly isn't to inform, contribute, or facilitate the debate.

~~~
mrich
Who says a kernel must have 15 million lines of code? I'm pretty sure there
are kernels out there that have been verified correct. Of course the hardware
could still be broken (can also be proven correct) or a cosmic ray could flip
some bits...

~~~
lomnakkus
L4 was verified (mostly, at least):
[http://ertos.nicta.com/research/l4.verified/approach.pml](http://ertos.nicta.com/research/l4.verified/approach.pml)

------
bjourne
Hey, I've had that bug! The whole computer freezes for up to a minute and then
you get a dmesg stack trace:

    
    
        Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
        CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
        0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
        ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
        ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
        Call Trace:
        <NMI> [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
        ...
    

It was easily reproducible by running two or more cores at 100% for a few
minutes. I wanted to be a good citizen and report it, but I wasn't able to
figure out where or how you're supposed to report kernel bugs. If it's the
same bug as the Phoronix article mentions, then it's not a regression because
I've had the problem in 3.13 too.

~~~
panzi
Have you tried [https://bugzilla.kernel.org/](https://bugzilla.kernel.org/)?

~~~
bjourne
Afaict, that site only wants bugs for the mainline kernel not for kernels
modded by Ubuntu.

~~~
cpach
[https://wiki.ubuntu.com/Kernel/Bugs](https://wiki.ubuntu.com/Kernel/Bugs)

------
byuu
So it's said to take days of stress testing for the crash to occur each time.

I've had to trace bugs under emulation that took substantial amounts of time
to trigger. Thankfully in my case, I was able to serialize the entire system
into snapshots. And so by just taking a snapshot every few minutes, the one
right before a crash allowed me to be able to quickly reproduce the problem.

Does this crash event occur inside of VMs as well, and if so, could VM
snapshots be used to accelerate the bug hunt here?

~~~
xuhu
Honest question: what guarantees that if the bug reproduced 10 minutes after a
snapshot, it will reproduce again if you resume running nondeterminstically
after that snapshot ?

~~~
byuu
That's always an issue, but in my testing with emulators, once you rule out
external inputs (eg keyboard input), it tends to become predictable and
reproducible. And using a VM should help dramatically in reducing randomness.
And you keep capturing states and narrowing the window until you can reproduce
it in eg ten seconds; as too much can happen in the span of ten minutes. The
trick is that you have to get a state right before the problem actually
begins, but you don't yet know where that is. So it's possible your state
capture may be a bit too late, after the issue already eg corrupted memory
somewhere.

I know that there are certainly bugs that this kind of technique would never
work on. I've hit a few bugs that could only be triggered on live hardware
before. But I'm curious if they've tried this kind of approach for this bug
yet or not.

~~~
rwmj
Could you post more about your technique? Like, what emulator are you using
(qemu)? How do you trigger the snapshots? Are you snapshotting memory or disk
or both? How much disk space is consumed by all these snapshots? Do you
discard snapshots? How do you know when the bug has been triggered?

~~~
byuu
Sure, it's my own software. I hook up saving and loading states to key
presses, so eg F5 would save, F7 would load. And then F6/F8 would increment or
decrement the save slot number.

You basically have to capture everything possible: all memory, the state of
all the CPU registers, the state of all hardware registers, etc. Obviously
disk would be a real challenge, where you'd have to keep a delta list of disk
changes since program start, or simply not serialize that state. If you miss
anything, you can have problems loading states correctly. However, there is
quite a bit of tolerance between theory and practice, so if there is something
that you really can't capture the state of for some reason (like a hardware
write-only register when you weren't logging what was previously written to
it), a lot of times you can get away with it anyway.

Because the system I am using is so old, snapshots are only 300KB each.
Sometimes I dump them to disk, sometimes I just keep them in RAM. I know that
a PC would be much more challenging, given how much more hardware is at play,
and because VMs aren't quite the same as pure software emulation like qemu
(though you could potentially use qemu for this too), but VM software _does_
implement this snapshot system, so clearly it's possible.

You know when the bug is triggered through the visual output. And what's cool
is that by saving periodic snapshots automatically to a ring buffer, you can
code a special keypress to "rewind" the program. So it crashes, then you go
back a bit, and save a disk snapshot there, and wait to see if the bug
repeats. If it does, then you turn on trace logging and dump all of the CPU
instructions between your save point and the crash. Then you go to the crash
point, and slowly work your way back to try and find out where things went
wrong.

------
Animats
Is this an out-of-memory lockup or something else?

Linux has a history of out-of-memory lockups where disk caching has tied up
most of memory. Linux uses free memory as disk cache. When there's a need for
memory, clean cache pages are reused. Sometimes the I/O systems have the cache
locked at the moment there's a need for memory, resulting in an out of memory
lockup. This has been fixed in the past, and broken in the past.

------
tveita
So has anyone besides Dave Jones managed to reproduce the issue? The article
doesn't mention.

This doesn't look like front page news until it is confirmed to be a
widespread problem.

------
spydum
I find it interesting someone posted a patch to the KERNEL with a comment of
"doesnt always work?"

It would seem prudent to invest a bit more time to understand why a particular
thing doesn't work, for such a critical bit of code (I'm not implying I know
that this is the cause or related to these new found bugs, but the article
mentions it, so I'm just a little surprised).

~~~
cbsmith
> It would seem prudent to invest a bit more time to understand why a
> particular thing doesn't work, for such a critical bit of code (I'm not
> implying I know that this is the cause or related to these new found bugs,
> but the article mentions it, so I'm just a little surprised).

It would indeed. That's no reason not to post the patch though. Welcome to the
imperfect world of software development.

------
ghshephard
The title is a little off - it appears that this is a 3.17 issue as well, so
not a 3.18 regression. What's interesting, is that in attempting to track down
what exactly is causing the kernel lockup, the team is likely to track down
other, quite likely unrelated kernel issues as well.

------
dkhenry
I don't think this is new to 3.18. I have had a lockup issue in every version
since 3.2

[https://bugzilla.kernel.org/show_bug.cgi?id=29162](https://bugzilla.kernel.org/show_bug.cgi?id=29162)

~~~
0x0
That looks like a reiserfs bug. Is there any reason to go with reiserfs these
days instead of ext4, xfs, btrfs?

~~~
lmm
Versus ext4: online resize, better at handling large directories. Versus xfs:
better at handling power loss/crashes (no "trailing zeroes problem"),
reiserfsck is not 100% reliable but it can sometimes save you whereas if you
reach the same situation with xfs then you have 0% chance of recovering your
data. Versus btrfs: more mature (isn't btrfs still tagged as experimental?),
same situation wrt fsck.

I use ZFS where I can, but for root filesystems on linux machines reiser still
seems like the best tradeoff.

~~~
feld
reiserfsck used to be a game of russian roulette -- it either sort-of works
and you can mount your filesystem, or it eats all your data.

I can't say I'd trust my data on it these days.

~~~
lmm
It's a gamble sure. But if you get into the same situation with XFS or btrfs
there's no gamble, just guaranteed data loss.

~~~
feld
XFS is not as bad as it once was, but I don't use either of those anyway.

