
Another Case of Obscure CPU Nondeterminism - gbrown_
http://robert.ocallahan.org/2017/06/another-case-of-obscure-cpu.html
======
dom0
rr := [https://github.com/mozilla/rr](https://github.com/mozilla/rr)

rr is a lightweight tool for recording and replaying execution of applications
(trees of processes and threads).

~~~
_yosefk
Wow, only learned about rr today - an amazing project! Google for instance
reported a sizable number of "flaky tests" which tend to pass but sometimes
fail. Always running tests under rr would take care of that (since each
failure would be reproducible.)

This is a huge deal. While I prefer the Cilk approach (automated debugging
pinpointing places which can theoretically execute non-deterministically),
it's not always applicable and isn't always or even often applied where
applicable. This is definitely the next best thing, and in absolute terms,
it's pretty damn good.

~~~
DannyBee
" Always running tests under rr would take care of that (since each failure
would be reproducible.)"

That's not actually the hard part :)

It honestly really wouldn't help that much, even if it wouldn't be infeasible
for other reasons (increased resource usage, even of their cited 1.2x, is a
_ton_ , etc).

Even if you could completely reliably restore the state to a random _other_
machine than the one it ran on (IE possibly different arch/memory/etc
configuration), and you met all the requirements, the main part rr helps you
with is reproducing the failure, which, most of the time, google could
actually do anyway.

The hard part is _figuring out what went wrong_. Remember Google already has
asan, ubsan, and tons of other stuff running. So the trivial causes are pretty
much not occurring. It's not like people look at the flakes and say, welp, i
messed up this one variable and that was that!

It's usually torturous debugging of trying to understand the set of conditions
that have occurred.

IE The reason there are so many flakes is because there's so much that can go
wrong. Not because people can't reproduce the failures.

Also note that in Google's world, RR would probably not be compatible with how
the tests are run anyway (RR wants fairly exclusive to the perf counters, but
the tests may be getting sampled), and for a set of flakes, perturbing the
performance counter settings will change things enough to make them less
flaky!

~~~
glandium
OTOH, what rr also brings to the table is reverse debugging, and _that_ makes
these kinds of torturous debugging an order of magnitude easier.

So not only are you able to reproduce the error, but you're also able to go
backwards to find _how_ it happened!

I, for one, barely use gdb anymore because reverse debugging with rr makes
debugging _so_ much easier (well, technically, I still do use it, since rr is
not an entirely new debugger, you still end up in with a gdb prompt).

FWIW, one of the first things I tried when I first used rr was to debug a
crash I had debugged years earlier, that was due to a miscompilation by GCC.
As that had happened years earlier, I didn't remember the details of what code
was miscompiled in what particular way, but I do know that debugging that took
a long time, and I had only figured it out by chance because valgrind
pinpointed to related code. With rr it only took minutes to find the root of
the crash.

~~~
DannyBee
"So not only are you able to reproduce the error, but you're also able to go
backwards to find how it happened!"

For some (and i'd guess a bunch of the people i mentioned in the parent
comment), they definitely find this easier, but just to present a contrarian
position: i actually don't. I admit to being weird - in a former life, i was a
gdb maintainer.

I also was trained by people who believed the right approach was not to
immediately try to find the sets of conditions and variables that caused your
problem and declare victory, but to go and meditate upon the code and think
about it until you understood it well enough to understand why this might
happen even when you think it couldn't. That will often enable you to
understand the code well enough to see what else is wrong.

(Again, i don't claim it's better, i just claim that's why i tend not to care
about RR. The hard part for me is the thinking about the code, not the finding
the sets of conditions and variables that caused a particular set of errors)

It's definitely the case that, personally, when i follow the "find conditions,
fix bugs" approach, i tend to write much buggier code (even with good testing
strategies) than when i follow the other way.

~~~
roca
This reminds me of Linus Torvalds' distaste for debuggers.

I think eschewing debugging is fine for code you understand pretty well and
when you already have significant information about the failure. But when
those conditions aren't met, debuggers are very useful. (NB, if you use
logging code and think "I'm not using a debugger!", you're just using a bad
one.)

It's true that a good debugger tempts one to think less deeply about the code
than one should, but that temptation can be overcome.

------
userbinator
_commodity CPUs running user-space code really are deterministic in practice,
or at least the nondeterminism can be efficiently detected and controlled,
without restorting to pervasive binary instrumentation. We 're mostly winning
that bet on x86 (but not ARM)_

I'm curious about that mention of ARM --- since I always thought it was a far
simpler and more rigidly defined architecture with fewer edge-cases/undefined
(more like implementation-defined) behaviour than x86.

~~~
ajdlinux
[https://github.com/mozilla/rr/issues/1373](https://github.com/mozilla/rr/issues/1373)

This disappoints me, as I'd love to see rr ported to PowerPC, which is another
LL/SC architecture that I suspect might run into this issue.

------
yuhong
The story is that Intel later introduced XSAVEOPT etc to speed up XSAVE by
only saving modified state.

