
Debugging an evil Go runtime bug - cmsimike
https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/
======
warent
Marcan's attitude is great; I know of a ton of people (myself included) who
would've written that article with far more complaining interleaved. Super
informative as well, I learned a ton from this article (GRUB 2 feature for
marking off bad RAM? Wow!). Very well written, informative, humorous, etc.
Love it

~~~
jchw
Agreed. Marcan's a chill dude. I saw him a few times in the Dolphin Emulator
IRC and he always had interesting things to say.

From my perspective this approach was pretty unique; going all the way down to
debugging the hardware first may seem obvious to some, but it's a totally
opposite approach to how I'd go about it. My mind would jump directly to
producing a minimal test case. Would've never thought to mark off bad RAM with
an obscure(ish?) GRUB 2 feature. Would've never thought to selectively flip a
kernel flag for some parts of the code.

It's great to get these perspectives from people who really know how to dig
down and debug deep.

~~~
Twirrim
From a slightly more old-school sysadmin approach, I learned to troubleshoot
roughly in line with the OSI model ([https://www.lifewire.com/layers-of-the-
osi-model-illustrated...](https://www.lifewire.com/layers-of-the-osi-model-
illustrated-818017)), starting at layer 1 (physical) and working up.

That's not to say I spend a whole lot of time looking at the lower levels, but
my quick mental checklist starts off down at physical points, and I try to
quickly eliminate possibilities. In a lot of cases it's obvious it's a code /
logic bug, and you can completely skip the lower layer stuff, but making it a
conscious step pays off.

------
frikkasoft
That is such as well written post, and showcases some serious diagnostic
skills by an experienced person.

Btw, thanks about telling me about the badram Grub 2 feature, had no idea that
exited.

------
ereyes01
Man I loved the approach to narrow down the offending object file using the
regex on the SHA hash of the binaries. That would've saved me lots of time
hunting and guessing bugs with cscope+kdb back in my kernel hacker days!

------
cfstras
Hah. I remember Bryan Cantrill complaining about this exact thing. Glad that
it's fixed.

Turns out somebody else did, too:
[https://twitter.com/bcantrill/status/774290166164754433?lang...](https://twitter.com/bcantrill/status/774290166164754433?lang=en)

/edit: spelling

------
donquichotte
Wow, these are some serious debugging skills. I also admire the tenacity and
the will to investigate the root cause.

> I tried setting GOMAXPROCS=1, which tells Go to only use a single OS-level
> thread to run Go code. This also stopped the crashes, again pointing
> strongly to a concurrency issue.

I think I would have stopped there.

------
shurcooL
The investigation in the linked Go issue [1] is also impressive.

[1]
[https://github.com/golang/go/issues/20427](https://github.com/golang/go/issues/20427)

~~~
smegel
Don't forget about this one, similar but affects BSD and still open:

[https://github.com/golang/go/issues/15658](https://github.com/golang/go/issues/15658)

------
stmw
This is great story, probably wins the year for "best bug you've ever
encountered?" question. Having implemented some weird runtimes for weird
languages, I am sympathetic to Go team here -- these odd tradeoffs of pushing
the envelope on OS <-> your_own_compiler interactions can trigger some wild
experiences.

------
FLUX-YOU
>Since the problem gets worse with temperature, what happens if I heat up the
RAM?

Neat. I wonder if that makes Rowhammer more likely to occur.

~~~
dboreham
Probably. Hotter usually means closer to not working for semiconductors.

~~~
stmw
Possibly, although hotter fundamentally means more thermal noise, which might
actually reduce correlations / ability to communicate effectively between
adjacent circuits.

Think of it as SNR (signal-noise-ratio) -- increasing temperature increases
thermal noise (there are other kinds), and with the same signal, it should
actually reduce the efficiency of the side channel.

But it brings up a good question, I wonder if anyone has studied this...

------
squeed
For a similar tale of vDSO getting someone in trouble, check out this fun talk
"Really crazy container troubleshooting stories":
[https://media.ccc.de/v/ASG2017-115-really_crazy_container_tr...](https://media.ccc.de/v/ASG2017-115-really_crazy_container_troubleshooting_stories)

------
0x0
Setting up for a 104 byte stack seems pretty crazy, wouldn't you risk
overrunning the red zone even without all that stack probing?
[https://en.wikipedia.org/wiki/Red_zone_(computing)](https://en.wikipedia.org/wiki/Red_zone_\(computing\))

~~~
zaarn
The redzone is only non-explicit stack and only really matters if you violate
it. If vDSO allocates stack properly, which it should considering it's an
exported function, there is no problem.

~~~
0x0
How does being an exported function change the rules, I thought the x86-64 ABI
mandated an implicit safe 128 bytes below rsp at all times? Also, how can a
vDSO function "allocate" stack? It would have to know about the current stack
space as configured by the go runtime, and somehow dig into this go-runtime-
specific record of the current stack limit? Isn't the only available option
for any exported function just to /use/ pre-allocated stack space (by
subtracting from rsp) - I don't see how it could possibly extend the pre-
allocated stack.

~~~
zaarn
The Red Zone is only a thing you _should_ be doing within a function. Once you
call another function you need to update the RSP regardless.

This "at all times" is most relevant when you have interrupts in your
execution that you can't predict. When an interrupt runs, you can't know how
much of the redzone has been used, so you assume 128 bytes below the current
RSP value.

> Also, how can a vDSO function "allocate" stack?

By updating RSP and using the stack properly, that is what I mean with
allocating the stack. It's normal code.

> It would have to know about the current stack space as configured by the go
> runtime, and somehow dig into this go-runtime-specific record of the current
> stack limit?

No, you simply use push and pop as usual. The go routine sets up the stack,
the bug in the blogpost is that the go runtime assumed the vDSO would only use
a couple bytes on the stack, but it used more because it did a stack probe
about 4 kilobytes into the stack.

>Isn't the only available option for any exported function just to /use/ pre-
allocated stack space (by subtracting from rsp) - I don't see how it could
possibly extend the pre-allocated stack.

Not at all, as previously mentioned, stack is simply the value in RSP, you can
update that as you want. PUSH and POP the values you need and the OS will
implicitly allocate memory for the stack as needed.

~~~
0x0
Agreed on all those points (except - I think interrupts use the kernel stack?
Otherwise triggering an interrupt would clobber the red zone, which should be
preserved, since it pushes the return address and flags IIRC? So interrupts
don't "use" the red zone)

I was just wondering, with the Red Zone in the ABI specification requiring 128
bytes below RSP, wouldn't that also mean any function can assume the 128 bytes
below RSP are actually mapped? If Go sets up a stack with only 104 bytes to
go, then couldn't accessing rsp-105 to rsp-128 from the vDSO (which should be
safely within the red zone) risk causing a segfault even if the function
didn't do the race-y 4k stack probe?

~~~
zaarn
> (except - I think interrupts use the kernel stack? Otherwise triggering an
> interrupt would clobber the red zone, which should be preserved, since it
> pushes the return address and flags IIRC? So interrupts don't "use" the red
> zone)

Interrupts may use the kernelstack, you can use the userstack but this is
trouble if they don't uphold the redzone, which they may not.

The redzone is a mere suggestion the compiler _may_ apply when it exposes
functions elsewhere. The stack is mostly free game.

When you are writing a kernel, the red zone is off, I may have worded this
wrong previously. The userspace should use the redzone, the kernel can't,
interrupts must not assume a redzone since they don't have the time to add or
sub from the RSP.

>wouldn't that also mean any function can assume the 128 bytes below RSP are
actually mapped?

The stack is always sort of mapped and not at all. Normally the stack region
is unmapped in the page table. Should a process run the stack into an unmapped
region, the OS will generally map this region if memory is available. And
because programmers are silly, the OS will also happily map memory way above
the last mapped page if accessed.

Go setting up a 104 byte stack only means that the runtime assumes the vDSO
will not use more than 104 bytes of stack and another go routine may be
allocated just after those 104 bytes. So if the vDSO does it's stack probe 4k
into foreign stack just under the current thread stack, it will cause a race
condition on the memory write.

The stack probe is there to prevent overflowing the stack into the heap, there
is always a guard page between user memory and user stack which will kill the
task if it is accessed. So the probe will jump into this guard page and kill
the program if any funny business is going on.

~~~
0x0
But shouldn't Go really assume that the vDSO (or any other external function)
will use at least 128 bytes of stack then? If go only reserves 104 bytes, then
even without all this 4k probe business, there would be a risk that the vDSO
could overwrite up to 128 bytes (which could be another go-routine's stack, or
an unmapped page (which would probably cause a segfault if Go's trap handler
for segfaults doesn't expect an access there, instead of the OS or the trap
handler silently mapping more memory))? I mean, isn't the assumption flawed
from the beginning by reserving less than 128 bytes?

Also, regarding:

> Interrupts may use the kernelstack, you can use the userstack but this is
> trouble if they don't uphold the redzone, which they may not.

I don't see how this is possible without corrupting the redzone. The CPU
pushes the return RIP and the flags onto RSP when an interrupt occurs, which
would overwrite parts of the redzone, so the damage would be done on the
userspace stack even before the first instruction of the interrupt handler is
executed)

~~~
zaarn
> I mean, isn't the assumption flawed from the beginning by reserving less
> than 128 bytes?

Not technically since the vDSO does not use more than 104 bytes of stack. The
problem was the compiler assuming wrongly there might be more to have and
writing data above this limit.

The actual stack usage of a vDSO is not documented afaik.

------
igravious
Ninja level debugging and diagnostic skills. A fascinating read from start to
finish. Bonus points for the GRUB 2 feature for masking out bad RAM blocks –
still dreaming of owning a laptop with ECC memory :/

------
emmelaich
Such a thorough and well written write up.

To think that some experienced programmers I know declare that concurrency is
easy.

~~~
EpicEng
It's only 'easy' because

A) other (probably better?) engineers have created abstractions for them, and

B) they've never had to debug a truly difficult issue related to concurrency

------
cwzwarich
Why doesn't the vDSO code just use MOV in its stack probe probe rather than an
OR?

~~~
jicks
Apparently, because it's shorter [0].

[0]: See
[https://lkml.org/lkml/2017/11/10/348](https://lkml.org/lkml/2017/11/10/348)

~~~
cwzwarich
Why is it shorter? Both MOV and OR have one byte encodings, and with the OR
you either have to use an immediate zero (which burns a byte) or materialize
zero in some other way. As that email points out, the entire sequence would be
shorter using a different addressing mode anyways. And a read-modify-write is
definitely slower at runtime.

~~~
simcop2387
I wonder if it's because it's safer in that it doesn't change anything there
if you've gone over the stack limit and into the heap? I know that -fstack-
protect was designed a long time ago, possible before guard pages and before
64bit addressing.

------
fierro
this is incredibly impressive

~~~
MrBuddyCasino
That was Captain Ahab level persistence. I wonder how long it took him.

~~~
squeed
I was one of the spectators on the Prometheus thread. It took him 2 days. It
was insane.

~~~
nikanj
I ordinarily speak against the whole "10x engineer" trope, but this case
would've easily taken me 20 days. Or months.

------
alistproducer2
I learned so much from that post. The author clearly love tinkering with
computers. I wish I had that same leveled curiosity. Well done.

------
euph0ria
Thanks for taking the time to do this writeup! Super fun to read and
informative.

------
ezoe
Not only he does that, he also explains the debug procedure like we're 5 years
old. I'm so impressed.

------
jbub
I wish i can do something like this one day :) Impressive skills by the
author! Well done!

------
voiper1
An awesome mystery story!

