

Anatomy of a Compiler Bug - nealyoung
http://www.mikeash.com/pyblog/friday-qa-2013-06-28-anatomy-of-a-compiler-bug.html

======
CountHackulus
I used to work on compilers, C/C++, COBOL, PL/I, Java, a whole bunch of them.
This is actually somewhat similar to my favorite bug I encountered while
implementing some stack mapping optimizations.

The compiler itself was crashing out while compiling a SPEC2000 test case
(perlbmk I think?) with an illegal instruction. This was already quite suspect
since it was branching to somewhere WAY outside of where the program usually
resides, and the compiler was compiled with a compiler that's known to be
nearly rock solid. I got quite lucky in that I managed to find one level of
the stack trace, and it pointed me towards sprintf. Using some awesome tools
some coworkers and I had developed over the years, I managed to narrow down
the test case to about 5 lines of code that involved long doubles. So I
grepped the compiler source code for sprintf, set breakpoints on the ones that
I thought would get called, and just kept stepping through them until it
finally crashed hard. Then I just reran, and stopped at the final breakpoint
and started stepping through the assembly. What I saw happen just blew my
mind, the code was just a simple:

sprintf(buffer, "fold: %Lf", result);

But what was happening is that the buffer was only 200 characters long, and
the long double was roughly 1000 characters long. It was just a buffer
overflow, that was so long it ended up overwriting the register save area, and
the return address pointer. So the sprintf completed, but when it went to
branch back, it loaded some characters instead of the return address. Just
hilarious, and good thing I was working on stack mapping and was familiar with
the stack layout of this linkage convention.

The solution of course was to just use snprintf instead. No sorry, that's
wrong since that platform doesn't have an snprintf (yay mainframes!), and so I
had to use %0.6Lg instead of %Lf.

Compilers are fun!

~~~
asveikau
> that platform doesn't have an snprintf (yay mainframes!)

One trick I did once when I wrote code somewhere that didn't have snprintf:
create a pipe, fprintf into it, and only read at most N bytes back.

It worked and was portable but I'm sure the performance was horrible; this
wasn't anything professional, I was just messing around as a kid (back when it
was more common to come across platforms that hadn't gotten to SUSv2 or C99
yet). Probably a better solution would be to steal an implementation from an
open source libc.

~~~
asveikau
Not sure why that was downvoted. I'm sorry I offended you, downvoter, but in
my defense I've seen much more inane comments not downvoted here.

------
mrich
Great analysis and tenacity in hunting this one down :)

Letting new compilers loose on existing codebases is always fun and you learn
lots of things in the process, I can only recommend it. I debugged a problem
once that also had to do with interfacing runtime-generated code with
compiletime-generated code. There were differences in the expectations of the
ABI, which is described in this bug:

[http://llvm.org/bugs/show_bug.cgi?id=12207](http://llvm.org/bugs/show_bug.cgi?id=12207)

It only surfaced when compiling the codebase with clang (previously gcc). Took
quite some digging to find the problem.

~~~
Bootvis
Great analysis, indeed. I miss one thing though: what was causing the
incomplete printing of 'Testing' in the loop?

~~~
mikeash
That's a good question. I just attributed that to general corruption during
the run, but never checked it out in detail. Presumably the calculation of the
string pointer was sometimes being offset by a few bytes somehow, but I don't
know exactly why.

------
limmeau
Thanks for sharing -- I love puzzles like that.

Just curious: Were the lldb session snippets taken from the original debugger
session? I keep getting weird looks when I use a command-line debugger just to
have a transcript afterwards. (The weird looks being from people who'd rather
send me a screenshot of their GUI debugger's call stack.)

~~~
mikeash
If you mean the original but from last year or so, no. I didn't even have it
on my computer, and was kind of remote-debugging through my friend to figure
it out. It only happened on 10.6, and I didn't have that handy, so I couldn't
check it out locally.

------
comex
Nice article. But I wish it had gotten into an explanation of the actual bug
in LLVM's code. Anyone have a bug number?

~~~
mikeash
I wanted to get into that as well, but ran out of brainpower. This is
supposedly the revision that fixes the bug:

[http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-
Mon-...](http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-
Mon-20120702/145937.html)

Would love to hear your conclusions about what the ultimate cause was if you
get that far.

------
sublimit
Haha, it's almost like some detective short story.

Surprisingly it wasn't as hard to follow as I thought. Maybe I'm starting to
get good at this Computer Science thing.

~~~
mikeash
Must be my fantastic writing.

Seriously though, assembly isn't all that hard, mentally. It's difficult in
the same way that unloading a truck full of sand with your bare hands is
difficult. Which is to say, it takes a long time, but all it really needs is
time, not deep thought.

People get scared away from it because they don't know where to start with it,
or because it looks really hard, or because it just takes too much time to
understand what's going on, but it can be really rewarding.

~~~
sublimit
Or because they don't see immediate value in it, as typical programming isn't
done in ASM nowadays. I think it's worthwhile just to "demystify" computers,
but since embedded programming, demos and retro gaming are part of my
interests, I get practical benefits out of it as well.

~~~
ryanmolden
True, but when it comes in handy it can be immensely so. I was looking at a
customer crash dump at work the other day and the callstack shown by two
different debuggers didn't make any sense, it was showing calls sequences the
code clearly didn't make. Looking at the raw stack in memory and using the
disassembly to help see the layout of the frames made it (relatively) easy to
see there were a couple of calls in the sequence that both debuggers simply
weren't showing. Once I had the real sequence and the dissassembly to see how
the frames evolved and recompute call targets for earlier in the frames life
it became a lot easier to see what was going wrong. If I hadn't been able to
do that (and my ASM skills are definitely not strong) this would have been a
No Repro (and hence no fix) bug for sure.

