Perhaps my favorite part of that paper is the fact that there is no ground truth (section 2.6). They discover bugs by testing random programs against multiple compilers. If the result from any of the compilers disagree, then there must be a bug. (They guarantee that the inputs are legal.) In theory, it's possible that all of the compilers could be wrong in the same way, which means they wouldn't discover a bug. In practice, this is extremely unlikely. But you can't know for sure. (In practice, they never saw an instance where there were three different results from three compilers; at least two of the compilers always agreed.)
How does this make sense?
If the result differs from the specification, it is a bug.
If the result is unspecified in the specification, the different compilers can differ as much as they want without any of them being considered buggy.
If they can do this, it finds a subset of bugs, with no false positives.
A large part of the C standard is implementation defined(see acqq's post here: http://news.ycombinator.com/item?id=4131828 ), so the result could be different on multiple compilers, not a bug, and STILL completely within spec.