
A Clever Way to Find Compiler Bugs - furcyd
https://rjlipton.wordpress.com/2019/11/19/a-clever-way-to-find-compiler-bugs/
======
0xid
Interesting but not new. Back in 2005 I implemented a random syntax forest
walker that generates random Verilog/VHDL test cases. My colleague extended
the tool so it compares simulation results with other simulators. This became
part of Altera (acquired by intel in 2015) Quartus II code base. On the first
day we generated 10k test cases, found ~1k failures and they boiled down to
~20 bugs.

------
alain94040
This is a silly article.

 _This is the clever idea of this paper. They assume that there are at least
two compilers say {X} and {Y}. Then let {A} be the output of {X(P)} and let
{B} be the output of {Y(P)}. The key insight is: If {A} is not equal to {B},
then one of the compilers is wrong._

Seriously? It's called having a reference, and comparing your fancy compiler
against a reference to make sure the results match.

That's the entire insight of that article, after spending half a page
discussing the order of authors in a paper.

Really weird.

~~~
setr
The idea seems to be a bit broader than that: we can test any group of
compilers, none of whom we trust to be correct, by comparing their outputs
against each other.

Though it won't tell us who is wrong, it tells us one of the pairs is (or
both).

~~~
WalterBright
> Though it won't tell us who is wrong, it tells us one of the pairs is (or
> both).

If it's implementation-defined behavior, it won't even tell you that.

~~~
recursivecaveat
CSmith is carefully designed to only generate fully defined programs. That
obviously limits the scope of bugs it covers quite a lot, but it still managed
to uncover quite a few. Its biggest weakness is the difficulty of minimizing
and root-causing bugs.

~~~
zellyn
regehr's posts on program minimization are quite fascinating in their own
right:
[https://blog.regehr.org/archives/1679](https://blog.regehr.org/archives/1679)

------
dwheeler
It's a decent technique to check on compilers, but it's older than that. Arial
Faigon discusses that approach in an undated blog post:
[http://www.yendor.com/testing/#compiler](http://www.yendor.com/testing/#compiler)

I don't know the date, but I cited it in my 2009 dissertation:
[https://dwheeler.com/trusting-
trust/dissertation/html/wheele...](https://dwheeler.com/trusting-
trust/dissertation/html/wheeler-trusting-trust-ddc.html)

~~~
pascal_cuoq
The original article that describes differential fuzzing for compilers is from
1998. In addition, that article is extremely readable:

[https://pdfs.semanticscholar.org/fc88/1e8d0432ea8e4dd5fda497...](https://pdfs.semanticscholar.org/fc88/1e8d0432ea8e4dd5fda4979243cac5e4b9e3.pdf)

~~~
payne92
Being the “Andy Payne” cited in the acknowledgment, I can authoritatively say
“can confirm”.

I built an implementation of this method around 1992-93.

------
carapace
> As of early 2011, the under-development version of CompCert is the only
> compiler we have tested for which Csmith cannot find wrong-code errors. This
> is not for lack of trying: we have devoted about six CPU-years to the task.

Buries the lede?

Sure formal methods have a high overhead but for compilers maybe it's worth
it?

~~~
ncmncm
CompCert does not generate production-grade code. Which is to say, it does not
optimize as well as production-grade compilers. There are circumstances where
this is not so important; anywhere one would otherwise use a slow language,
like Javascript or Java, CompCert output might be fast enough.

More important, though, is that CompCert output is not, generally, more
correct than its input. So unless you are formally verifying the program
source, CompCert is unlikely to make your resulting system more correct. Most
of us go years without finding a bug caused by our compiler. Furthermore,
unless you are formally verifying your system specification, a formally-
verified program may not help much. Being definitively wrong is rarely much
better than accidentally wrong.

But what do you use to verify your specification? Formal methods can verify
numerous properties, but correctness is not among them. Similarly: programming
language standards -- specifications -- are also not formally verified.
Correctly implementing a buggy language design gets you little closer to a
correct system.

Testing finds bugs in all layers, but not reliably. Bugs found in testing tend
to be implementation bugs, at first, but soon testing finds mainly
specification bugs.

~~~
carapace
I'm hearing, "The tolerances on our gears aren't that great to begin with so
why care if sand is falling in?"

You're making good points, but the undeniable fact that CompCert is not a
panacea doesn't contradict the fact that other compilers are buggier?

~~~
ncmncm
Other compilers are buggier in ways that demonstrably does not much affect the
correctness of systems built with them.

It's a matter of managing expectations. How much less-buggy would systems be,
if built with compilers that are less buggy? The number is probably not
negative, but the space between it and zero would be hard to see.

So it is an interesting achievement, academically, but a long way from
industrially useful. Another way to say it is that it is a necessary step in
achieving verifiable systems, but there are many other such steps yet to be
taken.

~~~
carapace
Oh, well, if you're going to be reasonable about it... :)

I still feel like the conclusion is wrong, so one of the axioms or deductions
must be wrong, but I can't quite put my finger on it.

In any event, well met.

------
perlgeek
> If {A} is not equal to {B}, then one of the compilers is wrong.

Or the program relies on undefined behavior.

~~~
IshKebab
That possibility is mentioned - read a bit further.

------
sitkack
Is this a fertile ground tilled by the plow of C, or is it broadly applicable
to other languages like Scala, Rust or Haskell?

~~~
lmkg
You need two different compilers, and the ground is more fertile the more
compilers you have. It would be more applicable to standards-based languages
with a robust ecosystem of implementations like Common Lisp or SQL. Less so
languages where there's mostly just the one compiler, like Scala or Rust.

You can still use it to test different settings on a single compiler, though.
Like optimization settings or target architecture.

------
anfilt
The example is comparing signed and unsigned chars. So 255 is casted to -1 in
the comparsion. The compiler is not wrong for returning 1. Generally read left
to right. The first number is a signed char so the compiler may then try to
convert any types to the right of the operation to a signed value as well.

If your comparing between types make your casts explicit.

~~~
jjnoakes
Promotion rules are more complicated than just left to right. In this example
both sides are promoted to int.

~~~
anfilt
I am quite aware of that keep mind I said generally. However, still the
compiler is not wrong returning 1 in the example. Let me show the parts of
standard as to why.

So first the operation is a relation expression. Both signed Char and Unsigned
Char of the arithmetic type. So we will follow usual arithmetic conversions.

[https://i.ibb.co/92bHJMj/1.png](https://i.ibb.co/92bHJMj/1.png)

So the when converting integer types the following applies. Particularly, the
section I have highlighted. Both signed and unsigned char are also the same
rank. So the signed char is converted to unsigned char. However, we have a
problem -1 can't be represented in unsigned so we need to look a bit further.

[https://i.ibb.co/jRhqd9H/2.png](https://i.ibb.co/jRhqd9H/2.png)

If we look here we can see the 3 applies. As you can see implementation
defined.

[https://i.ibb.co/8XrXt34/3.png](https://i.ibb.co/8XrXt34/3.png)

However, if we promote to a signed int as you suggest happens; we still run
into the same thing.

[https://i.ibb.co/RppGK5D/4.png](https://i.ibb.co/RppGK5D/4.png)

~~~
jjnoakes
You are close, but a little off.

Your first citation is correct - we have to perform the usual arithmetic
conversions.

In your second citation, though, you missed the very first paragraph in your
picture:

> The integer promotions are performed on both operands, _then_ the following
> rules apply:

(emphasis is mine).

Before you apply those rules you have to perform the "integer promotions";
they are defined in section 6.3.1.1 subsection 2 and they say, in part, that
operands of rank smaller than int (which we have in this example) are
converted to int first (which is what my original reply was pointing out).

Once both operands are converted to int, no more "usual arithmetic
conversions" are needed, so the resulting operation is done on two ints, and
the result has to be 0.

This is not implementation-defined.

(By the way your last citation is for _extended_ integer types, which we are
not dealing with here - we are dealing with the basic integer types here).

------
_bxg1
My first thought was, assuming your compiler is self-hosting, run it through
itself a series of times. It seems like, if there's something wrong with it
(which would have to affect its own code, admittedly), its behavior would
"drift", getting more wrong with each iteration.

~~~
infogulch
Compilers aren't a complete - or even representative - sample of source code
that they are capable of compiling. If a bad output only manifests with source
code that is valid but is not present in the self-hosted compiler's own source
code, then this method will never find it.

~~~
_bxg1
I acknowledged that already. It wouldn't be a comprehensive test, just a novel
one that wouldn't rely on there being another compiler available for the same
language.

~~~
Dylan16807
The drifting you describe would pretty much never happen. The chance of
outputting wrong code that affects self-compilation without breaking self-
compilation, that then _repeats_? Very very small. And the compiler's already
being used to compile itself all the time, so you're also testing some of the
most-tested code.

