Hacker News new | past | comments | ask | show | jobs | submit login
A Clever Way to Find Compiler Bugs (rjlipton.wordpress.com)
78 points by furcyd 8 months ago | hide | past | favorite | 44 comments



Interesting but not new. Back in 2005 I implemented a random syntax forest walker that generates random Verilog/VHDL test cases. My colleague extended the tool so it compares simulation results with other simulators. This became part of Altera (acquired by intel in 2015) Quartus II code base. On the first day we generated 10k test cases, found ~1k failures and they boiled down to ~20 bugs.


This is a silly article.

This is the clever idea of this paper. They assume that there are at least two compilers say {X} and {Y}. Then let {A} be the output of {X(P)} and let {B} be the output of {Y(P)}. The key insight is: If {A} is not equal to {B}, then one of the compilers is wrong.

Seriously? It's called having a reference, and comparing your fancy compiler against a reference to make sure the results match.

That's the entire insight of that article, after spending half a page discussing the order of authors in a paper.

Really weird.


The underlying idea is differential fuzzing. It's useful and can often be surprisingly effective, but is pretty obvious in application. It's disappointing to see the author bury it under such obtuse language.


The idea seems to be a bit broader than that: we can test any group of compilers, none of whom we trust to be correct, by comparing their outputs against each other.

Though it won't tell us who is wrong, it tells us one of the pairs is (or both).


> Though it won't tell us who is wrong, it tells us one of the pairs is (or both).

If it's implementation-defined behavior, it won't even tell you that.


CSmith is carefully designed to only generate fully defined programs. That obviously limits the scope of bugs it covers quite a lot, but it still managed to uncover quite a few. Its biggest weakness is the difficulty of minimizing and root-causing bugs.


regehr's posts on program minimization are quite fascinating in their own right: https://blog.regehr.org/archives/1679


That doesn't matter though; if there is any difference, we can sort out the root cause by inspection.

There are also false negatives: either, or both compilers are wrong, or else the test program is wrong (undefined behavior or non-portable), or some combination of these, yet by fluke the results agree.

That's no big loss; if you're fuzzing, you're generating a lot of this stuff. If this geneated program won't catch a problem, maybe another one will. Individually, the test cases are cheap.


Unless all are wrong in the same way, which doesn't seem too impossible.


The tool isn't meant to find all bugs, but rather to suggest that bugs exist. That there exist limitations to the kinds of bugs it can find is probably acceptable.

Also if everyone implements something the same way, incorrectly according to the standard, then perhaps the standard itself is wrong. But otherwise, you just need one compiler to implement more strictly according to the standard for the bug to be suggested


One of their compilers is never wrong: "Formal Methods as Bugscreen [--] the formally-verified CompCert compiler"


John Regehr did find 11 bugs in CompCert (the unverified portion).


If I've understood correctly, the compiling step is verified but the parsing and assembly steps are not, and no bugs have been found where valid input would have been compiled to valid but wrong assembly. I didn't see a statement about this on CompCert's web site though.


There were no bugs that their program could find; not that there are no bugs at all :-)


To me, the real insight is the fuzz testing of compilers by automatically generating test cases that don't compare the compiled binaries directly but the results of executing those binaries instead. This way, they can find bugs in optimizing compilers as well. Also, one of the compilers is formally verified so they have an oracle (a golden reference). EDIT: Plus of course, their method is able to generate test cases where the results are free of any allowed undefined behaviour.


Anyone who has bootstrapped a compiler using the interpreter for the same language, and then validated the original interpreted test suite under compilation, has done this. Likewise anyone who boostrapped a new compiler using an existing one. There is nothing earth shattering here. Fuzzing is also not new. The idea of feeding the same input to multiple implementations of the same program side by side is not new, either. Just maybe the combination of all three is what is new.


> Just maybe the combination of all three is what is new.

Precisely.


There's a related technique called metamorphic testing where the insight is that the only part of the program being compiled that matters is the part that will be executed to compute that differentiating output. So, the other parts can be mutated (even with undefined behaving code, as those parts are not executed) and differently mutated versions can be tested against one another. This works even when one just has a single compiler.


It's a decent technique to check on compilers, but it's older than that. Arial Faigon discusses that approach in an undated blog post: http://www.yendor.com/testing/#compiler

I don't know the date, but I cited it in my 2009 dissertation: https://dwheeler.com/trusting-trust/dissertation/html/wheele...


The original article that describes differential fuzzing for compilers is from 1998. In addition, that article is extremely readable:

https://pdfs.semanticscholar.org/fc88/1e8d0432ea8e4dd5fda497...


Being the “Andy Payne” cited in the acknowledgment, I can authoritatively say “can confirm”.

I built an implementation of this method around 1992-93.


Cool! Thanks for the link!!


Archive.org has copies as early as 09/2000:

https://web.archive.org/web/20000919193619/http://www.yendor...


I'm surprised it warrants a blogpost specifically aimed at compilers. People test against simulations all the time.


> As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task.

Buries the lede?

Sure formal methods have a high overhead but for compilers maybe it's worth it?


I felt like that was kind of misleading when I read that section.

Compcert is hardly an optimizing compiler compared to GCC. And bugs were found in it (just not middle end).

It's unclear to me how the defect rate really compared for equivalent performance code (which is probably -O0 or maybe -O1 gcc).

CompCert also had correctness as a headline goal in a way that other compilers have not, so because of these differences it's hard to tell how much impact formal methods were themselves have.

Though it its reason to be hopeful at least.


CompCert does not generate production-grade code. Which is to say, it does not optimize as well as production-grade compilers. There are circumstances where this is not so important; anywhere one would otherwise use a slow language, like Javascript or Java, CompCert output might be fast enough.

More important, though, is that CompCert output is not, generally, more correct than its input. So unless you are formally verifying the program source, CompCert is unlikely to make your resulting system more correct. Most of us go years without finding a bug caused by our compiler. Furthermore, unless you are formally verifying your system specification, a formally-verified program may not help much. Being definitively wrong is rarely much better than accidentally wrong.

But what do you use to verify your specification? Formal methods can verify numerous properties, but correctness is not among them. Similarly: programming language standards -- specifications -- are also not formally verified. Correctly implementing a buggy language design gets you little closer to a correct system.

Testing finds bugs in all layers, but not reliably. Bugs found in testing tend to be implementation bugs, at first, but soon testing finds mainly specification bugs.


I'm hearing, "The tolerances on our gears aren't that great to begin with so why care if sand is falling in?"

You're making good points, but the undeniable fact that CompCert is not a panacea doesn't contradict the fact that other compilers are buggier?


Other compilers are buggier in ways that demonstrably does not much affect the correctness of systems built with them.

It's a matter of managing expectations. How much less-buggy would systems be, if built with compilers that are less buggy? The number is probably not negative, but the space between it and zero would be hard to see.

So it is an interesting achievement, academically, but a long way from industrially useful. Another way to say it is that it is a necessary step in achieving verifiable systems, but there are many other such steps yet to be taken.


Oh, well, if you're going to be reasonable about it... :)

I still feel like the conclusion is wrong, so one of the axioms or deductions must be wrong, but I can't quite put my finger on it.

In any event, well met.


> If {A} is not equal to {B}, then one of the compilers is wrong.

Or the program relies on undefined behavior.


That possibility is mentioned - read a bit further.


Is this a fertile ground tilled by the plow of C, or is it broadly applicable to other languages like Scala, Rust or Haskell?


You need two different compilers, and the ground is more fertile the more compilers you have. It would be more applicable to standards-based languages with a robust ecosystem of implementations like Common Lisp or SQL. Less so languages where there's mostly just the one compiler, like Scala or Rust.

You can still use it to test different settings on a single compiler, though. Like optimization settings or target architecture.


This works for every programming language, but C/C++ is most widely used, and is used in e.g. OS, network stacks, ML/AI. Hence the correctness of C/C++ compilers affects most software.


The example is comparing signed and unsigned chars. So 255 is casted to -1 in the comparsion. The compiler is not wrong for returning 1. Generally read left to right. The first number is a signed char so the compiler may then try to convert any types to the right of the operation to a signed value as well.

If your comparing between types make your casts explicit.


Promotion rules are more complicated than just left to right. In this example both sides are promoted to int.


I am quite aware of that keep mind I said generally. However, still the compiler is not wrong returning 1 in the example. Let me show the parts of standard as to why.

So first the operation is a relation expression. Both signed Char and Unsigned Char of the arithmetic type. So we will follow usual arithmetic conversions.

https://i.ibb.co/92bHJMj/1.png

So the when converting integer types the following applies. Particularly, the section I have highlighted. Both signed and unsigned char are also the same rank. So the signed char is converted to unsigned char. However, we have a problem -1 can't be represented in unsigned so we need to look a bit further.

https://i.ibb.co/jRhqd9H/2.png

If we look here we can see the 3 applies. As you can see implementation defined.

https://i.ibb.co/8XrXt34/3.png

However, if we promote to a signed int as you suggest happens; we still run into the same thing.

https://i.ibb.co/RppGK5D/4.png


You are close, but a little off.

Your first citation is correct - we have to perform the usual arithmetic conversions.

In your second citation, though, you missed the very first paragraph in your picture:

> The integer promotions are performed on both operands, then the following rules apply:

(emphasis is mine).

Before you apply those rules you have to perform the "integer promotions"; they are defined in section 6.3.1.1 subsection 2 and they say, in part, that operands of rank smaller than int (which we have in this example) are converted to int first (which is what my original reply was pointing out).

Once both operands are converted to int, no more "usual arithmetic conversions" are needed, so the resulting operation is done on two ints, and the result has to be 0.

This is not implementation-defined.

(By the way your last citation is for extended integer types, which we are not dealing with here - we are dealing with the basic integer types here).


I hope I never depend on code written by anfilt.

It is a forlorn hope: anfilt is legion.


My first thought was, assuming your compiler is self-hosting, run it through itself a series of times. It seems like, if there's something wrong with it (which would have to affect its own code, admittedly), its behavior would "drift", getting more wrong with each iteration.


Compilers aren't a complete - or even representative - sample of source code that they are capable of compiling. If a bad output only manifests with source code that is valid but is not present in the self-hosted compiler's own source code, then this method will never find it.


I acknowledged that already. It wouldn't be a comprehensive test, just a novel one that wouldn't rely on there being another compiler available for the same language.


The drifting you describe would pretty much never happen. The chance of outputting wrong code that affects self-compilation without breaking self-compilation, that then repeats? Very very small. And the compiler's already being used to compile itself all the time, so you're also testing some of the most-tested code.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: