Hacker News new | past | comments | ask | show | jobs | submit login
Clang-11.0.0 Miscompiled SQLite (sqlite.org)
302 points by marcobambini on June 4, 2020 | hide | past | favorite | 207 comments



If I understand correctly, everything is working as intended: a fuzzer caught a bug in an unreleased version of clang. The title makes it sound like somebody fucked up pretty badly.


No, the point here is that the fuzzer "caught" a bug in sqlite. The blog post mostly clears things up. To me it seemed pretty neutral just explaining it was actually with clang. No snarkyness our anything.


> No, the point here is that the fuzzer "caught" a bug in sqlite. The blog post mostly clears things up.

Could you explain what you mean? What bug is there in sqlite? What blog post are you referring to?

Via someone else in the thread, this was reported as a bug in LLVM, which was confirmed and is now fixed: https://bugs.llvm.org/show_bug.cgi?id=46194. In light of this, I wonder why you claim that this was a bug in sqlite.


What happened:

1. OSSFuzz reports a bug against SQLite.

2. SQLite dev tries to fix the reported problem but is unable to repro the bug on his desktop

3. SQLite dev replicates the OSSFuzz build environment which uses clang-11.0.0 and is then able to repro the reported bug

4. SQLite dev finds that the bug isn't in SQLite at all, but rather in clang-11.0.0

5. SQLite dev patches SQLite to work around the clang bug, posts a brief note about this on the SQLite forum, and goes to bed.

6. The post on the SQLite forum is picked up by HN while the SQLite dev is asleep. LLVM devs read HN, isolate and fix the clang problem, all before the SQLite dev wakes up.


7. SQLite dev wakes up and responds to comments on HN!!


SQLite dev tries... / SQLite dev replicates... / SQLite dev finds... / SQLite dev patches...

Now compare this with what happens when you meet an MSSQL bug.


For those of us saddled with only the misfortune of having to use MSSQL and not having run into a bug yet, what's the experience?


MSSQL was an excellent product since version 7. Reporting bugs and getting support for MSSQL 2000 was a good and rapid experience. It's just gone downhill slowly since. The documentation at that time was first class too - it's important to remember that, docs really matter.

In 2014 the small company I worked for made a release with my work in it and I was spoken to the next day because it had crashed. The boss was not pleased as it had to be down to my work, nothing else had changed. In fact it was down to geographical data in the DB being changed as well. One of the geog functions (edit: think it was STIntersects, but the next bit applies to many such functions) claimed to return a 0 or a 1 - but in fact I found, separately documented, that it could also return NULL.

While spending a couple of hours digging this out, I found via a blog that geography data types used approximate algorithms (which is not surprising, and quite reasonable when you understand why, but it was not documented). These two problems caused us what could have been a serious production problem.

There's lots more. I've hit too many. Just a few weeks ago I did a straightforward recursive CTE on a small data set, which took a minute to run. Eh? It was an optimiser bug. Breaking off the base query into a temp table first and it dropped to a second.

I've too much of this shit, much more. MS doesn't care. BTW if you use the long-existing "update from ...join" syntax which MS released years ago as an extension, be aware it allows you to do stupid things that ansi-standard SQL does not, like repeated unpredictable updates to the same row, in the same single sql statement. I hit that last year in someone else's code. It's broken. Well, what's new.


So much of microsoft's documentation is like this nowadays.

There's parts of the 'What's new in C# 8.0' that aren't really there, and the provided code samples are entirely wrong (i.e. not just one issue, the language literally can't do the feature.)

See also the state of most documentation for ASPNETCORE.

> BTW if you use the long-existing "update from ...join" syntax which MS released years ago as an extension, be aware it allows you to do stupid things that ansi-standard SQL does not, like repeated unpredictable updates to the same row, in the same single sql statement. I hit that last year in someone else's code. It's broken. Well, what's new.

I'm curious if you have an example. I know it's not ANSI standard but I've yet to run into any issues.

MERGE on the other hand is a dumpster fire and everyone knows it.


Merge... christ. When I saw the list of fuckups around that my heart sank, it's such a nice statement, if only it worked. I've chosen never to use it (in MSSQL).

As for update from, see here https://www.sqlservercentral.com/forums/topic/is-update-from...

IIRC If you join on unique columns you're safe, but like I said, I found 2 cases of this last year in some code where the cols joined were not were not unique. That meant the output was, quite simply, junk.


Ahhhhh... I've never run into that issue, but I can see how it would be frustrating. I guess I've always just taken those risks from a cross-join for granted. Possibly because I cut my teeth on Oracle 10g, and when you are having to consistently write update statements the ANSI way you get in the habit of making sure that you'll only get a single record per row. Unless I know the table in FROM is going to be 1-1 I'll make sure to consider the right way to put a TOP 1 in (usually null merges)


Just a small one,

> select isnumeric('$+.') > 1

so it thinks it's a number but obviously it isn't

> select cast('$+.' as int) > Conversion failed when converting the varchar value '$+.' to data type int.

This kind of thing makes data cleaning even less fun than it is. They now have TryParse(), but the above (which I'm pretty sure used to allow an even wider range of crap) is just unnecessary, and is a sign of how they view their own flagship product.


That's actually pretty awesome overall, besides the initial mis-attribution of the bug, which is an understandable mistake. Props to everyone involved.


Hi, I love SQLite. It is the crux of a system that allows me to make a living.


There is no bug in sqlite. That's the reason this blog post was written. And that's why it was interesting. Not because anyone claimed clang has a major fuckup.


You wrote earlier:

> No, the point here is that the fuzzer "caught" a bug in sqlite.

Which is where the confusion stems from. Perhaps "'caught a bug' in sqlite" would have been a more clear way of writing what you intended?


Ah I see now. Yes I could have extended the quotation. I was assuming the context was given by the start of the second paragraph:

> OSSFuzz has been reported bug 23003 against SQLite.


What strikes me as odd is why SQLite is committing changes just to make an unreleased compiler happy...


It was late at night. I didn't realize the clang-11.0.0 was unreleased until I saw this HN thread the next morning. The only reason I was using clang-11.0.0 is because the OSSFuzz bug report against SQLite used clang-11.0.0 and I couldn't get the (alleged) SQLite problem to repro using any other compiler.


I think it’s a code improvement, too. It now computes the new flags in one place.

Just looking at these lines, I don’t understand the code, though. Apparently, sqlite3VdbeMemRelease can change pMem->flags, but that change can be ignored. I can’t think of a simple explanation for that.

Also, the function name sqlite3VdbeMemRelease seems to indicate memory gets released, but pMem->flags = MEM_Str|MEM_Term…, to me, indicates it contains a string.


> but that change can be ignore

No the changes _must_ be ignored. This is where the bug came from.

The code does store the flags, does an operation and then restores some flags while setting some new ones.

But after clang reorders the operations it will not restore the flags from _before_ the function calls but instead takes them from _after_ the function call.

I.e. any change `sqlite3VdbeMemRelease` does to `pMem` mutst be ignored but after reordering changes to `MEM_AffMask` `MEM_Subtype` are no longer ignored.


I think what they are saying is, why does sqlite3VdbeMemRelease change the flags in the first place if the change must be ignored?


Because these changes may not be ignored in some other place?


Probably, but it doesn’t look like good API design to provide a function whose actions you sometimes have to partially undo.


Perhaps, but the justification behind the change should be that one, not the miscompilation...


SQLite is so pervasive that when, not, if someone compiles it with the busted compiler (or others with a similar mistake) they want to be ready.


The compiler is unreleased and months away. There are many, many broken revisions of compilers in development.

A project cannot burden its code trying to work on every single commit of every compiler out there.


Why is OSSFuzz using an unreleased compiler to test things?


Likely because the company that provides OSSFuzz employs a large number of developers that work on the unreleased compiler, so consider it part of the QA process.


But not with gcc, which releases broken versions since 9.0 and cannot catch up fixing these regressions, mostly in the str library. Just look at their Bugzilla.

clang fixed it in one day. for gcc I expect 5 years to fix their regressions.


I read the SQLite commit such that they weren't aware it's not a released compiler.


Indeed, the commit hash reported in the version corresponds to something that landed on master at 11PM this Monday.


Clang is the lucky one here, it has both SQLite and Dr Hipp on its side.

Rust has Crater which allows new rustc builds to be run against the entire public crates.io dgraph.

https://github.com/rust-lang/crater


> Clang is the lucky one here, it has both SQLite and Dr Hipp on its side.

Clang is lucky that OSSFuzz does testing of unreleased clang version on other OSS projects, like SQLite.


Love it. This is the dream of open-source.


Not really, at least, not given the current info.

Does anyone have a link to the reported clang bug?

For example, it might well be that the function that clang believes does not modify pmem->flags "promises" that it does not modify the value. In which case it might be an SQLite bug, that clang just did not happen to "correctly optimize" before.

This can happen, e.g., if you mark your function with __attribute__((const)), for example, which promises that your function does not access memory via arguments that have pointers. It then becomes trivial for clang to make the transformation above.

The bug wouldn't even need to be on the function definition, it might just be on the declaration, or on a declaration somewhere, in which case you have an ODR violation as well.

All that seems very unlikely since it appears to be clear that the function does modify memory, but still, I'd just wait for the clang bug to be filled with a reproducer, and the issue to be tracked down.

It does strongly hint at a bug in clang, but from the linked page, I at least cannot conclude that yet.



Appears to already be fixed, thanks for the link!


Please, the code is there, read it rather than make assumptions this might happen or that might happen. It's just plain C code all the way there was no fancy "const" stuff anywhere involved in this case. Just as claimed, CLang tried to do some optimization and unfortunately it was FUBAR.


Sorry, but no. As a compiler writer, the burden of proof that this is a bug isn't on me, its on you. If you believe this is a bug in clang, open up a bug with a reproducer, and get somebody to analyze it and confirm it. SQLite isn't a tiny project, so doing that should be quite easy.

I have better things to do than reading the whole SQLite code base like, for example, fixing the bugs of the people that actually bothered to fill in a bug report.

Until this bug is confirmed, it can be anything, and unfortunately, bugs reported for C and C++ projects are quite often not bugs in the compiler, but UB in user code.

The fact that you think that just by looking at a single function signature one can deduce where this could be UB related to const depicts the problem. UB is unfortunately not that simple, and I can think of a couple of ways of const-related UB in which the declaration "looked fine".


> As a compiler writer, the burden of proof that this is a bug isn't on me, its on you.

Normally true, but not for SQLite. SQLite is probably the most thoroughly tested and correctness conscious piece of software anywhere in open source. I think with SQLite, the burden is on the compiler writer.


Nevertheless OSSFuzz found legitimate UB bugs in SQLite before.

https://bugs.chromium.org/p/oss-fuzz/issues/list?can=1&q=Pro...


> SQLite is probably the most thoroughly tested and correctness conscious piece of software anywhere in open source.

Given the amounts of bugs that are both filled and fixed on a daily basis, reality does not seem to reflect this claim.

If you expect compiler writers to go and explore your open source project of choice in a blind hunt for compiler bugs that they can fix, I hope you sit down while you wait and do not really need these kinds of bugs actually fixed.


As a developer, this attitude on the part of C compiler writers is a reason to avoid C and C++.

The actual semantics of the language are effectively impossible for a non-expert to tell from the code as written and operating. And what you don't know can hurt you. That's like juggling caltrops with bare hands - you're guaranteed to get hurt.


> As a developer, this attitude on the part of C compiler writers is a reason to avoid C and C++.

Which attitude? The one that requests users to fill a bug report if they think there is an error in our compiler ?

Please let us all know which software you write and ship, and how you inspect all inputs that your users pass it in the search for bugs, instead of requiring users to fill a bug report if they think they encounter a bug.

Or maybe you are talking about the attitude of not assuming that all users know the C standard to the letter and never write code that has bugs ?

I suppose that for all the C and C++ code that you write, when the binary doesn't work as expected, you start by looking for bugs in compilers instead of proofing your own code right? Because it is almost impossible to write subtly buggy code in C and C++ right?


Which attitude?

The attitude that undefined behavior is seen as an opportunity for optimization, and backwards compatibility with your installed base is not particularly important. Not even if you know that the change breaks code in the wild which was intended for security checks.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475 for discussion around one such change.

I will elide the specific self-congratulatory attitudes that you claim to have, and your not so subtle insults to myself. With one exception.

Or maybe you are talking about the attitude of not assuming that all users know the C standard to the letter and never write code that has bugs ?

This is exactly the reverse of the attitude that I've observed.

Including in yourself 2 posts up when you described how hard it can be to decide whether something is undefined behavior, but if it is undefined behavior then it isn't a compiler bug. Which, as a developer, tells me that I definitely am not able to decide whether the code will do anything like what appears to be written, or I will get unexpected nasal demons, in some unrelated section of code.


> The attitude that undefined behavior is seen as an opportunity for optimization

When you choose to write C, you are accepting the contract that if your program has UB, there are no guarantees about what it does.

If you can't uphold your end of the contract, C is not for you.

There is a huge difference between complaining that this contract is too hard to uphold, which is a complain that you should take with the C standard, and complaining to the compiler that, on contract violation, your program misbehaves.

If you can't accept that, again, C is not for you.

For every novice that complains about their mistake misbehaving I have 100 expert C programmers that would complain about bad codegen for programs that do uphold the contract.

From a compiler writer perspective, it is impossible to argue that we should penalize those that take the time and experience to play by C rules in favor of those that write broken programs.

100% of the experts would agree that it is hard to write correct code, but "disabling compiler optimizations" is not the right place to fix that.

If you believe that's the case, you are free to compile all your code with -O0, use volatile everywhere, or just use Python or some other language that suits you best.

But the argument that you are making is ridiculous. You don't like the consequences of breaking the law, so your conclusion is that prisons should be 5 star hotels and all the law-abiding tax-payers should finance them so that you enjoy them at their cost. I wish you the best luck in life with that attitude mate.


Gee, it didn't take you long to revert to form, did it? I will just respond to the most important bits.

When you choose to write C, you are accepting the contract that if your program has UB, there are no guarantees about what it does.

Back when I started programming, C was widely understood as having a very different contract. Namely, "It is a very simple language with no guard rails, but describes exactly what the computer will do." This was the language as describe in K&R.

The contract that you describe was something that compiler writers came up with long after the language was in wide adoption. Based on my anecdotal impression, it also wasn't the direction that most of said user base was asking for. And it is the opposite of the direction that most other languages have gone.

If you can't uphold your end of the contract, C is not for you.

Which is what I said from the start. My exact words were, "As a developer, this attitude on the part of C compiler writers is a reason to avoid C and C++."

I generally am fine in slower languages like Python, JavaScript, SQL, Go, etc. The next time that I need performance, I will look into Rust. Isn't this exactly what you think that I should do?

But the argument that you are making is ridiculous. You don't like the consequences of breaking the law, so your conclusion is that prisons should be 5 star hotels and all the law-abiding tax-payers should finance them so that you enjoy them at their cost. I wish you the best luck in life with that attitude mate.

You know, if you argue with what I said rather than what you think I said, you might make more sense. Again, here is what I said:

"The actual semantics of the language are effectively impossible for a non-expert to tell from the code as written and operating. And what you don't know can hurt you. That's like juggling caltrops with bare hands - you're guaranteed to get hurt."

Given the level of care and knowledge which is required to identify undefined behavior, the first sentence is true. Given how much compiler authors value squeezing more performance over backwards compatibility, so is the second. The third is the logical result of the first two.


I think the attitude of compiler writers is a good reason to fix the spec so they can't keep ratfucking developers trying to get work done.


This got worse over the past 10 - 15 years, as people went optimization crazy.


A fuzzer should be one of the last lines of defence against human errors. This was also caught by fuzzing sqlite, not the offending compiler clang. When all but the last lines of defence are breached it's absolutely cause for concern and shouldn't be brushed away like all is fine.


Testing compiler outputs by testing popular libraries an apps generated with it is a fine and common approach.

Expecting a compiler (and the hardware it runs on) to be perfect is a mistake.


Except for compcert, which is basically perfect.


Well, not quite, because there are parts of CompCert that are not proven [1]:

- Parsing, type-checking, and pre-simplifications

- Assembling and linking

And there have been bugs on edge cases there. Moreover, correctness is limited to what is specified. COQ does not guarantee that there are no missing specifications or theorems. The C standard is written in English, people had to manually encode the standard in COQ.

[1] http://compcert.inria.fr/compcert-C.html


And the performance of the generated binary might be underwhelming for some.


It's all right if it's running the testcases of these popular libraries; however, it seems that this was not caught by the testcases, only by a fuzzer (which not only is not as deterministic, but also can be much more costly to run).


It is a fuck up. Ignoring a sequence point (function call) is serious business. Probably a too agressive inlining.


Relax, it’s a bug in a prerelease version of a compiler that was caught early and fixed promptly. Sure, it’s a bug in an important part of the compiler, but there’s not need to call foul.


It might not be that simple: bugs like this should preferably be caught by the test suite of the compiler.


LLVM is one of the most critical pieces of infrastructure out there. And yet you can modify master without actually passing any tests.

I build LLVM from source a lot, and master not even building because somebody made a change that doesn't even compiler is astonishing.

FWIW, LLVM has a pretty big test suite. But... to catch errors... you actually... have to... run it...

It is often the case that Rust, which does have a policy of "master always passes all tests" catches bugs in LLVM before LLVM does.


It's the most important rule in my opinion. Master must always compile and pass tests.

I'm glad sending patches or looking into bugs whenever I find them in OSS but having to jump through hoops just to get things compiled is a big no-no.


If you are not even compiling a change, then why would you submit it?

The bot will catch it, yes, but you are still filling the queue of maintainers.


> If you are not even compiling a change, then why would you submit it?

For example, some people send doc fixes. Sometimes these doc fixes have "bugs" that actually make compilation fail. Like adding a line break and forgetting to mark the next line as a comment...

Compiling LLVM for such changes takes quite a bit of time, so of course people don't even do it.


That does not answer the question.


Of course it does. Some changes, like doc changes, should not affect the program.

So why would you try to compile your software to verify those instead of just, e.g., generating your documentation ?


Not what I wrote. I never send in untested patches, because I don't start writing patches if I cant easily compile it myself in the first place.

Also, this was not specifically targeted at llvm but all open source software which may not even have a build-bot or published regression checks.

All I'm saying is: Please keep master clean so it always compiles so people can easily pick it up and contribute.


Honestly "master doesn't pass tests" is just a naming problem. Solving is just a matter of renaming master to something else ('dev'? 'potentially-incorrect'?) and making a new version of it that does pass tests.


Having a version that doesn't pass tests is just a way of delaying the fixing of test until some time later, where the authors of the tests might have forgotten about them.

And obviously, under the hope that somebody notices the failure by then, tracks it down to the right commit, and fixes the commit, instead of adding a workaround that avoid having to find the root of the problem.

"master doesn't pass tests" isn't just about naming. Its about culture. Its about agreeing that there is some source of code that everyone agrees on should work, and having the culture of agreeing that all modifications to it should work as well, and that it is the responsibility of those doing the modifications to make sure that's the case.


This.


That would mean that there would be a default branch with the meaning of "development branch that is as up to date as possible while still guaranteed to build and pass tests". Which would be a change from current practice, since there is currently no such branch under any name. So this is not just a naming problem. It is a problem of maintaining a development branch that is as up to date as possible while still guaranteed to build and pass tests.


Aren't there prerelease builds though? The branch of prereleases seems to fall under that bucket.


From a very quick search I can't seem to find such a branch. Judging from https://github.com/llvm/llvm-project/releases they seem to occasionally branch release candidate branches from master, every two weeks or so. That's very far from a well-tested master updated several times a day.

Continuous integration for compilers is a well-understood problem. There are posts in this thread explaining how Rust handles it. I understand that it's a royal pain to set up and maintain. I am a compiler engineer and am happy to work on compilation stuff but wouldn't want to touch our CI system with a ten-foot pole. But LLVM is a project driven by Google and Apple, there are really no excuses for not finding the people willing to do this.


how can you make a prerelease build of a non-compiling code?


OK then. Every commit in the dev branch should pass all tests.

If the commits don't even build they might as well be thrown away. It's going to make regression tracking next to impossible otherwise. The only reason to keep a branch around is because it's a small ephemeral branch under active development, or because it's an eternal branch that you can use to track regressions (using git bisect, for example).


Yes, which is why bors-ng[0] is a thing. The fact that LLVM doesn't use it (or something similar) is strange.

[0]: https://github.com/bors-ng/bors-ng


LLVM doesn't use GitHub pull requests, so obviously bors-ng isn't an option.


I strongly disagree. If you keep things tidy coming in then you have a known good base.


Don't you always have a known good base? It's just the latest build that worked correctly, right?


> It's just the latest build that worked correctly, right?

But how would you tell which one that is? I guess even within the current system one could set up a bot that monitors master, and whenever the latest commit builds and passes tests, it updates a "known working" branch. But this doesn't seem to be the case. And if such a system is set up, it would make a lot more sense for contributors to push to a "testing" branch instead of master and then only updating master for working builds, rather than the other way round.


How do you even tell whether something is right? That's not always a simple question to answer.

Sure, one way is to have an automatic test suite and an integration system that moves commits that passed it from one branch to another. If you can actually make a test suite that has no false negatives and finds enough negatives to be useful, that's great.

In my (limited) experience test suites are not close to that ideal. I've worked with systems where there was an automatic build test at least, and sometimes that catches a bad commit. But I think for many fast-moving projects there are few useful errors these systems can find, and sometimes they have annoying false negatives. The thing that I can think of that would most improve my productivity right now is a check that nobody added tab characters in the source code.

(I realize my comment might not apply to a situation like LLVM, where there could be a large test suite of programs to compile and run for example).


I think your comments apply more to other situations. Compilers are very testable: The input is usually human-readable text that can be stored in a file (source code), the final output is usually human-readable text that can be stored in a file (assembly code), intermediate states can usually be dumped to human-readable text that can be stored in a file (in the case of LLVM, bitcode). There is no nondeterminism, no network, no database state, no particular hardware or system requirements for most tests, etc., to consider during testing.

Compilers are also often very modular. LLVM in particular makes it very easy to take a bitcode file, apply a certain well-defined set of transformations, dump the resulting bitcode file, and check whether it fulfills the expected properties, e.g., "this computation has been optimized to this other computation". LLVM has a tool called lit that does this kind of test, and there is an extensive test suite that uses it. Of course none of this guarantees that other programs will behave differently, but in my experience these test suites really are very good and useful.

The only missing piece in LLVM is consistent enforcement of a policy that tests must pass in order for the repository to reach certain well-defined states.


Well, that explains everything...


Gcc is. Llvm is the wannabe.


Anything and everything with tests implemented in the right languages can be used as a test suite for a compiler, though. So, I'd rather see that that was done systematically than trying to catch everything with the compiler's test suite.

I mean something like setting up something building and running the tests for everything included in Debian, homebrew and vcpkg and looking for regressions compared to the last release of the compiler.


It should, but I don’t think it’s fair to say it’s a huge issue. Perhaps it’s just highlighting an area of improvement for the test suite or perhaps a need for more caution when working in that area of the compiler.


Do you have a pointer to where this was fixed in the compiler?


I do not; I think I misread the bug report as somehow saying this had been fixed; I think I should really go to bed. I actually have no idea if this has been fixed yet, though I would still assume that it won’t be like that in master for long. Sorry for the unsubstantiated claim!


No problem, I was just wondering if I missed something. Filed https://bugs.llvm.org/show_bug.cgi?id=46194


http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-202... says that the offending change was reverted.


Question for any low-level optimizing compiler engineers here: I obviously realize these are all important, but in your judgment, how much of making an error-free compiler would you say is about having a comprehensive test suite, vs. having very careful software engineers, vs. having extremely thorough code reviews, vs. something else? Put another way, if you were to lose one of these, which ones do you think would have the most/least negative impact (or be the easiest/hardest to make up for with other things) in terms of the correctness of the final product?


Somewhat unrelated but I read somewhere that the Rust team test their releases by compiling the entirety of crates.io and all Rust GitHub repositories to find regressions. The tool is called Crater. Source: https://www.pietroalbini.org/blog/shipping-a-compiler-every-...


I was reading the docs of some library the other day, and they had a list of projects using that library. It was an impressive list: long, and with some serious projects on there.

Then i saw that they try building those projects against their library as part of their test pipeline. If you want to add your project to the list, you can submit a pull request. No wonder the list is so substantial - it gets you free testing! Highly ingenious!


No longer the entirety but still a sizeable chunk of it; it helps with regression detection for them. World's biggest test suite?


It is very common for compilers to be tested using client code. It is a practice that goes back decades.

Nowadays with so many open source projects and so much distributed computer power, it is even easier.


What makes it easy and powerful for Rust is that ~all Rust projects use Cargo as the build system and test runner, so Crater can really test every Rust project in existence, not just code samples hand-picked into compiler's test suite.


Only open source projects, which will not be the majority if Rust ends up taking up speed.


The test suite is ultra important. Even in the unlikely case that you have an engineer who thoroughly understands the machine and language specs, you have a ton of existing, useful, programs that rely on under-defined behaviour in one or both of them.


Ah, I might have to tease out my question here. By correctness I'm only referring to the implementation being true to the language specification, not to "not breaking existing programs". So if a program relies on undefined behavior and then gets broken in the next version of the compiler because of that behavior, I don't consider that a correctness problem in the compiler for the purposes of this question. (Which is not to suggest keeping it working isn't a worthwhile goal. I'm saying that's not the aspect of it that I'm asking about.)


The problem is that many don't learn C and C++ properly, they have read some language introduction book and assume "how my compiler does it" is the right way and probably never touched a copy of ISO.

So plenty of programs rely on UB even without realizing it.


I have absolutely no problem finding out that we have a dozen new bugs in production for our 15 year old codebase, because we upgraded our compiler and it revealed places we used undefined or unspecified behavior.

I’d much rather have that than a compiler that evolves slower because its developers worry about existing non-conforming code.

I can always use the old version of the compiler if I don’t want it to break.


You might if the compiler yelled at you for it before you shipped, but it’s just as likely to compile your code into something you didn’t really foresee…


How do much do you expect your compiler to yell? Undefined Behavior can result from extremely trivial things like x+1, where x is signed. Obviously, you don't want the compiler yelling at you for almost every single for loop or simple signed addition in your program.

There also seems to be this weird misapprehension (probably not shared by you) that this is the compiler being "out of to get you" for UB. That's not what happens... the compiler ASSUMES that no UB can happen and optimizes accordingly. This can certainly have surprising results, but doing anything else would leave a lot of performance on the floor. What has happened over the last decade is that compilers have become better and better at optimizing leading to more and more surprising results.

To anyone (again, not you) ITT complaining that their program 'used to work fine': It just happened to work, but it's been buggy from the get-go. If that's a problem for your ego, then you need to adjust your ego.


> There also seems to be this weird misapprehension (probably not shared by you) that this is the compiler being "out of to get you" for UB.

Of course nobody actually thinks that the compiler is "out to get you". But if you do any work in a security context, that is the attitude you have to take: "If the compiler was lawful evil and trying to put security vulnerabilities into my code, what excuse could it use from the spec to justify doing so?" That's the only way to make sure that you won't get bitten by "well-intentioned" optimizations.

> To anyone (again, not you) ITT complaining that their program 'used to work fine': It just happened to work, but it's been buggy from the get-go. If that's a problem for your ego, then you need to adjust your ego.

This would be valid if what people were complaining about was random cowboy behavior. What people are complaining about is the fact that it is nearly impossible to write code without UB. The fact that UB can result from "trivial things like x+1 where x is signed" is exactly the problem.


I mostly agree with your points, but I still think compiler authors have dropped the ball by not putting as much effort as detecting cases where they have run optimizations based on undefined behavior as they should have. We’ve seen some of these be rolled back with flags for overflow or null checks (which I personally think is a bit too far in the other direction).


Alternatively, you can instrument the code with checks for UB using a sanitizer. It won't be the compiler screaming at you but your CI running all your tests with this instrumented version of your software.

Not perfect, but should work for most sane people running CI before merging changes.


Sanitizers don't catch all instances of undefined behavior, and they sure won't find anything that doesn't run…


The compiler doesn’t do a good job of revealing undefined behavior, it usually just makes your program break in mysterious ways.


A better example is perhaps behavior that shouldn’t be relied upon: a standard library unordered collection (Set) where the implementation for 10 years preserves insertion order but says in the docs not to rely on it. Lots of programs have deliberately or not relied on it. Then suddenly it changes impl to one that doesn’t preserve insertion order for slightly better perf. You don’t want the std library developers to worry about existing code as long as the documentation always said “don’t rely on insertion order”.

Same with a list sort that is stable but never guaranteed it would be, then turns unstable much later. I was bitten by this one in .NET as were many others, and as a response the developers added hacks into the code to simulate the old behavior it detected the code was built against the older lib. That didn’t make the problem easier to find...


> You don’t want the std library developers to worry about existing code as long as the documentation always said “don’t rely on insertion order”.

Sometime practicality wins out and the change is reverted when it breaks everything. It’s nice to come up with purist views, and in this case being able to point to documentation and say that the entire world needs to change is the morally high stance to take, but it’s not always what happens. (Some may say that this is what needs to happen to undefined behavior, but that one’s still up in the air. As it so happens I am on the “read the standard” camp…)


More to the point, anyone doing compiler validation work would a) always have a full test suite as their first line of attack, b) always have a well understood developer pipeline in place, to ensure a consistent understanding - coding rules, code review processes, etc. and c) always looks at what the compiler is actually doing, compared to what they think its doing - i.e. is able to navigate code -> assembly and all the stages in between, with competence.

Anything less is asking for trouble.


A ‘full’ test suite? I’m not sure that one of those has ever existed for any large program, let alone one under active development. We have to work in the real world here!


Yes, a full test suite. 100% code coverage is a thing, even in compiler validation.

Also, every test that was ever written for a compiler is still out there, being run, somewhere.

And, for the record, the real world is pretty big.


But 100% code coverage isn't necessarily a useful thing in this context, as different code (and different options) mean different interactions between passes. And you're never going to cover all the different paths through the compiler.

Back when I worked on a commercial compiler, we had a smoke test which we ran before committing. It would take ~30 minutes to run. We had a longer, roughly eight hour, set of tests which we ran daily. And we had machines running tests 24/7, which managed to clock up roughly 20 million distinct test cases (combinations of code and options) over the course of a year.

Every single compiler bug we'd ever seen was present (in a reduced form) in the conformance tests we'd run before release. That took about a week.


>And you're never going to cover all the different paths through the compiler.

Depends who you work for and how willing they are to let such justifications stand in a court of law.


100% code coverage does not mean a full test suite. You can still find huge numbers of bugs in code that has 100% coverage in a test suite. The two are not the same!


I'm not offering 100% coverage as anything but a level playing field, that things are tested. Also, code-path analysis is pretty much more advanced than you might think, especially when you get auto/generated testing rigs configured for particular tasks - and especially if you have a compiler to test, as well as all the (reproducible) applications it may produce ..


Then the test suite is still hugely important -- you just have to design it to reflect what you consider "true to the language specification".

Or phrased differently, the test suite is (or should be) the machine-readable language specification.


That is only theoretical correctness, not useful correctness.


They all have a critical impact.

If you look at Rust, you cannot commit to master:

- without the resulting master merge passing the whole test suite, which would probably take a day or more to run locally, and - without a review of someone that knows the part of the code you modified well, or somebody they delegate to, and often, of a group of multiple people that know that part of the code well. These people verify that the code is correct, commented, tested, appropriate, etc. Often, a design team in charge of that part of the code that makes sure that the change goes in a direction that aligns with the project strategy/goals (i.e. that this is a change we want to make).

Even the best of the best software developers:

- don't always test that their code compiles before committing: for Rust, they really can't to, they'd need to cross-compile the compiler dozens of times, run these compilers under qemu on different hardware, and verify that they work - too much work for any single human - don't always test that their code passes the test suite: the test suite is _huge_. Its impossible for anybody to test anything locally, takes too much time. - don't always benchmark the compiler after a change on reasonable applications, takes too much time. - etc. point being, you cannot trust the best of the best, much less anyone else. As a Rust compiler dev, the first person that you should not trust is _you_.

The whole point of the process is identifying all possible sources of errors, and making them impossible, or at worst, extremely unlikely to happen.

Most of these errors are only quite unlikely to happen, e.g., because all the automated tooling that exists to prevent them is also written by humans that make mistakes.

But every time a new source of error is identified, there is at least a discussion about how to make it impossible to happen in the future.

Having a test suite, having reviewers, are just two things you have to do to avoid a large set of errors (see a comment below about how each release of Rust is required to succesfully compile all existing public Rust source code and run all their tests succesfully).

They are not barely enough. And even if your process makes errors extremely unlikely, Rust and LLVM have hundreds of PRs on flight and hundreds of developers working on the source code regularly. Throw enough developers at a project, and unlikely events become a daily thing.


Not a compiler author, but have some experience with finding bugs in compilers. If I had to venture a guess, I'd pick the testsuite (and other tools like fuzzing and mass user base), since the complexity is quite high. In our experience with the EMI project (https://web.cs.ucdavis.edu/~su/publications/emi.pdf), some of the miscompilations were caused by one pass of the compiler emitting code that was considered to have contained undefined behavior by the next pass (see Figure 1(b)). I'd say the code reviewer will catch issues within a single pass, but the interaction between them are less likely to be caught by a human reviewer as the complexity grows.


Good thoughts, thanks!


Completely off topic but reading this thread (without glasses in preparation to go to bed) was a bit confusing until I realized that your usernames differed by a single letter…


Haha, glad you cleared it up so that you can sleep easily now :-)


I'm not quite the class of people you're asking, I worked on an interpreter for an old language for a long time.

Tests are fantastic for defined behavior. but running actual real world gigantic battle tested programs is where I found bugs. It _seems_ like you can just define the behavior. In my experience, the interpreter would mostly do what I wanted, unless like 4 complicated factors were in play. Often, it's so confusing, the user of your language is just trying to make something work. They're under a deadline, they find a way, and they make it work. They're not morons, hell, you have to be pretty sophisticated to find that weird edge to exploit.

I was super new, but I thought of a lot of my coworkers as surgeons. They'd see the problem, and find the smallest possible change to fix a bug. They were very careful.

I dunno. it's super hard. I made some optimizations where I had a choice, I could break a working program (but relying on undefined behavior) in one way or another. Seemed like with those big projects, the optimization is totally worth it. But you have to pick. If you choose the semantics to mean 'a' then X breaks, if you choose 'b' then Y breaks. Our users had been around the block a few times. They were sympathetic, and we held a pretty firm line on undefined behavior.

I dunno. I think rust probably handles this the best, recompiling and retesting against every crate ever. But here's the thing, if you're relying on some weird edge case, the test probably is as well. So you don't actually detect the change in behavior.

I think it's a hard problem, and there's no silver bullet. Write tests. Have careful engineers. Run big programs. In spite of all that, you're going to have to go to your users hat in hand occasionally, because when you build big systems there are lots of subtle edge cases. Shit's hard man.


> having a comprehensive test suite, vs. having very careful software engineers, vs. having extremely thorough code reviews

I’ll be a contrarian, and say the careful software engineers are the most important. They’ll stop feature work to recreate the test suite.

On the other hand, nothing can save a good test suite from being selectively disabled or regressed by sloppy engineers.

Similarly, having a process that dictates careful code reviews just slows everyone down. The careful engineers would have done the code reviews properly without the policy. The sloppy ones will spend more time to produce comparably bad (or even worse) code reviews.


The OpenBSD project used to verify gcc (back when it was at 4.2 or so) as compiler for their O/S. Don't know if they still do; I'd imagine it's a lot of effort, and both clang and gcc have churned out yearly releases with ever-growing complexity since. C was once supposed to be a simple language closer to a portable macro assembler than a high-level language, and one for which writing a compiler is relatively straightforward and takes no longer than a year. This went out of the window with clang, and with gcc (and glibc!) being based on C++. Though of course optimizing out-of-order execution on modern CPUs is much more complex than your olde microcontroller-like CPU.


> C was once supposed to be a simple language closer to a portable macro assembler than a high-level language, and one for which writing a compiler is relatively straightforward and takes no longer than a year.

Do you have a source for this claim? AFAIU the history of C, it was invented for entirely pragmatic reasons.

> This went out of the window with clang, and with gcc (and glibc!) being based on C++.

This is nonsensical, whichever way one interprets "being based on C++". The GCC and Clang compilers conform to the standard whichever version of C or C++ you ask them compile. (Anything else is a bug and should be reported. They may accept a few non-conforming programs, but that's allowable under either implementation-defined behavior or undefined behavior. Remember that UB can include 'doing what the programmer expects'.)


That entirely pragmatic reason was to be assembler with curly braces. See memory capacity of PDP-7 and think what you can reasonably compile with those resources. Fun fact: if you don't specify variable's type, modern C compiler assumes int, this is how the first C compiler worked, you don't need types in assembler, the register type (machine word) is almost the only type you deal with. Implicit function declaration is an assembler feature too.


First C compiler ran on the PDP-11, rather than the PDP-7 didn't it? Not that it makes a huge different to your point.


That's also how tcc (simplistic and functional modern one-pass C compiler) works I guess.


> C was once supposed to be a simple language closer to a portable macro assembler than a high-level language,

That might have been K&R's original intention, but that went out the window pretty early on when optimizing compilers arrived on the scene. And eventually when the ANSI spec was written, it was explicitly written to allow all kinds of optimizations (the "as-if" rule).


> And eventually when the ANSI spec was written, it was explicitly written to allow all kinds of optimizations (the "as-if" rule).

The ironic thing about that is that when they wrote C89, they said they were trying to standardize existing practice. I don't think they had any idea that they were effectively declaring 90% of all programs buggy.


> I don't think they had any idea that they were effectively declaring 90% of all programs buggy.

I agree. During the standardization process, there was a proposal (‘noalias’) that dmr described as “a license for the compiler to undertake aggressive optimizations that are completely legal by the committee's rules, but make hash of apparently safe programs”, and, that “[i]t negates every brave promise X3J11 ever made about codifying existing practices, preserving the existing body of code, and keeping (dare I say it?) ‘the spirit of C.’”¹

If he or anyone else at the time had realized that ‘undefined behaviour’ would turn out to do the same, there response would have been the same: “[it] must go. This is non-negotiable.”

¹ https://www.lysator.liu.se/c/dmr-on-noalias.html


And then retroactively declare that code broken and not conforming. Good times!


In defense of whoever was involved in the C89 process, I think they had an impossible task. At the same time they wanted to standardize the "portable macro-assembler" as well as allowing optimizations. With such constraints they could not but fail.

With hindsight, maybe K&R should have consulted with PL experts and researchers and designed a "proper" language with a better emphasis on soundness and safety. I'm not saying they could have invented something like Rust back in the 1970'ies, but even then the state of the art was far ahead of K&R C.


> comprehensive test suite, vs. having very careful software engineers, vs. having extremely thorough code reviews, vs. something else

In practice, you need all, for various reasons.

> in terms of the correctness of the final product

What is there "correctness"? Ah, I see in your later post:

> I'm only referring to the implementation being true to the language specification

The answer is: even checking for that is not something which can put the humans out of the loop. But humans alone (as in "extremely thorough code reviews" or "very careful software engineers") are provably not enough as the "space" that has to be tested is too big to be useful without "a comprehensive test suite." Still I'm not aware that a complete "comprehensive test suite" which would be "enough" for any non-trivial language can exist.

In my experience, non-trivial goals are simply bigger than any "simple" or "single" "solution." So in that sense, any project eventually fails due to the human factor. It's the humans that will use anything else as the tools to achieve the goal. It's the quality of humans that are in charge that is the precondition for everything else.

So the answer is always: prioritize for having good people, anything else will be taken care of by them: they will maintain the tests, be careful, do the reviews etc. Don't expect you can "save" on something.


Test suite, without a doubt.

Asking humans to be careful at scale (whether authoring or reviewing) just won't work.

Keep in mind that you need loooooots of test suites, not just one.


llvm has a very thorough test suite and it's main test suite is exactly for this purpose. The error witnessed here by sqlite will be turned into a test case that takes the C code and compiles it and hand checks whether the output was miscompiled or not.

Here's a test from 2007 that tests a similar error in the instruction selector:

https://github.com/llvm-project/llvm/blob/master/test/CodeGe...

Evidently the error that caused this test to be rewritten reordered the `ret` instruction before the `mov $1, $eax` instruction that sets the return value. This tests assures that the bug will never happen again.


These sorts of bugs are more common than you may think! Here’s gcc miscompiling fish shell: https://github.com/fish-shell/fish-shell/issues/6962


These sort of things get very tricky. Years ago in college me and my team was building an operating system and we were 100% sure we found a GCC bug. It turned out to be a gdb bug (we were disassembling with gdb) instead that made us think it was a GCC bug, but it was actually a bug in our code. Very subtle.


I don't understand trying to use the trunk version of a under-development compiler. Would you use the binaries built from an unstable compiler in production?

Given that here we talk about SQLite, what's the advantage to use clang 11 instead of a previous version?


You're looking at the wrong side of sqlite usage. If you develop sqlite, why wouldn't you test it on every compiler/version combination you can get your hands on? Whether someone uses it in production is irrelevant.


Because you'll spend a lot of time chasing down bugs that are actually issues in the compiler. Unless you're trying to vet the compiler itself, this won't improve your software but cost you a lot of effort.


If you’re SQLite, you generally do care about popular compilers converting your code into something wrong because you care about your code doing something wrong in general. If there was a bug in Intel processors that made SQLite fault randomly during tests, I’m sure they’d be interested in that too.


Yes.

Two things though:

1. Hopefully none of your users run an unreleased clang from head, so you're not using anything a user will run into.

2. Consider if it is it worth the effort to validate every commit pushed to head, or whether you'd get more out of your investment if you only tested releases that received some more vetting (e.g. nightlies).


> 1. Hopefully none of your users run an unreleased clang from head, so you're not using anything a user will run into.

Yes, but clang eventually gets released. You want to catch bugs in clang and get them fixed before that release happens.


Yes, but Clang is many months away from a release.

It is not the same as playing with a release candidate.


It's an automated fuzzer, it caught a bug very early, but the report got sent to SQLite instead of Clang. Probably because OSFuzz doesn't handle this distinction. (Though it might should. Or maybe usually new compiler versions catch problems in old code a lot more frequently than it catches compiler bugs. Dunno.)


I don’t think SQLite is literally vetting every commit. (Although kudos to them if they are!)


On the other hand, you might catch either program or compiler bugs before it's too late and it has shipped to production. I think this would improve the software, and I'm also not so sure a lot of time is spent on this, and it wasn't even something caught by SQLite devs, but rather an external project which SQLite decided to work around for now.


There was a chance this miscompilation would not have been found until after the release of clang 11. Surely both the clang and sqlite developers benefit from people trying pre-release versions and reporting bugs?


LLVM follows an open model for commits. Therefore, their trunk being broken early in the release schedule is not a surprise. They are months away from a release.

Whether that development model is good or not is another question, but it is what it is.


If I read the article right, the SQLite people don't use clang 11, but OSSFuzz does and someone ran it against SQLite and filed a bug. This bug turned out to not be a big in SQLite but in clang 11.

I'm not sure why OSSFuzz uses a prerelease compiler, but maybe it's purposefully for this reason, to add some extra surface area for finding compiler bugs?


there's 3 versions of this comment floating around and another 2 floating around asking why this wasn't discovered earlier

combine both to find zen koan, test early and often


sqlite testing is pretty involved, and impressive. There are articles written on this, such as: https://www.sqlite.org/testing.html


I believe Chromium uses the under-development trunk of Clang, and bumps it forward every couple weeks - and that’s probably one of the most heavily used code bases in the world.

https://chromium.googlesource.com/chromium/src.git/+/master/...


OSS-Fuzz wants to get the new sanitizer features fresh from the oven I imagine to fing new bugs in fuzzed sw as early as possible.


Not working in sqlite, but postgres. We've found numerous bugs in gcc, clang and postgres by building / testing with unreleased compilers. And found usability issues with new compiler warnings.

That doesn't mean we use such compilers in prod.


The Google people do use them in production, it's the clang version they build Chromium with.


Why?


Clang 11 hasn't been released yet, right?


Right. But we've also observed non-determinism / undefined behavior in Clang 10:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=246630#c26

  ==120363== Conditional jump or move depends on uninitialised value(s)
  ==120363==    at 0x1634474: llvm::ConstantExpr::getGetElementPtr(llvm::Type*, llvm::Constant*, llvm::ArrayRef<llvm::Value*>, bool, llvm::Optional<unsigned int>, llvm::Type*) (Constants.cpp:2191)
  ==120363==    by 0x112D6D9: getGetElementPtr (Constants.h:1163)
  ==120363==    by 0x112D6D9: (anonymous namespace)::SymbolicallyEvaluateGEP(llvm::GEPOperator const*, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (ConstantFolding.cpp:1005)
  ==120363==    by 0x112DF70: (anonymous namespace)::ConstantFoldInstOperandsImpl(llvm::Value const*, unsigned int, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (ConstantFolding.cpp:1039)
  ==120363==    by 0x112C165: (anonymous namespace)::ConstantFoldConstantImpl(llvm::Constant const*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*, llvm::SmallDenseMap<llvm::Constant*, llvm::Constant*, 4u, llvm::DenseMapInfo<llvm::Constant*>, llvm::detail::DenseMapPair<llvm::Constant*, llvm::Constant*> >&) [clone .part.0] (ConstantFolding.cpp:1114)
  ==120363==    by 0x112C5CF: llvm::ConstantFoldConstant(llvm::Constant const*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (ConstantFolding.cpp:1194)
  ==120363==    by 0x188F410: prepareICWorklistFromFunction (InstructionCombining.cpp:3584)
  ==120363==    by 0x188F410: combineInstructionsOverFunction(llvm::Function&, llvm::InstCombineWorklist&, llvm::AAResults*, llvm::AssumptionCache&, llvm::TargetLibraryInfo&, llvm::DominatorTree&, llvm::OptimizationRemarkEmitter&, llvm::BlockFrequencyInfo*, llvm::ProfileSummaryInfo*, unsigned int, llvm::LoopInfo*) (InstructionCombining.cpp:3703)
  ==120363==    by 0x189205F: runOnFunction (InstructionCombining.cpp:3789)
  ==120363==    by 0x189205F: llvm::InstructionCombiningPass::runOnFunction(llvm::Function&) (InstructionCombining.cpp:3768)
  ==120363==    by 0x16F4352: llvm::FPPassManager::runOnFunction(llvm::Function&) (LegacyPassManager.cpp:1482)
  ==120363==    by 0x16F4DE8: llvm::FPPassManager::runOnModule(llvm::Module&) (LegacyPassManager.cpp:1518)
  ==120363==    by 0x16F51A2: runOnModule (LegacyPassManager.cpp:1583)
  ==120363==    by 0x16F51A2: llvm::legacy::PassManagerImpl::run(llvm::Module&) (LegacyPassManager.cpp:1695)
  ==120363==    by 0x1FF4CFE: EmitAssembly (BackendUtil.cpp:954)
  ==120363==    by 0x1FF4CFE: clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) (BackendUtil.cpp:1677)
  ==120363==    by 0x2C471A8: clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (CodeGenAction.cpp:335)
  ==120363==  Uninitialised value was created by a stack allocation
  ==120363==    at 0x112C653: (anonymous namespace)::SymbolicallyEvaluateGEP(llvm::GEPOperator const*, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (ConstantFolding.c


I’m curious, does clang promise to compile deterministically? (This is inspired by your bug but not directly relevant here, it seems?)


Non-deterministic compilation would be pretty bad.

First, producing different outputs from the same input brings the question of caching tools (e.g ccache, FastBuild), who assume the compiler is a pure function of its inputs.

Moreover, assuming your code and the compiler are correct, you might still end-up with a situation where the performance of the resulting binary differs depending on the planet's alignment at build time.

Worse: when the input code is wrong (which does happens when you're writing new code and trying it on your machine) : you build your code locally, and you're in "luck", as the compiler generates binary code "that won't crash". So you push your modifications, and then you start getting complaints from your coworkers, because they pulled your commit and now they're getting crashes. At this moment you don't know what's happening yet, so you might even tell them "it works on my machine, did you try to rebuild all?". And this might appear to solve the issue, if this time your coworkers are "lucky"!

Finally, let's suppose your code is correct, but the compiler has a code generation bug. Have you ever tried, as a user, to diagnose a compiler bug? You spent many hours trying to minimize the input file that triggers the bug, so it's executable without needing the rest of your project, so you can send it to the compiler devs. I wouldn't even try to do this if I knew the compiler was non-deterministic.

In short, non-deterministic compilation is an invitation for trouble and confusion.


I always thought that big compilers (clang/gcc/msvc) have some degree of non-determinism due to myriad of optimization passes in combination with heuristics deciding when and where to use certain kind of optimization. Is that true or not?

And certainly if you define deterministic compilation as always producing the same binary, this is already broken by compiler macros like "__DATE__" and randomly generated names during link-time optimization. [1]

[1] https://blog.conan.io/2019/09/02/Deterministic-builds-with-C...


It's not true. They are usually deterministic for the purpose of reproducibility. The same compiler version, flags, and sources should produce the same binary.

Yes, __DATE__ in sources can break reproducibility. That does not mean compilers get carte blanche to be nondeterministic.


I would think that profile-guided optimization (PGO) also makes compilation nondeterministic (unless the profile is stored for reuse.)

https://en.wikipedia.org/wiki/Profile-guided_optimization

I've been enabling PGO when building Python lately and I imagine the resulting binaries are a little different every time due to random events during profiling.


Except that gcc and clang are nondeterministic, to an extent

Sometimes the compiler needs to generate a random value and base part of the compilation on that value. Thins like trying to predict which is the best branch, or things done at compile time.

A lot of work has been done to reduce the nondeterminism, and some of it can only be reduced by using things like "-frandom-seed=$your_git_commit" for example

build determinism also goes deeper than that, for example static libraries are archives that include the date of the archive creation, and so on

The simplest programs might generate the same hash, but don't expect all code to generate the same-hash binary by default


> Sometimes the compiler needs to generate a random value and base part of the compilation on that value.

As a compiler engineer with experience (among others) in LLVM and GCC this is the first time I'm hearing of this. Could you provide more details or a source?

I can't imagine where such behavior would be useful, let alone required. The only slightly plausible scenario I can think of would be representing some internal data structures as hash tables with random seeds to avoid denial-of-service attacks. But then the compiled code would still have to rely on, at some point, picking an arbitrary element out of such a hash table. I can't think of contexts inside a compiler where this would be a useful thing to do.


No experience here, but from the manpage:

-frandom-seed=string

  This option provides a seed that GCC uses in place of random numbers in generating certain symbol names that have to be different in every compiled file.  It is also used to place unique stamps in coverage data files and the object files that produce them.  You can use the -frandom-seed option to produce reproducibly identical object files.
though my example/guess on branch prediction is probably wrong


Ah, thanks. Looks like this only applies to symbol names, so the generated executable code should be deterministic, only symbol table entries might differ.


> The simplest programs might generate the same hash, but don't expect all code to generate the same-hash binary by default

Debian and others have put quite a lot of work into reproducible software builds:

https://wiki.debian.org/ReproducibleBuilds#Even_more

This of course only works if the compiler cooperates.

The bug linked earlier is a regression in Clang 10. Clang 9 was deterministic for the same file, flags, etc.


> First, producing different outputs from the same input brings the question of caching tools (e.g ccache, FastBuild), who assume the compiler is a pure function of its inputs.

Why should they have to? Shouldn’t they just be able to reach for any valid compilation of this particular object file and slot it in?

> Moreover, assuming your code and the compiler are correct, you might still end-up with a situation where the performance of the resulting binary differs depending on the planet's alignment at build time.

This is already the case due to your environment. If you have the wrong number of environment variables you might penalize your program’s performance by a significant amount already just because you misalign the stack!

> At this moment you don't know what's happening yet, so you might even tell them "it works on my machine, did you try to rebuild all?". And this might appear to solve the issue, if this time your coworkers are "lucky"!

This sounds like the situation already with nondeterministic bugs like races, albeit with the same binary?

> Have you ever tried, as a user, to diagnose a compiler bug? You spent many hours trying to minimize the input file that triggers the bug, so it's executable without needing the rest of your project, so you can send it to the compiler devs.

I deal with nondeterministic programs all the time…they’re a bit more difficult to file bugs for, but it’s still possible.


Nope, 10.0 was just released recently.


Why is OSSFuzz using such a bleeding edge compiler? That seems a little nuts.


Wouldn't you rather catch bugs before they're released in a stable version?


It is unfair to the authors of the software that is actually tested, in this case SQLite.

You are forced to investigate, otherwise people will attribute the bug to your software.

Toolchain bugs take an amazing amount of time and energy and happen more often than people think.


Exactly. This is precisely the point of nightly builds, is it not?


Clang 11 is still in early development stages. Release date is several months away. Clang 10 was released just a couple of months ago. 11 is expected to be buggy and not fit for use yet.

The SQLite devs now have to deal with "is it or isn't it a compiler bug" nonsense, taking their time away from fixing actual problems, working on features etc, from OSSFuzz deciding to use a compiler that the compiler devs themselves don't think is fit for use.

How much trust can you have that even fuzz results exposed are actually legitimate either? False positives, or worse still false negatives?


If you're going to go down that route, I would expect that they test using both the latest stable version and the whatever unstable version they want. Bugs found using the stable compiler should be reported to the project, while bugs found only using the unstable version should be reported to the compiler.


Given that it's SQLite, this is likely a compiler bug. However, the code given is insufficient to demonstrate a compiler bug. Given:

      c = pMem->flags;
      sqlite3VdbeMemRelease(pMem);
      pMem->flags = MEM_Str|MEM_Term|(c&(MEM_AffMask|MEM_Subtype));
You could have say:

    sqlite3VdbeMemRelease(struct foo pMem) {
        *(some_other_type *)&pMem->flags = bar;
    }
In which case the C aliasing rules would allow the compiler to assume that the assignment through "some_other_type" does not affect the assignment through whatever type pMem->flags is.

I have seen this bug happen before, when compilers got better at inlining, where it was something like

   void getAddressOfSomething(intptr_t *address);
   ...
   char *p;
   getAddressOfSomething((intptr_t *)&p)
   *p=foo
The compiler could reorder the \*p=foo line to be before the getAddressOfSomething call for the same reason.

TL;DR: Turn of strict aliasing via compiler flags (-fno-strict-aliasing on gcc) if you ever type-pun anywhere without using a union.


The code for sqlite3VdbeMemRelease can be found here: https://www.sqlite.org/src/file?name=src/vdbemem.c&ci=4218c7...

I don't see any type-punning of the flags going on. (The sqlite3VdbeMemRelease method doesn't touch the flags directly, but it calls vdbeMemClear which in turn calls vdbeMemClearExternAndSetNull which resets flags to MEM_Null.)


That's exactly what I expected. SQLite is one of the very few pieces of software where, when compiled on a new compiler, a bug is much more likely to be in the compiler than in the software it's compiling.

That's not because compilers aren't buggy; compilers have bugs. It's because most software is compiled with very few compilers and has fairly low ratio of effort invested in being correct vs effort invested in writing features. Neither of those are true for SQLite.


Notice that C allows you to introduce UB for sqlite3VdbeMemRelease in any TU of the program, not only the TU that defines it, so you'd need to check every declaration in the program at least, and that would include checking at least every file that includes its header file, making sure that, e.g., all defines are identical, among other things.


> Turn of strict aliasing via compiler flags (-fno-strict-aliasing on gcc) if you ever type-pun anywhere without using a union.

That's the topic of Gary Bernhardt's (of JS WAT and Execute Program fame) very important "A Case Study in Not Being A Jerk in Open Source". https://www.destroyallsoftware.com/blog/2018/a-case-study-in...


> loses none of the meaning

It absolutely looses some meaning and important information conveying, and most importantly some impact factor.

Communicating by pretending you are in a neutral state of mind when you feel very strongly about a subject put everybody at risk of not understanding the perceived importance of it.

The rewritten email looks like corporate bullshit and is the kind of thing you risk to immediately forget. It looks like The PowerPoint slide that crashed a space shuttle https://medium.com/@JoachimLasoen/the-powerpoint-slide-that-...

It would have been way less distributed and discussed if originally written this way.

Now the original is maybe too much and, while a few repetitions are ok, it contains probably way too many. But there is probably a middle ground to be found between the original email and the rewritten one. For example, "AND STANDARD" in caps is perfectly fine if you want such strong emphasis on that point. Likewise for "documented" for emphasis but somehow less than caps. That makes no sense to pretend that such emphasis is a meaningless expression of anger; those are key point of the mail, called early on. I agree you probably want to avoid "f*ckin moron" and "braindamage", however there is no point in removing "piece of garbage" about this point in the C standard, and the precise point being talked about is contextually clear in the mail.


Linus actually matches the standard in that one, interestingly.


What an utter bullshit article. If people can't see the technical merit of an article past the swear words, they should stop working and join their convent of choice. Weaklings.


I think that is the entire point of the article.


      c = pMem->flags;
      sqlite3VdbeMemRelease(pMem);
      pMem->flags = MEM_Str|MEM_Term|(c&(MEM_AffMask|MEM_Subtype));
'pMem->flags' is a u16[1]. Shouldn't it be copied to 'c'. How can 'sqlite3VdbeMemRelease' alter the value of 'c'.

[1]https://github.com/smparkes/sqlite/blob/8caf9219240123fbe6cf...


'sqlite3VdbeMemRelease' doesn't alter the value of 'c' but may alter the value of 'pmem->flags' which is what the compiler replaced 'c' with.

from the post by Dr. Richard Hipp "From the -S output, it looks like Clang-11.0.0 is compiling these three lines as if there were written as:"

      sqlite3VdbeMemRelease(pMem);
      pMem->flags = MEM_Str|MEM_Term|(pMem->flags&(MEM_AffMask|MEM_Subtype));


It doesn't. The compiler is moving the access to `pMem->flags` to come later in the sequence, after `pMem` has been modified by `sqlite3VdbeMemRelease`.


He said

>> Clang seems to be assuming that the sqlite3VdbeMemRelease() function does not change the value of pMem->flags.


You're right.

But for some reason, it's thinking it can just reread the flags from the structure after that function, which has a non-const pointer

void sqlite3VdbeMemRelease(Mem *p);

I'll be waiting for the excuses of the "C standard evangelists" and the extreme optimization people on how the SQLite developers are wrong on this case and how it's funny to pull the rug under them.


> But for some reason, it's thinking it can just reread the flags from the structure after that function, which has a non-const pointer

The pointer being const would not give the compiler any more guarantees that would allow it to reorder the read.


Looks like an average compiler bug rather than malformed code?


pMem is a pointer; the contents of pMem will not be copied.

[edit] I see what you're asking. The point is that the compiler changes:

      c = pMem->flags;
      sqlite3VdbeMemRelease(pMem);
To something equivalent to:

      sqlite3VdbeMemRelease(pMem);
      c = pMem->flags;
Which is correct only if sqlite3VdbMemRelease doesn't change pMem->flags


Holy cow, this happened at -O1. This doesn't seem like the sort of optimization that should be possible at such a low level. I've run into plenty of trouble with higher level optimization flags in compilers before, but this is wild.


I see several references in this thread to "a bug in Clang that is already fixed", but I can't see anywhere where anyone references this bug.

Can anyone point to the bug report (and/or fix) in Clang?


https://bugs.llvm.org/show_bug.cgi?id=46195, it's now linked in the sqlite bug


Why would OSFuzz send the bug report to SQLite instead of Clang?


OSSFuzz is an automated system.


More generally, bug reports frequently get filed to a project that uses the system that actually has the bug as dependency - even with non-automated systems.


So... if I'm understanding correctly, you're accessing released memory? How is this anything but undefined behaviour?


So... apparently I'm not understanding correctly haha.


If I understand correctly, sqlite3VdbeMemRelease doesn't free pMem nor pMem->flags, just some other fields in the pMem struct.


You're almost certainly reading that correctly, that pMem is a pointer to a pointer to a block of memory, plus some metadata for that block of memory, and sqlite3VdbeMemRelease either releases the memory back to the application-level pool, frees it back to the memory allocator (glibc malloc/jremalloc/tcmalloc/dmaloc/etc.), or directly brk()/mummap()s it directly back to the OS.

I have no experience with sqlite internals, but it's pretty clearly doing some application-level accounting on top of the normal memory management.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: