I know this was possible in the past, but I'm not sure if tcc can still compile a relatively new gcc.
Best method I can think of is assembly writes a macro assembler that writes compiler for simple, high-level language which is used for the first, C compiler. That compiles the complicated, optimizing one. HLA, a P-code variant, and Hansen's Edison are all potential contenders for low-level language. An Oberon-like language with something closer to C syntax & Wirth-style compiler might do for initial, high-level language. tcc or QBE backend have potential for simple, initial, C compiler. I'd do both LLVM and GCC for last step as good projects rely on each.
If one wanted to spend money, then license CompCert, extract it to ML, compile with CakeML, verify each function against assembly, and equivalence check everything with testing. CompCert verifies C to ASM with its outputed ML verified by CakeML assuming their specs are correct. Easiest cheat. Use CompCert to compile first C compiler, the main C compilers, or anything else you can.
https://github.com/thepowersgang/mrustc exists, anyway. I don't think it can compile rustc yet.
> Thankfully, (from what I have seen), the borrow checker is not needed to compile rust code (just to ensure that it's valid)
It's not finished, though, and honestly I'm pretty sure that borrow checking isn't even that hard compared to all the other stuff a Rust compiler has to do, like generic/trait resolurion. But we'll see.
You can implement basic trait resolution via name mangling and a global map. I did this as an experiment a number of years ago to implement type class-like overloading in C. I'm not sure how far you could take that technique though. I don't think it would work for higher-kinded types, but Rust doesn't have those.
But yes, such a compiler can be used.
The misattribution and forcing a correction by Thompson is in 3.2.4 of their lessons learned paper:
Here's the original paper where the so-called Thompson attack is described on page 17.
Also note they were inventing both hacking techniques and INFOSEC while doing this evaluation. It was part of forerunner work happening among small number of people with little to draw on. It's why you see me say "the legendary Paul Karger" when describing the results they got. Due credit might be "Paul Karger's compiler attack popularized & further explored by Ken Thompson's paper, Trusting Trust." I keep mentioning it until more give it.
Back on topic. These days we have verified and certifying compilers, too. Even typed assembly language with correctness proofs. Lots of stuff to base it off of that's close to what most developers can understand. Basic refinement from Rust compiler code with no optimizations to macro assembly or local scripting languages is what I've been recommending outside verified compilers since any developer can do it without special tooling. I even proposed bash one time although as a compile target more than what I'd try code it in lol. I see using local scripting is in your suggestions, too. That different people are thinking on same lines here more often might mean it's worth exploring further.
For other interested readers:
a) html version of David Wheeler's dissertation: http://www.dwheeler.com/trusting-trust/dissertation/html/whe.... I read it over a week last month, and it made a big impression on me.
b) the HN discussion for the Wheeler dissertation: https://news.ycombinator.com/item?id=12666923
Here's his high-assurance and SCM pages:
It would be extra cool if you had a series and did this with Go as well :)
It shouldn't be hard, and I've played with the Go AST before. I ... could :)
This was sort of a bucket list thing though, doing it a second time is less fun :p And I don't have that much time.
Why don't you try? The technique explained in the blog post is pretty universal. Go has an AST too, and while the Folder abstraction doesn't exist in its codebase (as far as I know), you can still tweak the AST. In the case of Go it would be easier to do the injection during parsing. https://github.com/golang/go/blob/5c9035acc1e1540a564ea66000... would be a good entry point. https://golang.org/pkg/go/ast/ should help.
(I am not a Go compiler dev so there may be better entry points)
> It shouldn't be hard, and I've played with the Go AST
> before. I ... could :)
I appreciate the tips on starting points. Very interesting. If I do end up doing something like this I will give credit where it's due.
> Simply recompile the purported source code twice: once with a second (trusted) compiler, and again using the result of the first compilation. If the result is bit-for-bit identical with the untrusted binary, then the source code accurately represents the binary
This sounds very familar to me. If I remember correctly, this is exactly what happens when you build GCC. GCC's 3-stage bootstrap process is well-known for a long time, as well as the decussion about whether this makes sense or is just paranoid.
It's nice to have this written down and analyzed throughly in a scientific paper, though.
but… can we assume that any compiler is trusted?
Or you can use the incremental approach I describe elsewhere in this thread.
For the minority of you that haven't read the original Reflections on Trusting Trust, you should do so now. It goes more in depth on this attack category, and its implications. Personally, I would say it's one of the few papers that should be required reading for programmers (along with In The Beginning Was The Command Line, and The Lambda Papers).
This is an interesting solution to a problem I often face (whenever I write a tool to in environment X to process something for environment X) . Is this a common way to handle this problem? I don't think I've seen this before.
The other one that comes up in Rust is 'type', people tend to use 'ty'.
Would something like randomizing memory layouts, or reversing stack direction be an easy mitigation or solve an attack like this?
One potential way to try to remove attacks like this is to run the source code of the compiler through something that randomizes the structure and text without changing how it actually functions. That way, you may be able to defeat the code that is trying to match on code that looks like the part of the compiler it's trying to hijack.
what I mean is that I wouldn't expect a the 2012 rust prerelease to be able to compile the 2016 rust source code, but it can probably do 2013, use that to compile 2014, use that to compile 2015 use that to compile 2016.
As long as you have a single binary from any point before the attack was introduced it shouldn't be an issue. The whole point is that at no point does the source code contain the trust backdoor, so you can just work forward from any binary that doesn't have it yet.
if the very first version of rust binary already had it as an issue, as well as was written in rust, you could conceivably have a problem though. then you would need an alternative compiler, however sub-optimal it might be...
or you could simply patch and remove the backdoor from the binary and then have it compile itself without inserting the backdoor.
Clearly you aren't aware of Rust's history :)
proto-Rust has been under so many rapid changes that each compiler usually only compiles with a specific hash. Now stuff works with a numbered Rust release, but that's a relatively new phenomenon. This process will likely need to go through hundreds of compilation steps. Doable, but not as simple as a year-by-year process.
In the worst case, you have to follow each commit in the version control system of the compiler, but I'm pretty sure you don't need to do it that fine grained.
>you use an old compiler binary that didn't have the attack yet to compile the latest source code that it is able to compile, then use that to compile the latest it can compile, and so forth.
in trying to find the "latest source code that it is able to compile" you can do a binary search backward from the current version of the source code. it doesn't take long to find the latest version for each one (i.e. the latest each one compiles without error, into a working binary that passes some test suite). And anyway whenever you're binary-searching forward (i.e. after any point where the binary search yields ""greater-than" because the one you just tried compiled) then you can just use the version you just successfully compiled. Let me illustrate what i mean with this binary search:
so if we're at version 1 million today, which the version 7 doesn't compile, then you try version 7 on version 500,000 (it'll fail), then on version 250,000 (it'll fail), version 125,000 (it'll fail), version 125,000 (it'll fail), version 62,500 (it'll fail), version 31,250 (it'll fail), version 62500 (fail), version 31250 (fail), versions, 15625, 7812, 3906, 1953 and 976.
Now suppose that version 7 compiles version 976 successfully. So the failure with version 7 is between 976 and 1953. But since version 976 is stronger, you can start working forward with version 976. So it's like a binary search that's restarted whenever you get a "greater-than".
Even if for some reason this were a manual process, each time you get a working version you at least halve the remaining space.
Finally, as you said above, someone could make a batch file / shell script that literally goes through every commit in the version control system (not skipping any) and always use the previously working version on the next working version. A script doing so may well run in matter of days, though, depending on how long it takes to compile a the compiler compiler. Ordinarily there are a lot of commits!
The binary search above cuts this down significantly. (However the binary search isn't theoretically guaranteed to be faster; after all theoretically we can imagine that version 7 compiles version 8 but fails on version 9; version 8 compiles version 9 but fails on version 10; etc. So theoretically every single commit could be breaking. (Theoretically.)
But that's extremely unlikely to be the case. I don't imagine you'd have to do more than a few hundred compilations with the above binary-search methodology before you got one that started with version 7 but stepping through the commits in the way specified, produced a binary that compiles the latest version.
The process may have to be partly manual due to breaking changes in semantics of invoking the compiler or its dependencies, but that should be rare.
This process would also in the end allow you to do a bit-for-bit comparison of the output of the current version of the compiler when compiled using the above trusteable version, versus compiled from source code with the "trusting-trust" backdoor (where every version is backdoored and inserts a backdoor when compiling itself, without this backdoor being in the source code anywhere.
so the above process would let you tell whether there's a trusting-trust backdoor, as long as there is a single early version that for sure didn't have it yet, and you have the commit history (from which the trusting-trust has been edited out).
as I said above, if you don't have a single known-good version without the trusting-trust backdoor, then you'll have to write something that can compile version 2 or 3 (or 7) yourself, in another language.
I just wanted to note that the scale was way off. You said "I wouldn't expect a the 2012 rust prerelease to .." followed by "but it can probably do 2013", so it was clear you had an expectation of what the dates would be approximately like.
And for many compilers, this is true -- you can use really old compilers to compile the new one. But not everyone is aware of how tumultuous Rust's history is, so I thought it interesting to note.
I should have worded it better I guess.
(By comparison the specifications for C don't break earlier compilers often at all.)
Thanks for the clarification :)
>So you have a string containing the contents of the module, except for itself
I assume interesting versions of TT would have to avoid this trick, since someone running "strings" on the binary would notice something very suspicious, unless something strange is done to string literals.
The original article was really about this -- at the end of the day, you have to trust _someone_. Of course, we more easily trust microprocessors and assemblers over binary blobs, so
> I assume interesting versions of TT would have to avoid this trick
Pretty easy to encode the string literal into some binary format.
Alternatively, serialize and later deserialize the AST with a stable binary serialization mechanism.
A really good version of TT operating on the AST would have to backdoor not only the part where it creates the AST, but also the parts where intermediate state is displayed by the compiler (e.g. where it can dump AST/MIR output).
There are things you can do. As a proof of concept, I didn't bother to do them. My current POC is toothless and I like it that way!
It's cleaner to instead operate at the end of the pipeline; on llvm ir or the generated binary (but it's also harder to write). And if you can insert a trusting trust attack in llvm itself, well, that would be something :)
True... but it has to be backdoored for that particular system. There are many ways you can make it very unlikely that the other compiler/system is backdoored for the same target.
For more information, see my work: http://www.dwheeler.com/trusting-trust/
With about 5... esp U.S., French, Russian, Chinese, and Japanese... you should be fine. Add South Korea these days with Samsung hardware. The software doesn't have to come from same countries as hardware. Better if it doesn't. All safety checks on in it with POLA enforced at the least. Shady exploits based on language primitives (esp C's) is why the compilers need to be in safe languages.
Great work on DDC, btw, I really enjoyed that paper.
I hated the idea that there could be an uncounterable attack... so I kept at it until I could show that there is a countermeasure.
On the lowest level, you can hand-check it with pencil and paper. Per Brinch Hansen used to write the earliest compilers in ALGOL that he type-checked and tested by hand. He'd then write optimized assembly that visibly matched each function or procedure. He claimed the ALGOL still aided correctness plus served as better documentation of algorithms than commented ASM.
Far as Rust, one could do it by hand if they skipped safety checks and optimizations. Otherwise, do it in older or randomly acquired stuff highly unlikely to be subverted. Many such devices.
I remember being impressed at the time. An excellent read covering many topics often featured on HN.
Yep. This was the original intention of the talk/paper; it was about _trust_, not attacking compilers. Attacking compilers is just a super cool example that was used.
I'm not even sure that the open source RISC-V initiative might prevent this, as theoretically the NSA could gain access to the manufacturing plant and insert their own core.
The only way to catch that at that point would be to X-ray the chip and look for their mod, or something. Anyway, assume the NSA has access to your shit, focus on preventing the random from grabbing your bank account access and stealing your money.
Hypothetically, Graydon could have inserted a trusting trust attack in the ocaml compiler that would also insert a version of itself inside rustc. Highly unlikely; very hard to do, but ... possible.
(This file is no longer in master because the compiler bootstraps from the previous stable release; too lazy to accurately isolate the last time it was updated)