Reproducible builds are so important, they prevent a build server or developer laptop from being a single point of failure as a tainted build can now be detected by others.
The NetBSD wiki entry is about progress on reproducible builds for tools, kernel and userland.
As for packages, reproducible builds under pkgsrc is still a WIP.
My use of third party binary packages and pkgsrc is minimal but I am continually building custom kernels and crunched binaries with build.sh and I do make use of the binary userlands from releng.
For me at least, having reproducible builds for these alone is quite useful.
Valerie Young had a good progress report at Linux.conf.au this year:
Reproducible Builds for a Better Future:
"I would also like to acknowledge the work done by the Debian folks who have provided a platform to run, test and analyze reproducible builds. Special mention to the diffoscope tool that gives an excellent overview of what's different between binary files, by finding out what they are (and if they are containers what they contain) and then running the appropriate formatter and diff program to show what's different for each file."
The output of those jenkins builds is then run through debian-provided test suites to validate the reproduce-ability of aforementioned NetBSD builds.
This one, at least, can be done in NixOS/Guix once you check out the source -- and the Nix package manager can technically be installed on any Linux distro, too (and some other ports to Cygwin/FreeBSD/Mac etc) and run a single command to get the ISO, or any kind of build product you want.
The carefully tested and maintained portability/cross-compilation is another thing though: NetBSD has fantastic support here that is not easily replicated without just doing a ton of work. So its universal basically-always-works cross compilation, everywhere -- is rather unique here. You can't build NixOS ISOs natively e.g. Nix-on-Darwin, which is rather unfortunate.
1) Prevent tampering by making part of the system immutable? The fingerprint isn't necessary; unconditionally prevent modification to the relevant files instead.
2) Prevent tampering by using trusted files? Normally this should be done by having a set of trusted keys, not hardcoded hashes. That way you can still securely upgrade the system.
3) Accessing files from a remote untrusted filesystem? This doesn't seem to work either; see the caveats section in veriexec(9).
Am I missing something here?
You can just as easily just download the source over one or more links to your OS then compile it locally. That was what TCSEC required for trusted distribution. TCSEC was partly designed by guy (Karger) who invented the compiling-compiler attack if you're wondering credibility. If your existing compiler is subverted, then your system is probably subverted with rootkit given that's easier than sniping one program you may never use the intended way with exotic attack. If you cant trust the repos for source or local compiler, then how did you trust your kernel or anything else to begin with?
The real solution is secure SCM, ability to build from source, and a trusted point to start with. You have the latter unless your system is pre-backdoored. Reproducible builds might have other benefits such as during debugging. It's bringing you little security benefit over the stronger alternative I mentioned.
The maliciousness in the compiler would have to be introduced in source code (which is harder) or somehow already in the initial operating system binary (disregarding bad hardware that is bad for all systems testing builds).
That's kind of what I'm saying. In your model, you trust the binaries and files you received in the OS to run with a bunch in kernel mode. You don't trust the compiler. In my model, I trust the semi-privileged compiler I received with the OS since I already trust all the OS code. An attacker that could subvert the compiler would prefer subverting the OS instead or in addition to compiler. Then, they could possible avoid any solutions I have such as double compilation. If I use a stock scheme, they might even subvert it.
If you're not trusting the repo, you have much bigger problems than reproducible builds can solve. At the least, you will be doing reproducible builds plus a bunch of downloads from mirrors, hash/sig checks, comparisons, etc.
Even if you have all that, how can you be sure no one tampered your system at some point? Reproducibility lets you compare your entire system against known-good states.
If you have a known, good state, then you don't have to. You simply use the source or binaries in the repo after checking hashes or signatures on a few mirrors. If your state is that bad, then you're already compromised with possible a rootkit.
"Even if you have all that, how can you be sure no one tampered your system at some point?"
That takes way more than reproducible builds. You need source code for a compiler that doesn't have 0-days in it in terms of backdoors, optimizations that eliminate security checks, or passes that create exploitable faults. That's called a verified compiler. Only one exists for C (CompCert) that's proprietary with quite a few open ones for ML languages. The SCM's distributing that source code must be secure so they aren't adding modifications to source or your OS binary you started with. During transfer, they have to be signed or the transport has to be secure. You then need some kind of local tool you can trust to bootstrap it. If you trust nothing, you're going to be using ancient hardware you bought in person with cash to run an interpreter you wrote by hand that executes something equivalent to a small C compiler you use to compile the others that you inspected by eye or trusted via 3rd party review w/ hash/sig confirmation.
If exotic attacks like Karger's are in your threat model, so are 0-days that are accidental or deliberate along with endpoint and SCM attacks. Reproducible builds don't protect you from much in that model. The stronger methods, from endpoint protection to SCM to local tooling, will pay off over time in more situations given they're independently critical. It's why Karger et al put them in first standards for INFOSEC to begin with.
The other problem is that making all binaries come out equal creates more of a monoculture that aids attacks on endpoints. Many good tools in the field automatically harden C programs by making them memory or data-flow safe. Softbound+CETS comes to mind. Others obfuscate the hell out of them on a per-user basis to reduce odds of one-size fits all attack. You can't use these techniques with everyone having same binary. So, it's a step back from existing compiler and system security methods in total assurance you can achieve.
This is a red herring, for two reasons. First, even if the build is not reproducible, the distribution produces packages from some source code and those packages are fixed. If one is installing a distribution's binary packages there's no more or less of a monoculture if that package can be reproduced.
Second, the sources of non-reproducibility we're trying to fix here come from things like timestamps embedded in the binaries, arbitrary directory file ordering, etc. These provide little to no meaningful diversity in the resulting binaries anyhow.
Having the ability to reproducibly build packages does not preclude the application of tools or techniques to introduce diversity in the compilation process or otherwise obfuscate built binaries on one's own system.
That's not a red herring. That means the first distros will all start the same if a binary or be built from source (eg Gentoo). For the binary, they can build from source on local machine to immediately diversify then do the rest with source. This means the monoculture risk is only present when system is first set up. It can even be programmed to exclusively connect to repos until that risk is removed or warn the user if they turn that off manually. One could go further to even have multiple binaries available that each used a different, security technique or combo of them. All built into an automated, build system.
"Second, the sources of non-reproducibility we're trying to fix here come from things like timestamps embedded in the binaries, arbitrary directory file ordering, etc. These provide little to no meaningful diversity in the resulting binaries anyhow."
That's a red herring. The diversity I referred to came from compiler transformations that provably enhance security with site-specific obfuscations on top of that. Transformations that couldn't have happened if everyone's compile results in the same binary with hashes they're checking against each other. Those other things don't effect security much as you said. It's why I didn't bring them up.
"Having the ability to reproducibly build packages does not preclude the application of tools or techniques to introduce diversity in the compilation process or otherwise obfuscate built binaries on one's own system."
That much is true. However, solving the SCM problem + local, build tools already eliminates the need for the security aspect of it. The other techniques can then be included in by default. A reproducible build can also be done for any of it for extra benefits it brings but the compiler subversion is already knocked out by former techniques + a certifying compiler.
No. The work to get to a state where packages build reproducibly, in general, consists of removing timestamps, providing stable sort order for inputs, and similar cases. In the vast majority of cases the only diversity in a distro's package set comes from these sorts of differences. If addressing these cases results in a reproducible build, there was no meaningful diversity to begin with.
If the goal is stopping subversion, I identified a bunch of other things you have to do. Some conflict with reproducible binaries where you avoid them or throw them away immediately. Some with strongest security... memory-safe languages, certified compilation, or highly-assured SCM... you aren't doing at all that I'm aware. Your attackers will try to hit all of this, though, rather than just do a compiler-compiler-subversion thing in MITM scenario. Hence the need for strong, holistic stuff instead of tactical hacks.
Of course there's a lot more that needs to be done to prevent or detect malfeasance, and while it's related, it's beyond the scope of the reproducible builds effort.
The main threat model that reproducible builds are meant to guard against is simple: an attacker, who has write access to a binary distribution of some piece of software, uploads a malicious binary rather than the true output of compiling the corresponding source.
With a reproducible build, anyone can prove that a given binary is non-malicious by re-running the build and verifying that the output is the same as the binary they downloaded. This leaves the problem of ensuring that everyone gets the same binary, i.e. the server distributing binaries hasn't been modified to serve different files to different IP addresses or something like that. But there are ways to solve that: for example, you could have multiple independent parties who each verify all the builds and sign them with their keys, and end users could check for signatures from N different trusted parties before installing.
Maybe you understand this, but I have a hard time seeing how many of the things you propose in any way substitute for reproducibility.
> If you have a known, good state, then you don't have to. You simply use the source or binaries in the repo after checking hashes or signatures on a few mirrors. If your state is that bad, then you're already compromised with possible a rootkit.
Checking signatures on a few mirrors is nice, but you still have to trust the single machine where the package was originally built. Reproducible builds let you avoid that.
If you use no binaries at all, i.e. you start with some existing (trusted) OS with a compiler already installed, and from that OS you build the entire target OS from scratch, then sure, you don't need reproducible builds. But nobody installs operating systems that way. The vast majority of OSes don't even support being built from any OS other than themselves. As has been mentioned in this thread, NetBSD is a partial exception as it can build from any POSIX system, but that still rules out building from, say, Windows. Much easier to just verify some GPG signatures on your OS of choice, at least if you trust that the N verifiers won't collude or all be hacked, etc.
> That takes way more than reproducible builds. You need source code for a compiler that doesn't have 0-days in it in terms of backdoors, optimizations that eliminate security checks, or passes that create exploitable faults. That's called a verified compiler.
A verified compiler would be nice to have in many cases (with the drawback that the output binaries are usually rather slow), but that's pretty much orthogonal to reproducible builds, as the threat models of "0-day via bad compiler optimization" and "binary distribution compromised" are completely separate. Compiler backdoors are also orthogonal, since a binary purportedly corresponding to a verified compiler can still be backdoored (the verification is usually done on the compiler's source)...
> The other problem is that making all binaries come out equal creates more of a monoculture that aids attacks on endpoints. Many good tools in the field automatically harden C programs by making them memory or data-flow safe. Softbound+CETS comes to mind. Others obfuscate the hell out of them on a per-user basis to reduce odds of one-size fits all attack. You can't use these techniques with everyone having same binary.
If you want to customize binaries per-user, or use different compiler settings/passes from upstream, then that's great. In that case, reproducible binaries are still useful for the initial bootstrap step, as I mentioned above.
But I don't know what that has to do with hardening passes. There is nothing inherently nondeterministic about those. They can be part of the upstream build, in which case they should be reproducible, or they can be not part of it, in which case you need to build from source if you want them. But in that respect they don't differ from any other compiler option or compiler variant.
This isn't the only attack reproducible builds are about. They started as part of Wheeler's Countering Trusting Trust paper on how to beat the Karger attack of modifying compilers to backdoor themselves. So, there's a MITM problem and compiler subversion problem they're about in most places doing them. Most of the threads on the topic (including this one) also have people brining up stuff in Thompson paper or Wheeler's technique. That threat model requires more security, esp against compiler vulnerabilities or malicious source.
My other argument was that users are trusting an image of their OS from a repo that comes in binary. If they trust that, then why not trust the binary of the compiler or other apps in that repo? If they don't trust the repo, they shouldn't be downloading privileged software from it that runs in kernel mode. Bit of a contradiction. One counter is that it might be compromised later on. Well, do they not do software updates either then? Benefits of reproducible binaries over source-based distribution and updates from secure repo are slim to none outside saving compilation time.
If the former, then as I said, it's theoretically possible to build the entire desired OS from source on the existing system, but that's often not supported by OS build systems, and very uncommon in practice as an installation method.
If the latter, with reproducible builds you don't have to trust the repo - even for the initial install. You only have to trust that at least one of the people who signed the packages in the repo, ideally after verifying using entirely separate infrastructure, is honest. There's no single point of failure.
Well, you also have to trust that the source isn't backdoored, of course, but that's at least somewhat easier to detect than backdoors in binaries.
If you download a binary package that still leaves you trusting the packager who built and signed it, and makes it much harder for other packagers to cross-check their work. To what extent does anyone currently disassemble the binary packages in, say, Debian?
At the moment every packager has the keys to the kingdom. Cutting that down to only the packagers of core system binaries like the compiler would be a win. Having multiple independent packagers double-checking each other's work would be a win. It's not everything, but it's valuable.
I'm agreeing with that. I'm also saying you're already doing it for a whole distro. Why not the compiler they can ship with it on top of that? If you don't trust repo, then you need to be doing more than reproducibly building a compiler's source.
This is true unless the repo you're using can be accessed by malicious parties or contain code from them. As in, if you're avoiding packages from person A but not B and both have write access to the repo, its server, or its network then your security is unchanged. If different source/binaries are completely isolated by different people, then your choice might reduce your risk in event one becomes malicious.
A and B sign the packages they build with their own PGP keys, no?
(Cool thought: with reproducible builds, multiple independent packagers could perform the build locally and upload the package signature - only the first packager would have to upload the actual package, but we could check the others' signatures to increase our confidence that no funny business was going on)
Satisfying this made GCC a lot better.
I always dread dealing with build systems, mostly in the C land.
Deterministic behaviour, especially in this rigorous fashion, is probably very helpful for much more cases than just trust.
This looming assumption that make executes pure functions to produce output could actually become true. Now it really suffices if make triggers a target if one of the inputs changed.
This is exactly how Nix and Guix works. If any inputs change, it will force a rebuild of all dependent packages.
Indeed, there are a load of QA benefits for reproducible builds. Let alone the CO2 savings that result from cache hits instead of pointlessly rebuilding dependencies.
"Reproducible builds of Debian as a whole are still not a reality, though individual reproducible builds of packages are possible. So while we are making very good progress, it is a stretch to say that Debian is reproducible."
In my experience, if you have a single home made package in C it is pretty easy to make it reproducible
OpenBSD - Absolute Security
FreeBSD - General purpose
What is NetBSD aiming at?
From their website:
NetBSD is a free, fast, secure, and highly portable Unix-like Open Source operating system. It is available for a wide range of platforms, from large-scale servers and powerful desktop systems to handheld and embedded devices. Its clean design and advanced features make it excellent for use in both production and research environments, and the source code is freely available under a business-friendly license.
> Unfortunately this was not easy to find on NetBSD, because we are still using CVS as the source control system
seems just weird.
These special strengths -- vast hardware compatibility, rump kernels, now full reproducible builds, are all enabled by a greater underlying (and seemingly underrated) technical excellence.
2. dates/times/authors etc. embedded in source files
3. timezone sensitive code
4. directory order/build order
5. non-sanitized data stored into files
6. symbolic links/paths
7. general tool inconsistencies
9. build information / tunables / environment
10. making sure that the source tree has no local changes
* Non-isolated build environment. This is just asking for all sorts of trouble (users, hostname, network access, etc).
* File system time stamps.
* Recording times in the build process (although even gcc supports SOURCE_DATE_EPOCH since version 7).
* Usage of CPU-specific instructions, e.g. -march=native to GCC.