Hacker News new | past | comments | ask | show | jobs | submit login
Debian Buster will only be 54% reproducible, while we could be at 90% (debian.org)
280 points by JNRowe on Mar 5, 2019 | hide | past | favorite | 160 comments

I find it fascinating that, from a security standpoint, it's best for the static representation of a program (the executable and the package) to be perfectly predictable, while the dynamic representation of a program (once it's loaded into memory and running) should be as unpredictable as possible while maintaining correctness (using ASLR and other methods to slow down attacks on vulnerable code). I guess running a program is like setting off an explosive: you need to know exactly what you're handling before you release its power.

Anyway, many kudos for this work!

What do you mean by "as unpredictable as possible"?

From the perspective of an application developer I want that dynamic representation to also be predictable.

You are probably talking about different kinds of representations.

hathawsh meant representation in the sense of memory layout and other low-level details of how application state is encoded. You seem to be talking about the programmer-visible abstractions built out of that layer. Behaviour at that level has to be predictable in order to "maintain correctness", but predictability at the lower level is mostly a help maintaining for the correctness of malware.

Interesting. Out of curiosity, is there any correlation between programming paradigms and the predictability of of low-level details? I would imagine that a “boring” C program with one big chunk of mutable state would be predictable in this sense. What about programs developed in functional languages with immutable data structures?

> I would imagine that a “boring” C program with one big chunk of mutable state would be predictable in this sense.

It likely wouldn't. ASLR will choose a random load address for the module (.so or .exe), which means that main() wouldn't be at a predetermined address. The heap and stack would similarly have arbitrary offsets. Furthermore, any other modules that get loaded would be at arbitrary addresses. Ultimately, even "hello world" should become extremely randomized, provided ASLR is enabled for the process.

ASLR prevents attacks arusing due to known addresses in virtual memory because they are otherwise reliably predictable (just run the program, attach a debugger, and find the addresses that you care about).

I think the idea is that (ideally) these low level details are unpredictable even for a boring C program. I.e., when the C program calls malloc, it doesn't actually need to know what the exact format of the heap is. And if every running system has a slightly different format of heap and stack, that makes it harder to write exploits which will work across all of that variation. Can't smash the stack if you don't know how its formatted.

For someone who's found a way to inject data into your app, you don't want them to know the address of your symbols and data structures.

In other words, computer exploitation is basically hostile debugging.

Enbugging. ;-)

Well, bugfinding :-)

If your debug symbols still work, are you sure you really care?

Wow, I'm impressed by the consistent progress.

There is so many things based on Debian and adding reproducibility in Debian is a massive security improvement.

I have a lot of respect for people working on this, it must be hard to respect release freezes when you've been working on this for years.

Slightly fluid numbers aside, half the packages for Buster still sounds like good progress to me.

It's certainly the right direction! I applaud them for pushing this.

From the article, it looks like many of the others are reproducible on a code level, but the release system is using older binaries, which haven't been rebuilt yet.

It's too bad it won't be fixed in time for this release, but bodes very well for future releases.

What's the importance of reproducible builds?

Edit: found this nice overview by clicking a few links from the original post https://reproducible-builds.org/

It allows you to verify that the source code was unaltered when the original build was produced. (By building it a second time with known good code.)

Does anyone diff public releases with builds done using a known good environment?

It would be a lot of work but could expose malicious providers.

Then again it might be hard to create an actual good environment considering all the software /firmware/hardware layers modern systems have.

Yes, this is known as a rebuilder.

There is still work to be done, but NYU was one of the organizations working on developing and running a rebuilder. The idea is that you pull buildinfo files from https://buildinfo.debian.net/, then try to verify them and if you got the same artifact you sign that you successfully verified this binary package.

A user could then configure "I trust rebuilder X, Y, Z and I require that at least N have successfully verified the package" before installing it.

See here for an example: https://gitian.org/. As I understand it's used to build a (the?) Bitcoin client.

One thing that gets me excited about reproducible build systems in general, which has less to do with trust and verifiability, is cacheability/content addressable storage of build artifacts.

Its extremely valuable for debugging. When you've got a bug that is in a specific build you can collect the set of dependencies that produced that build and BAM you've got an env that should replicate the bug.

This is the best argument Ive heard for it. It's also an argument for my priorities of safer languages, development rigour, and use of vetification and test tooling. Each of these reduce amount of debugging one does later on top of improving reliability and security.

one thing that I think got lost culturally in the move to linux was the emphasis on open source. yes, not open source in some abstract warm fuzzy way, but actual useful open source.

if I cant run the build process, and I dont really know which patch sets have been applied on my system, and which version of the source was used, and I dont really have a handle on the dependencies. doesn't make it substantially more difficult to debug and work with the resulting system?

its really quite odd that the dominant open source platform settled on opaque and unreproducible binaries as the distribution mechanism early on.

Assuming you're referring to Debian, you've missed the mark. Debian is more committed to open source, in the RMS 'free software' sense, than any other distro I'm aware of... it's the foundation of the distro. Any Debian user can absolutely rebuild from source any[1] part of the system.

The problem that reproducible builds is attempting to solve is that my build, while being 100% functional, might not be bit identical to yours or what shipped with the distro for any number of valid reasons.[2] The problem this creates given the current world of increasingly sophisticated malware is that I can't prove a given build is identical to what was intended based on the binaries. So it's an issue of 'we've always been giving you the source, now we're going to ensure that it's possible to create bit-identical binaries' rather than 'we were giving you free binaries, now we're giving you the source'.

If your argument is that one should always build from source, reproducible builds is part of getting to the same effective result without the overhead and headache that the vast majority of users are not willing to deal with.

[1] In the cases when this turned out not to be possible, holy hell tends to get raised... see things like the Chromium BLOB issues (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=786909) etc. These issues tend to have a lot of noise made and get resolved very quickly since it goes against the very core of what Debian is intended to be. In cases where you have things that are binary-only such as proprietary GPU drivers, these get segregated into a separate repo with the general advice from Debian of 'you shouldn't use these'. This doesn't tend to be realistic advice for many users given the state of many of the open source drivers... but they've tried their best to steer people away while not making the distro unusable for those who insist on using them.

[2] A trivial example would be baking the build date/machine name into the binary... there's no expectation that one would ever reproduce that build again even on the same machine.

I think the gripe is that it's never been as easy as it could be to say "OK, I'm going to make a change in this here source file, then run the result on my machine while keeping everything else the same".

For example if I'm reporting a bug and I have a proposed patch, it would be nice to be able to say "I've been running with this patch for the last month and it's been fine".

Ideally there'd be a coherent way to download the source for a particular package, rebuild it, and have the system use the rebuilt result. Bonus marks if it did something useful when a newer version of the package came along.

I think https://packages.debian.org/sid/dgit is an attempt to make the download-and-rebuild part look more uniform, but I think there's still work to be done to make it easy to substitute the result for the distro package.

This is trivial on funtoo/gentoo you just make a private tree of ebuild files containing only the packages that you care about modifying with the patches located in the same directory tree and applied by adding the patch to the ebuild.

You can also host this somewhere so others can run/test your modified version easily. No real special infrastructure required they can sync this directory via git.

There are of course downsides. The annoyance of building updates and discovering bugs not discovered by others by some unique set of patches / build time options.

What's missing from: "apt-get source the-package", add patches, "pbuilder build the-package.dsc" ? (I'm assuming that pbuilder was already configured to work)

FWIW As someone that touches debian package building once in a blue moon, it seems like there's wide variability into which is the "right way" to do the actual build step and that everyone has their own artisanal preference. pbuilder, debuild, debootstrap, dpkg-buildpackage, sbuild, fakeroot all leave me scratching my head.

If we're taking about the reproducible packages, then I believe only pbuilder is a valid answer here. (maybe sbuild, I haven't used that one) debootstrap / fakeroot / dpkg are part of the pbuilder system, not full solutions.

But yeah, I noticed that Debian is not going out of its way to advertise one blessed solution.

Two main things, I think:

- coping with the variety of different patch systems used in Debian packages

- an integrated way of keeping track of which locally-built packages have been installed

Also of course getting pbuilder set up, and knowing that it's an option in the first place.

In an ideal world I'd like to be able to say that the only knowledge you need to modify one of the programs on your system is the language that the program is written in, rather than having to also fill your head with bits of distro-specific arcana.

I do think that things are better than they were even a few years ago (in particular, the variety-of-different-patch-systems situation is getting noticeably better).

1) These days, there's only quilt.

2) There isn't really difference between locally built packages and ones from the distro, but one way to distinguish yours is to append a suffix to your versions, e.g. andrewshX in my case, where X is an incrementing number.

That's a valid complaint (I too wish it were simpler) but am not sure how feasible that is given the myriad of ways upstream packages are managed and the varying degrees to which they do or don't support distros in general, let alone a specific one like Debian. I think that's part of the reason Debian has become such a popular upstream distro: it converts an incoming Tower of Babel collection of source repos into a reasonably coherent and standardized set of packages.

> baking the build date/machine name into the binary

This I find especially moronic. A deterministic build would have provided a perfect way to trace back a binary to it's sources/source commit, but let's give that up and trade it for a date and the hostname of the build machine :/

Reproducible builds were already infeasible for other reasons at the time when people thought this was a good idea. So in that context, it wasn't so dumb

Yes, to some extend I understand where this is coming from. I'm also more talking about the sisyphean situation than the person actually doing it: it's like placing a cherry on top of a mountain of dirt.

You can run the build process of any Debian package you want. It is quite easy to reproduce a functionally identical package.

What is hard is too reproduce binaries that are byte-for-byte identical. This is not really needed to debug or work with a system, but needed for security to assure that no-one has meddled with the binary.

If you think source package distribution are more useful than binary distribution, then Gentoo Linux is still a viable option.

Move to Linux from where? Windows? SunOS?

> its really quite odd that the dominant open source platform settled on opaque and unreproducible binaries as the distribution mechanism early on

It was perfectly logical when I was installing slackware from floppies, and it remains more useful in most use cases - smaller downloads that installed faster. Debian's great innovation was dependency management of binary packages, after all.

oh yes, sorry, yes I'm that old. before linux, except for maybe gcc, open source was largely distributed as source.

and I agree that installing linux as a binary makes sense. but people didn't think it was important to maintain the reference back to the actual source used. I think to some extent (RedHat specifically), the patch sets used and the hidden build process was used to make it less trivial to reproduce.

The distros do maintain a reference back to the source. Debian makes it trivial to download the source code for packages and to build it yourself.

  apt source htop
  apt build-dep htop
  cd htop-2.0.2
  dpkg-buildpackage -us -uc -nc

I should mention that reproducible builds are really about making bit-for-bit identical builds. There are reasons this is difficult, including individual projects inserting timestamps at the time that they build, or compile-time randomness, sort order of files in the filesystem based on the current locale, etc.

> less trivial to reproduce.

Seriously, this is nit picking, and you seem to acknowledge that here by using the term "less trivial". Instead of "tar -x" you run "rpmbuild -bp" to get the patched source RedHat will unleash the compiler on. For debian its "dpkg-source -x". I'm not even sure that it could be called "less trivial".

Getting hold of the source in the first place it looks to be "more trivial" to me as there is no Wumpus hunt to find the URL. For Debian it's just "apt-get source package-name". The RedHat equivalent is "yumdownloader --source". In both cases that gets you the exact source used to build the binary on your machine. I don't know what you would call a strong reference back to the actual source used, but I'm struggling to see how it could be much stronger.

As for the "hidden build process": it's only hidden because it's automated. You may as well be blaming "make" for hiding the kernel build process, which is more complex than most distro package builds. But every step the distro build processes (any distro) can be made very visible, naturally enough because otherwise it would be impossible to debug when things go wrong. And there is excellent doco on how to do all of this for both Red Hat and Debian.

One of the things I liked about Gentoo was the idea of building everything yourself. They even had a license manager that would allow you to decide if you wanted to use packages based on their license (and would warn you when the license changed). I always wondered why that never caught on.

But binary distributions are just convenient. Even MCC was a binary distribution and that's going a long way back. I remember rebuilding everything by hand because I wanted a lean system that included exactly what I wanted and nothing more, but that was considered pretty extreme even at that time.

One of the problems with having the flexibility of a pure source based system (including being about to recompile with exactly the flags and hence features you want) is making a decent dependency graph. If you install package X and it wants capability A, it's practically impossible to guarantee that it will "all just work" because the user can easily screw it up. In many ways, this was one of the problems with Gentoo (apart from it being unreasonable these days to keep a complete system up to date while recompiling every single package -- it just takes too much time).

Of course Gentoo (and derivatives) have binary packages these days as well. I've moved over to Arch which has very minimal configuration for packages and so you have to take responsibility for making everything work. You can easily build things and the build scripts make it very easy to see what patches, etc you are using. However, its a lot more work than most people want.

I think the reality of the situation is that most people do no debug their systems. They just use it. If it is "broken" then they complain that it is "broken" and wait for someone to fix it. Actually maintaining your system is akin to working on hotrods in your garage. It's mostly a hobby, even for professionals.

The great thing about a reproducible build system and a large enough userbase is that it blurrs the line between a compile-from-source distro like Gentoo and a binary distro. This could be huge: flexibility when needed, and convenience everywhere else, all within a single toolchain. But the work required to reach that goal is massive, and probably not worth the effort if above outcome is the only reason to pursue it. I think trust/security/verifiability might be the more realistic driver, but the fallout could just be a new kind of distro.

> its really quite odd that the dominant open source platform settled on opaque and unreproducible binaries as the distribution mechanism early on

It's not that odd. Reproducible builds are very hard. "gcc foo.c -o foo" is very easy.

Also as a sibling comment mentioned it's not priority number one. People were happy they didn't have to compile the whole damn thing anymore.

imho (and I wasn't there at the time) this is a result of proprietary unices constricting the growth of 'free unix' (e.g. source based BSD) in the commercial realm along with the rapid growth of space-constrained 90s PC's and "unix" users who came up in this commercial "unix" environment..

linux distributions were designed around the model of binary-only proprietary "unix" that just happened to have also have source available as a bonus instead of comprising a fully self-hosting source+binaries+documentation system like research unix and source-available derivatives/patchsets (like bsd, etc) always were from day 1

My personal opinion is that unix is a culture and not a trademark, the culture is the users and systems based around this self-hosting system with source lineage back to the original. Binary unix-derived distributions (and unix clones like linux) are therefore implicitly not unix since they are not out of the box self hosting and source included to one degree or another, and do not have a direct code lineage back to at&t v7. At best, they are some approximation of 'unix like', which can still of course be pretty useful.

Not much. Mostly a fad. Repo security plus secure transport handles most of the risk there. Whereas, most black hats will use bad configurations, social engineering, or app/OS vulnerabilities. There's enough of those that vulnerability brokers aren't paying high for hitting that OS.

So, best use of time would be one of three activities:

1. Gradually rewriting apps or kernels in safe languages. Can even make stuff reproducible as a side effect of rewrites.

2. Improving tools like SAFEcode snd Softbound+CETS that make C code safe. Esp usability and packaging. If not that, then more mitigations in OS like OpenBSD does. Can even port theirs.

3. Running static analyzers, fuzzers, etc on all that code out there. Then, fixing the vulnerabilities. Id say prioritize highly-priveleged and newest code.

That would actually improve the security of Debian against the kind of attacks that hit it the most. There's some people doing this. Not enough by far.

While I mostly agree with your assessment on "best use of time", labor is not fungible and cannot be redirected like that.

Labor can be pulled or pushed in certain directions. They're bringing in labor on reproducible builds. I'm saying they bring it in on QA. Those currently directing it try to move it in a different direction.

BTW I was just looking at diffoscope - developed as part of the "reproducible builds" Debian project.


We need a body/service that verifies that the code-binary combo being distributed by a developer is actually reproducible. Because, most users don't have the time or resources to verify each package they use. But its useful to society if distributed packages are reproducible. So a trusted third party should provide this service.

I would envisage it working in the following way. The developer submits links to their code and links to their package. The service builds from code and checks if the final package matches what is being distributed by the developer. If yes, it then publishes the hashes of the code and package, so users can quickly check that they are using a reproducibly built package from the correct source.

This is being actively worked on and already has had quite a bit of work. This project is also submitted to Google summer of code under the Debian project.





The way F-Droid (android app distribution service) does this is all builds are done server-side (devs only update the publicly available source) and then anyone can quickly setup a verification server that builds+diffs the packages https://verification.f-droid.org

For those of us unfamiliar with the term "reproducible", yet in the business of software - what does it mean and why is it desirable or needed?

If you build the package correctly, you get the exact same binary.

This allows users to verify that the binary packages you see were actually gotten by compiling the source code (presuming your compiler isn't compromised). This means that auditing the source-code is almost as good as auditing the package (the exception being compiler mistakes). Thus, we need less trust in the software packagers.

Just wanted to add that besides the security gains there's a real performance gain. If you've got a slow-to-build codebase where you've made a small modification you can use a remote artifact cache so that all the parts that don't change just get downloaded from when someone built it. Then you have faster incremental compilation.

You can do this with Bazel and Buck and with a few assumptions and some configuration even with Gradle.

I think that's slightly different: if a build is reproducible, it means that the build products I get from my own machine are identical to those on a remote machine or cache (e.g. they have the same SHA hash).

However, that doesn't speed anything up, since I don't know what the hash is unless I actually do the build or download the file (and then hash the result). If a cache used such hashes as IDs, I would only know which file to fetch once I've already got it!

For such a cache to work there needs to be another mechanism for obtaining the IDs. As an example, Nix caches using the hash of build inputs (scripts, tarball URLs + hashes, etc.), not the build outputs. Since the build inputs are known before we do the build, we can combine these into a hash and query the cache.

Since the whole point of using a cache is to avoid building things ourselves, we also need a separate mechanism to trust/verify what the cache has sent us (we could build it ourselves and compare, but then we might as well throw away the cache entirely!). Nix allows certain GPG keys to be trusted, and checks whether cached builds are signed.

Since caching mechanisms don't make use of reproducibility, and verification mechanisms don't make use of reproducibility, such caches turn out not to require byte-for-byte reproducibility. All that's required is that plugging in the cached files gives a working result: in some cases that might be practically the same as byte-for-byte reproducibility (e.g. a C library compiled against certain ABIs at certain paths, etc.); others, like scripting languages, might work despite all sorts of shenanigans (e.g. a Python file might get converted into a different encoding; might get byte-compiled; might get optimised; might even get zipped!)

I think you missed something in the parent comment. Bazel can skip compiling an output file if the hashes for its source code files + BUILD files have an artifact in the remote (or local) cache. This requires reproducible builds or else you could introduce build errors when your build environment changes.

> Bazel can skip compiling an output file if the hashes for its source code files + BUILD files have an artifact in the remote (or local) cache.

This is what I mentioned in a sibling comment: we need some way to identify binaries that doesn't rely on their hash. Using the hash of their source code files and build instructions is one way to identify things (Nix also does this, as well as those of any dependencies (recursively)). A different approach is to assign each binary an arbitrary name and version, which is what Debian packages do; although this is less automatic and is more prone to conflicting IDs.

> This requires reproducible builds or else you could introduce build errors when your build environment changes.

No, this only requires that builds are robust. For example, scripting languages are pretty robust, since their "build" mostly just copies text files around, and they look up most dependencies at run time rather than linking them (or a hard-coded path) into a binary. Languages like C are more fragile, but projects like autotools have been attempting to make their builds robust across different environments for decades. In this sense, reproducibility is just another approach to robustness.

Don't get me wrong, I'm a big fan of reproducibility; but caching build artefacts is somewhat tangential (although not completely orthogonal).

> I don't know what the hash is unless I actually do the build

Yes you do, this is the point of reproducible builds. The same source always produces the exact same binary.

No I don't. I can either:

- Do the build, then hash the result

- Download and hash a pre-built binary, but I have to trust whoever I get the file from

- Ask someone for the hash, but I have to trust them

If the build is reproducibile then I don't need to trust the second two options, since I can check them against the first. But in that case there's no point using the second two options, since I'm building it myself anyway.

If you take that quote in context, you'll see I'm talking about a (hypothetical) cache which uses the binaries' hashes as their IDs, i.e. to fetch a binary from the cache I need to know its hash.

In this scenario the first option is useless, since there's no point using a cache if we've already built the binaries ourselves.

The second two options in this scenario have another problem: how do we identify which binary we're talking about (either to download it, or to ask for its hash)? We need some alternative way of identifying binaries, for example Nix uses a hash of their source, Debian uses a package name + version number.

Yet if we're going to use some alternative method of identification, then we might as well cut out the middle man and have the cache use that method instead of the binaries' hashes!

The important point is that the parent was claiming that reproducible builds improve performance over non-reproducible builds because of caching. Yet nothing about such a cache requires that anything be reproducible! We can make a cache like this for non-reproducible builds too. Here are the three scenarios again:

- We're doing the build ourselves. Since we're not using the cache, it doesn't matter (from a performance perspective) whether our binary's hash matches the cached binary or not.

- We're downloading a pre-built binary. Since we must identify the desired binary using something other than its hash (e.g. the hash of its source), it doesn't matter what the binary's hash is, so it doesn't need to be reproducible. Pretty much all package managers work this way, it doesn't require reproducibility.

- We're asking someone for the hash, then fetching that from the cache. Again, we must identify what we're after using something other than the hash. The only thing we need for this scenario to work is that the hash we're given matches the one in the cache. That doesn't require reproducibility, it only requires knowing the hashes of whatever files happen to be in the cache. This is what we're doing whenever we download a Linux ISO and compare its hash to one given on the distro's Web site; no reproducibility needed.

By definition, for open-source software, you can compile it yourself. But most users of most distros don't do that, instead they use pre-built binary packages.

Which means someone has to build and upload those binary packages.

How can you tell that the person who built or hosts the binary packages didn't change the source code (for example, putting in backdoors or other malware)?

It helps with this problem if anyone else can build the same source code, and get a byte-for-byte identical copy of the binary package.

This sounds trivial, but it actually requires some dedicated support from build tools and build scripts. (You have to have the exact same version of the compiler and everything else, the compiler wants to automatically put the build timestamp in the executable, the order the OS lists files in a directory can sometimes change, the timestamps in files contained in tar/zip archives can't be set to the current time, etc.)

Some people in the Debian community have been making an effort to update all of Debian's packages to be byte-for-byte reproducible.

This seems to be the issue https://issues.apache.org/jira/browse/MNG-6276 for reproducible builds with Maven. I'm not sure what the current status is given that all the issue links are closed.

Go in and vote for it anyway. I'm sure some of the Debian packages are written in Java. ;-)

I definitely understand the merit of making it super easy to verify, by simply hashing the build ISO. But that also shifts the burden of forcing deterministic results to the build process. I wonder if for a little more complexity in verification, if a lot less complexity would be needed on the build side?

For example the ISO file system, typically ISO 9660 or UDF, will have a volume UUID from a random number. Sure you can code a flag for mkfs to specify a fixed number, that's easy. But then next that ISO typically contains a payload in the form of a squashfs file. And quite a bit of work on squashfs has happened to make sure file timestamps can be set to a known value. However, if one build process uses xz level 3 compression, and another build process uses xz level 7 compression - hashing will of course fail. The point is, be it inode UUIDs and timestamps and compression levels, there's a lot being measured that we don't really care about, just to have a simple verification method at the back end.

Can anyone recommend a good way to install/maintain/remove new software on distros like Debian or Ubuntu LTS? (By 'new' I mean versions newer than the ones available through apt).

I never figured out a maintainable scheme. Some of my packages are in /opt, some are in ~, some in ~/local, and some under GNU stow. And other than GNU stow's half-assed uninstall method, it's always a pain in the neck when I have to remove or upgrade any of these packages.

One thing that works many times is to download the deb source package from a newer distro and rebuild it in the desired system.

If there’s no newer package available you can try to reuse the debian directory from the older source package to build from the newer source.

Sometimes this works in the first attempt, but if you need this for a lot of packages, it’ll be a lot of work.

Why do you find GNU stow's uninstall method to be half-assed? Just curious.

Helps having Google as a client wanting reproducible builds.


One has to step back and ask: "What made you decide to introduce non-determinism into your compilation process in the first place?"

Nobody sat down and said "yes, I'll make this nondeterministic". If you look at the wiki page's sampling of different issues -- https://wiki.debian.org/ReproducibleBuilds/Howto#Identified_... -- you'll see it's a mix of various things:

* output is deterministic but dependent on some aspect of the build environment (locale, hostname, etc)

* output is accidentally non-deterministic (eg "we put all the .html files into a tarball with a shell glob pattern, which gets you an order dependent on your filesystem implementation and the phase of the moon")

* a wide array of "output contains a timestamp" issues

"Let's put the timestamp into our version string/a generated file so the user knows when it was built" is a really common thing, and it seemed like purely a nice convenience feature until the concept of 100% binary-reproducible builds became a current concern.

Curious: What's the usual fix for this class of problem? Is it possible to flag certain bits of data as known-to-change and evaluate the rest of the build in isolation?

Example: Say you have a version string that outputs a build time. Can you hash the program with just that bit of string data marked as unknown (or that string table entry replaced with a placeholder, as far as our verifier is concerned) and verify the rest of the program is unchanged across builds?

This only works in some limited circumstances; more often, the build differences cascade and produce a completely different result.

For example, if the compiler puts full path names in intermediate object files, so on a different computer the object files have a different size, this may result in choosing a completely different layout for the binary.

It varies by case.

For the particular example of time, some common approaches are to hook the sys call for time such that it returns a constant value, or to teach your tool to output a constant value.

The Nix thesis (https://nixos.org/~eelco/pubs/phd-thesis.pdf) is a great read and goes through a bunch of these examples as part of its research.

https://reproducible-builds.org/docs/ has a lot of strategies. For example, you could use the SOURCE_DATE_EPOCH environment variable, which if set, uses that value instead of the current actual time. This means that your normal builds do the normal thing, but you can configure the variable and get a reproducible build.

Typically you just stop including the build time. It's a deterministic build, so why would the build time matter?

I was going to say "because it's convenient", but I realized that you're right: if it's identical anyway, the release date is as good a timestamp of the code as you're going to get. If for some sysadmin reason the build time is relevant, file metadata can tell you.

Maybe I want to know that someone was messing with stuff, and even if they put something 'back in place' that matches, I may still want to know that it was built this morning and that something is going on.

Of course code signing is going to mess that up.

> Is it possible to flag certain bits of data as known-to-change and evaluate the rest of the build in isolation?

This is what I do. I'm working with a hardware vendor that is not particularly security-conscious. I have to send them compiled binaries as part of a build pipeline, which they then modify in a non-deterministic way. Fortunately the changes are limited to data which has no impact on the function of the software. So I've created hashes of all the individual components that make up the binaries, and I skip hashing the parts that they modify.

It's not ideal, but I can be reasonably confident that the binaries I compile function the same as the binaries that my hardware vendor distributes.

You leave it out and go off revision hash or something. Back when we first introduced build times it was pre-modern-source-control.

The Rev hash fulfils the function now.

You fix the tools so that they work properly. You modify the compiler to always produce deterministic output (e.g. dummy timestamps).

Except for the first one, the last three are great examples of why I don't care about 100% binary reproducible builds as an end user.

Knowing when something was built is more valuable to me than a reproducible build. Seems like a fad triumphing over common sense.

Reproducible builds allows for better auditing which makes it more difficult for malicious maintainers to plug backdoors into systems.

What's the relevance of knowing when something was built if you can check the source version and rebuild it yourself in the exact same way? If the build is reproducible it's merely a cache. You could potentially recompile it every time you try to execute the command and you'd get the exact same thing.

Users often don't care about things until they become a problem. For example, most users don't care how their password is stored on a server... until it gets compromised. Also, timestamps in output usually indicate source code age, not build date. I have a package with '2015' in the version but it was built in 2018.

That's a condescending non-answer.

We're not talking about passwords in a database, we're talking about reproducible builds. I already trust Debian, so it's irrelevant to me, as a user, if their builds are 100% reproducible. Can you explain why I should feel differently?

Although I don't think it's a valuable thing to spend time on, it's not my time, so I honestly don't care too much.

From an end user point of view, probably the most important thing it allows Debian developers to do is verify that the binary packages they are shipping are the packages they intended to ship. You trust them, they have to trust their tools... so they're making sure their tools tell them about yet another situation when something might have gone wrong. (it could be something innocent or malicious, but without the tooling, they might not even be aware of the issue)

It doesn't: there can still be backdoors or compiler errors before the hash of the binary. That's why DO-178C certification requires proof of source-to-binary equivalence. It's also why CompCert and CakeML were built. Stopping backdoors requires doing that with independent review for requirements, design specs, security policy, and implementation with them all corresponding.

People concerned about this stuff are putting the least effort into what causes the most failures. Reproducible builds are a distraction if what's reproducible is 0-day-ridden software whose safety/security checks might have been removed by buggy, black-box compiler.

You are right that this isn't the be-all, end-all but is merely one of the pieces needed. I was trying to give a layperson's reason for caring (i.e. what was being asked) which often involves oversimplifying things.

Oh, you explaining it is a nice thing to do. I was just showing what was actually necessary to meet the intended goal since that's my field. I learned a lot of the problems from people like Myers who wrote the book on subversion back in 1980. Current approaches still aren't as thorough as what they were doing in the 1970's-1980's.


If it's about repos, I recommend David A. Wheeler's page on SCM security since following it eliminates most of the risks.


Knowing when something was built is more useful than having better verification of exactly what code went in?

Besides, it's a deterministic build. You can use a lookup table with the hash to find out exactly what revision it is.

If the build is reproducible, why does it matter when it was built?

Knowing the build was reproducible tells me nothing of value.

Knowing the timestamp of when it was built, or the source was downloaded tells me roughly how old the package is, which can be useful for a variety of reasons.

>Knowing the build was reproducible tells me nothing of value.

It does. It proves that the given package was produced by the given source code. However, you can keep the timestamp as well! One of the goals of reproducible builds is to record values like this so it can be replayed later at a future point in time. This is done by the BUILDINFO format.


> Knowing the build was reproducible tells me nothing of value. It means you can uniquely address what you wanted by what you got (and vice versa). This makes storing and retrieving things much easier, scalable and secure. It also makes it much easier to ensure that you got what you wanted, and someone else did too, on their machine.

I can do a build off of year old code.

Would you rather you get a timestamp embedded that tells you it was built today, or take the hash and be able to find the exact revision.

You don't care about the build date, you care about the release date, which is specifically what debian is trying to achieve with SOURCE_DATE_EPOCH.

Do you care when the build was made or from when the code build into it is? The former is a problem for reproducible builds, the latter is possible.

You don't really 'introduce non-determinism'.

That's like saying, "why did you introduce entropy?" It's there. It happens, and it's difficult to get rid of in any complex system.




Debian was one of the first large distributions to use a build farm for many different architectures. Maintaining deterministic builds (even as far as just file timestamps) in a distributed environment is very challenging.

If you aren't thinking about reproducibility things like putting timestamps in the binary seem reasonable. Which is what lots of tools did. Hence the problem.

This is true and it's also the model I use to avoid bugs. Any time I'd add a bug to the codebase, I just look at it and decide not to write the bug. In this way, no bugs are ever introduced.

What software currently has 100% reproducible builds?

What software? Like, individual packages? Many of them - here's the ones that do so on Arch: https://tests.reproducible-builds.org/archlinux/archlinux.ht...

If you mean which distribution has 100% of its packages reproducible, probably none yet. But Arch and Debian are both making progress.

To be fair:

> 56 (100.0%) out of 56 built NetBSD files were reproducible

That is not really comparable to debian's 26475/28522 for buster https://tests.reproducible-builds.org/debian/reproducible.ht...

But is it "reproducible" or reproducible? Holger still considers the debian numbers "reproducible" as we are only building things twice. To achieve proper reproducible builds the artifacts needs to be reproducible by users. User facing tools needs to be provided and I have yet to see how NetBSD provides this.

As far as I understand your pages that applies only to the kernel, not the whole distribution. The Linux kernel is reproducible as well: https://tests.reproducible-builds.org/archlinux/archlinux.ht...

Package management-wise, Nix.

OS-wise, NixOS.

Build system-wise, there are lots of options: Blaze, Buck, Pants, Please (AFAIK)

Unfortunately, nix does not produce fully reproducible builds. The build environment is portable and produced in a way that it can be repeated, but due to the limitations of the software that is being built, the builds are not binary reproducible. You can see some commentary on the nix team hoping to adopt some of the work being done by debian et al here: https://github.com/NixOS/nixpkgs/issues/9731

There is also https://r13y.com/


In case any Nixers are reading this, here is how I got NSPR to build reproducibly in Guix:


Interesting, thanks for sharing!

NixOS currently isn't there. A lot of this work is done, but much still remains. We have benefited greatly from Debian's work, though (Debian maintainers frequently come across as happy upstream participants to fix issues like this in the ecosystem, which really helps everyone!)

https://r13y.com/ tracks the progress of NixOS reproducibility; currently we're at 98.23% bit-for-bit identical for our minimal installer ISO. After that, we'll need the graphical installer, and then more of the base package set. So we've still got a ways to go.

Any program with a build system designed in such way it doesn't introduce anything beyond the source code into the binary should be.

If you build the same code on two different machines, using the same compiler, with the same options, then the generated binaries should be exactly the same.

> If you build the same code on two different machines, using the same compiler, with the same options, then the generated binaries should be exactly the same.

There is so much context that is normally embedded into a binary that this is usually not true unless explicit measures have been taken.

Two very common sources that introduce variability are time-stamps used in the build, and environment variables such as $HOME and $USER.

If you're generating or modifying source code at build time (eg. adding timestamps or build IDs) then you have violated the constraints on build reproducibility.

If you define the problem as excluding things a large percentage of real-world build systems do by default, then it's not very interesting. The interesting part of Debian's and others work here is making this work with small, unintrusive changes to such systems.

As long as the intrusive changes are taken upstream, I've no problem with it.

what if you have a multi-threaded backend to the compiler that happens to lay down data in different orders?

You don't even need multi-threading. In gcc we had at least one case where a key=>value data structure was keyed by memory address, causing symbols to be emitted in different order depending on ASLR, phase of the moon, or whatever.


Most compilers give no guarantees in which order they lay out the data. I love deterministic processes as much as everyone. But randomized approaches have their advantages too. And if a compiler has reasons to randomize output e.g. for speed than it’s a trade off to consider.

Thread finishes work

grabs lock

writes to file

writes to index

releases lock

That's not a race condition. The output order doesn't matter, but it is nondeterministic.

Why is it a bug? I write a program to download four files. I do so in parallel. Sometimes X finishes first, sometimes Y finishes first, and the files are written to disk in a different order. Why do I want to serialize this operation?

The end result is a set of four files. You don't care about the order they are laid out on the disk and the next steps shouldn't let the order of those files influence the end result.

Let's assume there is a latent bug in the compiler that gets triggered if file four is the first one. Good luck debugging that.

But parent claimed that creating a set with a different order was inherently a bug. Not that depending on the order of an unordered set was a bug.

It's the internal structure of the files.

Don't forget the absolute paths of the source files...

It always bugged me thats considered part of reproducibility.

That's 100% controllable and deterministic.

Until very recently you needed root access to do it on linux (user namespaces can let you do it without root).

Until a build process starts naming things with timestamps, locales, etc. Just because the build is "source code only" doesn't mean it is deterministic.

That's why I wrote "it doesn't introduce anything beyond the source code into the binary". Unfortunately, I forgot to emphasize the anything.

A build process that names things with timestamps or leaks your locale into the build configuration (or doesn't pin build-time dependency versions) will make the build depend on things other than the source code (both program and build settings) you made available.

It may even be desirable for it to be non-reproductible - if, for instance, you want to use optimizations targeted to your specific system, then your build system will have to introduce the architecture information into the build process and your build will result in a unique binary that targets your own machine.

Unfortunately, if we take this definition of "anything" literally, it is impossible to build such a build system.

For example, depending on the input order, linker may produce different output. Surely you can sort the object files, but the sorted object files order is still effectively "stored" into the binary, and that's not source code.

You can only normalize such things (like in the example above, sorting), you can not eliminate them, they naturally exist.

> you can sort the object files, but the sorted object files order is still effectively "stored" into the binary, and that's not source code.

No, but the order should be explicitly defined in the build scripts or the result will not be deterministic.

If the order triggers, say, a linker bug that makes one in 50 builds crash, execution will not be deterministic and that's really, really bad.

This is actually an annoying challenge of reproducible builds. In many cases it is actually useful to have a build timestamp, git sha, or build number available for debug output from the program. I've often gone as far as embedding a sha and/or timestamp into a file on export into a tgz which allows it to be reproducible from the tarfile, although builds directly out of source control would not be.

Git hashes can be inserted in reproducible builds, they are deterministic.

Compilers haven't been built with that as a condition, so this isn't true.

It's not true in practical code either, people like to stick in timestamps.

It's not ever true on windows, unless you use the fairly recent PE header changes.

The vast majority of code written at Google does, for one.

Bit for bit reproducible?

So I can checkout an arbitrary version from years ago and reproduce the exact same set of output files?

Yes, look up Google Bazel (the open source version of Blaze)

Right, that's not a bit for bit reproducible builds.

Think of it as absent a cache I should get bit for bit identical out (perhaps ignoring logs and such).

I think GNU Guix offers what you are after.

I maintain my own build farm and tried comparing my results against the official CI server:

  $ guix challenge --substitute-urls="https://ci.guix.info"
  14,224 store items were analyzed:
    - 4,972 (35.0%) were identical
    - 265 (1.9%) differed
    - 8,987 (63.2%) were inconclusive
Of the 5237 build artifacts that were available on the substitute server, only 265 (5%) differed.

All of these items can be (and have been) built entirely from source, starting with Guix' initial "binary seeds", on (probably) different hardware and kernel compared to the CI system.

I don’t think “one artifact, one vote” is a fair way to measure this.

One reason builds become irreproducible is when a build is multi-threaded, and the order in which artifacts are combined into larger ones becomes unpredictable. That problem doesn’t exist, or at least is a lot smaller, for ‘leaf’ artifacts (example: if your C compiler is single-threaded, and you run make multi-threaded, individual object files do not have the ordering problem, but libraries built from multiple object files do)

On the other hand, a single static struct with a padding “hole” that isn’t consistently written that happens to end up in lots of binaries will decrease your percentage a lot.

Sorry, I think my use of "artifact" here caused some confusion.

Each of these "artifacts" are actual isolated builds of complicated programs such as Chromium or GCC. The technical term is "derivation", which produce "outputs".

All of those packages can be reproduced from source now or 100 years into the future and SHOULD produce the exact same binary output. If they don't, it's a bug.

What does guix mean by "inconclusive"?

It means that the items could not be found on the remote server(s).

GuixSD, NixOS

I believe Solaris is as well.

My experience at Dave & Buster's is around 80% reproducible, so I think progress is possible.

FYI, Debian codenames are all based off Toy Story characters. See https://unix.stackexchange.com/questions/222394/linux-debian...

Might be good progress but it still sounds very low to me as I didn't know anything below 100% was possible... it sounds crazy to me (almost like something that was introduced to be able to inject backdoors undetected).

Lots of problems come from things like timestamps, or race conditions in concurrent build systems giving slightly different bytes on disk. These generally aren't "trusting trust" level problems, since they do not and cannot affect program behaviour; but they do screw up things like digital signing, cryptographic hashes, etc. which are useful for automatically verifying that self-built artefacts are the same as distro-provided ones.

These problems can also cascade, if component A embeds the hash of another component B, e.g. to verify that it's been given a correct version. If that hash comes from an unreproducible upstream, and building it ourselves gives a different hash, then we'll need to alter component A to use that new hash. That, in turn, changes the hash of component A, which might be referenced in some other component C, and so on.

Nope, loads of build tools were never built with reproducibility in mind.

Look at windows. Even if you fix the compiler and linker, you still non-reproducibility by design, the PE header contains a timestamp.

People also like to stick non-reproducible stuff into builds directly, like timestamps.

Compilers don't have any reason to lay down data in a specific order, so if they are threaded in the backend they just don't.

IDL tools might stick in the timestamp of when a file was generated, for convenience.

and on and on and on.

Every significant project I've worked on embedded the build host and build time in the resulting executable or firmware image. This was along with other static build information, like version number, compiler version and build flags.

Once you make the sensible choice to include build time in the result you've broken reproducibility. Fixing this means tracking down every package that does this and removing the timestamp.

Why is including build time sensible?

If one has reproducible builds, wouldn't a commit/tag from the version control system also do the job of traceability and reproducibility ?

What I've moved to is splating that info into the binaries during the release process. Far as I can tell there aren't standard tools to do that though. At least last time I looked.

Would be nice if there was. I think this is the root of issues such as firmware with the same password/cryto keys across a whole product family instead of unique ones.

That's what coreboot moved to (incl. the timestamp of the commit in its timestamp field) for reproducibility.

Thing is just that host + build time is what was traditionally used. There's no single commit you could use in cvs.

A timestamp is sensible if reproducibility isn't your goal, and exact reproducibility of build artifacts was never a goal on any of my projects. It was simply never a priority.

My guess is that making code reproducible involves some kind of change that hasn’t been applied or all of the code or build files.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact