Anyway, many kudos for this work!
From the perspective of an application developer I want that dynamic representation to also be predictable.
hathawsh meant representation in the sense of memory layout and other low-level details of how application state is encoded. You seem to be talking about the programmer-visible abstractions built out of that layer. Behaviour at that level has to be predictable in order to "maintain correctness", but predictability at the lower level is mostly a help maintaining for the correctness of malware.
It likely wouldn't. ASLR will choose a random load address for the module (.so or .exe), which means that main() wouldn't be at a predetermined address. The heap and stack would similarly have arbitrary offsets. Furthermore, any other modules that get loaded would be at arbitrary addresses. Ultimately, even "hello world" should become extremely randomized, provided ASLR is enabled for the process.
ASLR prevents attacks arusing due to known addresses in virtual memory because they are otherwise reliably predictable (just run the program, attach a debugger, and find the addresses that you care about).
There is so many things based on Debian and adding reproducibility in Debian is a massive security improvement.
I have a lot of respect for people working on this, it must be hard to respect release freezes when you've been working on this for years.
From the article, it looks like many of the others are reproducible on a code level, but the release system is using older binaries, which haven't been rebuilt yet.
It's too bad it won't be fixed in time for this release, but bodes very well for future releases.
Edit: found this nice overview by clicking a few links from the original post https://reproducible-builds.org/
It would be a lot of work but could expose malicious providers.
Then again it might be hard to create an actual good environment considering all the software /firmware/hardware layers modern systems have.
There is still work to be done, but NYU was one of the organizations working on developing and running a rebuilder. The idea is that you pull buildinfo files from https://buildinfo.debian.net/, then try to verify them and if you got the same artifact you sign that you successfully verified this binary package.
A user could then configure "I trust rebuilder X, Y, Z and I require that at least N have successfully verified the package" before installing it.
if I cant run the build process, and I dont really know which patch sets have been applied on my system, and which version of the source was used, and I dont really have a handle on the dependencies. doesn't make it substantially more difficult to debug and work with the resulting system?
its really quite odd that the dominant open source platform settled on opaque and unreproducible binaries as the distribution mechanism early on.
The problem that reproducible builds is attempting to solve is that my build, while being 100% functional, might not be bit identical to yours or what shipped with the distro for any number of valid reasons. The problem this creates given the current world of increasingly sophisticated malware is that I can't prove a given build is identical to what was intended based on the binaries. So it's an issue of 'we've always been giving you the source, now we're going to ensure that it's possible to create bit-identical binaries' rather than 'we were giving you free binaries, now we're giving you the source'.
If your argument is that one should always build from source, reproducible builds is part of getting to the same effective result without the overhead and headache that the vast majority of users are not willing to deal with.
 In the cases when this turned out not to be possible, holy hell tends to get raised... see things like the Chromium BLOB issues (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=786909) etc. These issues tend to have a lot of noise made and get resolved very quickly since it goes against the very core of what Debian is intended to be. In cases where you have things that are binary-only such as proprietary GPU drivers, these get segregated into a separate repo with the general advice from Debian of 'you shouldn't use these'. This doesn't tend to be realistic advice for many users given the state of many of the open source drivers... but they've tried their best to steer people away while not making the distro unusable for those who insist on using them.
 A trivial example would be baking the build date/machine name into the binary... there's no expectation that one would ever reproduce that build again even on the same machine.
For example if I'm reporting a bug and I have a proposed patch, it would be nice to be able to say "I've been running with this patch for the last month and it's been fine".
Ideally there'd be a coherent way to download the source for a particular package, rebuild it, and have the system use the rebuilt result. Bonus marks if it did something useful when a newer version of the package came along.
I think https://packages.debian.org/sid/dgit is an attempt to make the download-and-rebuild part look more uniform, but I think there's still work to be done to make it easy to substitute the result for the distro package.
You can also host this somewhere so others can run/test your modified version easily. No real special infrastructure required they can sync this directory via git.
There are of course downsides. The annoyance of building updates and discovering bugs not discovered by others by some unique set of patches / build time options.
But yeah, I noticed that Debian is not going out of its way to advertise one blessed solution.
- coping with the variety of different patch systems used in Debian packages
- an integrated way of keeping track of which locally-built packages have been installed
Also of course getting pbuilder set up, and knowing that it's an option in the first place.
In an ideal world I'd like to be able to say that the only knowledge you need to modify one of the programs on your system is the language that the program is written in, rather than having to also fill your head with bits of distro-specific arcana.
I do think that things are better than they were even a few years ago (in particular, the variety-of-different-patch-systems situation is getting noticeably better).
2) There isn't really difference between locally built packages and ones from the distro, but one way to distinguish yours is to append a suffix to your versions, e.g. andrewshX in my case, where X is an incrementing number.
This I find especially moronic. A deterministic build would have provided a perfect way to trace back a binary to it's sources/source commit, but let's give that up and trade it for a date and the hostname of the build machine :/
What is hard is too reproduce binaries that are byte-for-byte identical. This is not really needed to debug or work with a system, but needed for security to assure that no-one has meddled with the binary.
If you think source package distribution are more useful than binary distribution, then Gentoo Linux is still a viable option.
> its really quite odd that the dominant open source platform settled on opaque and unreproducible binaries as the distribution mechanism early on
It was perfectly logical when I was installing slackware from floppies, and it remains more useful in most use cases - smaller downloads that installed faster. Debian's great innovation was dependency management of binary packages, after all.
and I agree that installing linux as a binary makes sense. but people didn't think it was important to maintain the reference back to the actual source used. I think to some extent (RedHat specifically), the patch sets used and the hidden build process was used to make it less trivial to reproduce.
apt source htop
apt build-dep htop
dpkg-buildpackage -us -uc -nc
Seriously, this is nit picking, and you seem to acknowledge that here by using the term "less trivial". Instead of "tar -x" you run "rpmbuild -bp" to get the patched source RedHat will unleash the compiler on. For debian its "dpkg-source -x". I'm not even sure that it could be called "less trivial".
Getting hold of the source in the first place it looks to be "more trivial" to me as there is no Wumpus hunt to find the URL. For Debian it's just "apt-get source package-name". The RedHat equivalent is "yumdownloader --source". In both cases that gets you the exact source used to build the binary on your machine. I don't know what you would call a strong reference back to the actual source used, but I'm struggling to see how it could be much stronger.
As for the "hidden build process": it's only hidden because it's automated. You may as well be blaming "make" for hiding the kernel build process, which is more complex than most distro package builds. But every step the distro build processes (any distro) can be made very visible, naturally enough because otherwise it would be impossible to debug when things go wrong. And there is excellent doco on how to do all of this for both Red Hat and Debian.
But binary distributions are just convenient. Even MCC was a binary distribution and that's going a long way back. I remember rebuilding everything by hand because I wanted a lean system that included exactly what I wanted and nothing more, but that was considered pretty extreme even at that time.
One of the problems with having the flexibility of a pure source based system (including being about to recompile with exactly the flags and hence features you want) is making a decent dependency graph. If you install package X and it wants capability A, it's practically impossible to guarantee that it will "all just work" because the user can easily screw it up. In many ways, this was one of the problems with Gentoo (apart from it being unreasonable these days to keep a complete system up to date while recompiling every single package -- it just takes too much time).
Of course Gentoo (and derivatives) have binary packages these days as well. I've moved over to Arch which has very minimal configuration for packages and so you have to take responsibility for making everything work. You can easily build things and the build scripts make it very easy to see what patches, etc you are using. However, its a lot more work than most people want.
I think the reality of the situation is that most people do no debug their systems. They just use it. If it is "broken" then they complain that it is "broken" and wait for someone to fix it. Actually maintaining your system is akin to working on hotrods in your garage. It's mostly a hobby, even for professionals.
It's not that odd. Reproducible builds are very hard. "gcc foo.c -o foo" is very easy.
linux distributions were designed around the model of binary-only proprietary "unix" that just happened to have also have source available as a bonus instead of comprising a fully self-hosting source+binaries+documentation system like research unix and source-available derivatives/patchsets (like bsd, etc) always were from day 1
My personal opinion is that unix is a culture and not a trademark, the culture is the users and systems based around this self-hosting system with source lineage back to the original. Binary unix-derived distributions (and unix clones like linux) are therefore implicitly not unix since they are not out of the box self hosting and source included to one degree or another, and do not have a direct code lineage back to at&t v7. At best, they are some approximation of 'unix like', which can still of course be pretty useful.
So, best use of time would be one of three activities:
1. Gradually rewriting apps or kernels in safe languages. Can even make stuff reproducible as a side effect of rewrites.
2. Improving tools like SAFEcode snd Softbound+CETS that make C code safe. Esp usability and packaging. If not that, then more mitigations in OS like OpenBSD does. Can even port theirs.
3. Running static analyzers, fuzzers, etc on all that code out there. Then, fixing the vulnerabilities. Id say prioritize highly-priveleged and newest code.
That would actually improve the security of Debian against the kind of attacks that hit it the most. There's some people doing this. Not enough by far.
I would envisage it working in the following way. The developer submits links to their code and links to their package. The service builds from code and checks if the final package matches what is being distributed by the developer. If yes, it then publishes the hashes of the code and package, so users can quickly check that they are using a reproducibly built package from the correct source.
This allows users to verify that the binary packages you see were actually gotten by compiling the source code (presuming your compiler isn't compromised). This means that auditing the source-code is almost as good as auditing the package (the exception being compiler mistakes). Thus, we need less trust in the software packagers.
You can do this with Bazel and Buck and with a few assumptions and some configuration even with Gradle.
However, that doesn't speed anything up, since I don't know what the hash is unless I actually do the build or download the file (and then hash the result). If a cache used such hashes as IDs, I would only know which file to fetch once I've already got it!
For such a cache to work there needs to be another mechanism for obtaining the IDs. As an example, Nix caches using the hash of build inputs (scripts, tarball URLs + hashes, etc.), not the build outputs. Since the build inputs are known before we do the build, we can combine these into a hash and query the cache.
Since the whole point of using a cache is to avoid building things ourselves, we also need a separate mechanism to trust/verify what the cache has sent us (we could build it ourselves and compare, but then we might as well throw away the cache entirely!). Nix allows certain GPG keys to be trusted, and checks whether cached builds are signed.
Since caching mechanisms don't make use of reproducibility, and verification mechanisms don't make use of reproducibility, such caches turn out not to require byte-for-byte reproducibility. All that's required is that plugging in the cached files gives a working result: in some cases that might be practically the same as byte-for-byte reproducibility (e.g. a C library compiled against certain ABIs at certain paths, etc.); others, like scripting languages, might work despite all sorts of shenanigans (e.g. a Python file might get converted into a different encoding; might get byte-compiled; might get optimised; might even get zipped!)
This is what I mentioned in a sibling comment: we need some way to identify binaries that doesn't rely on their hash. Using the hash of their source code files and build instructions is one way to identify things (Nix also does this, as well as those of any dependencies (recursively)). A different approach is to assign each binary an arbitrary name and version, which is what Debian packages do; although this is less automatic and is more prone to conflicting IDs.
> This requires reproducible builds or else you could introduce build errors when your build environment changes.
No, this only requires that builds are robust. For example, scripting languages are pretty robust, since their "build" mostly just copies text files around, and they look up most dependencies at run time rather than linking them (or a hard-coded path) into a binary. Languages like C are more fragile, but projects like autotools have been attempting to make their builds robust across different environments for decades. In this sense, reproducibility is just another approach to robustness.
Don't get me wrong, I'm a big fan of reproducibility; but caching build artefacts is somewhat tangential (although not completely orthogonal).
Yes you do, this is the point of reproducible builds. The same source always produces the exact same binary.
- Do the build, then hash the result
- Download and hash a pre-built binary, but I have to trust whoever I get the file from
- Ask someone for the hash, but I have to trust them
If the build is reproducibile then I don't need to trust the second two options, since I can check them against the first. But in that case there's no point using the second two options, since I'm building it myself anyway.
If you take that quote in context, you'll see I'm talking about a (hypothetical) cache which uses the binaries' hashes as their IDs, i.e. to fetch a binary from the cache I need to know its hash.
In this scenario the first option is useless, since there's no point using a cache if we've already built the binaries ourselves.
The second two options in this scenario have another problem: how do we identify which binary we're talking about (either to download it, or to ask for its hash)? We need some alternative way of identifying binaries, for example Nix uses a hash of their source, Debian uses a package name + version number.
Yet if we're going to use some alternative method of identification, then we might as well cut out the middle man and have the cache use that method instead of the binaries' hashes!
The important point is that the parent was claiming that reproducible builds improve performance over non-reproducible builds because of caching. Yet nothing about such a cache requires that anything be reproducible! We can make a cache like this for non-reproducible builds too. Here are the three scenarios again:
- We're doing the build ourselves. Since we're not using the cache, it doesn't matter (from a performance perspective) whether our binary's hash matches the cached binary or not.
- We're downloading a pre-built binary. Since we must identify the desired binary using something other than its hash (e.g. the hash of its source), it doesn't matter what the binary's hash is, so it doesn't need to be reproducible. Pretty much all package managers work this way, it doesn't require reproducibility.
- We're asking someone for the hash, then fetching that from the cache. Again, we must identify what we're after using something other than the hash. The only thing we need for this scenario to work is that the hash we're given matches the one in the cache. That doesn't require reproducibility, it only requires knowing the hashes of whatever files happen to be in the cache. This is what we're doing whenever we download a Linux ISO and compare its hash to one given on the distro's Web site; no reproducibility needed.
Which means someone has to build and upload those binary packages.
How can you tell that the person who built or hosts the binary packages didn't change the source code (for example, putting in backdoors or other malware)?
It helps with this problem if anyone else can build the same source code, and get a byte-for-byte identical copy of the binary package.
This sounds trivial, but it actually requires some dedicated support from build tools and build scripts. (You have to have the exact same version of the compiler and everything else, the compiler wants to automatically put the build timestamp in the executable, the order the OS lists files in a directory can sometimes change, the timestamps in files contained in tar/zip archives can't be set to the current time, etc.)
Some people in the Debian community have been making an effort to update all of Debian's packages to be byte-for-byte reproducible.
Go in and vote for it anyway. I'm sure some of the Debian packages are written in Java. ;-)
For example the ISO file system, typically ISO 9660 or UDF, will have a volume UUID from a random number. Sure you can code a flag for mkfs to specify a fixed number, that's easy. But then next that ISO typically contains a payload in the form of a squashfs file. And quite a bit of work on squashfs has happened to make sure file timestamps can be set to a known value. However, if one build process uses xz level 3 compression, and another build process uses xz level 7 compression - hashing will of course fail. The point is, be it inode UUIDs and timestamps and compression levels, there's a lot being measured that we don't really care about, just to have a simple verification method at the back end.
I never figured out a maintainable scheme. Some of my packages are in /opt, some are in ~, some in ~/local, and some under GNU stow. And other than GNU stow's half-assed uninstall method, it's always a pain in the neck when I have to remove or upgrade any of these packages.
If there’s no newer package available you can try to reuse the debian directory from the older source package to build from the newer source.
Sometimes this works in the first attempt, but if you need this for a lot of packages, it’ll be a lot of work.
* output is deterministic but dependent on some aspect of the build environment (locale, hostname, etc)
* output is accidentally non-deterministic (eg "we put all the .html files into a tarball with a shell glob pattern, which gets you an order dependent on your filesystem implementation and the phase of the moon")
* a wide array of "output contains a timestamp" issues
"Let's put the timestamp into our version string/a generated file so the user knows when it was built" is a really common thing, and it seemed like purely a nice convenience feature until the concept of 100% binary-reproducible builds became a current concern.
Example: Say you have a version string that outputs a build time. Can you hash the program with just that bit of string data marked as unknown (or that string table entry replaced with a placeholder, as far as our verifier is concerned) and verify the rest of the program is unchanged across builds?
For example, if the compiler puts full path names in intermediate object files, so on a different computer the object files have a different size, this may result in choosing a completely different layout for the binary.
For the particular example of time, some common approaches are to hook the sys call for time such that it returns a constant value, or to teach your tool to output a constant value.
The Nix thesis (https://nixos.org/~eelco/pubs/phd-thesis.pdf) is a great read and goes through a bunch of these examples as part of its research.
Of course code signing is going to mess that up.
This is what I do. I'm working with a hardware vendor that is not particularly security-conscious. I have to send them compiled binaries as part of a build pipeline, which they then modify in a non-deterministic way. Fortunately the changes are limited to data which has no impact on the function of the software. So I've created hashes of all the individual components that make up the binaries, and I skip hashing the parts that they modify.
It's not ideal, but I can be reasonably confident that the binaries I compile function the same as the binaries that my hardware vendor distributes.
The Rev hash fulfils the function now.
Knowing when something was built is more valuable to me than a reproducible build. Seems like a fad triumphing over common sense.
What's the relevance of knowing when something was built if you can check the source version and rebuild it yourself in the exact same way? If the build is reproducible it's merely a cache. You could potentially recompile it every time you try to execute the command and you'd get the exact same thing.
We're not talking about passwords in a database, we're talking about reproducible builds. I already trust Debian, so it's irrelevant to me, as a user, if their builds are 100% reproducible. Can you explain why I should feel differently?
Although I don't think it's a valuable thing to spend time on, it's not my time, so I honestly don't care too much.
People concerned about this stuff are putting the least effort into what causes the most failures. Reproducible builds are a distraction if what's reproducible is 0-day-ridden software whose safety/security checks might have been removed by buggy, black-box compiler.
If it's about repos, I recommend David A. Wheeler's page on SCM security since following it eliminates most of the risks.
Besides, it's a deterministic build. You can use a lookup table with the hash to find out exactly what revision it is.
Knowing the timestamp of when it was built, or the source was downloaded tells me roughly how old the package is, which can be useful for a variety of reasons.
It does. It proves that the given package was produced by the given source code. However, you can keep the timestamp as well! One of the goals of reproducible builds is to record values like this so it can be replayed later at a future point in time. This is done by the BUILDINFO format.
Would you rather you get a timestamp embedded that tells you it was built today, or take the hash and be able to find the exact revision.
That's like saying, "why did you introduce entropy?" It's there. It happens, and it's difficult to get rid of in any complex system.
Debian was one of the first large distributions to use a build farm for many different architectures. Maintaining deterministic builds (even as far as just file timestamps) in a distributed environment is very challenging.
If you mean which distribution has 100% of its packages reproducible, probably none yet. But Arch and Debian are both making progress.
> 56 (100.0%) out of 56 built NetBSD files were reproducible
That is not really comparable to debian's 26475/28522 for buster https://tests.reproducible-builds.org/debian/reproducible.ht...
Build system-wise, there are lots of options: Blaze, Buck, Pants, Please (AFAIK)
In case any Nixers are reading this, here is how I got NSPR to build reproducibly in Guix:
https://r13y.com/ tracks the progress of NixOS reproducibility; currently we're at 98.23% bit-for-bit identical for our minimal installer ISO. After that, we'll need the graphical installer, and then more of the base package set. So we've still got a ways to go.
If you build the same code on two different machines, using the same compiler, with the same options, then the generated binaries should be exactly the same.
There is so much context that is normally embedded into a binary that this is usually not true unless explicit measures have been taken.
Two very common sources that introduce variability are time-stamps used in the build, and environment variables such as $HOME and $USER.
Most compilers give no guarantees in which order they lay out the data. I love deterministic processes as much as everyone. But randomized approaches have their advantages too. And if a compiler has reasons to randomize output e.g. for speed than it’s a trade off to consider.
writes to file
writes to index
That's not a race condition. The output order doesn't matter, but it is nondeterministic.
Let's assume there is a latent bug in the compiler that gets triggered if file four is the first one. Good luck debugging that.
That's 100% controllable and deterministic.
A build process that names things with timestamps or leaks your locale into the build configuration (or doesn't pin build-time dependency versions) will make the build depend on things other than the source code (both program and build settings) you made available.
It may even be desirable for it to be non-reproductible - if, for instance, you want to use optimizations targeted to your specific system, then your build system will have to introduce the architecture information into the build process and your build will result in a unique binary that targets your own machine.
For example, depending on the input order, linker may produce different output. Surely you can sort the object files, but the sorted object files order is still effectively "stored" into the binary, and that's not source code.
You can only normalize such things (like in the example above, sorting), you can not eliminate them, they naturally exist.
No, but the order should be explicitly defined in the build scripts or the result will not be deterministic.
If the order triggers, say, a linker bug that makes one in 50 builds crash, execution will not be deterministic and that's really, really bad.
It's not true in practical code either, people like to stick in timestamps.
It's not ever true on windows, unless you use the fairly recent PE header changes.
So I can checkout an arbitrary version from years ago and reproduce the exact same set of output files?
Think of it as absent a cache I should get bit for bit identical out (perhaps ignoring logs and such).
I maintain my own build farm and tried comparing my results against the official CI server:
$ guix challenge --substitute-urls="https://ci.guix.info"
14,224 store items were analyzed:
- 4,972 (35.0%) were identical
- 265 (1.9%) differed
- 8,987 (63.2%) were inconclusive
All of these items can be (and have been) built entirely from source, starting with Guix' initial "binary seeds", on (probably) different hardware and kernel compared to the CI system.
One reason builds become irreproducible is when a build is multi-threaded, and the order in which artifacts are combined into larger ones becomes unpredictable. That problem doesn’t exist, or at least is a lot smaller, for ‘leaf’ artifacts (example: if your C compiler is single-threaded, and you run make multi-threaded, individual object files do not have the ordering problem, but libraries built from multiple object files do)
On the other hand, a single static struct with a padding “hole” that isn’t consistently written that happens to end up in lots of binaries will decrease your percentage a lot.
Each of these "artifacts" are actual isolated builds of complicated programs such as Chromium or GCC. The technical term is "derivation", which produce "outputs".
All of those packages can be reproduced from source now or 100 years into the future and SHOULD produce the exact same binary output. If they don't, it's a bug.
These problems can also cascade, if component A embeds the hash of another component B, e.g. to verify that it's been given a correct version. If that hash comes from an unreproducible upstream, and building it ourselves gives a different hash, then we'll need to alter component A to use that new hash. That, in turn, changes the hash of component A, which might be referenced in some other component C, and so on.
Look at windows. Even if you fix the compiler and linker, you still non-reproducibility by design, the PE header contains a timestamp.
People also like to stick non-reproducible stuff into builds directly, like timestamps.
Compilers don't have any reason to lay down data in a specific order, so if they are threaded in the backend they just don't.
IDL tools might stick in the timestamp of when a file was generated, for convenience.
and on and on and on.
Once you make the sensible choice to include build time in the result you've broken reproducibility. Fixing this means tracking down every package that does this and removing the timestamp.
If one has reproducible builds, wouldn't a commit/tag from the version control system also do the job of traceability and reproducibility ?
Would be nice if there was. I think this is the root of issues such as firmware with the same password/cryto keys across a whole product family instead of unique ones.
Thing is just that host + build time is what was traditionally used. There's no single commit you could use in cvs.