Hacker News new | past | comments | ask | show | jobs | submit login
Lots of progress for Debian's reproducible builds (lwn.net)
190 points by meskio on Jan 26, 2015 | hide | past | web | favorite | 33 comments



Debian does amazing amounts of system-wide initiatives. Off the top of my head, there are multiarch https://wiki.debian.org/Multiarch, clang rebuild http://clang.debian.net/, and automated code analysis https://qa.debian.org/daca/.


The reproducible builds talk at 31C3 also does a nice job of explaining some of the many possible attack vectors that make reproducible builds desirable, and many of the subtleties involved in making it work: http://media.ccc.de/browse/congress/2014/31c3_-_6240_-_en_-_...


This is a link shared by an LWN subscriber - usually, articles only become available for free 7 days after publication. If you read this article, please think about supporting LWN financially.


I actually did some work making debootstrap reproducible. So even if the 100 or so .deb builds it depends on are reproducible, then the chroot image resulting from debootstrap will not be reproducible byte-for-byte, due to the debootstrap shell script itself and the tools it calls.

Offhand, I remember that /etc/{passwd,group} are copied from the host machine by design. There is also a random seed file, to save entropy across reboots. And there is some nondeterminism in the dynamic linker cache AFAIK. And timestamps in logs.

If anyone is interested in this let me know.


Debian appears to be doing some work on that too:

https://wiki.debian.org/ReproducibleInstalls


Interesting, didn't know that. They mentioned logs like bootstrap.log and dpkg.log, which I noticed, but looking at my shell scripts now there is also nondeterminism/host influence in:

  etc/resolv.conf
  var/cache/ldconfig/aux-cache
  var/lib/urandom/random-seed
  etc/init.d/.depend.{stop,start,boot}
  etc/shadow
  etc/passwd- and family (with trailing hyphen)
  etc/apt/trustdb.gpg and other keys
(not exhaustive)


I'm interested! I actually have had to work around the /etc/{passwd,group} shenanigans for other reasons, interested to see what you did there.


Unfortunately I haven't published it yet, but I can describe what I did. I have like 3000 lines of shell script to make containers, and maybe 800-1000 lines are related to debootstrap.

I have a cron job running daily on multiple machines, doing a deterministic debootstrap of Debian Wheezy on i386 and amd64.

It basically wraps debootstrap, strips down the image a bit, and stamps out the nondeterminism. Every day it has gives a checksum for i386 and one for amd64, which holds across multiple host machines (one is Debian Wheezy, the other is Ubuntu Trusty). So it is free from host influence.

The /etc/apt/sources.list just has the "wheezy" repo now (i.e. not wheezy-updates). So Debian 7.7 had one pair of checksums, and then on Jan 10 2015, I noticed Debian 7.8 was released. They changed that day, and have been stable/reproducible every day since.

Part of this is also mirroring the Release/Packages metadata daily and storing version history in Git. One nice thing I found out about Debian through doing this is that the Release file completely describes the input, since it's hashes all the way down (a "Merkle tree", basically like Git itself.) My scripts also make it so you can store versioned metadata in one tree, while keeping data immutable in "pool" (all this really requires is symlinks and file:// URLs for the repo).

What project were you working with in this area? I'm basically doing this to make reproducible builds of containers. I'm kind of surprised that Docker completely punts on this problem.


Thanks for sharing, that sounds great. Do you have a GitHub/bit bucket/etc. account I could follow?

I've found myself automating Debian/kernel/boot loader/libs/app stack packages for some ARM hardware for which I'm an app developer. This is due to the outrageous situation of COM module vendors (and/or SoC manufacturers) thinking that kernel 3.0 and horribly ancient/unsupported/unpatched userlands are acceptable in 2014/2015.

I run our .deb builds (and final debootstrap) in a bunch of docker containers, given that xapt/debcross toolchain has some stuff that, although these are awesome in themselves, still have some quirks when building stuff with awkward/complex dependencies and require a lot of hand-holding normally. So dockerizing the build env makes my life easier (some of it still has to be done from arm native chroot via qemu+binfmts). I use Jenkins to automate and integrate with the rest of our build tools but make files work too.

The /etc/passed et. al. copying from host is definitely the wrong thing to do in my circumstances as well; it's curious the various failure modes chroot builds can have.


I didn't have anything related to this on my Github account, but I just uploaded the data files in case that is useful to anyone.

https://github.com/andychu/debian-wheezy-metadata

(NOTE: The repo takes up 300+ MB on my local disk)

I chatted on #debian IRC about this a little. It seemed like one person thought doing all the archs would take up too much space. So far doing it daily for over a year for two archs has been manageable. I think you simply have one repo per arch, instead of having multiple archs in one repo like I have here.

I will try to get the code up... if you think you will use it you can ping me on github. I think this should be in Debian itself, and actually supported by debootstrap, but so far I am just wrapping it rather than patching it. (I started off trying to patch it.)


It can be surprisingly difficult. Funnily enough moving from svn git in one project I know of probably did a lot of the necessary work to achieve this, by having to remove reliance on $SVN tags and pre/post-"build commits" which used to be a part of the release process.

It's an interesting use-case for Docker as well: you can ship the build environment (or its Dockerfile describing it) for people to run builds under the same env as the official released build.


I had the same hope for Docker, that it would ease reproducible builds. It quickly became clear to me how much work still needs to be done for that to be realistic. I maintain a packaging of Meteor using Docker designed to increase reproducibility (https://registry.hub.docker.com/u/danieldent/meteor/). It does some checksums on the Meteor code, but the build process itself introduces entropy, and Dockerfile doesn't really have the primitives to make it easy to prevent that.

Many many build processes have hidden from-the-network dependencies which are yet another huge source of problems (both in terms of having high-availability for the build process and in terms of understanding exactly what is getting built).

All of the upstream Docker images (including the bottom base images which most people never look at carefully) would need to be built in a reproducible way. And the Docker code would need to actually verify checksums more carefully than it currently does.

Having projects like Debian doing work like this will make it a lot easier for everything else to become reproducible. Which in addition to the security benefits is also pretty useful when bug hunting - having a system that makes it clear which (if any) of your dependencies changed narrows the list of things which need to be checked when something breaks. Ubuntu Core, Snappy, and Nix are also on my list of things to watch in this area.


Ah, I should've thought twice about that - I'm not even attempting reproducible builds in my own work, but I've often seen my Dockerfiles have apt-get-induced shenanigans due to the particular day or half-broken mirror I happen to be running the build from.

NixOS is definitely on my list, although for different reasons (whilst I love using and babysitting Debian systems, orchestrating it is eating up time that could be better spent on something like NixOS).


Can you elaborate on what these pre/post-"build commits" did and why they were needed?

If these commits were used to adjust version numbers in source files, the following trick should eliminate them: Check out the release branch into a working copy, adjust the version numbers in the working copy, and then copy the working copy to the tag's URL (as in: cd working-copy; svn copy . ^/tags/1.0)

If this was a maven project: The maven release plugin is doing this wrong and performs 3 commits to an SVN repository per release...


It was a bespoke build system but this aspect of it was similarly "wrong". IIRC pre-build commit, yes, was version-number related. It also gathered issues fixed in the release being built and updated change logs/release notes/upgrade info documentation automatically. IIRC the post-build commit helped confirm in the commit history that a particular build for release x.y.z was successful and the version number can now be incremented (occasionally it took multiple attempts for the release manager to build successfully). Of course, one could go through the tags but most of the developers liked seeing release management stuff in trunk/release branches.

I didn't mean to criticize SVN as being inherently incapable of reproducible builds (if anything that's harder to achieve with git, especially with hacks like I've listed above), but the act of cleaning up the SVN repos and preparing for migration to git where a lot of our old SVN (and RCS!) habits would be problematic, also seem like the same kind of housecleaning you'd need to prepare for reproducible builds.


Little bit of related trivia : Lunar (J.Bobbio) worked on hOp, a GHC based Haskell micro kernel so you can write drivers in it. See https://github.com/dls/house. A knowledgeable fellow.


Can anyone comment on why all builds are not currently "reproducible"?

I mean, if a package is compiled on the same system, with the same compiler, with the same build script -- should it not produce the same output?


The biggest offender is timestamps - compiled binaries, documentation, archives, etc. often contain the time at which the file was built.

Other problems include non-deterministic filesystem order, randomized hash algorithms, and even the fact that Markdown processors mangle email addresses randomly. A highly vexing problem that I'm currently trying to solve is that libxslt implements the XSLT generate-id() function by taking the memory address of the XML node struct, which makes documentation generated with XSLT non-reproducible (memory addresses are non-deterministic because of address space randomization and randomized hash tables).

(I'm the author of strip-nondeterminism, a tool in our custom toolchain that attempts to normalize files after they're built.)


strip-nondeterminism "is a Perl module for stripping bits of non-deterministic information, such as timestamps and file system order, from files such as gzipped files, ZIP archives, and Jar files. It can be used as a post-processing step to make a build reproducible, when the build process itself cannot be made deterministic. It is used as part of the Reproducible Builds project."

Browse source @ https://anonscm.debian.org/cgit/reproducible/strip-nondeterm...

PS. non-deterministic filesystem order .. doesn't this go away if you use tmpfs?


Maybe you could fix the libxslt problem by using a pool allocator for the nodes. The pool would contain only the nodes, and you could use the offset in the pool as their ID, rather than the full virtual address. Just some food for thought.


That's a very interesting idea. It's possible to tell libxml2 to use a custom allocator, but unfortunately it will use that allocator for all objects, not just nodes, and thus would probably be affected by hash table randomization. The architecture dependence is also a problem.

My current solution is a patch that uses a hash table to map the memory address of a node to a counter that increments in a deterministic order. It's a massive hack and I hate it. I'm currently trying to assess what its performance impact is.

The proper solution would be for libxml2 to maintain a counter and assign every node a deterministic ID, but that would require extending the _xmlNode struct, which would break ABI compatibility because the struct is not opaque.


That probably isn't architecture independent.


Thanks for your work!

I've been wondering -- why do compiled binaries actually include timestamps? For human-readable stuff, I sort of understand, but for binaries I don't see we'd want to include timestamps.


I guess it's related to debugging purposes. If you know when a binary was built, you know that a source file X with a changetime later than the build is different to the source file used for building the binary.


For a few examples, see the third paragraph of the article. Things like filesystem ordering, system time, ...

Also, the goal is to make it reproducible across machines, since one of the motivations is to enable everybody to verify that the binary distribution matches the source.


From memory windows binary format supposedly contains a compilation timestamp, but I've never developed for windows. And I know ELF doesn't contain a timestamp.

You can blame the compiler random number seed for much non-determinism. I googled for this and found at good explanation at:

http://blog.mindfab.net/2013/12/on-way-to-deterministic-bina...

The summary is anonymous namespaces actually have a name, that being a big integer, and they're all distinct so each has a different one and gcc just picks a psuedorandom number for each anon namespace. They're anonymous to you, but not to the compiler LOL crazy but true. Some folklore that once every 50 trillion years (maybe less often) linking two compiled files will fail at runtime not compile time because two namespaces randomly were assigned the same number so they "crossed the streams".

Supposedly everyone knows some versions of GCC sometimes pick specific optimizations pseudorandomly. This is one of those "everyone knows and nobody has evidence" things. I'd welcome a link to actual evidence, like to actual code that sometimes compiles differently based on phase of moon or whatever. If I were more bored I'd make a github project specifically to manipulate gcc for fun, sounds amusing.

If you're compiling up static libraries, AR is just a generic, although crude and ancient, file archiver and therefore contains timestamps as explained in the link below. I don't think that's necessary for static library operation its just an artifact of the file format.

http://en.wikipedia.org/wiki/Ar_%28Unix%29

There is also a funny failure mode where two systems with somewhat different libraries (perhaps one upgraded after the other, or a security patch, or got owned), when compiling a statically linked binary with the same code and compiler will obviously have different static linked result. This is of the same class as optimizations getting a little too personal for your specific CPU family such that two generic amd64 intel boxes are not quite so identical as you'd think in both optimizations and runtimes. (oh edited to add what if your virtualization is funky such that GCC didn't think a certain instruction set was present in the virtual image, so bare metal compiles would have different optimizations than images running on the same bare metal, although this is totally theoretical and probably doesn't exist in reality)


I love this kind of projects, and I think that for Debian is one of the best things that can happens.

Also openSUSE have reproducible builds/packages since ages via OBS (http://build.opensuse.org) and now Factory/Tumbleweed have reproducible packages + automatic CI (using openQA: https://openqa.opensuse.org) Quite an achievement for a rolling distribution.


Baserock (http://wiki.baserock.org) may have a repeatable build of OpenEmbedded for automotive systems.


What's the relationship between Baserock and OpenEmbedded? Glancing through OpenEmbedded's wiki page, it seems to a recipe/build system for embedded linux. Baserock seems to be in the same league.


It's a distro/downstream stable version of OE, used in automotive, http://www.genivi.org/. There are several OE "distros", e.g. Angstrom, http://www.angstrom-distribution.org/ . There's also Yocto, the overall build system, https://www.yoctoproject.org/


I drank with the Yocto people in 2011. They absolutely live and breathe release engineering. On subjects like repeatable builds, I trust their opinion. At present, they don't appear to mention it at all on their docs or wiki.


Will this provide a guaranteed method for reproducible builds, or will it still be technically possible to create build scripts that produce different results (e.g., by pulling from /dev/random, or grabbing timing information from various sources, or by writing a multithreaded program whose threads all write to a single file)?


How could anyone or anything prevent you from building something nonreproducible? This is about making build processes that are intentionally reproducible... You seem to be asking if one could continue to do things as they currently are done, which given that the entire system is open source, of course you can. This can't possibly force all users to only make repeatable builds. This seems like such an odd question that I think I must be misunderstanding you.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: