
Lots of progress for Debian's reproducible builds - meskio
https://lwn.net/SubscriberLink/630074/217c398c74495155/
======
sanxiyn
Debian does amazing amounts of system-wide initiatives. Off the top of my
head, there are multiarch
[https://wiki.debian.org/Multiarch](https://wiki.debian.org/Multiarch), clang
rebuild [http://clang.debian.net/](http://clang.debian.net/), and automated
code analysis [https://qa.debian.org/daca/](https://qa.debian.org/daca/).

------
christop
The reproducible builds talk at 31C3 also does a nice job of explaining some
of the many possible attack vectors that make reproducible builds desirable,
and many of the subtleties involved in making it work:
[http://media.ccc.de/browse/congress/2014/31c3_-_6240_-_en_-_...](http://media.ccc.de/browse/congress/2014/31c3_-_6240_-_en_-
_saal_g_-_201412271400_-_reproducible_builds_-_mike_perry_-_seth_schoen_-
_hans_steiner.html)

------
leonhandreke
This is a link shared by an LWN subscriber - usually, articles only become
available for free 7 days after publication. If you read this article, please
think about supporting LWN financially.

------
chubot
I actually did some work making debootstrap reproducible. So even if the 100
or so .deb builds it depends on are reproducible, then the chroot image
resulting from debootstrap will not be reproducible byte-for-byte, due to the
debootstrap shell script itself and the tools it calls.

Offhand, I remember that /etc/{passwd,group} are copied from the host machine
by design. There is also a random seed file, to save entropy across reboots.
And there is some nondeterminism in the dynamic linker cache AFAIK. And
timestamps in logs.

If anyone is interested in this let me know.

~~~
xai3luGi
Debian appears to be doing some work on that too:

[https://wiki.debian.org/ReproducibleInstalls](https://wiki.debian.org/ReproducibleInstalls)

~~~
chubot
Interesting, didn't know that. They mentioned logs like bootstrap.log and
dpkg.log, which I noticed, but looking at my shell scripts now there is also
nondeterminism/host influence in:

    
    
      etc/resolv.conf
      var/cache/ldconfig/aux-cache
      var/lib/urandom/random-seed
      etc/init.d/.depend.{stop,start,boot}
      etc/shadow
      etc/passwd- and family (with trailing hyphen)
      etc/apt/trustdb.gpg and other keys
    

(not exhaustive)

------
csirac2
It can be surprisingly difficult. Funnily enough moving from svn git in one
project I know of probably did a lot of the necessary work to achieve this, by
having to remove reliance on $SVN tags and pre/post-"build commits" which used
to be a part of the release process.

It's an interesting use-case for Docker as well: you can ship the build
environment (or its Dockerfile describing it) for people to run builds under
the same env as the official released build.

~~~
DanielDent
I had the same hope for Docker, that it would ease reproducible builds. It
quickly became clear to me how much work still needs to be done for that to be
realistic. I maintain a packaging of Meteor using Docker designed to increase
reproducibility
([https://registry.hub.docker.com/u/danieldent/meteor/](https://registry.hub.docker.com/u/danieldent/meteor/)).
It does some checksums on the Meteor code, but the build process itself
introduces entropy, and Dockerfile doesn't really have the primitives to make
it easy to prevent that.

Many many build processes have hidden from-the-network dependencies which are
yet another huge source of problems (both in terms of having high-availability
for the build process and in terms of understanding exactly what is getting
built).

All of the upstream Docker images (including the bottom base images which most
people never look at carefully) would need to be built in a reproducible way.
And the Docker code would need to actually verify checksums more carefully
than it currently does.

Having projects like Debian doing work like this will make it a lot easier for
everything else to become reproducible. Which in addition to the security
benefits is also pretty useful when bug hunting - having a system that makes
it clear which (if any) of your dependencies changed narrows the list of
things which need to be checked when something breaks. Ubuntu Core, Snappy,
and Nix are also on my list of things to watch in this area.

~~~
csirac2
Ah, I should've thought twice about that - I'm not even attempting
reproducible builds in my own work, but I've often seen my Dockerfiles have
apt-get-induced shenanigans due to the particular day or half-broken mirror I
happen to be running the build from.

NixOS is definitely on my list, although for different reasons (whilst I love
using and babysitting Debian systems, orchestrating it is eating up time that
could be better spent on something like NixOS).

------
agumonkey
Little bit of related trivia : Lunar (J.Bobbio) worked on hOp, a GHC based
Haskell micro kernel so you can write drivers in it. See
[https://github.com/dls/house](https://github.com/dls/house). A knowledgeable
fellow.

------
Alupis
Can anyone comment on why all builds are not currently "reproducible"?

I mean, if a package is compiled on the same system, with the same compiler,
with the same build script -- should it not produce the same output?

~~~
agwa
The biggest offender is timestamps - compiled binaries, documentation,
archives, etc. often contain the time at which the file was built.

Other problems include non-deterministic filesystem order, randomized hash
algorithms, and even the fact that Markdown processors mangle email addresses
randomly. A highly vexing problem that I'm currently trying to solve is that
libxslt implements the XSLT generate-id() function by taking the memory
address of the XML node struct, which makes documentation generated with XSLT
non-reproducible (memory addresses are non-deterministic because of address
space randomization and randomized hash tables).

(I'm the author of strip-nondeterminism, a tool in our custom toolchain that
attempts to normalize files after they're built.)

~~~
jzwinck
Maybe you could fix the libxslt problem by using a pool allocator for the
nodes. The pool would contain only the nodes, and you could use the offset in
the pool as their ID, rather than the full virtual address. Just some food for
thought.

~~~
agwa
That's a very interesting idea. It's possible to tell libxml2 to use a custom
allocator, but unfortunately it will use that allocator for all objects, not
just nodes, and thus would probably be affected by hash table randomization.
The architecture dependence is also a problem.

My current solution is a patch that uses a hash table to map the memory
address of a node to a counter that increments in a deterministic order. It's
a massive hack and I hate it. I'm currently trying to assess what its
performance impact is.

The proper solution would be for libxml2 to maintain a counter and assign
every node a deterministic ID, but that would require extending the _xmlNode
struct, which would break ABI compatibility because the struct is not opaque.

------
aplanas
I love this kind of projects, and I think that for Debian is one of the best
things that can happens.

Also openSUSE have reproducible builds/packages since ages via OBS
([http://build.opensuse.org](http://build.opensuse.org)) and now
Factory/Tumbleweed have reproducible packages + automatic CI (using openQA:
[https://openqa.opensuse.org](https://openqa.opensuse.org)) Quite an
achievement for a rolling distribution.

------
walterbell
Baserock ([http://wiki.baserock.org](http://wiki.baserock.org)) may have a
repeatable build of OpenEmbedded for automotive systems.

~~~
desdiv
What's the relationship between Baserock and OpenEmbedded? Glancing through
OpenEmbedded's wiki page, it seems to a recipe/build system for embedded
linux. Baserock seems to be in the same league.

~~~
walterbell
It's a distro/downstream stable version of OE, used in automotive,
[http://www.genivi.org/](http://www.genivi.org/). There are several OE
"distros", e.g. Angstrom, [http://www.angstrom-
distribution.org/](http://www.angstrom-distribution.org/) . There's also
Yocto, the overall build system,
[https://www.yoctoproject.org/](https://www.yoctoproject.org/)

~~~
contingencies
I drank with the Yocto people in 2011. They absolutely live and breathe
release engineering. On subjects like repeatable builds, I trust their
opinion. At present, they don't appear to mention it at all on their docs or
wiki.

------
jml7c5
Will this provide a _guaranteed_ method for reproducible builds, or will it
still be technically possible to create build scripts that produce different
results (e.g., by pulling from /dev/random, or grabbing timing information
from various sources, or by writing a multithreaded program whose threads all
write to a single file)?

~~~
SwellJoe
How could anyone or anything prevent you from building something
nonreproducible? This is about making build processes that are intentionally
reproducible... You seem to be asking if one could continue to do things as
they currently are done, which given that the entire system is open source, of
_course_ you can. This can't possibly force all users to only make repeatable
builds. This seems like such an odd question that I think I must be
misunderstanding you.

