
Debian Buster will only be 54% reproducible, while we could be at 90% - JNRowe
https://lists.debian.org/debian-devel/2019/03/msg00017.html
======
hathawsh
I find it fascinating that, from a security standpoint, it's best for the
static representation of a program (the executable and the package) to be
perfectly predictable, while the dynamic representation of a program (once
it's loaded into memory and running) should be as unpredictable as possible
while maintaining correctness (using ASLR and other methods to slow down
attacks on vulnerable code). I guess running a program is like setting off an
explosive: you need to know exactly what you're handling before you release
its power.

Anyway, many kudos for this work!

~~~
zachrose
What do you mean by "as unpredictable as possible"?

From the perspective of an application developer I want that dynamic
representation to also be predictable.

~~~
adrianratnapala
You are probably talking about different kinds of representations.

hathawsh meant representation in the sense of memory layout and other low-
level details of how application state is encoded. You seem to be talking
about the programmer-visible abstractions built out of that layer. Behaviour
at that level has to be predictable in order to "maintain correctness", but
predictability at the lower level is mostly a help maintaining for the
correctness of malware.

~~~
zachrose
Interesting. Out of curiosity, is there any correlation between programming
paradigms and the predictability of of low-level details? I would imagine that
a “boring” C program with one big chunk of mutable state would be predictable
in this sense. What about programs developed in functional languages with
immutable data structures?

~~~
zamalek
> I would imagine that a “boring” C program with one big chunk of mutable
> state would be predictable in this sense.

It likely wouldn't. ASLR will choose a random load address for the module (.so
or .exe), which means that main() wouldn't be at a predetermined address. The
heap and stack would similarly have arbitrary offsets. Furthermore, any other
modules that get loaded would be at arbitrary addresses. Ultimately, even
"hello world" should become extremely randomized, provided ASLR is enabled for
the process.

ASLR prevents attacks arusing due to known addresses in virtual memory because
they are otherwise reliably predictable (just run the program, attach a
debugger, and find the addresses that you care about).

------
jopsen
Wow, I'm impressed by the consistent progress.

There is so many things based on Debian and adding reproducibility in Debian
is a massive security improvement.

I have a lot of respect for people working on this, it must be hard to respect
release freezes when you've been working on this for years.

------
JNRowe
Slightly fluid numbers aside, half the packages for Buster still sounds like
good progress to me.

~~~
e1ven
It's certainly the right direction! I applaud them for pushing this.

From the article, it looks like many of the others are reproducible on a code
level, but the release system is using older binaries, which haven't been
rebuilt yet.

It's too bad it won't be fixed in time for this release, but bodes very well
for future releases.

------
badfrog
What's the importance of reproducible builds?

Edit: found this nice overview by clicking a few links from the original post
[https://reproducible-builds.org/](https://reproducible-builds.org/)

~~~
convolvatron
one thing that I think got lost culturally in the move to linux was the
emphasis on open source. yes, not open source in some abstract warm fuzzy way,
but actual useful open source.

if I cant run the build process, and I dont really know which patch sets have
been applied on my system, and which version of the source was used, and I
dont really have a handle on the dependencies. doesn't make it substantially
more difficult to debug and work with the resulting system?

its really quite odd that the dominant open source platform settled on opaque
and unreproducible binaries as the distribution mechanism early on.

~~~
blihp
Assuming you're referring to Debian, you've missed the mark. Debian is more
committed to open source, in the RMS 'free software' sense, than any other
distro I'm aware of... it's the foundation of the distro. Any Debian user can
absolutely rebuild from source any[1] part of the system.

The problem that reproducible builds is attempting to solve is that my build,
while being 100% functional, might not be bit identical to yours or what
shipped with the distro for any number of _valid_ reasons.[2] The problem this
creates given the current world of increasingly sophisticated malware is that
I can't _prove_ a given build is identical to what was intended based on the
binaries. So it's an issue of 'we've always been giving you the source, now
we're going to ensure that it's possible to create bit-identical binaries'
rather than 'we were giving you free binaries, now we're giving you the
source'.

If your argument is that one should always build from source, reproducible
builds is part of getting to the same effective result without the overhead
and headache that the vast majority of users are not willing to deal with.

[1] In the cases when this turned out not to be possible, holy hell tends to
get raised... see things like the Chromium BLOB issues
([https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=786909](https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=786909)) etc. These issues tend to have a lot of noise
made and get resolved very quickly since it goes against the very core of what
Debian is intended to be. In cases where you have things that are binary-only
such as proprietary GPU drivers, these get segregated into a separate repo
with the general advice from Debian of 'you shouldn't use these'. This doesn't
tend to be realistic advice for many users given the state of many of the open
source drivers... but they've tried their best to steer people away while not
making the distro unusable for those who insist on using them.

[2] A trivial example would be baking the build date/machine name into the
binary... there's no expectation that one would ever reproduce that build
again even on the same machine.

~~~
mjw1007
I think the gripe is that it's never been as easy as it could be to say "OK,
I'm going to make a change in this here source file, then run the result on my
machine while keeping everything else the same".

For example if I'm reporting a bug and I have a proposed patch, it would be
nice to be able to say "I've been running with this patch for the last month
and it's been fine".

Ideally there'd be a coherent way to download the source for a particular
package, rebuild it, and have the system use the rebuilt result. Bonus marks
if it did something useful when a newer version of the package came along.

I think
[https://packages.debian.org/sid/dgit](https://packages.debian.org/sid/dgit)
is an attempt to make the download-and-rebuild part look more uniform, but I
think there's still work to be done to make it easy to substitute the result
for the distro package.

~~~
viraptor
What's missing from: "apt-get source the-package", add patches, "pbuilder
build the-package.dsc" ? (I'm assuming that pbuilder was already configured to
work)

~~~
EE84M3i
FWIW As someone that touches debian package building once in a blue moon, it
seems like there's wide variability into which is the "right way" to do the
actual build step and that everyone has their own artisanal preference.
pbuilder, debuild, debootstrap, dpkg-buildpackage, sbuild, fakeroot all leave
me scratching my head.

~~~
viraptor
If we're taking about the reproducible packages, then I believe only pbuilder
is a valid answer here. (maybe sbuild, I haven't used that one) debootstrap /
fakeroot / dpkg are part of the pbuilder system, not full solutions.

But yeah, I noticed that Debian is not going out of its way to advertise one
blessed solution.

------
aruggirello
BTW I was just looking at diffoscope - developed as part of the "reproducible
builds" Debian project.

[https://try.diffoscope.org/](https://try.diffoscope.org/)

------
abdullahkhalids
We need a body/service that verifies that the code-binary combo being
distributed by a developer is actually reproducible. Because, most users don't
have the time or resources to verify each package they use. But its useful to
society if distributed packages are reproducible. So a trusted third party
should provide this service.

I would envisage it working in the following way. The developer submits links
to their code and links to their package. The service builds from code and
checks if the final package matches what is being distributed by the
developer. If yes, it then publishes the hashes of the code and package, so
users can quickly check that they are using a reproducibly built package from
the correct source.

~~~
Foxboron
This is being actively worked on and already has had quite a bit of work. This
project is also submitted to Google summer of code under the Debian project.

[https://ssl.engineering.nyu.edu/blog/2019-01-18-in-toto-
pari...](https://ssl.engineering.nyu.edu/blog/2019-01-18-in-toto-paris)

[https://salsa.debian.org/reproducible-builds/debian-
rebuilde...](https://salsa.debian.org/reproducible-builds/debian-rebuilder-
setup)

[https://github.com/in-toto/apt-transport-in-toto](https://github.com/in-
toto/apt-transport-in-toto)

[https://reproducible-builds.org/docs/sharing-
certifications/](https://reproducible-builds.org/docs/sharing-certifications/)

------
iagooar
For those of us unfamiliar with the term "reproducible", yet in the business
of software - what does it mean and why is it desirable or needed?

~~~
scarejunba
Just wanted to add that besides the security gains there's a real performance
gain. If you've got a slow-to-build codebase where you've made a small
modification you can use a remote artifact cache so that all the parts that
don't change just get downloaded from when someone built it. Then you have
faster incremental compilation.

You can do this with Bazel and Buck and with a few assumptions and some
configuration even with Gradle.

~~~
chriswarbo
I think that's _slightly_ different: if a build is reproducible, it means that
the build products I get from my own machine are identical to those on a
remote machine or cache (e.g. they have the same SHA hash).

However, that doesn't speed anything up, since I don't know what the hash is
unless I actually do the build or download the file (and then hash the
result). If a cache used such hashes as IDs, I would only know which file to
fetch once I've already got it!

For such a cache to work there needs to be another mechanism for obtaining the
IDs. As an example, Nix caches using the hash of _build inputs_ (scripts,
tarball URLs + hashes, etc.), not the build outputs. Since the build inputs
are known before we do the build, we can combine these into a hash and query
the cache.

Since the whole point of using a cache is to avoid building things ourselves,
we also need a separate mechanism to trust/verify what the cache has sent us
(we could build it ourselves and compare, but then we might as well throw away
the cache entirely!). Nix allows certain GPG keys to be trusted, and checks
whether cached builds are signed.

Since caching mechanisms don't make use of reproducibility, and verification
mechanisms don't make use of reproducibility, such caches turn out not to
require byte-for-byte reproducibility. All that's required is that plugging in
the cached files gives a working result: in some cases that might be
practically the same as byte-for-byte reproducibility (e.g. a C library
compiled against certain ABIs at certain paths, etc.); others, like scripting
languages, might work despite all sorts of shenanigans (e.g. a Python file
might get converted into a different encoding; might get byte-compiled; might
get optimised; might even get zipped!)

~~~
maccam94
I think you missed something in the parent comment. Bazel can skip compiling
an output file if the hashes for its source code files + BUILD files have an
artifact in the remote (or local) cache. This requires reproducible builds or
else you could introduce build errors when your build environment changes.

~~~
chriswarbo
> Bazel can skip compiling an output file if the hashes for its source code
> files + BUILD files have an artifact in the remote (or local) cache.

This is what I mentioned in a sibling comment: we need some way to identify
binaries that doesn't rely on _their_ hash. Using the hash of their source
code files and build instructions is one way to identify things (Nix also does
this, as well as those of any dependencies (recursively)). A different
approach is to assign each binary an arbitrary name and version, which is what
Debian packages do; although this is less automatic and is more prone to
conflicting IDs.

> This requires reproducible builds or else you could introduce build errors
> when your build environment changes.

No, this only requires that builds are _robust_. For example, scripting
languages are pretty robust, since their "build" mostly just copies text files
around, and they look up most dependencies at run time rather than linking
them (or a hard-coded path) into a binary. Languages like C are more fragile,
but projects like autotools have been attempting to make their builds robust
across different environments for decades. In this sense, reproducibility is
just another approach to robustness.

Don't get me wrong, I'm a big fan of reproducibility; but caching build
artefacts is somewhat tangential (although not completely orthogonal).

------
tofflos
This seems to be the issue
[https://issues.apache.org/jira/browse/MNG-6276](https://issues.apache.org/jira/browse/MNG-6276)
for reproducible builds with Maven. I'm not sure what the current status is
given that all the issue links are closed.

Go in and vote for it anyway. I'm sure some of the Debian packages are written
in Java. ;-)

------
cmurf
I definitely understand the merit of making it super easy to verify, by simply
hashing the build ISO. But that also shifts the burden of forcing
deterministic results to the build process. I wonder if for a little more
complexity in verification, if a lot less complexity would be needed on the
build side?

For example the ISO file system, typically ISO 9660 or UDF, will have a volume
UUID from a random number. Sure you can code a flag for mkfs to specify a
fixed number, that's easy. But then next that ISO typically contains a payload
in the form of a squashfs file. And quite a bit of work on squashfs has
happened to make sure file timestamps can be set to a known value. However, if
one build process uses xz level 3 compression, and another build process uses
xz level 7 compression - hashing will of course fail. The point is, be it
inode UUIDs and timestamps and compression levels, there's a lot being
measured that we don't really care about, just to have a simple verification
method at the back end.

------
fizixer
Can anyone recommend a good way to install/maintain/remove new software on
distros like Debian or Ubuntu LTS? (By 'new' I mean versions newer than the
ones available through apt).

I never figured out a maintainable scheme. Some of my packages are in /opt,
some are in ~, some in ~/local, and some under GNU stow. And other than GNU
stow's half-assed uninstall method, it's always a pain in the neck when I have
to remove or upgrade any of these packages.

~~~
andrenth
One thing that works many times is to download the deb source package from a
newer distro and rebuild it in the desired system.

If there’s no newer package available you can try to reuse the debian
directory from the older source package to build from the newer source.

Sometimes this works in the first attempt, but if you need this for a lot of
packages, it’ll be a lot of work.

------
crb002
Helps having Google as a client wanting reproducible builds.

------
getcrunk
Eli5?

------
peterkelly
One has to step back and ask: "What made you decide to introduce non-
determinism into your compilation process in the first place?"

~~~
pm215
Nobody sat down and said "yes, I'll make this nondeterministic". If you look
at the wiki page's sampling of different issues --
[https://wiki.debian.org/ReproducibleBuilds/Howto#Identified_...](https://wiki.debian.org/ReproducibleBuilds/Howto#Identified_problems.2C_and_possible_solutions)
\-- you'll see it's a mix of various things:

* output is deterministic but dependent on some aspect of the build environment (locale, hostname, etc)

* output is accidentally non-deterministic (eg "we put all the .html files into a tarball with a shell glob pattern, which gets you an order dependent on your filesystem implementation and the phase of the moon")

* a wide array of "output contains a timestamp" issues

"Let's put the timestamp into our version string/a generated file so the user
knows when it was built" is a really common thing, and it seemed like purely a
nice convenience feature until the concept of 100% binary-reproducible builds
became a current concern.

~~~
Karunamon
Curious: What's the usual fix for this class of problem? Is it possible to
flag certain bits of data as known-to-change and evaluate the rest of the
build in isolation?

Example: Say you have a version string that outputs a build time. Can you hash
the program with just that bit of string data marked as unknown (or that
string table entry replaced with a placeholder, as far as our verifier is
concerned) and verify the rest of the program is unchanged across builds?

~~~
jefftk
Typically you just stop including the build time. It's a deterministic build,
so why would the build time matter?

~~~
lucb1e
I was going to say "because it's convenient", but I realized that you're
right: if it's identical anyway, the release date is as good a timestamp of
the code as you're going to get. If for some sysadmin reason the build time is
relevant, file metadata can tell you.

------
neatcoder
What software currently has 100% reproducible builds?

~~~
rvp-x
NetBSD is 100% reproducible.
[https://blog.netbsd.org/tnf/entry/netbsd_fully_reproducible_...](https://blog.netbsd.org/tnf/entry/netbsd_fully_reproducible_builds)
[https://tests.reproducible-
builds.org/netbsd/netbsd.html](https://tests.reproducible-
builds.org/netbsd/netbsd.html)

~~~
diffeomorphism
To be fair:

> 56 (100.0%) out of 56 built NetBSD files were reproducible

That is not really comparable to debian's 26475/28522 for buster
[https://tests.reproducible-
builds.org/debian/reproducible.ht...](https://tests.reproducible-
builds.org/debian/reproducible.html)

------
jwiley
My experience at Dave & Buster's is around 80% reproducible, so I think
progress is possible.

~~~
nodesocket
FYI, Debian codenames are all based off Toy Story characters. See
[https://unix.stackexchange.com/questions/222394/linux-
debian...](https://unix.stackexchange.com/questions/222394/linux-debian-
codenames)

------
trumped
Might be good progress but it still sounds very low to me as I didn't know
anything below 100% was possible... it sounds crazy to me (almost like
something that was introduced to be able to inject backdoors undetected).

~~~
jdblair
Every significant project I've worked on embedded the build host and build
time in the resulting executable or firmware image. This was along with other
static build information, like version number, compiler version and build
flags.

Once you make the sensible choice to include build time in the result you've
broken reproducibility. Fixing this means tracking down every package that
does this and removing the timestamp.

~~~
Fronzie
Why is including build time sensible?

If one has reproducible builds, wouldn't a commit/tag from the version control
system also do the job of traceability and reproducibility ?

~~~
Gibbon1
What I've moved to is splating that info into the binaries during the release
process. Far as I can tell there aren't standard tools to do that though. At
least last time I looked.

Would be nice if there was. I think this is the root of issues such as
firmware with the same password/cryto keys across a whole product family
instead of unique ones.

