Just wanted to add that besides the security gains there's a real performance ga...

chriswarbo · on March 5, 2019

I think that's slightly different: if a build is reproducible, it means that the build products I get from my own machine are identical to those on a remote machine or cache (e.g. they have the same SHA hash).

However, that doesn't speed anything up, since I don't know what the hash is unless I actually do the build or download the file (and then hash the result). If a cache used such hashes as IDs, I would only know which file to fetch once I've already got it!

For such a cache to work there needs to be another mechanism for obtaining the IDs. As an example, Nix caches using the hash of build inputs (scripts, tarball URLs + hashes, etc.), not the build outputs. Since the build inputs are known before we do the build, we can combine these into a hash and query the cache.

Since the whole point of using a cache is to avoid building things ourselves, we also need a separate mechanism to trust/verify what the cache has sent us (we could build it ourselves and compare, but then we might as well throw away the cache entirely!). Nix allows certain GPG keys to be trusted, and checks whether cached builds are signed.

Since caching mechanisms don't make use of reproducibility, and verification mechanisms don't make use of reproducibility, such caches turn out not to require byte-for-byte reproducibility. All that's required is that plugging in the cached files gives a working result: in some cases that might be practically the same as byte-for-byte reproducibility (e.g. a C library compiled against certain ABIs at certain paths, etc.); others, like scripting languages, might work despite all sorts of shenanigans (e.g. a Python file might get converted into a different encoding; might get byte-compiled; might get optimised; might even get zipped!)

maccam94 · on March 5, 2019

I think you missed something in the parent comment. Bazel can skip compiling an output file if the hashes for its source code files + BUILD files have an artifact in the remote (or local) cache. This requires reproducible builds or else you could introduce build errors when your build environment changes.

chriswarbo · on March 7, 2019

> Bazel can skip compiling an output file if the hashes for its source code files + BUILD files have an artifact in the remote (or local) cache.

This is what I mentioned in a sibling comment: we need some way to identify binaries that doesn't rely on their hash. Using the hash of their source code files and build instructions is one way to identify things (Nix also does this, as well as those of any dependencies (recursively)). A different approach is to assign each binary an arbitrary name and version, which is what Debian packages do; although this is less automatic and is more prone to conflicting IDs.

> This requires reproducible builds or else you could introduce build errors when your build environment changes.

No, this only requires that builds are robust. For example, scripting languages are pretty robust, since their "build" mostly just copies text files around, and they look up most dependencies at run time rather than linking them (or a hard-coded path) into a binary. Languages like C are more fragile, but projects like autotools have been attempting to make their builds robust across different environments for decades. In this sense, reproducibility is just another approach to robustness.

Don't get me wrong, I'm a big fan of reproducibility; but caching build artefacts is somewhat tangential (although not completely orthogonal).

p1necone · on March 5, 2019

> I don't know what the hash is unless I actually do the build

Yes you do, this is the point of reproducible builds. The same source always produces the exact same binary.

chriswarbo · on March 6, 2019

No I don't. I can either:

- Do the build, then hash the result

- Download and hash a pre-built binary, but I have to trust whoever I get the file from

- Ask someone for the hash, but I have to trust them

If the build is reproducibile then I don't need to trust the second two options, since I can check them against the first. But in that case there's no point using the second two options, since I'm building it myself anyway.

If you take that quote in context, you'll see I'm talking about a (hypothetical) cache which uses the binaries' hashes as their IDs, i.e. to fetch a binary from the cache I need to know its hash.

In this scenario the first option is useless, since there's no point using a cache if we've already built the binaries ourselves.

The second two options in this scenario have another problem: how do we identify which binary we're talking about (either to download it, or to ask for its hash)? We need some alternative way of identifying binaries, for example Nix uses a hash of their source, Debian uses a package name + version number.

Yet if we're going to use some alternative method of identification, then we might as well cut out the middle man and have the cache use that method instead of the binaries' hashes!

The important point is that the parent was claiming that reproducible builds improve performance over non-reproducible builds because of caching. Yet nothing about such a cache requires that anything be reproducible! We can make a cache like this for non-reproducible builds too. Here are the three scenarios again:

- We're doing the build ourselves. Since we're not using the cache, it doesn't matter (from a performance perspective) whether our binary's hash matches the cached binary or not.

- We're downloading a pre-built binary. Since we must identify the desired binary using something other than its hash (e.g. the hash of its source), it doesn't matter what the binary's hash is, so it doesn't need to be reproducible. Pretty much all package managers work this way, it doesn't require reproducibility.

- We're asking someone for the hash, then fetching that from the cache. Again, we must identify what we're after using something other than the hash. The only thing we need for this scenario to work is that the hash we're given matches the one in the cache. That doesn't require reproducibility, it only requires knowing the hashes of whatever files happen to be in the cache. This is what we're doing whenever we download a Linux ISO and compare its hash to one given on the distro's Web site; no reproducibility needed.