Hacker News new | past | comments | ask | show | jobs | submit login
Reproducible builds for Debian: a big step forward (qubes-os.org)
125 points by Foxboron 57 days ago | hide | past | favorite | 55 comments



On the subject of reproducible debian-based environments I wrote apt2ostree[1]. It applies the cargo/npm lockfile idea to debian rootfs images. From a list of packages we perform dependency resolution and generate a "lockfile" that contains the complete list of all packages, their versions and their SHAs. You can commit this lockfile to git.

You can then install Debian or Ubuntu into a chroot just based on this lockfile and end up with a functionally reproducible result. It won't be completely byte identical as your SSH keys, machine-id, etc. will be different between installations, but you'll always end up with the same packages and package versions installed for a given lockfile.

This has saved us on a few occasions where an apt upgrade had broken the workflow of some of our customers. We could see exactly which package versions changed in git history and roll-back the problematic package before working on fixing it properly. This is vastly better than the traditional `RUN apt-get install -y blah blah` you see in `Dockerfile`s. You know exactly what was installed before an update and exactly what is installed after and you can rebuild old versions.

IMO it's also more convenient than debootstrap as you don't need to worry about gpg keys, etc. when building the image. Dependency resolution and gpg key stuff is done at lockfile generation time, so the installation process can be much simpler. In theory it could be made such that only dpkg is required to do the install, rather than the whole of apt, but that's by-the-by.

apt2ostree itself is probably not interesting to most people as it depends on ostree and ninja but I think the lockfile concept as applied to debian repos could be of much broader interest.

[1]: https://github.com/stb-tester/apt2ostree#lockfiles

[2]: https://ostreedev.github.io/ostree/


We all agree about how much important debian is for all tech community. Why is it so difficult to scale their snapshot service? Serving static files at scale is a solved problem. Am I missing something? Can't a cloud provider help them out?


It's very, very, expensive. I don't know the details for debian, but pypi.org, hosting python packages, costs $800k/month: https://twitter.com/dstufft/status/1236331765846990848

I imaging debian is also super expensive, and so scaling that must not be easy. Every decision could be thousands of dollars.


This lazy constant pulling of dependencies by CI systems and containers is not very substainable. pypi should set up limits and make people use some cache proxy.


I worked at a place (briefly) where the CI process was pulling down 3-4GiB of container images for every run. Same repos, same sites, every check-in on every branch pushed to GitHub set up a blank environment and pulled everything down afresh. Then the build process yoinked dozens of packages, again with no cache. It must have consumed terabytes a day. Madness!


I am curious about how such a cache could be setup reliably.

1. Proxy inside the company network that intercepts every request to pypi? Doesn't work well due to https, I guess.

2. Replacing pypi in your project description with your own mirror? Might work, but at least slows down every build outside of your network. Also needs to be replicated for npm, cargo, maven, docker, ...

3. Start a userspace wrapper before starting the build that transparently redirects every request. That would be the best solution, IMO. But how do I intercept https requests of child processes? Technically, it must be possible, but is there such a tool?


> Also needs to be replicated for npm, cargo, maven, docker, ...

All these frameworks come with a tool to run an internal repo/mirror.

There are also commercial offerings (like artifactory) that cover all these languages in a single tool.

For python, just set PIP_INDEX in the CI environment to point to the internal mirror and it will be used automatically. It's very easy.

By default the downloaded wheels are cached in ~/cache/somewhere. Isolated builds don't benefit from it if they start with a fresh filesystem on every run, consider mounting a writable directory to use as cache, the speedup is more than worth it.


Just so. It does take a little configuration. The system I was talking about had been band-aided since the company's two-person days, when I'm sure the builds were a fraction of the size. Good example of infra tech debt.


For 2.:

pip can be configured to use a local warehouse, and there are warehouses that transparently proxy to pypi, but caches any previously fetched result. E.G: https://pypi.org/project/proxypypi/

Since you control it, and it's read only from the outside, you can actually expose it even outside of your network.

But indeed, it must be replicated for npm, cargo, maven, docker...

There is a startup idea here :)


Use Nix instead. :P


Agreed. In fact, if you do more than 100 requests/minute for the same IP, you should get throttled. If you want out of it, you should pay.


Setting up your own proxy that covers pretty much everything (maven/npm/nuget/pypi/gems/docker/etc.) is not difficult and takes only a few hours of work. I went for sonatype nexus, but many (most?) cases can even be covered with nginx caching proxy.


This didn't go so well for Docker Inc. when they tried to do this with Docker Hub


Doesn't matter, the PSF is not for profit. If people starts trying to avoid the throttle by using alternatives, it just saves money.


AIUI, PyPI doesn't _actually_ cost that much to host: it's all donated/sponsored hosting (with data largely served from Fastly), and the "cost" I'd expect is "what it would cost if we actually had to pay the normal published prices".


Is that with at least some attempt at building a CDN? Generally cloud providers don't charge for traffic between hosts in the same availability zone. One could think about putting a slave into each major availability zone of various cloud providers. The main service would then only be used to create HTTP redirects to the specific slave, or if none exists, or the package isn't replicated yet at the slave, just answer directly.

Even if such a system isn't built, with that kind of money on the table you could get a team of FANG scale developers to build it for you.


All the CI/CD build agents with no cache and so on. This is a general problem for all tech. For the web, cache is cheap but as far as I know there is no equal way to cache builds as cheap.

I think there needs to be a redesign in how dependencies work in most programming languages. Deterministic builds have been such a game changer and I think that CPU vs bandwidth may be the next big area to explore when it comes to compiling code.


Isn't Debian opinionated about the freedom of the stack it stands on? Would the community be happy to build a dependency on a vendor?


Debian is opinionated about software freedom - thankfully.

But it's OK to accept donations in hardware or money as long as there are no strings attached.


Not to detract from the article, but given:

> It took 3-4 months to get 4.2 TB of data

Maybe send an email and ask someone to mail a couple of hard drives? (offering to reimburse hw, labour and shipping obviously).


can someone please elucidate the _benefits_ of reproducible builds ? perhaps i am missing something trivial?

thank you kindly!


- it's a great way to be sure binaries aren't backdoored or built from unofficial sources

- in general if you want to do additional development on a piece of software, it's useful to validate that your build environment is set up correctly and can reproduce bit-exact binaries, before doing additional development

- it helps prevent silent environment-specific build dependencies creeping in, aka "works on my machine"


The last point is exceedingly useful. If you build lots of in-house packages you may have forgotten what depends on what and when you go to say 'clean room' upgrade to a newer version of X, Y and Z don't work or X must be built first. Of course these simpler things are handled by proper dependencies but if you always build on a box designated "we built packages here" those lines can blur. Doing from scratch reproducable build enforces getting those things right from the start.


It ensures two important things.

The first is that it makes sure any maintainers who try to backdoor a package need to document that fact somewhere in the build process. Hence it becomes possible to get more confidence from an audit of a package.

This first goal is done very simply. You rebuild the package, and check that the resulting binary is bit-for-bit identical. With reproducible builds, the same build process should lead to the same binary. So if someone tampered with the source code or the build process, that can be detected if you have reproducible builds.

The second goal is to just have a consistent system which makes debugging easier.

A nice example of the importance of reproducible builds is the app signal, which I mention because of your user-name. We trust the source code of signal, but how are we supposed to be sure that the binaries offered by apt, the google play store, the app store, or whatever source you have for signal, are actually made from the source code? With reproducible builds, one person can rebuild the binary and check to see that the output is the same. This is much better than telling everyone to build signal from source, because now the lazy people who trust the binaries get to benefit from the skeptical people who build from source.


Its very important in FOSS imo.

There is FOSS called ungoogled chromium. And building chromium takes a lot of times (in my case 8 hours). The problem here is chromium build is not reproducible and repo author can't build chromium for every platform and every versions as its very expensive from him. So what he currently do now is accepts the binaries from others.

But as a user how can I trust the user who build the binaries? He might have tempered with binaries? He might have inserted back doors secretly? Just because he published binaries lets say for 10 years doesn't grants 100% trustworthy right? He might be hacked etc too?

I think reproducible builds fixes such problems. Other than trust factor I really don't think there is significant advantage.



Too strange.

To confirm I went to https://github.com/Eloston/ungoogled-chromium it says "NOTE: These binaries are provided by anyone who are willing to build and submit them. Because these binaries are not necessarily reproducible, authenticity cannot be guaranteed;"


I don’t follow your example. Isn’t compile(src, platform) → binary a one way function?

If someone kindly hands the repo author a (binary, platform) pair how would the author ever verify the original src value, other than re-doing the 8 hour compile themselves?


> I don’t follow your example. Isn’t compile(src, platform) → binary a one way function?

Of course not! It's more like `compile(src, platform, compiler, compiler version, CPU features, current time, current date, your whole file system tree, compiler caches, ...) → binary`.

There are also absolute evil build systems that fetch random sources from the internet, then practically the whole world becomes your function closure. Even if you manage to fix or exclude these variables, you could still end up with something that's not reproducible at the bit level: for example, unordered data structures result in binary output that will vary from execution depending on the state of the compiler.

> If someone kindly hands the repo author a (binary, platform) pair how would the author ever verify the original src value, other than re-doing the 8 hour compile themselves?

If the build is bit-level deterministic it results in a fixed output given a fixed input. So, you simply hash the output, sign it and ship it along the build. The user will simply check if the hash matches the binary they received, even from an untrusted source.


Isn't that what things like apt do?

My Packages file contains the following about a given package that apt will verify

  SHA256: 04cd079a0676438a8fe1bdf2d897927ead9aa689808f4209d41edadfc3f72f0e
  SHA1: 2073306c14d4d755d8a32dabce0623e0d9ca3dde
  MD5sum: f49c2e53ff64d2568f3cb8b6fcb4b112
Release has MD5, SHA1 and SHA256 of the Packages

9ff4da44240de168f7b98d1461d2d96c 65274 main/binary-amd64/Packages dc9007fe54d7f26d0922a377c81e45d9 17568 main/binary-amd64/Packages.gz fd2c8c93e1590e1fc5bc797b20b9ddd670171f1856326a8ba7816f2a6753b779 65274 main/binary-amd64/Packages 06159def8d253ed6f3401cc797bf0f2743b4623a16f996815bf8ffde6f79bfaf 17568 main/binary-amd64/Packages.gz

And Release itself has a .gpg signature

  -----BEGIN PGP SIGNATURE-----
  Version: GnuPG v2
  
  iQEcBAABCAAGBQJhTyrcAAoJECF64EwcjMKfNvUH/RSnkcd41AlOHrrnsDc1P0YL
  MbV0RrhmHnQYCdO4VvE/C0BTOFstvF0bRDcnCHzEyaU2LvZ7pQFsOTgQqaNjuxSd
  NyVTIfhwg00AGsEN/MIUZOd2jZLHQQ8esDv0eOCcpO+q4UmBtP6unTCjmdkMekkx
  SGA/ClfNA1Ql48ZiXMAhn87Vpxvcl4IFtdOV5UZmMVGtMaX1bp7u8Ifz1bm2xbkx
....

Of course I still have to trust the person that controls the gpg signing key, but how does that vary with

> If the build is bit-level deterministic it results in a fixed output given a fixed input. So, you simply hash the output, sign it and ship it along the build. The user will simply check if the hash matches the binary they received, even from an untrusted source.

Or is it that 3/4/10 independent people can independently build the same, concur that the checksums of the built package are all the same, and be confident that none of them are being coerced or hacked? So it's a way of verifying the person building the package in the first place (any coercion/hacking would have to apply to everyone confirming the build - which won't be everyone using it, but would be a dozen people in a dozen jurisdictions


> My Packages file contains the following about a given package that apt will verify

Those hashes and signature confirm that the package is coming from a maintainer, but it can't confirm nor deny that the package you received was actually built from the sources the maintainer claims to have used. Simply because there's no guaranteed bit-level link between sources and output.

> Or is it that 3/4/10 independent people can independently build the same, concur that the checksums of the built package are all the same, and be confident that none of them are being coerced or hacked?

Yes, the improvement compared to traditional packages is that given the package inputs (sources, dependencies, build instructions) and the output you can be sure that the output is the result of building the inputs. You don't actually need multiple indepdent people: just rebuild the package and compare the hashes. Again, you could try to do this with APT and it will probably work for some very simple software (or some that the Debian maintainers managed to make reproducible), but not in general.


If the maintainer has 10 people they 99% trust, and they seem unlikely to collaborate, and they all have builds with identical SHA-256 sums, that's much better than picking one build from one of those people.


Yes, it's a one-way function. And if the builds are reproducible, this function is deterministic.

The way to verify is to redo the entire compile process. But the advantage is that anyone can do this, so you can have other users in the community validating the builds.


author can't verify , user can't verify that's why we need reproducible builds.

if no reproducible builds then, sha256(binary_from_user1) != sha256(binary_from_user2)

For smaller project distributing code source is usually correct choice as auditing src code is better than auditing binaries. But the project like chromium I think reproducible build is essential because you can't force everybody to compile and waste hours right?


The media libraries are what is the most of those 8 hours.


...which begs the question: shouldn't those be built independently as shared libraries to avoid rebuilding the world?


The most important benefit is that if a build is reproducible, then a backdoored build is reproducible. The security benefits and security implications of being able to reproduce what bad guys have done are huge.

And by having this "threat" against the bad guys, which consists in saying: "If you poison the well, we'll find how you did it", suddenly the bad guys are having a much harder time being bad.


This is a bit more niche than what the answers already posted mention, but reproducible builds ensure reproducible output from said software. This is important in scientific software as — presuming you have a deterministic model — you don’t want the same inputs giving different outputs because of some weird build artefact.


I would turn that around: if your builds are irreproducable, how would you troubleshoot build errors?


They are important in the security world, to prove that the delivered executable corresponds to the source code.

E.g. nobody added a keylogger after publishing the source but before compiling the executable.



People already listed:

- security against backdoors

- validating build environment

- reproducing bugs

But there's more:

- legal liabilities: many bad build systems pull trees of dependencies from the Internet during the build. How can you prove that no license breaching occurred in any of the dependencies? Debian explicitly verifies licensing in each package.

- prevent pulling dependencies from deleted (or hijacked!) repositories on the Internet

- reproduce performance improvements: non reproducible builds can often lead to non reproducible performance. Even things like the length of the local hostname can leak into a binary and affect memory alignment.


an article arguing that reproducible builds are a lot of effort, and the benefits are not that big: https://blog.cmpxchg8b.com/2020/07/you-dont-need-reproducibl...


The main benefit of reproducible builds isn't security, but sanity. There is just no reason why your build process should produce different output for the same input. That just means it's broken in subtle ways and information is leaking into it that was never intended to be there.

The real benefit of reproducible builds however isn't for the individual software itself, but for the software landscape as a whole. As with reproducible builds you have the whole dependency chain specified completely from top to bottom, no more hidden dependencies. And everything is fully automated, not just on your personal machine, but in a way that others can reproduce. That in turn will dramatically improve the ease with which users can build software (and change it), as it turns an hour long hunt for dependencies into a single click.

To really see the fruits of this labor will still take some years, but it has the potential to pretty drastically reshape and improve the way FOSS software works (e.g. Nix Flakes) and actually allow users to make use of their freedom instead of giving up before they have even managed to build the software.


This blogpost can be summarized as with an XKCD essentially.

It positions itself with the following assertions:

> Q. If a user has chosen to trust a platform where all binaries must be codesigned by the vendor, but doesn’t trust the vendor, then reproducible builds allow them to verify the vendor isn’t malicious. > I think this is a fantasy threat model. If the user does discover the vendor was malicious, what are they supposed to do? The malicious vendor can simply refuse to provide them with signed security updates instead, so this threat model doesn’t work.

Which only works in the context of proprietary vendors and not in the context of FOSS distributions. Nothing can be denied as everything is freely distributed. You want to have the ability to verify the work done by packagers and build servers.

Next up is the essentially the claim that "reproducible builds can't solve bugdoors. Thus it's insufficient to solve any problems".

But this is essentially just an XKCD argument; https://xkcd.com/2368/

Reproducible Builds is a nice property of any build system for multiple reasons. It's also part of the supply-chain security story and not the entire story alone. As for how much effort it is? It's a lot. But considering the core community of reproducible builds people is below 50 people, and we are still able to come close to an 88% reproducible builds in real world distributions should point out how achievable this goal is.

https://reproducible.archlinux.org/


> Which only works in the context of proprietary vendors and not in the context of FOSS distributions. Nothing can be denied as everything is freely distributed. You want to have the ability to verify the work done by packagers and build servers.

Indeed. Open source isn't a vendor it's just volunteers from the Internet. If you look at the way people behave in nearly every other walk of life that isn't software, it would seem amazing that we somehow managed to create such a beautiful thriving gift economy. In order to keep it that way, we need to have reproducible builds, because they promote transparency.

It is however very very expensive. Even if you manage to surgically remove all the dependencies on things like __TIME__ then you still need to audit the code for things like iterating over hash tables. Many core libraries these days such as expat xml parsing will seed their data structures using /dev/random. You can qsort but you might get snagged by the fact that it doesn't use stable algorithms. If things don't work out then it helps to have deterministic execution too so you can run the build and see what it does differently that causes it to produce different output. Sadly, with things like kernel imposed memory randomization that's easier said than done.

So the unfortunate reality is that open source isn't arching towards determinism. The trend is very much the opposite where it's becoming more non-deterministic, especially in the last five years. So if having deterministic builds is something your org cares about then you really need to hire an expert and in many cases change the engineering culture too.


As another data point for achievability:

"bookworm [the next Debian stable release] on amd64 is 95.6% reproducible right now! "

https://isdebianreproducibleyet.com/


Sadly, they are not "real" numbers.

This is taken from the integration suite which Debian have been running for years. This represents checking out the code, and building twice. This is not distributed packages from Debian. This inflates the number a little bit.

Holger explains this in a thread a few years back. https://lists.debian.org/debian-devel/2019/03/msg00017.html


Actual numbers for bullseye currently look like this:

https://debian.notset.fr/rebuild/results/bullseye_full.amd64...


Why are 28.8% of Debian stable packages currently "pending"?

The stats on reproducible-builds.org say that "29595 packages (95.7%) successfully built reproducibly in bullseye/amd64."[0] which may not be accurate for the reasons given in the grandparent post, but I note that the situation seems to have improved significantly[1] since that mailing list thread.

[0] https://tests.reproducible-builds.org/debian/bullseye/index_...

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900837#17


>Why are 28.8% of Debian stable packages currently "pending"?

They are separate systems. The CI/CD and the rebuilder are not doing the same job essentially.

The 28.8% on pending might be because of slow rebuild times as the integration system has had more time building packages with significantly more CPU power behind it.


I guess the release date was as recent as 2021-07-23, and it's conceivable that such a rebuilding project being run only with spare CPU cycles might take months (or might only have started recently).

As an interesting data point, I see that the number of pending packages is now 21.2%. Let's hope that most of these remaining packages (and those being retried) turn out to be reproducible.


Such rebuilds systems are complicated and might not work 100% when you first start out. There can be multiple complete rebuilds done while removing edge-case bugs that either fails the building or introduces variance into the builds.


That's some helpful context, thank you.

As a further analysis, I note that, excluding the current "pending" builds, the "reproducible" segment accounts for 93% of all packages so far.

I don't know if it's reasonable to assume that the "pending" packages are a representative sample in terms of their reproducibility, or how likely the "retry" packages are to succeed, but I'm hopeful that in a few days the "reproducible" stat will pass 90% for real.


As predicted, the "reproducible" segment on that pie chart[0] is now at 90.1% with a few percent of packages still pending or being retried.

[0] https://debian.notset.fr/rebuild/results/bullseye_full.amd64...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: