You can then install Debian or Ubuntu into a chroot just based on this lockfile and end up with a functionally reproducible result. It won't be completely byte identical as your SSH keys, machine-id, etc. will be different between installations, but you'll always end up with the same packages and package versions installed for a given lockfile.
This has saved us on a few occasions where an apt upgrade had broken the workflow of some of our customers. We could see exactly which package versions changed in git history and roll-back the problematic package before working on fixing it properly. This is vastly better than the traditional `RUN apt-get install -y blah blah` you see in `Dockerfile`s. You know exactly what was installed before an update and exactly what is installed after and you can rebuild old versions.
IMO it's also more convenient than debootstrap as you don't need to worry about gpg keys, etc. when building the image. Dependency resolution and gpg key stuff is done at lockfile generation time, so the installation process can be much simpler. In theory it could be made such that only dpkg is required to do the install, rather than the whole of apt, but that's by-the-by.
apt2ostree itself is probably not interesting to most people as it depends on ostree and ninja but I think the lockfile concept as applied to debian repos could be of much broader interest.
I imaging debian is also super expensive, and so scaling that must not be easy. Every decision could be thousands of dollars.
1. Proxy inside the company network that intercepts every request to pypi? Doesn't work well due to https, I guess.
2. Replacing pypi in your project description with your own mirror? Might work, but at least slows down every build outside of your network. Also needs to be replicated for npm, cargo, maven, docker, ...
3. Start a userspace wrapper before starting the build that transparently redirects every request. That would be the best solution, IMO. But how do I intercept https requests of child processes? Technically, it must be possible, but is there such a tool?
All these frameworks come with a tool to run an internal repo/mirror.
There are also commercial offerings (like artifactory) that cover all these languages in a single tool.
For python, just set PIP_INDEX in the CI environment to point to the internal mirror and it will be used automatically. It's very easy.
By default the downloaded wheels are cached in ~/cache/somewhere. Isolated builds don't benefit from it if they start with a fresh filesystem on every run, consider mounting a writable directory to use as cache, the speedup is more than worth it.
pip can be configured to use a local warehouse, and there are warehouses that transparently proxy to pypi, but caches any previously fetched result. E.G: https://pypi.org/project/proxypypi/
Since you control it, and it's read only from the outside, you can actually expose it even outside of your network.
But indeed, it must be replicated for npm, cargo, maven, docker...
There is a startup idea here :)
Even if such a system isn't built, with that kind of money on the table you could get a team of FANG scale developers to build it for you.
I think there needs to be a redesign in how dependencies work in most programming languages. Deterministic builds have been such a game changer and I think that CPU vs bandwidth may be the next big area to explore when it comes to compiling code.
But it's OK to accept donations in hardware or money as long as there are no strings attached.
> It took 3-4 months to get 4.2 TB of data
Maybe send an email and ask someone to mail a couple of hard drives? (offering to reimburse hw, labour and shipping obviously).
thank you kindly!
- in general if you want to do additional development on a piece of software, it's useful to validate that your build environment is set up correctly and can reproduce bit-exact binaries, before doing additional development
- it helps prevent silent environment-specific build dependencies creeping in, aka "works on my machine"
The first is that it makes sure any maintainers who try to backdoor a package need to document that fact somewhere in the build process. Hence it becomes possible to get more confidence from an audit of a package.
This first goal is done very simply. You rebuild the package, and check that the resulting binary is bit-for-bit identical. With reproducible builds, the same build process should lead to the same binary. So if someone tampered with the source code or the build process, that can be detected if you have reproducible builds.
The second goal is to just have a consistent system which makes debugging easier.
A nice example of the importance of reproducible builds is the app signal, which I mention because of your user-name. We trust the source code of signal, but how are we supposed to be sure that the binaries offered by apt, the google play store, the app store, or whatever source you have for signal, are actually made from the source code? With reproducible builds, one person can rebuild the binary and check to see that the output is the same.
This is much better than telling everyone to build signal from source, because now the lazy people who trust the binaries get to benefit from the skeptical people who build from source.
There is FOSS called ungoogled chromium. And building chromium takes a lot of times (in my case 8 hours). The problem here is chromium build is not reproducible and repo author can't build chromium for every platform and every versions as its very expensive from him. So what he currently do now is accepts the binaries from others.
But as a user how can I trust the user who build the binaries? He might have tempered with binaries? He might have inserted back doors secretly? Just because he published binaries lets say for 10 years doesn't grants 100% trustworthy right? He might be hacked etc too?
I think reproducible builds fixes such problems. Other than trust factor I really don't think there is significant advantage.
To confirm I went to https://github.com/Eloston/ungoogled-chromium it says "NOTE: These binaries are provided by anyone who are willing to build and submit them. Because these binaries are not necessarily reproducible, authenticity cannot be guaranteed;"
If someone kindly hands the repo author a (binary, platform) pair how would the author ever verify the original src value, other than re-doing the 8 hour compile themselves?
Of course not! It's more like `compile(src, platform, compiler, compiler version, CPU features, current time, current date, your whole file system tree, compiler caches, ...) → binary`.
There are also absolute evil build systems that fetch random sources from the internet, then practically the whole world becomes your function closure. Even if you manage to fix or exclude these variables, you could still end up with something that's not reproducible at the bit level: for example, unordered data structures result in binary output that will vary from execution depending on the state of the compiler.
> If someone kindly hands the repo author a (binary, platform) pair how would the author ever verify the original src value, other than re-doing the 8 hour compile themselves?
If the build is bit-level deterministic it results in a fixed output given a fixed input. So, you simply hash the output, sign it and ship it along the build. The user will simply check if the hash matches the binary they received, even from an untrusted source.
My Packages file contains the following about a given package that apt will verify
9ff4da44240de168f7b98d1461d2d96c 65274 main/binary-amd64/Packages
dc9007fe54d7f26d0922a377c81e45d9 17568 main/binary-amd64/Packages.gz
fd2c8c93e1590e1fc5bc797b20b9ddd670171f1856326a8ba7816f2a6753b779 65274 main/binary-amd64/Packages
06159def8d253ed6f3401cc797bf0f2743b4623a16f996815bf8ffde6f79bfaf 17568 main/binary-amd64/Packages.gz
And Release itself has a .gpg signature
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
Of course I still have to trust the person that controls the gpg signing key, but how does that vary with
> If the build is bit-level deterministic it results in a fixed output given a fixed input. So, you simply hash the output, sign it and ship it along the build. The user will simply check if the hash matches the binary they received, even from an untrusted source.
Or is it that 3/4/10 independent people can independently build the same, concur that the checksums of the built package are all the same, and be confident that none of them are being coerced or hacked? So it's a way of verifying the person building the package in the first place (any coercion/hacking would have to apply to everyone confirming the build - which won't be everyone using it, but would be a dozen people in a dozen jurisdictions
Those hashes and signature confirm that the package is coming from a maintainer, but it can't confirm nor deny that the package you received was actually built from the sources the maintainer claims to have used. Simply because there's no guaranteed bit-level link between sources and output.
> Or is it that 3/4/10 independent people can independently build the same, concur that the checksums of the built package are all the same, and be confident that none of them are being coerced or hacked?
Yes, the improvement compared to traditional packages is that given the package inputs (sources, dependencies, build instructions) and the output you can be sure that the output is the result of building the inputs. You don't actually need multiple indepdent people: just rebuild the package and compare the hashes. Again, you could try to do this with APT and it will probably work for some very simple software (or some that the Debian maintainers managed to make reproducible), but not in general.
The way to verify is to redo the entire compile process. But the advantage is that anyone can do this, so you can have other users in the community validating the builds.
if no reproducible builds then,
sha256(binary_from_user1) != sha256(binary_from_user2)
For smaller project distributing code source is usually correct choice as auditing src code is better than auditing binaries. But the project like chromium I think reproducible build is essential because you can't force everybody to compile and waste hours right?
And by having this "threat" against the bad guys, which consists in saying: "If you poison the well, we'll find how you did it", suddenly the bad guys are having a much harder time being bad.
E.g. nobody added a keylogger after publishing the source but before compiling the executable.
- security against backdoors
- validating build environment
- reproducing bugs
But there's more:
- legal liabilities: many bad build systems pull trees of dependencies from the Internet during the build. How can you prove that no license breaching occurred in any of the dependencies? Debian explicitly verifies licensing in each package.
- prevent pulling dependencies from deleted (or hijacked!) repositories on the Internet
- reproduce performance improvements: non reproducible builds can often lead to non reproducible performance. Even things like the length of the local hostname can leak into a binary and affect memory alignment.
The real benefit of reproducible builds however isn't for the individual software itself, but for the software landscape as a whole. As with reproducible builds you have the whole dependency chain specified completely from top to bottom, no more hidden dependencies. And everything is fully automated, not just on your personal machine, but in a way that others can reproduce. That in turn will dramatically improve the ease with which users can build software (and change it), as it turns an hour long hunt for dependencies into a single click.
To really see the fruits of this labor will still take some years, but it has the potential to pretty drastically reshape and improve the way FOSS software works (e.g. Nix Flakes) and actually allow users to make use of their freedom instead of giving up before they have even managed to build the software.
It positions itself with the following assertions:
> Q. If a user has chosen to trust a platform where all binaries must be codesigned by the vendor, but doesn’t trust the vendor, then reproducible builds allow them to verify the vendor isn’t malicious.
> I think this is a fantasy threat model. If the user does discover the vendor was malicious, what are they supposed to do? The malicious vendor can simply refuse to provide them with signed security updates instead, so this threat model doesn’t work.
Which only works in the context of proprietary vendors and not in the context of FOSS distributions. Nothing can be denied as everything is freely distributed. You want to have the ability to verify the work done by packagers and build servers.
Next up is the essentially the claim that "reproducible builds can't solve bugdoors. Thus it's insufficient to solve any problems".
But this is essentially just an XKCD argument; https://xkcd.com/2368/
Reproducible Builds is a nice property of any build system for multiple reasons. It's also part of the supply-chain security story and not the entire story alone. As for how much effort it is? It's a lot. But considering the core community of reproducible builds people is below 50 people, and we are still able to come close to an 88% reproducible builds in real world distributions should point out how achievable this goal is.
Indeed. Open source isn't a vendor it's just volunteers from the Internet. If you look at the way people behave in nearly every other walk of life that isn't software, it would seem amazing that we somehow managed to create such a beautiful thriving gift economy. In order to keep it that way, we need to have reproducible builds, because they promote transparency.
It is however very very expensive. Even if you manage to surgically remove all the dependencies on things like __TIME__ then you still need to audit the code for things like iterating over hash tables. Many core libraries these days such as expat xml parsing will seed their data structures using /dev/random. You can qsort but you might get snagged by the fact that it doesn't use stable algorithms. If things don't work out then it helps to have deterministic execution too so you can run the build and see what it does differently that causes it to produce different output. Sadly, with things like kernel imposed memory randomization that's easier said than done.
So the unfortunate reality is that open source isn't arching towards determinism. The trend is very much the opposite where it's becoming more non-deterministic, especially in the last five years. So if having deterministic builds is something your org cares about then you really need to hire an expert and in many cases change the engineering culture too.
"bookworm [the next Debian stable release] on amd64 is 95.6% reproducible right now! "
This is taken from the integration suite which Debian have been running for years. This represents checking out the code, and building twice. This is not distributed packages from Debian. This inflates the number a little bit.
Holger explains this in a thread a few years back.
The stats on reproducible-builds.org say that "29595 packages (95.7%) successfully built reproducibly in bullseye/amd64." which may not be accurate for the reasons given in the grandparent post, but I note that the situation seems to have improved significantly since that mailing list thread.
They are separate systems. The CI/CD and the rebuilder are not doing the same job essentially.
The 28.8% on pending might be because of slow rebuild times as the integration system has had more time building packages with significantly more CPU power behind it.
As an interesting data point, I see that the number of pending packages is now 21.2%. Let's hope that most of these remaining packages (and those being retried) turn out to be reproducible.
As a further analysis, I note that, excluding the current "pending" builds, the "reproducible" segment accounts for 93% of all packages so far.
I don't know if it's reasonable to assume that the "pending" packages are a representative sample in terms of their reproducibility, or how likely the "retry" packages are to succeed, but I'm hopeful that in a few days the "reproducible" stat will pass 90% for real.