Hacker News new | past | comments | ask | show | jobs | submit login
Reproducible Git Bundles (baecher.dev)
187 points by pbhn on Dec 25, 2023 | hide | past | favorite | 34 comments



Parallelism doesn't inherently have to break reproducibility. If the output depends on the scheduling order of the threads, then it will. But it's possible to use threads to farm out work while still doing the same work deterministically.


Strictly speaking in terms of system backups, this approach misses the git reflog which is a crucial "get out of jail free" card on the (hopefully) rare occasion that you actually need it.


The question is whether you can expect this format to stay stable and reproducible across git versions. Remember the fallout from git 2.38 when the output of 'git archive' changed. Although for this backup use case it would just mean the next backup with a new format would make a full copy once.


git bundles have a standardised format (defined in [1]) while git archives did not and continue to not have one. Git can still handle all previous bundle versions and you can specify which version of the bundle standard you want to use when creating bundles.

So no the git 2.38 issue should not be a problem for bundles even if the format changes in the future.

1. https://git-scm.com/docs/gitformat-bundle


Standardized format != guaranteed reproducible. Git makes no promises that it'll keep PACK contents stable, just that they're guaranteed to "deflate" to the same contents.

Which is what the linked article discovered. Threading is a trivial way to discover this, but there's other ways PACK contents might differ across versions.


Yeah, even tar is not stable bit for bit across versions.


Absolute stability is not needed for this use case. Just reproducible per git version would be enough, to ensure mostly no redundant backups.


Why not just use tar or any other archive tool on the repository .git folder? Unless your repository is a un-gc'd mess with millions of unpacked objects...


I think that is a fair alternative, but restoring the backup means that the repository is in a bit of a weird state. Whereas a bundle can be cloned from nicely. Your way does have the property that it includes hooks and config, though (which could be desired or not).


There are very few/no operations that will put a git repository in a "weird" state.

Actually, any snapshot of a Git repository is consistent (due to its CAS nature), minus the index which doesn't really need backup anyway.


This is also non-deterministic between versions of tar, but I guess for this usecase that would be fine. It’s just not good for reproducible build systems when trying to recreate tarballs after years.


Does not meet stated goal of the author:

> The naive solution of simply backing up the entire file-system tree is clearly not desirable since that would clutter the backup with useless build artifacts.

Build artifacts can be filtered out with tar --exclude patterns, but this is a language-dependent set that will require curation.


Wouldn't all these useless build artifacts be outside the .git folders?


git bundle is the answer to the question "I need to keep my development trees in sync on different computers, with no network connection between them." https://stackoverflow.com/q/3635952/308851 buried in comments is a great script automating the process https://github.com/cxw42/git-tools/blob/master/wormhole alas it's abandoned but still, it's pretty great.


An engaging post and a lovely piece of sleuthing. Thanks!


Feels like bundles is a bad idea for back ups. Perhaps --bare repo in zip archive would be better


>The naive solution of simply backing up the entire file-system tree is clearly not desirable since that would clutter the backup with useless build artifacts.

Just ignore the files in .gitignore and backup the entire file-system tree.

Don’t be clever. This is a backup of source code that took many hours/days/weeks of effort to create. Since git is mainly source code, it is not that big of space hog.

Disk space is cheap. Time is not.


> Disk space is cheap. Time is not.

At the same time, this train of though is why VSCode dumps 8 GiB into the ~/.config directory, causing not only a lot longer time to wait for any backup to finish, but also incurring time in getting more and more backup disks, figuring out where to physically store them, etc. Better spend more time up front getting the storage right, and save space and time for those that use it later.


> The naive solution of simply backing up the entire file-system tree is clearly not desirable since that would clutter the backup with useless build artifacts.

`git rm -r --cached .`

> One solution is to create a fresh clone (with --mirror), but that will typically consist of many small files which isn't ideal for backups, either.

this is precisely what tar(1) was made for..


Do you mean `git clean`?


[flagged]


Mostly for security, data integrity, and QA purposes, it ensures the same data can be rebuilt at any given time.

Let's say we have a recipe for a food that we really like and we want it to be the same all the time. We figured out the list of ingredients, we used it and it worked to our taste.

Two months later, we decided we want that food again. We followed the same recipe with the same ingredients and it turns out the food tasted wrong or bad.

What happened? The company that makes one of our ingredients has decided to swap out to a cheaper version without telling anyone (like different sugar alcohol or smaller ratio was used). Our recipe is no longer reproducible, the list is exactly the same but some parts of it has changed without our intervention.

The same thing when building software, we want to make sure the dependencies, CICD system tooling, terminal, etc everything matches exactly down to last bit that we can always reproduce it to be the same.

One use case of this is if we need to do a hotfix off a stable version and our CICD system has rotated out of the old cache, so we need to rebuild with the same dependencies but some companies may have changed some things in a very subtle ways that we didn't know about, which mean we can easily introduce unintentional regressions without changing anything; despite us using a trusted/tested branch that was deployed in the past.

We saw this in the past when we had exactly the same code, rebuild the release to test something and for some odd reasons we were seeing regressions despite not changing a single thing. It turned out that our CICD's compiler had been updated, which had changed some of the behaviors when compiling the same code. Which, for security reasons, you do not want changed on you.

So, for security and quality purposes, it is important to have reproducible build systems that we can confide in.


Reproducibility does not neccessarily require having perfectly matching bits.

To me it seems like md5sum is not the best program to check if two bundles are equivalent. In this case going for bit for bit reproducibility does not seem to have much of a practical benefit.


> Reproducibility does not neccessarily require having perfectly matching bits.

https://en.wikipedia.org/wiki/Reproducible_builds: "Reproducible builds, also known as deterministic compilation, is a process of compiling software which ensures the resulting binary code can be reproduced. Source code compiled using deterministic compilation will always output the same binary."


This comment is baffling.

Reproducibility in general is a key part of all sciences.

Practically speaking, it’s important from a security perspective. If you can make a secure configuration of some system you’d want to be 100% certain you can reproduce that same configuration on some other system. Deep and provable reproducibility doesn’t matter too much for most web servers, unless security is a top priority.


Your science comparison is also baffling, to be honest. A typical "non-reproducible" compile meets the scientific standard of "reproducible". Imagine if recreating an experiment was expected to give you the same date points, in the same order, with the same timestamps. The two definitions of "reproducible" are worlds apart.


What “scientific standard” are you talking about?

I think if it was possible to configure the universe (time included) in the exact same way, save for a single variable, that would be the _only_ way science experiments would be done.

We almost have the ability to do that with software, but we just don’t most of the time. That’s okay for many applications, but not all of them.


The standard that real world experiments are held to to qualify as "reproduction". Your hypothetical is interesting but it's clearly a different standard from what we actually use.


Do you know an example for a security property that a reproducible build would be able to guarantee, and where pinning a dependency against a version identifier plus digest of the release tarball wouldn't accomplish the same thing?


I think pinning a dependency against a version identifier and a digest of a tarball is a great way to ensure reproducibility.

You’re controlling which dependency you’re using very specifically.

Reproducibility like this protects against supply chain attacks.

This may be a bit contrived, but It could prevent a malicious package maintainer from releasing a modified version of OpenSSL that you depend on without you noticing that change in your dependencies.


It easily answers one of the most important questions. Are these two thing equal?

With reproducibility you just need to compare bytes. Without, you need additional logic


We're talking about software releases. Those have an identifier already: the version number. You can even pin that identifier to a hash over the release tarball, all without even beginning to consider reproducible builds.


In this case, it accomplishes efficient backups.


What's efficient (or less efficient) about backing up two identical files vs. two different files with comparable sizes?

If one has a backup scheme where files are stored in a content-adressable store, I can see the efficiency. But I have never heard of such a scheme in the context of backups.

If one has a backup scheme with automatic deduplication over arbitrary byte ranges (such as ZFS's deduplication feature), I can see the efficiency. But I have never seen anyone enable that feature in the context of backups due to its massive RAM usage requirements.


> If one has a backup scheme where files are stored in a content-adressable store, I can see the efficiency. But I have never heard of such a scheme in the context of backups.

One example of a system which does that is borgbackup. But even a simple "rsync with snapshots" benefits from identical files, since with common rsync options identical files are not transferred again (which means the data is kept shared between the snapshots).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: