
Best way to do Linux clones for your CI - ScottWRobinson
https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html
======
indygreg2
Obviously not applicable to Linux since it is hosted on Git, but Mercurial has
this "clone bundles" feature built in.

If you e.g. `hg clone [https://hg.mozilla.org/mozilla-
central`](https://hg.mozilla.org/mozilla-central`) to clone the main Firefox
repo, your client will connect to a CDN to download the pre-generated bundle
then go back to the server to pull recent changes. Bitbucket also has the
feature deployed.

After I deployed this feature on hg.mozilla.org, server-side CPU load dropped
by like 90%. And with IP filtering to detect clients connecting from AWS that
allows us to serve URLs direct from S3 (as opposed to going through a CDN),
the total bill is like $20/mo for CI usage (S3 intra-region data transfer is
free). Literally thousands of hours of CPU-core time and >500 TB/mo offloaded
from the Mercurial servers.

~~~
derefr
(Assuming you’re familiar with Git internals as well:) do you think there’s
anything in the architecture of Git preventing it from having this feature?
And, if not, should we Git developers push for it / work on a PR?

It’d make a lot of sense to integrate it if it’s possible, I think. The design
direction of Git is somewhat subservient to the needs of Linux (i.e. Git is a
thing Linus made to replicate BitKeeper, but it also was made to scratch—and
continues to scratch—many of the LKML itches re: their unique patch-management
workflows.)

So if Linux is having to do something re: “SCM ops” tooling, that could be
better solved in Git, then why not solve it in Git?

(If anyone who was responsible for this Kernel.org Git bundle setup wants to
chime in, that’d be interesting; I assume they likely considered making this a
Git thing first, so there might be a good reason why it’s not.)

~~~
stefan_
I think Google has something like that for the Android repositories. They are
always synced through Git and 70 GiB+ in size.

~~~
mugsie
yeah, the "repo" command they use has support for getting bundles, and then
pulling the newer commits from the gerrit server.

------
phiresky
Git bundles are pretty nice, they are just a packfile [1], with a list of refs
prepended, that you can clone like a remote.

But this also means they have pretty inefficient compression, since every
object (or delta) is packed using zlib on its own. For packfiles this is good
because it allows random access, but for bundles it would be better to disable
this compression (set it to level 0) and compress the whole bundle instead:

    
    
        435M default.bundle
        412M default.bundle.bz2
        410M default.bundle.gz
        401M default.bundle.xz
    
        978M uncompressed.bundle
        290M uncompressed.bundle.bz2
        330M uncompressed.bundle.gz
        245M uncompressed.bundle.xz
    

That's 245MB vs. 401MB, 40% space savings! It can't be cloned directly,
though.

Produced using `git bundle create fname HEAD` on linux v3.0 after: `git repack
-a -d -F --depth=100 --window=100` (default compression level) `git -c
pack.compression=0 repack -a -d -F --depth=100 --window=100` (no zlib
compression)

[1]
[https://github.com/git/git/blob/master/Documentation/technic...](https://github.com/git/git/blob/master/Documentation/technical/pack-
format.txt)

------
dhimes
I think CI means "cluster infrastructure" but article refers to a "CI
infrastructure" so I'm not quite sure.

~~~
pageald
Continuous Integration

~~~
dhimes
Ah thank you.

