Hacker News new | comments | show | ask | jobs | submit login
Unpacking Git packfiles (recurse.com)
101 points by chimeracoder on June 24, 2015 | hide | past | web | favorite | 21 comments

Author here. I discovered this while working on a clean-room implementation of Git in pure Go. While there are a lot of references to packfiles online, surprisingly, the actual format of packfiles was rather underdocumented. Most resources just mention that they exist, and describe how to use `git verify-pack` to inspect a packfile, without explaining how to parse packfiles and apply deltas.

I decided to write this up to save others the trouble of having to reverse-engineer it from scratch!

Just wanted to say that your article is very nicely written. It is easy to follow and keeps the reader interested. Kudos. :)

Super minor nitpick:

"Since 0xFF (256) is the largest value that can fit into a single byte"

0xFF is 255 not 256

Yes, I link to a different version of that same file in the article (my link points to the version hosted on kernel.org, rather than Github). It provides a bit of high-level context, but by itself it doesn't provide enough detail to actually reimplement the corresponding Git functions.

Aside from being more terse and (IMHO) more difficult to read than prose with examples and non-ASCII diagrams, that file doesn't explain the context and motivation for packfiles, and it doesn't cover the parsing and application of deltas at all.

If you found that piece of documentation deficient while implementing a packfile parser, then it would be nice to update it to include those details that were lacking to help the next person to reimplement git.

Thanks for the suggestion! I'll take a look at the contribution process for the Git project.

Great article! Just one small correction: OFS_REF_DELTA should be OBJ_REF_DELTA.

What was your motivation to implement from scratch, rather than using libgit2?

A combination of the following:

- Cross-compilation (trivial in Go, less so in C)

- A chance to learn about the really dark, thorny corners of Git

And for what it's worth, source control is Git's intended use case, but people do use it for other purposes as well (like managing personal media collections across multiple devices[0]). Git has become a protocol or a platform in addition to a VCS[1].

But there aren't very many FOSS clean-room implementations of Git, at least not this far down the chain (packfiles). One of the best ways to discover hidden implementation issues or oversights in a spec or existing documentation is to try and reimplement it, which has the effect of strengthening the platform itself in the long run.

[0] e.g. https://git-annex.branchable.com/

[1] Bitcoin is a bit of a hot-button topic, but it's similar to bitcoin in this regard: the tool itself is intended for financial transactions, but people have already started to use it for all sorts of unrelated use cases.

> But there aren't very many FOSS clean-room implementations of Git, at least not this far down the chain (packfiles).

I know at least of gogit (https://github.com/speedata/gogit), if only a little, because I've contributed once to it. I don't want to belittle your project, but I'd like to know: what are the differences between gitgo and gogit ?

(I see at least one similarity: the name is extraordinarily unimaginative :)

> One of the best ways to discover hidden implementation issues or oversights in a spec or existing documentation...

There are issues. And there is the legendary "pack v4" (current pack versions are 2 or 3) but it still a work in progress.

Kudos. Just noting that one of libgit2's primary strengths is 100% cross-platform portability. Is this different from your cross-compilation goal? I'm not familiar with Go.

Yes because Go generates static binaries and requires no toolchain. Once you link a C library, you loose these benefits and cross-compiling returns to the "normal" difficulty. In addition to that, Go code may be buggy but it's safe, while linking a C library means dropping this guarantee.

So in the Go ecosystem it is actually preferred to use pure Go libraries not to loose these benefits.

Is it really a clean-room implementation (in the GPL-enforcement sense) if you're looking at the git internal documentation?

I doubt it. A clean-room implementation is intended to demonstrate to a court that no knowledge of the copyrighted material was available to the implementers during the creation of the work. Any similarities to the "original" work must then be caused by functional constraints resulting from compatibility requirements, thus free from creative elements (not protected by copyright).

If the internal documentation include any creative elements which are reflected by the git implementation, then the clean-room would be contaminated.

Interesting legal question. I'd bet on the `yes' side, but don't have any arguments.

This ("Unpacking Git packfiles") was a CTF challenge a few weeks ago (at the Haxpo CTF in Amsterdam), except we weren't given the original repository, we only got a pcap dump of the traffic. Using `git extract-objects` I was able to unpack them into object files (stored in .git/objects/xx/*) but even these were not readable. Eventually found some zpipe command that did the trick. What a pain to do this with common tooling if you don't have the time to dive into the format and write a real unpacker.

Yes, the objects are zlib compressed. You can view them with something like:

    cat .git/objects/c0/fb67ab3fda7909000da003f4b2ce50a53f43e7 \
    | zlib-flate -uncompress; echo

I wonder how the packing process works. How does it find pairs of objects that compress well with delta encoding?


More detail than that would require reading the source code..

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact