Hacker News new | past | comments | ask | show | jobs | submit login
Introducing DGit (githubengineering.com)
262 points by samlambert on Apr 5, 2016 | hide | past | web | favorite | 47 comments

> Until recently, we kept copies of repository data using off-the-shelf, disk-layer replication technologies—namely, RAID and DRBD. We organized our file servers in pairs. Each active file server had a dedicated, online spare connected by a cross-over cable.

With all the ink spilled about creative distributed architectures, it's really humbling to see how far they grew with an architecture that simple.

I'm not surprised really. I never worked at GH scale, but anyway learned early that simple solutions just work. Want a db cluster? Why not active-passive M-M instead. Want a M-M setup? Why not dual server and block replication.

Complicated things fail in complicated ways (looking at you, mysql ndb cluster), while simple solutions just work. They may be less efficient, but you'd better have a great use case for spending time on a new, fancy clustering solution - and even better idea how to handle it's state / monitoring / consistency.

For the real use cases for the fancy clustering solution, the benefits can be huge... and complicated things built with failure in mind actually fail in quite simple ways. DGit is an example in itself. Sure it heavily leverages the complexity of git's versioning/branching/distributed vcs mechanism to get the job done, but compared to a SAN or RAID + DRBD, the failure scenarios are much more straightforward to deal with.

I agree. As long as you have people who can build it tailored to the service and handling all the errors. GH can afford doing that almost from scratch. Or if you're sure that this is exactly the solution you need and are familiar with the failure scenarios.

But from what I've seen in a few places, a lot of people jump to cluster solutions without either a real need or enough people to support it.

> But from what I've seen in a few places, a lot of people jump to cluster solutions without either a real need or enough people to support it.


The interesting bit here will be how they reconcile potentially conflicting changes between the replicas. It is pretty easy to replicate the data, because git is content-addressable and can go garbage collection - I think even rsync would work. The challenge is that when e.g. the master branch is updated, you essentially have a simple key-value database that you must update deterministically. I look forward to learning how github chose to solve that challenge!

rsync would only work well if there aren't any pack files (which reduce the disk space used, and access times). Pack files break the simplicity of the basic content-addressed store by compressing multiple objects together.

Right, but it doesn't really matter if you have objects stored twice (or more) I believe. The next GC will clean it up anyway.

I'm not sure how GH solved this, but consistently updating refs in that simple k-v store seems to be the main challenge.

I wonder whether the receive-pack operation offers a natural boundary for transactions?

The replica count seems to be three - allowing quorum with a single lost host - and the repository goes read-only when quorum is lost.

That doesn't do anything to resolve consistency issues. The only time that quorum, pinned to primary replicas (i.e. disallowing sloppy failovers), helps you with consistency is when all reads and writes to those replicas are funneled through a serialized reader/writer. Then that serialized process can ensure "Read Your Own Writes" consistency using majority quorum (again, pinned to primary replicas).

So they're either doing strongly coordinated writes to make sure replication changes are consistent (serializer, consensus, or chain), they're encoding other causal information in the data and have some method that deterministically picks which replica should dominate the others, or they're silently losing updates on conflicts.

It would be cool to know what they're doing.

The article said "Writes are synchronously streamed to all three replicas and are only committed if at least two replicas confirm success."

Somehow missed that. Thank you.

So quorum + 2PC is what that sounds like to me. Suffice it to say, that's not a safe protocol without some other system guarantees in place.

I always imagined they had some kind of huge database holding all of the git objects to avoid excessive duplication. Now it sounds like they duplicate objects for every trivial fork, times three! I fork everything I am interested in just to make sure it stays available, and thought that was only costing github an extra file link each time...


Under "Your own fork of Rails", you will see how it actually works. The answer to your question is "no, they don't store 3 copies of the same repo".

Awesome post, thanks for sharing! Great to see that they committed their optimizations upstream too!

You are wrong. Before they used a single git repository for all user repositories in a network. So the git objects in the original repository and all its forks were properly deduplicated. Deduplication across different networks does not make much sense.

Now, as far as overprovisioning, they had 4 times as much disk space provisioned as necessary: 2 disks in RAID in a single machine times 2 (hot spare). Now they only need 3x, for the three copies.

Objects don't need to be duplicated for forks. See the objects/info/alternates file. They're talking about redundancy for the object storage backend, which is independent of how many forks/clones/repositories. In fact, assuming they solve collision detection somehow, they could store all objects for all repositories in one object store and have all repos be thin wrappers with objects/info/alternates files forwarding to that object store repository...

Please open-source this!

It's relatively straightforward to build something like this using JGit - I blogged about how I did it a while ago: http://blog.justinsb.com/blog/2013/12/14/cloudata-day-8/

Compare this approach with Google's - github sticks to the git-compatible 'loose' repository format on vanilla filestorage, while google uses packs on top of their BigTable infrastructure, and requires changes to repo-access at the git-level [1].

[1] https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/fi...

Interesting how Github is sounding like Google and Amazon. They're probably hitting the scale where it makes sense to build internal APIs and infrastructure abstractions to support their operations, eg. Bigtable and S3. In fact, DGit sounds like another storage abstraction like Bigtable and S3, albeit limited - eg. a git repo must be stored fully on a single server (based on my cursory reading of github's description of DGit), but in Bigtable, data is split into tablets that comprise the table might be stored on different places, which would allow higher utilization of resources.

Is there any plan to open source this?

"dgit" is not the best of names, since git itself is already distributed, as they note. It would have been more accurate to call it "replicated git", or "rgit". I guess they just wanted to be able to pronounce it "digit".

I agree.

I'll add that a person who pronounces git as JIT is probably a git. Dgit sounds like the git more than d JIT.

Is this what's been causing the very poor availability of GitHub today?


Unlikely. According to the article they've been rolling this out for months. It's not like they flipped a switch today to turn it on.

off-the-wall (and highly unlikely) suggestion: GH unleashed an aggressive Chaos Monkey today for the purpose of testing DGit's reliability claims in production.

Not to try to make this sound less awesome than it is, what happens in a proxy fails? Are the proxies now the weak point in GitHub's architecture?

The design of this seems very similar to GlusterFS, which has a very elegant design. It just acts as a translation layer for normal POSIX syscalls and forwards those calls to daemons running on each storage host, which then reproduces the syscalls on disk. This seems like very much the same thing except using git operations.

Thanks for the compliment. While I as a GlusterFS developer would like to see us get more credit for the ways in which we've innovated, this is not such a case. The basic model of forwarding operations instead of changed data was the basis for AT&T UNIX's RFS in 1986 and NFS in 1989. Even more relevantly, PVFS2 already did this in a fully distributed way before we came along. I'd like to make sure those projects' authors get due credit as well, for blazing the path that we followed.

I didn't say that GlusterFS was the first to do it, just always liked the simplicity of the design. It makes hacking on the internals extremely easily to reason about.

Glad you like it. We enjoy it too. :)

I don't understand why they just didn't use ceph. Ceph has all the features dgit was invented to solve.

Just is the worst word in the English language.

No, not really. I love my cousins over in Ceph-land, that's for sure, but asynchronous but fully ordered replication across data centers is not in their feature set. At the RADOS level replication is synchronous, so it's not going to work well with that kind of latency. At the RGW level it's async, but loses ordering guarantees that you'd need for something like this. Replicating the highest-level operations you can tap into - in this case git operations rather than filesystem or block level - is the Right Thing To Do.

Rados provides async APIs, couldn't you hook it to send git deltas between replicated copies? best of both worlds

is there a whitepaper actually showing how this works ?

I think your question may have been covered at the end:

>Over the next month we will be following up with in-depth posts on the technology behind DGit.

good lord. what a feat. tbh i'd have fucked this up.

ok i dont get this ... shouldn't server availability problems be solved using traditional server availability technologies, like cloud technologies

why mess with git

Like what?

I'm going to tentatively suggest this is one of those 'hard' problems that throwing buzz words like 'cloud technologies' at doesn't solve.

What replication tech would you imagine solves this issue of distributing hundreds of thousands of constantly updated repositories?

It's actually a relatively easy problem (compared to say a full POSIX filesystem) - I mentioned elsewhere in these comments a blog post where I implemented a fairly good solution. The objects are essentially immutable and content-addressable so you can get away with very relaxed semantics here. You need a reliable mapping of names to SHAs, but this is also comparatively easy (a key-value store).

For example, you can easily satisfy this with S3 and DynamoDB - I think the latest version of the cloudata project I was blogging about actually does that now.

They've addressed that in the post. To make git work comfortably especially on bigger repositories you need to have fast local access to all blobs. IIRC with something like libgit2 its relatively easy to implement what you describe but to make that perform while doing git log or diffs is completely different story.

It isn't clear that they wouldn't get that same performance by caching the blobs locally. In my experience, you would, with much higher reliability and scalability.

There's a classic talk by vmg that explains GitHub's philosophy really well: http://devslovebacon.com/conferences/bacon-2013/talks/my-mom... (it gets webscale around the 15 minute mark)

There's already a project called dgit, together with a package in Debian: https://lists.debian.org/debian-devel/2013/08/msg00461.html

Please address this before creating future hell for distributions.

Github's dgit is an internal infrastructure tool, it doesn't appear to have been open sourced.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact