With all the ink spilled about creative distributed architectures, it's really humbling to see how far they grew with an architecture that simple.
Complicated things fail in complicated ways (looking at you, mysql ndb cluster), while simple solutions just work. They may be less efficient, but you'd better have a great use case for spending time on a new, fancy clustering solution - and even better idea how to handle it's state / monitoring / consistency.
But from what I've seen in a few places, a lot of people jump to cluster solutions without either a real need or enough people to support it.
I wonder whether the receive-pack operation offers a natural boundary for transactions?
So they're either doing strongly coordinated writes to make sure replication changes are consistent (serializer, consensus, or chain), they're encoding other causal information in the data and have some method that deterministically picks which replica should dominate the others, or they're silently losing updates on conflicts.
It would be cool to know what they're doing.
So quorum + 2PC is what that sounds like to me. Suffice it to say, that's not a safe protocol without some other system guarantees in place.
Under "Your own fork of Rails", you will see how it actually works. The answer to your question is "no, they don't store 3 copies of the same repo".
Now, as far as overprovisioning, they had 4 times as much disk space provisioned as necessary: 2 disks in RAID in a single machine times 2 (hot spare). Now they only need 3x, for the three copies.
Interesting how Github is sounding like Google and Amazon. They're probably hitting the scale where it makes sense to build internal APIs and infrastructure abstractions to support their operations, eg. Bigtable and S3. In fact, DGit sounds like another storage abstraction like Bigtable and S3, albeit limited - eg. a git repo must be stored fully on a single server (based on my cursory reading of github's description of DGit), but in Bigtable, data is split into tablets that comprise the table might be stored on different places, which would allow higher utilization of resources.
I'll add that a person who pronounces git as JIT is probably a git. Dgit sounds like the git more than d JIT.
>Over the next month we will be following up with in-depth posts on the technology behind DGit.
why mess with git
I'm going to tentatively suggest this is one of those 'hard' problems that throwing buzz words like 'cloud technologies' at doesn't solve.
What replication tech would you imagine solves this issue of distributing hundreds of thousands of constantly updated repositories?
For example, you can easily satisfy this with S3 and DynamoDB - I think the latest version of the cloudata project I was blogging about actually does that now.
Please address this before creating future hell for distributions.