
Introducing DGit - samlambert
http://githubengineering.com/introducing-dgit/
======
ianlevesque
> Until recently, we kept copies of repository data using off-the-shelf, disk-
> layer replication technologies—namely, RAID and DRBD. We organized our file
> servers in pairs. Each active file server had a dedicated, online spare
> connected by a cross-over cable.

With all the ink spilled about creative distributed architectures, it's really
humbling to see how far they grew with an architecture that simple.

~~~
viraptor
I'm not surprised really. I never worked at GH scale, but anyway learned early
that simple solutions just work. Want a db cluster? Why not active-passive M-M
instead. Want a M-M setup? Why not dual server and block replication.

Complicated things fail in complicated ways (looking at you, mysql ndb
cluster), while simple solutions just work. They may be less efficient, but
you'd better have a great use case for spending time on a new, fancy
clustering solution - and even better idea how to handle it's state /
monitoring / consistency.

~~~
cbsmith
For the _real_ use cases for the fancy clustering solution, the benefits can
be huge... and complicated things built with failure in mind actually fail in
quite simple ways. DGit is an example in itself. Sure it heavily leverages the
complexity of git's versioning/branching/distributed vcs mechanism to get the
job done, but compared to a SAN or RAID + DRBD, the failure scenarios are much
more straightforward to deal with.

~~~
viraptor
I agree. As long as you have people who can build it tailored to the service
and handling all the errors. GH can afford doing that almost from scratch. Or
if you're sure that this is exactly the solution you need and are familiar
with the failure scenarios.

But from what I've seen in a few places, a lot of people jump to cluster
solutions without either a real need or enough people to support it.

~~~
cbsmith
> But from what I've seen in a few places, a lot of people jump to cluster
> solutions without either a real need or enough people to support it.

Totally.

------
justinsb
The interesting bit here will be how they reconcile potentially conflicting
changes between the replicas. It is pretty easy to replicate the data, because
git is content-addressable and can go garbage collection - I think even rsync
would work. The challenge is that when e.g. the master branch is updated, you
essentially have a simple key-value database that you must update
deterministically. I look forward to learning how github chose to solve that
challenge!

~~~
LukeShu
rsync would only work well if there aren't any pack files (which reduce the
disk space used, and access times). Pack files break the simplicity of the
basic content-addressed store by compressing multiple objects together.

~~~
justinsb
Right, but it doesn't really matter if you have objects stored twice (or more)
I believe. The next GC will clean it up anyway.

------
drewm1980
I always imagined they had some kind of huge database holding all of the git
objects to avoid excessive duplication. Now it sounds like they duplicate
objects for every trivial fork, times three! I fork everything I am interested
in just to make sure it stays available, and thought that was only costing
github an extra file link each time...

~~~
siong1987
[http://githubengineering.com/counting-
objects/](http://githubengineering.com/counting-objects/)

Under "Your own fork of Rails", you will see how it actually works. The answer
to your question is "no, they don't store 3 copies of the same repo".

~~~
seanp2k2
Awesome post, thanks for sharing! Great to see that they committed their
optimizations upstream too!

------
petemill
Please open-source this!

~~~
justinsb
It's relatively straightforward to build something like this using JGit - I
blogged about how I did it a while ago:
[http://blog.justinsb.com/blog/2013/12/14/cloudata-
day-8/](http://blog.justinsb.com/blog/2013/12/14/cloudata-day-8/)

------
rctay89
Compare this approach with Google's - github sticks to the git-compatible
'loose' repository format on vanilla filestorage, while google uses packs on
top of their BigTable infrastructure, and requires changes to repo-access at
the git-level [1].

[1]
[https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/fi...](https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/files/Scaling%20Up%20JGit%20-%20EclipseCon%202013.pdf)

Interesting how Github is sounding like Google and Amazon. They're probably
hitting the scale where it makes sense to build internal APIs and
infrastructure abstractions to support their operations, eg. Bigtable and S3.
In fact, DGit sounds like another storage abstraction like Bigtable and S3,
albeit limited - eg. a git repo must be stored fully on a single server (based
on my cursory reading of github's description of DGit), but in Bigtable, data
is split into tablets that comprise the table might be stored on different
places, which would allow higher utilization of resources.

------
venantius
Is there any plan to open source this?

------
Camillo
"dgit" is not the best of names, since git itself is already distributed, as
they note. It would have been more accurate to call it "replicated git", or
"rgit". I guess they just wanted to be able to pronounce it "digit".

~~~
newjersey
I agree.

I'll add that a person who pronounces git as JIT is probably a git. Dgit
sounds like the git more than d JIT.

------
Artemis2
Is this what's been causing the very poor availability of GitHub today?

[https://status.github.com/messages](https://status.github.com/messages)

~~~
eridius
Unlikely. According to the article they've been rolling this out for months.
It's not like they flipped a switch today to turn it on.

------
ryao
Not to try to make this sound less awesome than it is, what happens in a proxy
fails? Are the proxies now the weak point in GitHub's architecture?

------
mmckeen
The design of this seems very similar to GlusterFS, which has a very elegant
design. It just acts as a translation layer for normal POSIX syscalls and
forwards those calls to daemons running on each storage host, which then
reproduces the syscalls on disk. This seems like very much the same thing
except using git operations.

~~~
notacoward
Thanks for the compliment. While I as a GlusterFS developer would like to see
us get more credit for the ways in which we've innovated, this is not such a
case. The basic model of forwarding operations instead of changed data was the
basis for AT&T UNIX's RFS in 1986 and NFS in 1989. Even more relevantly, PVFS2
already did this in a fully distributed way before we came along. I'd like to
make sure those projects' authors get due credit as well, for blazing the path
that we followed.

~~~
mmckeen
I didn't say that GlusterFS was the first to do it, just always liked the
simplicity of the design. It makes hacking on the internals extremely easily
to reason about.

~~~
notacoward
Glad you like it. We enjoy it too. :)

------
nwmcsween
I don't understand why they just didn't use ceph. Ceph has all the features
dgit was invented to solve.

~~~
notacoward
No, not really. I love my cousins over in Ceph-land, that's for sure, but
asynchronous but fully ordered replication across data centers is not in their
feature set. At the RADOS level replication is synchronous, so it's not going
to work well with that kind of latency. At the RGW level it's async, but loses
ordering guarantees that you'd need for something like this. Replicating the
highest-level operations you can tap into - in this case git operations rather
than filesystem or block level - is the Right Thing To Do.

~~~
nwmcsween
Rados provides async APIs, couldn't you hook it to send git deltas between
replicated copies? best of both worlds

------
samuel1604
is there a whitepaper actually showing how this works ?

~~~
brazzledazzle
I think your question may have been covered at the end:

>Over the next month we will be following up with in-depth posts on the
technology behind DGit.

------
glasz
good lord. what a feat. tbh i'd have fucked this up.

------
systems
ok i dont get this ... shouldn't server availability problems be solved using
traditional server availability technologies, like cloud technologies

why mess with git

~~~
shadowmint
Like what?

I'm going to tentatively suggest this is one of those 'hard' problems that
throwing buzz words like 'cloud technologies' at doesn't solve.

What replication tech would you imagine solves this issue of distributing
hundreds of thousands of constantly updated repositories?

~~~
justinsb
It's actually a relatively easy problem (compared to say a full POSIX
filesystem) - I mentioned elsewhere in these comments a blog post where I
implemented a fairly good solution. The objects are essentially immutable and
content-addressable so you can get away with very relaxed semantics here. You
need a reliable mapping of names to SHAs, but this is also comparatively easy
(a key-value store).

For example, you can easily satisfy this with S3 and DynamoDB - I think the
latest version of the cloudata project I was blogging about actually does that
now.

~~~
summner
They've addressed that in the post. To make git work comfortably especially on
bigger repositories you need to have fast local access to all blobs. IIRC with
something like libgit2 its relatively easy to implement what you describe but
to make that perform while doing git log or diffs is completely different
story.

~~~
justinsb
It isn't clear that they wouldn't get that same performance by caching the
blobs locally. In my experience, you would, with much higher reliability and
scalability.

------
rlpb
There's already a project called dgit, together with a package in Debian:
[https://lists.debian.org/debian-
devel/2013/08/msg00461.html](https://lists.debian.org/debian-
devel/2013/08/msg00461.html)

Please address this before creating future hell for distributions.

~~~
douglasfshearer
Github's dgit is an internal infrastructure tool, it doesn't appear to have
been open sourced.

