
Counting Objects - jssjr
http://githubengineering.com/counting-objects/
======
brian_cloutier
This is technical recruiting done right, a lucid walkthrough of a hard problem
complete with links to the implementation. It must have taken at least a few
weeks of engineer-time to write, Github is awesome for making this public.

I wish they had talked a little more about the tradeoff they made. They
mentioned that splitting packfiles by fork was space-prohibitive, but ended up
with a solution which must take more space than originally used. (If the new
heuristic refuses to use some objects as delta bases, some options which would
have provided the best compression are no longer available to git)

The performance win is incredible, how much space did they give up in the
process?

~~~
omouse
This is an actual interesting problem that takes skill and knowledge to solve
and it actually produces immense business value too (at least for Github).
This is the kind of stuff more software engineers should be working on. I'm
dying over here working on a small React JS component and some CSS layout
tinkering, my brain's got graph traversal algorithms and bitmap indexes on its
mind and can't do anything about it.

~~~
mikekchar
Every large problem is a series of small problems. I've worked on large
complex systems and on small toy problems. It really isn't much different
except for one thing. With toy problems you can get away with making a lot of
mistakes because the interactions are simple. As the system becomes more and
more complex, your execution as a programmer becomes more and more important.
You have to think very deeply about the "why" and you have to be able to test
your assumptions well.

If you have ever played the game of go, when you first start out the board is
empty. You have to place your stone somewhere, but actually it doesn't really
matter where. Over time, as you place more and more stones, your choices
become more and more important (and ironically, you have more and more
constraints). The weird thing is that, even though it doesn't matter where you
place those first stones, it becomes very important where the stones were
originally placed as the situation unfolds.

Programming is similar. When you first start a project, it really doesn't
matter what you do. Almost everything will work to one degree or another. But
the original decisions gain more and more weight as the project becomes more
and more complex (and you are faced with more and more compromises).
Eventually those original decisions can make or break your project, even
though it didn't matter at first what they were (of course, this is why
refactoring is so important --- but that's a different discussion).

My point (finally) is, that even though you may be making simple changes on
simple systems, it doesn't have to stop you from understanding the
implications of your work should the project become larger. Take the
opportunity to polish your skills and make your "opening game" as perfect as
you can make it.

I agree that at some point every programmer must start working on complex
systems in order to grow. If you are at that point and your employer does not
offer complex problems, then maybe it is time to move. However, don't neglect
your "opening game". It is very, very important because, as I said at the
beginning, every large problem is a series of small problems.

~~~
omouse
I only recently started playing Go and the similarity that you point is
fantastic.

 _Take the opportunity to polish your skills and make your "opening game" as
perfect as you can make it_

Yep, exactly what I've been doing.

 _If you are at that point and your employer does not offer complex problems,
then maybe it is time to move._

Yep ;-)

------
rakoo
Not specific to the post, but:

> we're sending few objects, all from the tip of the repository, and these
> objects will usually be delta'ed against older objects that won't be sent.
> Therefore, Git tries to find new delta bases for these objects.

Why is this the case ? git can send thin packs if the receiver already has the
objects, why does it still need to find a full base to diff against ? (Not
counting when initial base objects are from another fork -- I don't know if
it's often the case)

On top of that as far as I understood from the discussion about heuristics
([https://git.kernel.org/cgit/git/git.git/tree/Documentation/t...](https://git.kernel.org/cgit/git/git.git/tree/Documentation/technical/pack-
heuristics.txt?id=HEAD)) it seems like the latest objects are full and the
earlier objects are diffed against them (double benefits: you usually want
access to the last object which is already full, and earlier objects tend to
be only remove stuff, not add because "stuff grows over time). So if objects
are still stored as packs, things should already be in a pretty good shape to
be sent as-is... or not ?

~~~
Anderkent
Their initial test case was a full clone, in which case you can't really send
a shallow pack.

The problem, as I understand it, was that when you requested a full clone of
fork1/repo.git, they'd find all objects reachable from refs in that repo, but
git would by default generate deltas for those objects that referred to
objects from other forks. When it noticed those objects were not going to be
sent to the client, nor did client know about them, git recovered by doing
expensive matching across the objects that it was sending, and without having
traversed the graph before its heuristics weren't working properly so this
took forever.

------
OJFord
This is awesome. I absolutely love <company>engineering blogs like this.

There's no real reason for companies to educate external
devs/hobbyists/students like this, but some do, and it's really awesome.

~~~
artursapek
The reason is recruiting, and it's effective.

~~~
OJFord
Because people read it and think "hey that sounds cool, I want to work there"?

~~~
sophacles
Pretty much.

There are second order effects as well. Having a good tech blog that talks
about this stuff gains the company mindshare amongst developers. If, as in in
the case of github, the developers (etc) are the customer, then the
translation to marketshare is pretty straight-forward.

When developers (etc) are not the target audience, they still get warm-fuzzies
about the company and it gains word of mouth, and non-tech who seek out a
recommendation from their friends, will hear about the company with the good
blog. Usually in the form of "Oh yeah, there are a couple of alternatives, but
the tech at $X is top notch, look hard at them".

~~~
OJFord
Hm yeah, that makes sense. So it's not the case that "[there are] no real
reasons" \- but I still think it's awesome, and learn a lot/see the
applications of stuff learned elsewhere in them.

------
delinka
Perhaps I'm a bit ignorant of git's storage and protocols, but what's the
purpose of this initial count? Seems to me it's traversing the tree twice -
once to count and once to send the objects across the network. So why not
traverse the tree, sending objects that need to be sent and ignoring the ones
that don't _instead_ of counting?

~~~
Tobu
You need to know which objects are reachable before you can delta compress
them (or reuse a packed delta chain). I guess you could restrict compression
antecedents to graph ancestors in your pack, but at the cost of not
compressing similar siblings (and blobs are leaves).

------
codezero
Hopefully this isn't too off topic, does anyone know what software was used to
make the charts in this post?

~~~
benihana
Judging by the water colors I'm _guessing_ Paper by FiftyThree.

------
politician
The Azure team responsible for the Local Git implementation needs to read this
fantastic article.

I've been putting up with 10-minute deploys due to precisely this issue of
counting objects. It's slow because we don't use Local Git as our source-of-
record repository (because commits initiate a deployment step), so every
deploy involves a clean fetch into a new tmpdir.

At least now I know why our deploys are getting slower and slower.

------
alagappanr
I wonder why GitHub has a separate domain githubengineering.com for this blog
instead of a subdomain like engineering.github.com.

I notice that there is an inactive user account called engineering. If at all
an User Page is created by that account, it would be available at
engineering.github.io.

~~~
spb
GitHub Pages used to be served under github.com subdomains (they were switched
after GitHub realized this was a same-origin security hole), and said
username.github.com subdomains still redirect to github.io for backward
compatibility.

It's not unheard of for GitHub to rename inactive accounts, though: most
likely, they gave this its own domain for something like SEO purposes (as it's
content marketing).

------
0x400614
So does GitHub hire C engineers? Git is implemented in C.

~~~
tmm1
Yes, much of GitHub's core infrastructure is written in C.

~~~
0x400614
That's awesome. You don't find many people that enjoy writing in C. I assumed
it was Ruby.

~~~
scott_karana
Ruby MRI and YARV are both written in C, so to be a true Ruby hacker you need
some C chops. :-)

~~~
klodolph
Interesting. I've heard that thrown around before, but I've never heard
someone say the same thing about Python, even though the main Python
implementation is C.

~~~
scott_karana
Really? Lots of the most famous Python modules are partially implemented in
C(++) for performance reasons too, it's not exclusive to Ruby.

For example, Numpy is 56% C.
[https://github.com/numpy/numpy](https://github.com/numpy/numpy)

------
joshmlewis
The illustrations look like they were done in Paper. Does anyone know if
that's right or not?

~~~
maxmcd
Yep, looks like you are correct:
[http://news.fiftythree.com/post/118777260503/introducing-
thi...](http://news.fiftythree.com/post/118777260503/introducing-think-kit)

------
u14408885
Could this technique be used to optimise graph traversal in a postgresql
table? Or are recursive queries still the way to go?

------
base698
Not only is the tech achievement amazing, and something that will save
countless man hours it's incredibly written to boot.

Amazing work.

------
jmount
"When you fork a repository on GitHub, we create a shallow copy of it. This
copy has no objects of its own, but it has access to all the objects of an
alternate ..."

So does this mean one could attach a GitHub repository by having a lot of
shill accounts cline it and add random objects (possibly having a performance
impact on the original)? I understand the engineering need for the use of
alternates, but wonder about the lowered degree of isolation.

