

Linus Torvalds proposes a change to the Git commit object format - avar
http://www.spinics.net/lists/git/threads.html#161308

======
breckinloggins
OK, finally found a decent explanation of what a Git generation number
actually IS:

<http://www.spinics.net/lists/git/msg161165.html>

~~~
andrewflnr
So it's almost, but not quite, like the revision numbers everyone else has
always had?

~~~
iclelland
Yes -- almost, but not quite. If you and I each create a branch off of a
commit with gen #123, then, as I understand it, the subsequent commits in my
branch would be #124, #125, etc., and your commits in your branch would _also_
be #124, #125, etc.

Contrast this with CVS, where I would have 1.124.1.1, 1.124.1.2, erc., and you
would have 1.124.2.1, 1.124.2.2, or with Subversion, where I might get
revisions 125, 128, and 129, while the server gave your commits #124, 127 and
130, and someone else, on a totally different part of the project got #126.

As long as development proceeds linearly, on a single branch, then yeah, it's
about the save as revision numbers in a centralized RCS -- once you start
branching and merging, though, it represents a different concept entirely.

~~~
jbert
For a single repository, it does have a very similar interpretation to, say,
svn revnos.

You can speak of "revision #125 of a branch" in a specific repository. Which
is generally exactly what you need for human-to-human communication about
development.

"Can you see if that bug is in r125 of unstable?" "I've got all changes up to
r245 of prod"

I guess the confusing aspect would be if "r245 of prod" in the central server
was "r100 of prod" in my local repo because I haven't cloned the full history?

~~~
noste
It would appear to me that multiple commits in a branch can have the same
generation number (see the diagram in
<http://www.spinics.net/lists/git/msg161165.html> ). So unless your history is
linear, using generation numbers in human-to-human communication may get
confusing really fast.

~~~
jbert
In that diagram, I see 2 sequences of commits (two branches).

The 'original branch' (e.g. "unstable") goes: 0,1,2,5,6

Then there is a topic branch (e.g. "add-frob"), which goes: (0,1),2,3,4,(5).
Note that I consider the the 'add-frob' branch ended at the merge commit, so
there is no "revno 6 of 'add-frob').

I don't consider that merging 'add-frob' back into unstable means that "revno
2 in unstable" could mean commit D - I would call commit D "revno 2 in add-
frob".

Does that system work?

------
rlpb
What I like about git is that it stores only the minimum amount of
information, and this makes it easy to explain. A commit hash is a hash of
canonical information, not of derived information.

It seems really ugly to store derived information in a commit (specifically,
that the hash would be altered by it).

It seems that Jeff has said the same thing, but Linus disagrees. Vocally.

<http://www.spinics.net/lists/git/msg161336.html>

~~~
ryannielsen
From my understanding, they're essentially adding this as an additional bit of
information that's minimally required. The currently used timestamps are error
prone and thus will be replaced by generation numbers which are more robust.
They're still adhering to the principle of only storing the minimum amount of
information, they're just adding generation numbers to that set.

In fact, you could make the argument that _timestamps_ are the derived
information that git has been storing all along while generation numbers are
the canonical information which should have been stored from the beginning.
Generation numbers are a result of the state of the tree, while timestamps are
derived from the ambient (and potentially incorrect!) environment from which
the commit was made.

~~~
rlpb
Well, generation numbers can be determined by counting up through parent
commits. So they are derived information, it's just that that takes ages and
lots of disk seeks to count through.

Timestamps aren't really needed. They are information that is useful to use
that we want to store, just like the date in an email. Thus they are as
required as the names of the author and committer.

The reason for the discussion about the commit timestamps is (AIUI) a
heuristic optimisation that works because they happen to be there and happen
to (most of the time) be in order.

~~~
ryannielsen
By that definition of "derived information", the hash is "derived information"
since it's based of the changes made to source data (whatever that data may
be).

That said, point taken about the necessity of both generation numbers and
timestamps. But that invalidates the OPs comment about git storing "only the
minimum amount of information". It sounds like that's never been a hard
principle.

~~~
davvid
git does store "only the minimum amount of information".

Here's what Linus had to say about it:

> Generation numbers are _completely_ redundant with the actual structure

> of history represented by the parent pointers.

Not true. That's only true if you add ".. if you parse the whole history" to
that statement.

And we've _never_ parsed the whole history, because it's just too expensive
and doesn't scale. So right now we depend on commit dates with a few hacks.

So no, generation numbers are not at all redundant. They are fundamental. It's
why we had this discussion six years ago.

From: <http://www.spinics.net/lists/git/msg161348.html>

~~~
ryannielsen
Thanks for the background! (Seriously, not trying to be snarky.)

That info does support my original point that generational numbers probably
should have been stored from the start and timestamps are the more
"derivative" bit of information since it comes from the environment and not
the data itself.

Thus, rlpb's concern that storing generational numbers pollutes its design of
storing "only the minimum amount of information" isn't necessarily well
founded, since the generational number might be more minimal and correct than
the current timestamp. That was the aim of my original post: generational info
is fundamental, not extraneous derived info, and probably have been stored
with commits in the first place.

------
mscarborough
I don't generally come across Linus' dev threads, but it's usually in the
context of some linkbaity 'watch Linus smack this dude down' or something of
that nature.

This reads like a really productive thread from my limited understanding of
git internals. It's pretty cool how much good engineering thought is going
into this proposal.

Maybe that's why git rocks so hard.

~~~
jacknagel
Minor smackdown from this thread here:
[http://article.gmane.org/gmane.comp.version-
control.git/1771...](http://article.gmane.org/gmane.comp.version-
control.git/177186)

------
gregschlom
Ah! I knew I was going to stumble upon Linus' signature "that's total and
utter bullshit" somewhere: <http://www.spinics.net/lists/git/msg161348.html>

------
pyre
I like the suggestion of storing the generation numbers in the pack index.
When you generate a pack you're already parsing the entire tree. That makes
more sense than requiring all future git objects to have 'generation numbers'
jammed into them. Especially because it introduces an incompatibility with
current git objects, which it would probably be best to avoid.

------
cypherpunks01
What operations would be sped up by having generation numbers?

I see Jeff King's message that they would make certain bounding traversals
faster, but when do bounding traversals need to be computed when I'm using git
day-to-day?

~~~
Rauchg
It's also about making git not error-prone, which the current timestamp
approach seems to do.

------
nplusone
Change last name to 'Torvalds' (edit: name in title changed)

------
derrickpetzold
I was wondering how they got along without generation numbers for so long. It
was by comparing timestamps and those are unreliable because systems can be
misconfigured. How they are going to handle legacy repos with that problem I
still don't get. I am guessing that history is f'd.

~~~
mdwrigh2
New versions of git will actually go back and generate this information for
old commits. This will lead to git being slightly slower when in old
repositories until all the commits contain the generation information, but
that should happen fairly quickly.

~~~
derrickpetzold
I was talking about the case where the timestamps are off. I don't think there
is any way to fix that.

~~~
ajuc
You just go the whole way up to the root node counting parents (taking max
length when there are many routes), no problem with amiguity. The problem is -
it's slow.

------
breckinloggins
Can someone explain what generation numbers are? Googling "git generation
numbers" pulls up mostly this discussion thread.

I'm assuming they're easy-to-remember incremental numbers tied to commit? Like
1, 2, 3, or tied to commit and branch, like master/1, etc.?

~~~
Kliment
Here is how I understand the problem.

At the moment, each commit stores a reference to the parent tree. By parsing
that tree and reading the entire history you can obtain a hierarchy of
commits. Because you need to order commits in many situations, reading the
entire history is extremely inefficient, so git uses timestamps to determine
the ordering of commits. This of course fails if the system clock on a given
machine is off. With a generation number, you can get an ordering locally from
the latest commits, without having to rely on timestamps or read the entire
tree.

When you have a commit with generation n, any later commits that include it
wound have generation >n, so to tell the relation between commits, you only
need look as far back as n, and you can immediately get the order of any
intermediate commits. It has nothing to do with "easy to remember". It's about
making git more efficient and robust.

------
Peaker
Why call it a "generation number" and not "depth"?

~~~
rs
Think they're using "generation" here in the context of number of parents
(yes, "depth" would work fine as well, but is a more general term)

~~~
pyre

      > number of parents
    

s/parents/ancestors/

But that statement only holds up on branches without any merge commits.
Because the actual algorithm does not just total up the number of ancestors.

------
macrael
I'd love an explanation of what "generation numbers" are.

~~~
pufuwozu
Looking at the diff, it seems like a generation number is just a number the
parents that a commit has. For example:

Commit a553af has no parents, it has generation number of 0.

Commit c464e0 is the next commit, it has generation number 1.

And so on. Branches count independently of one another. When commits have
multiple parents (e.g. merges), the generation number starts counting from the
previous maximum.

~~~
pyre

      > just a number the parents that a commit has
    

It's the max length from a commit to reach a root node in the graph. So any
time that you hit a fork in the path you take the longest route.

