
Git is a purely functional data structure (2013) - eisokant
https://blog.jayway.com/2013/03/03/git-is-a-purely-functional-data-structure/
======
BenoitEssiambre
I sometimes like to explain things the other way around. Immutability being
version control for program state.

I rarely use functional programming but I certainly see its appeal for certain
things.

I think the concept of immutability confuses people. It really clicked for me
when I stopped thinking of it in terms of things not being able to change and
started instead to think of it in terms of each version of things having
different names, somewhat like commits in version control.

Functional programming makes explicit, not only which variables you are
accessing, but which version of it.

It may seem like you are copying variables every time you want to modify them
but really you are just giving different mutations, different names. This
doesn't mean things are actually copied in memory. The compiler doesn't need
to keep every versions. If it sees that you are not going to reference a
particular mutation, it might just physically overwrite it with the next
mutation. In the background "var a=i, var b=a+j", might compile as something
like "var b = i; b+=j";

~~~
runeks
> I think the concept of immutability confuses people.

I think it confuses people because it’s framed oddly. Immutability isn’t about
being unable to mutate state, it’s about no longer using containers
(registers) as variables, such that the equal operator actually means “equals”
as opposed to “store”.

In most programming languages, the equal operator works as a “store”
operation, which stores a value in a named container/register. In “immutable
by default” languages like Haskell, the equal operator actually means
“equals”, as in “is synonymous with”.

The essence of immutability is referencing values directly, through synonyms,
as opposed to storing them in named registers for later retrieval. When it’s
done this way, immutability no longer makes sense: is the number _3_ mutable?
Can the number _3_ be mutated into _4_ , or are they just two distinct
numbers?

~~~
canes123456
I think I understand the point you are making but this is way more confusing
for me. Partly this can be blamed on imperative lanuages and = vs ==.

~~~
giesch
In a procedural language:

x == 3 means: Take the value out of the box x. Is it 3?

x = 3 means: x is a box, put 3 in it. The 3 lasts forever, or until someone
changes it.

In a functional language:

x = 3 means: Take the value out of box x. Is it 3?

let x = 3 in [...] means: x is a box, put a 3 in it. The 3 lasts for only the
[...], but cannot be undone.

~~~
nybble41
Correction: In a functional language (specifically Haskell):

    
    
      x = 3 means: x is defined as 3.
      x == 3 means: Is x equal to 3?
      let x = 3 in [...] means: within the scope of [...], x is defined as 3.
    

There is no specific point where something is "put into the box" or "taken out
of the box". In fact, there is no "box". The functional programming concept of
"bindings" does not correspond to the imperative concept of "variables", not
even the oxymoron known was "constant variables". (If you want variables in
Haskell you need IORef or STRef, where the load and store operations are
explicit monadic actions.)

------
Edmond
Perhaps for those familiar with "functional data structures" such an analogy
is helpful but I find it easier to simply explain git for what it is without
adding more exotic nomenclature to it.

Git lets you do version control via full snapshots as opposed to just tracking
diffs (even though it does actually do this too behind the scene).

You can think of a full snapshot as saving a copy of your project structure
every time you do a commit. The key trick is that git doesn't actually create
new copies of the content for each commit but simply maintains a tree
structure whose nodes are pointers (via hashing) to the content they
represent.

The complication from git is not in understanding the core concept but knowing
how best to apply them. There are all sorts of crazy workflows you could
implement by manipulating git pointers and their associated patches. As with
anything that is flexible, difficulty comes in knowing how to constraint
yourself when using it.

~~~
pdonis
_> git doesn't actually create new copies of the content for each commit_

More precisely, it doesn't create new copies of content that you didn't
change. For example, if you have 100 files in your repo and you change one of
them and then commit, git creates a new copy of the content of the file you
changed--a new blob storing the new file content--and a new tree object that
references the new blob instead of the old one, plus the other 99 blobs that
store the contents of files you didn't change; the new commit object then
references the new tree object (plus the message and metadata). But git never
stores diffs between old and new content; it just creates a new blob every
time the content of a file changes.

~~~
jdmichal
> But git never stores diffs between old and new content; it just creates a
> new blob every time the content of a file changes.

Git pack files compress objects by storing them as diff files going backwards.
That is, it stores the most recent state in full, then uses patches to go
backwards. Because you're more likely to need a recent version in full than an
older one.

[https://git-scm.com/book/en/v2/Git-Internals-Packfiles](https://git-
scm.com/book/en/v2/Git-Internals-Packfiles)

~~~
emmelaich
This is true but packfiles are an implementation detail.

It's still useful and more accurate conceptually to consider every commit as a
complete snapshot of the state of code that point.

~~~
jdmichal
That can be said of every version control system. Restoration of state to any
given version is their defining feature. How they achieve that is always an
implementation detail, but those details can still be important and
interesting.

~~~
bbatha
Git commits are composed of all of the files in the commit, it’s parent and
the commit message. This is an important guarantee that each checkout is valid
without the rest of the repo. This allows you to have a lot of exotic
implementations guarantee consistency between them. Meaning if your GitHub you
can distribute commits across many servers. Or your Microsoft and you build
partial checkouts for Gvfs. It’s what allows Git LFS to keep many of git’s
core guarantees while making tradeoffs to improve areas where git is
traditionally weak.

------
kazinator
Git is a purely functional data structure, except for the mutating head
pointers, rewriting of tags, various state in the repo related to things like
on-going rebases, cherry picks, bisects, ... oh and the index which is one
object changing in-place (not to mention working tree, of course).

~~~
vickychijwani
While what you say is technically accurate, I think you're missing the bigger
picture here. The author's point still stands: git's commit history (arguably
the most important part of a version-control system) can be viewed as a
purely-functional data structure, and that view has practical benefits too. I
tried to explain more here:
[https://news.ycombinator.com/item?id=15892013](https://news.ycombinator.com/item?id=15892013)

------
rubenbe
I often recommend people to read "Git Internals". If you know how git works
internally, it's much easier to understand how it works and the reasons behind
it.

[https://git-scm.com/book/en/v1/Git-Internals](https://git-
scm.com/book/en/v1/Git-Internals)

~~~
randomsearch
This suggests a poor abstraction.

~~~
goialoq
"Internals" is a poor choice of term. "Data structure" is a better term. Git
is "plumbing and porcelain". The plumbing is the core of git. Porcelain are
shortcuts. In general, Torvalds projects (Linux, Git) aren't big on
abstractions that maximize simplicity-of-use, they focus on doing complex
things correctly and quickly. Adding abstraction makes it hard to get details
correct and run quickly.

~~~
randomsearch
Agreed. Git seems like a good internal design with a terrible interface.

------
mannykannot
I think the author has a point in saying that learning Git by trying to map it
to Subversion is not the best way to do it, but I don't think analyzing it as
a functional data structure adds much insight. To me, it is easier to
understand when you look at its purpose, and how it solves the problems of
that domain - and the biggest difficulties of version control are on account
of the problem being essentially one of distributed, lockless concurrency,
something not mentioned in this article.

------
icc97
I found the explanation from the Immutable JS presentation easier to
understand when talking about Immutable data structures [0]

[0]: [https://youtu.be/I7IdS-PbEgI?t=5m7s](https://youtu.be/I7IdS-
PbEgI?t=5m7s)

------
shurcooL
Even after 4 years, this remains my favorite, most influential article that
helped me understand and feel comfortable with git. It's just a very good
analogy.

------
dustingetz
for a real database that works like git, see
[http://www.datomic.com](http://www.datomic.com)

if git killed svn, datomic kills postgres

~~~
xj9
if it were open source maybe, but i'm definitely not going to switch (even
though i might want to) for licensing reasons. in fact, i have _more_
motivation to write an libre datomic clone than to pay cognitect anything for
their proprietary db.

~~~
dustingetz
do it!

~~~
greghendershott
Depending on the definition of "it", someone has:
[https://github.com/tonsky/datascript](https://github.com/tonsky/datascript)

~~~
dustingetz
datomic is a distributed system, datascript is an in memory data structure (it
is very cool though, i use it)

------
ioquatix
.... and one day I had the crazy idea to make a database on top of it:
[https://github.com/ioquatix/relaxo](https://github.com/ioquatix/relaxo)
because the underlying immutable data structure makes this quite feasible.

------
snissn
Is git a blockchain?

~~~
linschn
Short answer, yes. The current commit includes the hash of its parent(s), so
its own hash reflects the whole history, and one can not change the history
without also changing the current hash. Just like a block contains the hash of
the previous block.

~~~
sparkie
That's a Merkle Tree. A blockchain is an application of a Merkle tree in which
each node contains transaction data, and a majority of clients agree that the
longest chain of blocks is the correct one.

Git also uses a Merkle-DAG, but it is not a blockchain.

~~~
mbrock
_a majority of clients agree that the longest chain of blocks is the correct
one_

If you squint, that's kind of true for Git repositories, too.

The version with the most "proof of work" on it is likely the master branch.

Of course the incentives are very different... but still, the similarity is
somewhat illuminating, I think.

------
doug1001
well the top-level git data structure is pretty close to eg, Scala's Vector,
which is an immutable container implemented as a tree with a high branching
factor of 32. Modification to such a vector, rebound to a new variable, relies
on structural sharing of the original
([http://www.codecommit.com/blog/scala/implementing-
persistent...](http://www.codecommit.com/blog/scala/implementing-persistent-
vectors-in-scala))

------
erikb
A data structure cannot be functional. I understand what he's trying to say,
and agree with most of it, but the word "functional" is purely wrong. What he
wants to say is "good". But not all "functional programming" is good, nor is
all good programming functional by necessity, despite what your local Lambda
The Ultimate nerd tries to tell you.

The Best, when it comes to data structures is a Directed, Acyclic Graph. For
instance your typical linux filesystem is a DAG. But there's one problem with
DAGs: When they reach a certain complexity human brains are not fit enough to
parse them anymore. (programs still can though)

So in many circumstances at least a human programmer needs to take a look at
the state of your program and make assumptions about its correctness, which is
called debugging. And that's why in Good programs we often use Good data
structures instead of The Best.

Good data structures are key->value stores (which you may know as "hash
tables" or "dictionaries"), trees, and trees in a simplified special form:
lists, each of them being somewhat able to represent the other two, if one can
accept a performance hit and/or increased complexity in source code.
Dictionaries, trees, lists. That's it. And you do that in every programming
language that is at least a little bit interested in being Good.

So there's nothing special or functional about git's data structures, it's
just normal Good programming, and a few programmers who are so good at
programming that they don't even need to mention it anymore, they breath good
programms.

Then of course to the normal bread-earning coder good programs are a rare
sight. But the reason is not that they are really rare, the reason is that
successful business doesn't really require Good programs to succeed. Mediocre
programs are good enough to earn their rent, and most of us spend most of our
coding hours to earn our rent.

All that being said, if you don't just want to make money, go and spend some
time studying git internals. It will teach you a lot more than most of your
teachers/professors taught you combined. Sadly the source code is written by
Linux gurus, who like to encrypt their source code with a very special key
that only people from their tribe can understand. But the Git Book is actually
good enough that you can study quite a lot of the internals from that book. I
also suggest writing your own git in your favorite programming language once,
to really understand it.

~~~
how_gauche
> A data structure cannot be functional. I understand what he's trying to say,
> and agree with most of it, but the word "functional" is purely wrong. What
> he wants to say is "good". But not all "functional programming" is good, nor
> is all good programming functional by necessity, despite what your local
> Lambda The Ultimate nerd tries to tell you.

Why is it that whenever functional programming comes up, "real programmers"
come out of the woodwork to bloviate, only to expose just how little they
actually know about the topic?

"Purely functional" in this case is a jargon term that relates to Okasaki's
1996 PhD thesis, available online
([https://www.cs.cmu.edu/~rwh/theses/okasaki.pdf](https://www.cs.cmu.edu/~rwh/theses/okasaki.pdf)).
It's a classification for data structures much in the way "regular" or
"context-free" or "turing-complete" serve as classifications for grammars.

For a data structure to be "purely functional" (and no, article author does
not mean "good" here) simply means that it's implementable in a pure
functional language without mutation. Examples of non-functional data
structures would be ones where mutation is intrinsic to how its algorithms
work: traditional linear probing hash tables, union-find for graphs, etc.

By this technical criterion, the git object graph is clearly a purely
functional data structure, sorry.

~~~
erikb
See the other leaf of this purely functional comment tree for my answer.

~~~
erikb
Since there is even another branch now, I paste it:

I feel like I already said "I understand but I disagree". A round wheel is
more useful for a car than a triangular wheel. That doesn't mean it's a "car
wheel". It's just as good on a horse wagon or a bike.

------
fpoling
Git is not a purely functional data structure [1]

[1] man git-rebase

~~~
19870213
But git-rebase does not alter existing commits in the commit tree, it simply
creates a new branch (meaning new commits) on the tree.

~~~
Simon_says
Alright wiseguy, git gc.

~~~
davidcuddeback
I don't understand what point you're trying to make. Unreachable data can be
garbage-collected in Haskell, ML, Clojure, Erlang, and many other functional
(and non-functional) programming languages. What about GC is supposed to
refute that a data structure is functional?

~~~
Simon_says
19870213's point was that a git rebase is functional because it doesn't change
existing commits, rather it only makes new ones and moves branch pointers.
This is correct. You can also say that git gc is functional because it's only
garbage collecting. But you can't have it both ways. The cumulative effect of
rebase followed by gc is to delete commits that have a branch pointing to
them.

~~~
rootlocus
> The cumulative effect of rebase followed by gc is to delete commits that
> have a branch pointing to them.

This is both false and irrelevant.

It's false because after the rebase, the branch you just rebased won't point
to the old commits. Unless you had other branches there, they would be
orphaned. Also, according to the "Notes" section of the git-gc documentation
[0]:

> git gc tries very hard not to delete objects that are referenced anywhere in
> your repository. In particular, it will keep not only objects referenced by
> your current set of branches and tags, but also objects referenced by the
> index, remote-tracking branches, refs saved by git filter-branch in
> refs/original/, or reflogs (which may reference commits in branches that
> were later amended or rewound).

Since the commit you were on before making the rebase is in the reflog, it
will actually not be GCed (yet) even though there are no branches pointing to
them.

It's irrelevant because, even if it was true, as long as there are no more
objects referencing that commit, it's perfectly eligible for gc. I don't
understand your argument that "you can't have it both ways".

[0] [https://git-scm.com/docs/git-gc](https://git-scm.com/docs/git-gc)

------
ianamartin
I find this conversation fascinating because there is so much disagreement on
the meaning of "functional" and "immutable"

What I've gathered so far from reading the article and the comments is that
some people who are in the know about a very specific paper agree that Git is
a purely functional data structure. And that others look at the ways you can
use Git and point a finger and say, "Look! It can be mutated! Therefore it
cannot be functional!" And the response to that is, "Don't be so technical
about how you define functional. Or immutable. You know it when you see it."

Is this some kind of Obi-wan Kenobi from a certain point of view stuff? Why is
this so difficult to get a handle on?

If a thing says immutable on the tin, and it's mutable, how is that purely
functional? I know, read the paper. I know. But still, it's a legit question.

It seems to me that a data structure so amazing as being purely functional
shouldn't be so easy to misunderstand as what we're seeing here. And it's
clearly being misunderstood. And not only by me.

~~~
how_gauche
> some people who are in the know about a very specific paper

This stuff isn't obscure just because you don't happen to know about it
already. Chris Okasaki's publications have been cited over 1000 times, mostly
for his PhD work -- those papers, especially on functional data structures and
amortization analysis in lazy languages, are considered foundational for a
whole research area in Computer Science.

> Is this some kind of Obi-wan Kenobi from a certain point of view stuff? Why
> is this so difficult to get a handle on?

Did you learn calculus overnight, or expect to understand a technical
conversation about differential equations and Cauchy sequences without taking
two years of classes or reading a couple of really thick books first? Why
should this be any different?

> If a thing says immutable on the tin, and it's mutable, how is that purely
> functional? .... It seems to me that a data structure so amazing as being
> purely functional shouldn't be so easy to misunderstand as what we're seeing
> here. And it's clearly being misunderstood. And not only by me.

Sigh. Explaining it properly would involve a tour though the untyped lambda
calculus, simply-typed lambda calculus and the Curry-Howard isomorphism, a
discussion on denotational vs. operational semantics (Hoare logic, functional
interpreters, type-preserving compilation, small-step vs large-step
operational semantics, etc).

TL;DR: "purely functional" is a description of the program's meaning (in a
technical sense), not its implementation.

~~~
ianamartin
Okay, so if a program means for an ibject to be immutable but it actually is
mutable then it’s still immutable if you explain it in terms of basic CS
theory.

Got it. Thanks for the attitude.

