
The Biggest and Weirdest Commits in Linux Kernel Git History - gary_bernhardt
https://www.destroyallsoftware.com/blog/2017/the-biggest-and-weirdest-commits-in-linux-kernel-git-history
======
curuinor
Clauset Shalizi Newman 2007 has not-nice things to say about the classic
physicist's idiot trick of fitting power law distributions by drawing a
straight line on a log-log graph: it's got huge bias.
[https://arxiv.org/abs/0706.1062](https://arxiv.org/abs/0706.1062)

However, the other difficult thing about power law distributions is that the
dataset size requirements for proper determination of the fact that it's a
power law distribution are occasionally incredibly difficult. So their
critique is very strong, given the comparative lack of data. It is often the
case that computer systems, with the overflowing reams of data, are still not
enough. Note that the paper I cited up there suggests MLE and then a
Kolmogorov-Smirnoff test, so it'll say a lot of things aren't power laws that
could well be.

Another way to look at it is from a more geometric point of view. The metric
entropy of any generic system of variables is defined as the sum of the
positive Lyapunov exponents: and as an "entropy" that quantity does have a lot
of commonalities with the other entropies. But to have positive Lyapunov
exponents is often to have a chaotic dynamics, so it could just be conjectured
that the time series of commits and merge octopus sizes in kernel git history
is chaotic, so the evolution of the time series will be fractal in nature.

But it's also really fucking hard to confirm or deny that one, because there
are varied and strange definitions of chaos itself and the methods that have
been suggested to measure Lyapunov exponent in real systems are arcane and
difficult. You could try some synchronization methods, but they remain arcane
and crap. Fractal measurement methods are also shitty and full of dark magic.

One neat little trick might be to discretize the series, symbolic dynamics-
style (it's already discretized but discretize further, into like percentiles
or something) and run it through one of the dynamical machine learning dealies
to see if there's patterns. Not too much literature on that but it's a thing
that some randoes in like 2004 or something did

~~~
captaincrowbar
There's an old saying among scientists: any set of data points will fit a
power law if plotted on log-log paper with a fat enough magic marker.

~~~
wopwopwop
That's a corollary of "no one will prove your statement wrong if you use
language ambiguous enough."

~~~
ddalex
> Maybe it was or it was not a bug that may have been impacting or not the
> staging or maybe production systems.

I've not been fired yet for saying that.

------
cpobrien
There is a mention of the 66 parent merge from Linus himself:

[http://marc.info/?l=linux-
kernel&m=139033182525831](http://marc.info/?l=linux-kernel&m=139033182525831)

~~~
semi-extrinsic
> "Christ, that's not an octopus, that's a Cthulhu merge"

~~~
bigiain
Related - from Twitter recently:
[https://twitter.com/HenryHoffman/status/694184106440200192/p...](https://twitter.com/HenryHoffman/status/694184106440200192/photo/1)

------
geofft
Another interesting piece of trivia: the very first more-than-two-parent merge
in the kernel history is a mistake. The second and third parents are _the same
commit_.

    
    
        commit 13e652800d1644dfedcd0d59ac95ef0beb7f3165
        Merge: 4332bdd 88d7bd8 88d7bd8
        Author: David Woodhouse <dwmw2@shinybook.infradead.org>
        Date:   Sun May 8 13:23:54 2005 +0100
    
            Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

------
SEJeff
Some of my favorite commits come from Rusty Russel, who wrote the lguest toy
hypervisor documentation as a story:

[https://github.com/torvalds/linux/commit/f938d2c892db0d80d14...](https://github.com/torvalds/linux/commit/f938d2c892db0d80d144253d4a7b7083efdbedeb#diff-847230dec604827964905e0dfec81e42R1)

------
gsylvie
I don't like OP's definition of divergence. I prefer to take the size of the
diff along first-parent instead.

Here's how I would do it:

    
    
      time git log -m --first-parent --shortstat --pretty="%H" --min-parents=2 |
      grep -v '^$\|3e1dd193edefd2a806a0ba6cf0879cf1a95217da'                   |
      sed 's/.* file.* changed,//'       |
      sed 's/insertion.*,/+/'            |
      sed 's/deletion.*//'               |
      sed 's/insertion.*//'              |
      sed 's/^\ \(.*\)\ $/\$\(\(\1\)\)/' |
      xargs -d '\n' -L 2 echo echo       |
      bash                               |
      sort -k 2,2 -g                     
    

Note: I skip 3e1dd193edefd2a806a0ba6cf0879cf1a95217da because that commit has
no diff along first-parent, and thus screws up my xargs result (which depends
on every 2nd line having the --shortstat output).

Of course "\--first-parent" doesn't guarantee that we're walking the mainline
(see: [https://developer.atlassian.com/blog/2016/04/stop-
foxtrots-n...](https://developer.atlassian.com/blog/2016/04/stop-foxtrots-
now/) ), but it _usually_ is.

On my laptop it takes 3 mins 30 seconds. Here are the 5 biggest merges by this
definition:

    
    
      099bfbfc7fbbe22356c02f0caf709ac32e1126ea 463702
      3f17ea6dea8ba5668873afa54628a91aaa3fb1c0 466320
      ce519e2327bff01d0eb54071e7044e6291a52aa6 500074
      7ea61767e41e2baedd6a968d13f56026522e1207 504965
      f063a0c0c995d010960efcc1b2ed14b99674f25c 569691
    

And here's "git show" for those 5:

    
    
      099bfbfc7fbb 2015-06-26T13:18:51-07:00 Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
      3f17ea6dea8b 2014-06-08T11:31:16-07:00 Merge branch 'next' (accumulated 3.16 merge window patches) into master
      ce519e2327bf 2009-01-06T17:04:29-08:00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
      7ea61767e41e 2009-09-16T08:11:54-07:00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
      f063a0c0c995 2010-10-28T12:13:00-07:00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6

~~~
gsylvie
This approach also helps with those orphan branches since it just calculates
the diff caused by bringing the orphan branch into the mainline.

Whereas OP's definition counts every line ever created on both sides (e.g.,
counts every line of code in the original project as well as the orphan at the
moment of the merge).

~~~
majewsky
If you want the diff caused by bringing a topic branch into master (i.e. its
unmerged parts), you can also do

    
    
      git diff master...topic
    

with three dots instead of two.

~~~
gsylvie
Or in this case (since we're processing merges one at a time to make the
calculation): git diff COMMIT^1...COMMIT^2 (aka, parent1...parent2).

I find the three-dot diff notation very useful, but confusing, and always have
to type "git help diff" first and stare at the docs a bit to find this bit:

    
    
      "git diff A...B" is equivalent to
               "git diff $(git-merge-base A B) B"
    

Also, I prefer first-parent since it gets the actual change to the mainline,
whereas "git diff parent1...parent2" might miss some changes with dirty merges
(e.g., conflict resolution). More info here:
[http://stackoverflow.com/a/41356308/5819286](http://stackoverflow.com/a/41356308/5819286)

------
kijin
> _" Christ, that's not an octopus, that's a Cthulhu merge"_

Perhaps git should throw a warning when you try to do an octopus merge with
more parents than an octopus has legs. If you really want to proceed, add the
--cthulhu option. The default behavior would be --no-cthulhu.

~~~
majewsky
Because Git totally needs more obscure command line options. /s

------
brongondwana
It only has one parent, but this would be the commit that I'm least proud of
(not in Linux, obviously):

[https://github.com/cyrusimap/cyrus-
imapd/commit/fdc0eb3d09bc...](https://github.com/cyrusimap/cyrus-
imapd/commit/fdc0eb3d09bcc2ce916d2790c98839a61d403937)

Showing 126 changed files with 14,128 additions and 20,617 deletions.

(ok, I'm pretty proud of reducing code size by 6k+ lines while improving lots
of stuff, but the commit is a shitshow)

------
userbinator
GitHub's logo always reminds me of the octopus merge; not sure if it was
chosen for this reason, but I think it's quite suitable.

~~~
azernik
Given that the name refers to octopus ("octocat"), it's very likely.

------
metrognome
I think Gary's commit counts are off:

    
    
      $ git log | wc -l
    

This should count the number of lines in the entire git log, including
metadata (not just commits). I think he means this:

    
    
      $ git log --oneline | wc -l
    

The number of commits for Rails should be closer to 61,000.

~~~
gary_bernhardt
That would explain why the commit frequency seemed to improbably high. I just
updated it with correct numbers. Thanks.

------
cperciva
_Octopuses are more common than you might expect_

The etymologically correct plural is _octopodes_. (Some people accuse
"octopodes* of being pedantic, but as I see it "pedantic" is just a euphemism
for "correct in a way I don't like".)

~~~
avar
'The standard pluralized form of "octopus" in the English language is
"octopuses" although the Ancient Greek plural "octopodes", has also been used
historically.'[1].

The book "Octopus: The Ocean's Intelligent Invertebrate"[2] which I have here
says this on the matter: "By the way, the plural of octopus isn't octopi,
because the word is Greek -- octopus to be exact -- not Latin. The Greek
plural would be octopodes, but we call them octopuses."

You're not being pedantic as it pertains to this word in the English language,
you're just wrong.

1\.
[https://en.wikipedia.org/wiki/Octopus#Etymology_and_pluraliz...](https://en.wikipedia.org/wiki/Octopus#Etymology_and_pluralization)

2\. [https://www.amazon.com/Octopus-Intelligent-Invertebrate-
Rola...](https://www.amazon.com/Octopus-Intelligent-Invertebrate-Roland-
Anderson/dp/1604690674)

~~~
cperciva
I didn't say _commonly used_ ; I said _etymologically correct_.

The word "octopus" comes from Greek, just like the word "corpus" comes from
Latin (and has the etymological plural "corpora").

~~~
StringEpsilon
How a word is affected by grammatical rules (such as pluralization) is not
determined by etymology (but it can be).

Here's the german conjugation of "mailen" (writing an e-mail), borrowed from
"to mail":

Ich maile, du mailst, er/sie/es mailt, wir mailen, sie mailen.

I don't know any loanwords that break english pluralization rules in german,
but for the reverse: The correct plural for "Kindergarten" would be
"Kindergärten" (not "kindergartens"), which I imagine some english speakers
would have problems with. And "Autobahnen" is rather unintuitive compared to
"autobahns".

~~~
germanier
> I don't know any loanwords that break english pluralization rules in german

The plural of 'Baby' is 'Babys' (instead of 'Babies' – though some people also
use that form) and 'Computer' doesn't change.

On the other hand, both 'Indices' and 'Indexe' are used and for 'Tempus' the
only plural is 'Tempora'.

------
smallnamespace
Slight article nitpick: a distribution that 'looks like a straight line' in a
log-log plot is often _not_ power-law distributed.

One could say that the distribution has a fat one-sided tail though.

~~~
gary_bernhardt
I thought about trying to clarify this, but "power law" is the term that's
been thrown around to describe this effect in software systems for many years.
Really, I only care about the fat-one-sided-ness (for its practical
implications); I don't care so much about the precise mathematical
formulation.

------
majewsky
I used octopus merges once for a deployment system that I built when my team
switched from SVN to Git. Since there were a lot of developers working on
different parts, it was many times required to test multiple different changes
in parallel in the QA system.

I built a small web UI where developers could select and unselect development
branches, and it would octopus-merge all selected branches into the master
branch, and force-push that state onto the QA branch (and deploy it to QA, of
course). So QA would always be master + all development branches that were
currently being verified. By using a Github webhook, it would update the QA
system whenever master or one of the branches being verified was pushed to.
I'm not in that team anymore, but I think that deployment tool is still
humming along nicely.

~~~
rwmj
There must have been many times this would fail because of merge conflicts?

------
tomatokiller
Has anyone asked Laxman Dewangan what he was up to with that initial commit
and merge thing?

~~~
geofft
It's common in a lot of Git UIs (GitHub, for instance) to prompt you to create
a README when you create a new repo. I think this is because when you're
_actually_ starting a project, older versions of git would get confused if you
cloned an empty repo, but using `git clone` to get a directory to work in is
much nicer than `git init` + do work + `git remote add origin` + `git push -u`
or whatever the command is. So, the hosting tool generates an initial commit
for you so that a clone works smoothly.

My guess is that Nvidia has some internal Git hosting tool or a private GitHub
account, and this developer clicked the most obvious "create a repo" button
and tried to push their local git clone of Linux it. That push was rejected
because it would have clobbered the auto-generated commit, so they did a `git
pull` and a `git push`, i.e., they merged in the auto-generated commit.

------
behm
That was the worst diagram today. <1 Commits on the y-axis? Where would be 30
on the x-axis? Can't tell if you only have 3 markers on a log axis.

~~~
gary_bernhardt
What were you trying to do, locate your favorite commit?

