Hacker News new | comments | show | ask | jobs | submit login
The Biggest and Weirdest Commits in Linux Kernel Git History (destroyallsoftware.com)
394 points by gary_bernhardt 130 days ago | hide | past | web | 51 comments | favorite

Clauset Shalizi Newman 2007 has not-nice things to say about the classic physicist's idiot trick of fitting power law distributions by drawing a straight line on a log-log graph: it's got huge bias. https://arxiv.org/abs/0706.1062

However, the other difficult thing about power law distributions is that the dataset size requirements for proper determination of the fact that it's a power law distribution are occasionally incredibly difficult. So their critique is very strong, given the comparative lack of data. It is often the case that computer systems, with the overflowing reams of data, are still not enough. Note that the paper I cited up there suggests MLE and then a Kolmogorov-Smirnoff test, so it'll say a lot of things aren't power laws that could well be.

Another way to look at it is from a more geometric point of view. The metric entropy of any generic system of variables is defined as the sum of the positive Lyapunov exponents: and as an "entropy" that quantity does have a lot of commonalities with the other entropies. But to have positive Lyapunov exponents is often to have a chaotic dynamics, so it could just be conjectured that the time series of commits and merge octopus sizes in kernel git history is chaotic, so the evolution of the time series will be fractal in nature.

But it's also really fucking hard to confirm or deny that one, because there are varied and strange definitions of chaos itself and the methods that have been suggested to measure Lyapunov exponent in real systems are arcane and difficult. You could try some synchronization methods, but they remain arcane and crap. Fractal measurement methods are also shitty and full of dark magic.

One neat little trick might be to discretize the series, symbolic dynamics-style (it's already discretized but discretize further, into like percentiles or something) and run it through one of the dynamical machine learning dealies to see if there's patterns. Not too much literature on that but it's a thing that some randoes in like 2004 or something did

There's an old saying among scientists: any set of data points will fit a power law if plotted on log-log paper with a fat enough magic marker.

That's a corollary of "no one will prove your statement wrong if you use language ambiguous enough."

> Maybe it was or it was not a bug that may have been impacting or not the staging or maybe production systems.

I've not been fired yet for saying that.

This is my favorite Bad TED talk - by a physicist using the power law straight line trick; https://www.ted.com/talks/sean_gourley_on_the_mathematics_of...

It was also published in Science. Sean also founded Quid.

I thought I was taking crazy pills when I saw it. I wondered how could this guy not know better and how could no one else around him know either.

This is Richardson 1948 shit, so Richardson had the good excuse of doing all the work while people did their computations by hand. This numbnuts is not only like 60 years late, but does not have that excuse.

Lot of people who don't like reading in complex systems, I find. Also a lot of people who aren't sufficiently cynical.

I removed the "power law" language from the post, other than a quick "this is often called a power law, which probably isn't correct" note. I don't want to get bogged down in statistics when all I really care about here is "it's fat and one sided".

Yeah, it's really hard to distinguish power-law from log-normal from statistical measurements, even in simulations when you know the generating process.

As somebody who has not much to do with statistics, what are the bad consequences of misidentifying a distribution? After skimming the paper it seems that there are several distributions that are similar in the sense that it is hard to tell which one a given dataset actually follows. So naively it seems that it should not really matter which one one picks if they are similar enough that it is hard to tell them apart to begin with. Is picking the wrong distribution bad because it leads to wrong conclusions when trying to explain the mechanism causing the distribution? Because extrapolations may yield significantly different results? Or is it more specifically about log–log charts and that small deviations on a logarithmic scale can correspond to significant differences on a linear scale so that a seemingly good fit is not actually a good fit?

Mean of lognormal distribution: e ^ (parameter + (other parameter ^ 2) / 2)

Mean of pareto (power law) distribution, where parameter < 1: infinity

You mean - pun not really intended - that making choices based on the mean of the supposed distribution, say for risk assessment purposes, may lead to wildly wrong conclusions? Or more generally, that characteristics of different distributions can be very different even if they fit a given dataset similarly well?

Both, actually. And of mechanistic statements.

For example, if you have this sort of multiplicative central limit theorem with no lower bound, where you imagine independent variables being multiplied together - the attractor of that dynamical process, the result of that universal phenomenon, is lognormal.

Add a lower bound, bam, power law (champernowne 1953 "A Model of Income Distribution"). So you're really making different claims about reality.

There is a mention of the 66 parent merge from Linus himself:


>Anyway, I'd suggest you try to limit octopus merges to ~15 parents or less to make the visualization tools not go crazy. Maybe aim for just 10 or so in most cases.

I'll file this under "problems I'm glad I don't have".

Whoa, you're right! He referenced it neither by 7-character short hash nor by full hash, and I didn't think to check intermediate hash lengths. I've updated the post to reference that email.

The kernel devs recommend using 12-character commit hash abbreviations: https://www.kernel.org/doc/html/latest/process/submitting-pa... (bottom of the section has a git config snippet)

> "Christ, that's not an octopus, that's a Cthulhu merge"

Never change, Torvalds, never change.

Another interesting piece of trivia: the very first more-than-two-parent merge in the kernel history is a mistake. The second and third parents are the same commit.

    commit 13e652800d1644dfedcd0d59ac95ef0beb7f3165
    Merge: 4332bdd 88d7bd8 88d7bd8
    Author: David Woodhouse <dwmw2@shinybook.infradead.org>
    Date:   Sun May 8 13:23:54 2005 +0100

        Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

Some of my favorite commits come from Rusty Russel, who wrote the lguest toy hypervisor documentation as a story:


I don't like OP's definition of divergence. I prefer to take the size of the diff along first-parent instead.

Here's how I would do it:

  time git log -m --first-parent --shortstat --pretty="%H" --min-parents=2 |
  grep -v '^$\|3e1dd193edefd2a806a0ba6cf0879cf1a95217da'                   |
  sed 's/.* file.* changed,//'       |
  sed 's/insertion.*,/+/'            |
  sed 's/deletion.*//'               |
  sed 's/insertion.*//'              |
  sed 's/^\ \(.*\)\ $/\$\(\(\1\)\)/' |
  xargs -d '\n' -L 2 echo echo       |
  bash                               |
  sort -k 2,2 -g                     
Note: I skip 3e1dd193edefd2a806a0ba6cf0879cf1a95217da because that commit has no diff along first-parent, and thus screws up my xargs result (which depends on every 2nd line having the --shortstat output).

Of course "--first-parent" doesn't guarantee that we're walking the mainline (see: https://developer.atlassian.com/blog/2016/04/stop-foxtrots-n... ), but it usually is.

On my laptop it takes 3 mins 30 seconds. Here are the 5 biggest merges by this definition:

  099bfbfc7fbbe22356c02f0caf709ac32e1126ea 463702
  3f17ea6dea8ba5668873afa54628a91aaa3fb1c0 466320
  ce519e2327bff01d0eb54071e7044e6291a52aa6 500074
  7ea61767e41e2baedd6a968d13f56026522e1207 504965
  f063a0c0c995d010960efcc1b2ed14b99674f25c 569691
And here's "git show" for those 5:

  099bfbfc7fbb 2015-06-26T13:18:51-07:00 Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
  3f17ea6dea8b 2014-06-08T11:31:16-07:00 Merge branch 'next' (accumulated 3.16 merge window patches) into master
  ce519e2327bf 2009-01-06T17:04:29-08:00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
  7ea61767e41e 2009-09-16T08:11:54-07:00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
  f063a0c0c995 2010-10-28T12:13:00-07:00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6

This approach also helps with those orphan branches since it just calculates the diff caused by bringing the orphan branch into the mainline.

Whereas OP's definition counts every line ever created on both sides (e.g., counts every line of code in the original project as well as the orphan at the moment of the merge).

If you want the diff caused by bringing a topic branch into master (i.e. its unmerged parts), you can also do

  git diff master...topic
with three dots instead of two.

Or in this case (since we're processing merges one at a time to make the calculation): git diff COMMIT^1...COMMIT^2 (aka, parent1...parent2).

I find the three-dot diff notation very useful, but confusing, and always have to type "git help diff" first and stare at the docs a bit to find this bit:

  "git diff A...B" is equivalent to
           "git diff $(git-merge-base A B) B"
Also, I prefer first-parent since it gets the actual change to the mainline, whereas "git diff parent1...parent2" might miss some changes with dirty merges (e.g., conflict resolution). More info here: http://stackoverflow.com/a/41356308/5819286

> "Christ, that's not an octopus, that's a Cthulhu merge"

Perhaps git should throw a warning when you try to do an octopus merge with more parents than an octopus has legs. If you really want to proceed, add the --cthulhu option. The default behavior would be --no-cthulhu.

I think I might add a no-op '--no-cthulhu' flag to my next project, just because I like the ring of it so much.

Because Git totally needs more obscure command line options. /s

It only has one parent, but this would be the commit that I'm least proud of (not in Linux, obviously):


Showing 126 changed files with 14,128 additions and 20,617 deletions.

(ok, I'm pretty proud of reducing code size by 6k+ lines while improving lots of stuff, but the commit is a shitshow)

GitHub's logo always reminds me of the octopus merge; not sure if it was chosen for this reason, but I think it's quite suitable.

Given that the name refers to octopus ("octocat"), it's very likely.

I think Gary's commit counts are off:

  $ git log | wc -l
This should count the number of lines in the entire git log, including metadata (not just commits). I think he means this:

  $ git log --oneline | wc -l
The number of commits for Rails should be closer to 61,000.

That would explain why the commit frequency seemed to improbably high. I just updated it with correct numbers. Thanks.

You can also use:

    git rev-list --count HEAD

The number is right, the command is just misprinted:

    titan:~/src/linux geofft$ git log --oneline 566cf87 | wc -l

Octopuses are more common than you might expect

The etymologically correct plural is octopodes. (Some people accuse "octopodes* of being pedantic, but as I see it "pedantic" is just a euphemism for "correct in a way I don't like".)

'The standard pluralized form of "octopus" in the English language is "octopuses" although the Ancient Greek plural "octopodes", has also been used historically.'[1].

The book "Octopus: The Ocean's Intelligent Invertebrate"[2] which I have here says this on the matter: "By the way, the plural of octopus isn't octopi, because the word is Greek -- octopus to be exact -- not Latin. The Greek plural would be octopodes, but we call them octopuses."

You're not being pedantic as it pertains to this word in the English language, you're just wrong.

1. https://en.wikipedia.org/wiki/Octopus#Etymology_and_pluraliz...

2. https://www.amazon.com/Octopus-Intelligent-Invertebrate-Rola...

I didn't say commonly used; I said etymologically correct.

The word "octopus" comes from Greek, just like the word "corpus" comes from Latin (and has the etymological plural "corpora").

But isn't etymology descriptive, instead of prescriptive?

And if it's prescriptive, how far do you want to go, etymologically? According to Wiktionary[1], the term originates from a Proto-Indo-European language (but doesn't give plural forms for those roots).

And isn't it also the case that once a word gets accepted into a language, it becomes a word in that language, no matter where it comes from? There are _several_ examples of such words, in all the European languages. The word "common", for example, dates back to Latin "communis", and I'm pretty sure the adverbial form for that isn't "communisly", so why the exception for Octopus?

[1] https://en.wiktionary.org/wiki/%CF%80%CE%BF%CF%8D%CF%82#Anci...

How a word is affected by grammatical rules (such as pluralization) is not determined by etymology (but it can be).

Here's the german conjugation of "mailen" (writing an e-mail), borrowed from "to mail":

Ich maile, du mailst, er/sie/es mailt, wir mailen, sie mailen.

I don't know any loanwords that break english pluralization rules in german, but for the reverse: The correct plural for "Kindergarten" would be "Kindergärten" (not "kindergartens"), which I imagine some english speakers would have problems with. And "Autobahnen" is rather unintuitive compared to "autobahns".

> I don't know any loanwords that break english pluralization rules in german

The plural of 'Baby' is 'Babys' (instead of 'Babies' – though some people also use that form) and 'Computer' doesn't change.

On the other hand, both 'Indices' and 'Indexe' are used and for 'Tempus' the only plural is 'Tempora'.

Yes, but "octopus" is an English word, and "etymologically correct" is not a subset of "correct".

The English word "journal" comes from French, but if you're talking in English about two systemd log files, they're "journals", not "journaux". (If you're talking in French, they are in fact "deux journaux".)

Slight article nitpick: a distribution that 'looks like a straight line' in a log-log plot is often not power-law distributed.

One could say that the distribution has a fat one-sided tail though.

I thought about trying to clarify this, but "power law" is the term that's been thrown around to describe this effect in software systems for many years. Really, I only care about the fat-one-sided-ness (for its practical implications); I don't care so much about the precise mathematical formulation.

I used octopus merges once for a deployment system that I built when my team switched from SVN to Git. Since there were a lot of developers working on different parts, it was many times required to test multiple different changes in parallel in the QA system.

I built a small web UI where developers could select and unselect development branches, and it would octopus-merge all selected branches into the master branch, and force-push that state onto the QA branch (and deploy it to QA, of course). So QA would always be master + all development branches that were currently being verified. By using a Github webhook, it would update the QA system whenever master or one of the branches being verified was pushed to. I'm not in that team anymore, but I think that deployment tool is still humming along nicely.

There must have been many times this would fail because of merge conflicts?

Has anyone asked Laxman Dewangan what he was up to with that initial commit and merge thing?

It's common in a lot of Git UIs (GitHub, for instance) to prompt you to create a README when you create a new repo. I think this is because when you're actually starting a project, older versions of git would get confused if you cloned an empty repo, but using `git clone` to get a directory to work in is much nicer than `git init` + do work + `git remote add origin` + `git push -u` or whatever the command is. So, the hosting tool generates an initial commit for you so that a clone works smoothly.

My guess is that Nvidia has some internal Git hosting tool or a private GitHub account, and this developer clicked the most obvious "create a repo" button and tried to push their local git clone of Linux it. That push was rejected because it would have clobbered the auto-generated commit, so they did a `git pull` and a `git push`, i.e., they merged in the auto-generated commit.

That was the worst diagram today. <1 Commits on the y-axis? Where would be 30 on the x-axis? Can't tell if you only have 3 markers on a log axis.

What were you trying to do, locate your favorite commit?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact