Clauset Shalizi Newman 2007 has not-nice things to say about the classic physicist's idiot trick of fitting power law distributions by drawing a straight line on a log-log graph: it's got huge bias. https://arxiv.org/abs/0706.1062
However, the other difficult thing about power law distributions is that the dataset size requirements for proper determination of the fact that it's a power law distribution are occasionally incredibly difficult. So their critique is very strong, given the comparative lack of data. It is often the case that computer systems, with the overflowing reams of data, are still not enough. Note that the paper I cited up there suggests MLE and then a Kolmogorov-Smirnoff test, so it'll say a lot of things aren't power laws that could well be.
Another way to look at it is from a more geometric point of view. The metric entropy of any generic system of variables is defined as the sum of the positive Lyapunov exponents: and as an "entropy" that quantity does have a lot of commonalities with the other entropies. But to have positive Lyapunov exponents is often to have a chaotic dynamics, so it could just be conjectured that the time series of commits and merge octopus sizes in kernel git history is chaotic, so the evolution of the time series will be fractal in nature.
But it's also really fucking hard to confirm or deny that one, because there are varied and strange definitions of chaos itself and the methods that have been suggested to measure Lyapunov exponent in real systems are arcane and difficult. You could try some synchronization methods, but they remain arcane and crap. Fractal measurement methods are also shitty and full of dark magic.
One neat little trick might be to discretize the series, symbolic dynamics-style (it's already discretized but discretize further, into like percentiles or something) and run it through one of the dynamical machine learning dealies to see if there's patterns. Not too much literature on that but it's a thing that some randoes in like 2004 or something did
This is Richardson 1948 shit, so Richardson had the good excuse of doing all the work while people did their computations by hand. This numbnuts is not only like 60 years late, but does not have that excuse.
Lot of people who don't like reading in complex systems, I find. Also a lot of people who aren't sufficiently cynical.
I removed the "power law" language from the post, other than a quick "this is often called a power law, which probably isn't correct" note. I don't want to get bogged down in statistics when all I really care about here is "it's fat and one sided".
Yeah, it's really hard to distinguish power-law from log-normal from statistical measurements, even in simulations when you know the generating process.
As somebody who has not much to do with statistics, what are the bad consequences of misidentifying a distribution? After skimming the paper it seems that there are several distributions that are similar in the sense that it is hard to tell which one a given dataset actually follows. So naively it seems that it should not really matter which one one picks if they are similar enough that it is hard to tell them apart to begin with. Is picking the wrong distribution bad because it leads to wrong conclusions when trying to explain the mechanism causing the distribution? Because extrapolations may yield significantly different results? Or is it more specifically about log–log charts and that small deviations on a logarithmic scale can correspond to significant differences on a linear scale so that a seemingly good fit is not actually a good fit?
You mean - pun not really intended - that making choices based on the mean of the supposed distribution, say for risk assessment purposes, may lead to wildly wrong conclusions? Or more generally, that characteristics of different distributions can be very different even if they fit a given dataset similarly well?
For example, if you have this sort of multiplicative central limit theorem with no lower bound, where you imagine independent variables being multiplied together - the attractor of that dynamical process, the result of that universal phenomenon, is lognormal.
Add a lower bound, bam, power law (champernowne 1953 "A Model of Income Distribution"). So you're really making different claims about reality.
Whoa, you're right! He referenced it neither by 7-character short hash nor by full hash, and I didn't think to check intermediate hash lengths. I've updated the post to reference that email.
>Anyway, I'd suggest you try to limit octopus merges to ~15 parents or less to make the visualization tools not go crazy. Maybe aim for just 10 or so in most cases.
I'll file this under "problems I'm glad I don't have".
Another interesting piece of trivia: the very first more-than-two-parent merge in the kernel history is a mistake. The second and third parents are the same commit.
commit 13e652800d1644dfedcd0d59ac95ef0beb7f3165
Merge: 4332bdd 88d7bd8 88d7bd8
Author: David Woodhouse <dwmw2@shinybook.infradead.org>
Date: Sun May 8 13:23:54 2005 +0100
Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
I don't like OP's definition of divergence. I prefer to take the size of the diff along first-parent instead.
Here's how I would do it:
time git log -m --first-parent --shortstat --pretty="%H" --min-parents=2 |
grep -v '^$\|3e1dd193edefd2a806a0ba6cf0879cf1a95217da' |
sed 's/.* file.* changed,//' |
sed 's/insertion.*,/+/' |
sed 's/deletion.*//' |
sed 's/insertion.*//' |
sed 's/^\ \(.*\)\ $/\$\(\(\1\)\)/' |
xargs -d '\n' -L 2 echo echo |
bash |
sort -k 2,2 -g
Note: I skip 3e1dd193edefd2a806a0ba6cf0879cf1a95217da because that commit has no diff along first-parent, and thus screws up my xargs result (which depends on every 2nd line having the --shortstat output).
This approach also helps with those orphan branches since it just calculates the diff caused by bringing the orphan branch into the mainline.
Whereas OP's definition counts every line ever created on both sides (e.g., counts every line of code in the original project as well as the orphan at the moment of the merge).
Or in this case (since we're processing merges one at a time to make the calculation): git diff COMMIT^1...COMMIT^2 (aka, parent1...parent2).
I find the three-dot diff notation very useful, but confusing, and always have to type "git help diff" first and stare at the docs a bit to find this bit:
"git diff A...B" is equivalent to
"git diff $(git-merge-base A B) B"
Also, I prefer first-parent since it gets the actual change to the mainline, whereas "git diff parent1...parent2" might miss some changes with dirty merges (e.g., conflict resolution). More info here: http://stackoverflow.com/a/41356308/5819286
> "Christ, that's not an octopus, that's a Cthulhu merge"
Perhaps git should throw a warning when you try to do an octopus merge with more parents than an octopus has legs. If you really want to proceed, add the --cthulhu option. The default behavior would be --no-cthulhu.
The etymologically correct plural is octopodes. (Some people accuse "octopodes* of being pedantic, but as I see it "pedantic" is just a euphemism for "correct in a way I don't like".)
'The standard pluralized form of "octopus" in the English language is "octopuses" although the Ancient Greek plural "octopodes", has also been used historically.'[1].
The book "Octopus: The Ocean's Intelligent Invertebrate"[2] which I have here says this on the matter: "By the way, the plural of octopus isn't octopi, because the word is Greek -- octopus to be exact -- not Latin. The Greek plural would be octopodes, but we call them octopuses."
You're not being pedantic as it pertains to this word in the English language, you're just wrong.
But isn't etymology descriptive, instead of prescriptive?
And if it's prescriptive, how far do you want to go, etymologically? According to Wiktionary[1], the term originates from a Proto-Indo-European language (but doesn't give plural forms for those roots).
And isn't it also the case that once a word gets accepted into a language, it becomes a word in that language, no matter where it comes from? There are _several_ examples of such words, in all the European languages. The word "common", for example, dates back to Latin "communis", and I'm pretty sure the adverbial form for that isn't "communisly", so why the exception for Octopus?
How a word is affected by grammatical rules (such as pluralization) is not determined by etymology (but it can be).
Here's the german conjugation of "mailen" (writing an e-mail), borrowed from "to mail":
Ich maile, du mailst, er/sie/es mailt, wir mailen, sie mailen.
I don't know any loanwords that break english pluralization rules in german, but for the reverse: The correct plural for "Kindergarten" would be "Kindergärten" (not "kindergartens"), which I imagine some english speakers would have problems with. And "Autobahnen" is rather unintuitive compared to "autobahns".
Yes, but "octopus" is an English word, and "etymologically correct" is not a subset of "correct".
The English word "journal" comes from French, but if you're talking in English about two systemd log files, they're "journals", not "journaux". (If you're talking in French, they are in fact "deux journaux".)
I thought about trying to clarify this, but "power law" is the term that's been thrown around to describe this effect in software systems for many years. Really, I only care about the fat-one-sided-ness (for its practical implications); I don't care so much about the precise mathematical formulation.
I used octopus merges once for a deployment system that I built when my team switched from SVN to Git. Since there were a lot of developers working on different parts, it was many times required to test multiple different changes in parallel in the QA system.
I built a small web UI where developers could select and unselect development branches, and it would octopus-merge all selected branches into the master branch, and force-push that state onto the QA branch (and deploy it to QA, of course). So QA would always be master + all development branches that were currently being verified. By using a Github webhook, it would update the QA system whenever master or one of the branches being verified was pushed to. I'm not in that team anymore, but I think that deployment tool is still humming along nicely.
It's common in a lot of Git UIs (GitHub, for instance) to prompt you to create a README when you create a new repo. I think this is because when you're actually starting a project, older versions of git would get confused if you cloned an empty repo, but using `git clone` to get a directory to work in is much nicer than `git init` + do work + `git remote add origin` + `git push -u` or whatever the command is. So, the hosting tool generates an initial commit for you so that a clone works smoothly.
My guess is that Nvidia has some internal Git hosting tool or a private GitHub account, and this developer clicked the most obvious "create a repo" button and tried to push their local git clone of Linux it. That push was rejected because it would have clobbered the auto-generated commit, so they did a `git pull` and a `git push`, i.e., they merged in the auto-generated commit.
However, the other difficult thing about power law distributions is that the dataset size requirements for proper determination of the fact that it's a power law distribution are occasionally incredibly difficult. So their critique is very strong, given the comparative lack of data. It is often the case that computer systems, with the overflowing reams of data, are still not enough. Note that the paper I cited up there suggests MLE and then a Kolmogorov-Smirnoff test, so it'll say a lot of things aren't power laws that could well be.
Another way to look at it is from a more geometric point of view. The metric entropy of any generic system of variables is defined as the sum of the positive Lyapunov exponents: and as an "entropy" that quantity does have a lot of commonalities with the other entropies. But to have positive Lyapunov exponents is often to have a chaotic dynamics, so it could just be conjectured that the time series of commits and merge octopus sizes in kernel git history is chaotic, so the evolution of the time series will be fractal in nature.
But it's also really fucking hard to confirm or deny that one, because there are varied and strange definitions of chaos itself and the methods that have been suggested to measure Lyapunov exponent in real systems are arcane and difficult. You could try some synchronization methods, but they remain arcane and crap. Fractal measurement methods are also shitty and full of dark magic.
One neat little trick might be to discretize the series, symbolic dynamics-style (it's already discretized but discretize further, into like percentiles or something) and run it through one of the dynamical machine learning dealies to see if there's patterns. Not too much literature on that but it's a thing that some randoes in like 2004 or something did