Hacker News new | comments | show | ask | jobs | submit login
The Biggest and Weirdest Commits in Linux Kernel Git History (2017) (destroyallsoftware.com)
326 points by swsieber 43 days ago | hide | past | web | favorite | 77 comments



My startup's monorepo has 2 root commits as well. When the company was first starting out, my co-founder and I created independent git repos. I was writing OCR type research-y code and he was doing more traditional CRUD REST webserver things.

When it came time to pull things together, we thought it'd be fun to try and maintain the histories. So I added his repo as a remote and simply merged his unrelated history into mine.

Fast forward and now we offer hoodies as swag to anyone who contributes to the repo. We personalize the hoodie with your git username, the truncated commit hash of your first commit, and the number of parent commits to your first commit.

Having 2 root commits means that both my cofounder and I have hoodies with a large 0 as number of parent commits. Just a nice way to commemorate this accident of history :)


Where I work, we have several dozen repos, on for each component, and are thinking about merging it into a monorepo soon. That will end up giving us a repo with several different root commits.

In fact, for at least one of our components, we've already done this. We had something that solved the same problem for a couple of different permutations of software versions and platforms. The person who had originally written it had used separate repos, with common code copied and pasted between them, and later on we merged the repos together and factored out the common code so that could be shared. So we already have at least one repo which has 4 root commits, I believe.

Actually, I just checked and it has 7 root commits. I'd forgotten that we had also had similar problems that needed to be solved for some other software, and had some interns who had done those in separate repos as well, which also eventually got merged in to share all of the common code:

  $ git log --max-parents=0 --pretty="format:%h %cd %s" --date=short
  666c0bf 2012-07-09 Initial tree.
  528ca3a 2012-06-12 first commit
  50f1b18 2012-06-12 first commit
  86d66e0 2011-11-17 initial
  76a7789 2011-03-21 initial commit
  58f7afc 2010-05-20 initial
  deb50b7 2009-10-29 initial commit


That's a fantastic idea. Do you have any photos of the hoodies?


Yep, here's one: https://i.imgur.com/LEI4BAA.jpg

It makes for a great conversation piece, even around non-tech crowds.


My company got to ~20 services in separated repos prior to moving them all into one monorepo, so we've ended up with 20 root commits and an enormous commit graph around the time where it all converged together. We did do them all separately though - rather than as an octopus merge.

Has been particularly interesting to balance the different languages in the same repo. Few painful bits, but overall has worked out really well.

We also build the entire thing into a single Docker container containing every service. You just start the particular one you want via the command (the entrypoint is a little shell script to invoke the correct executable).


What's the advantage of having all that in one repository? And what's the advantage of provisioning every container with functionality it will never need?


I don't know about that container strategy, but the thing I like about monorepos in that situation is that you can enforce having every commit to master represent a state where all your services and tools work together.

Otherwise if you ever need to do a rollback or post mortem or similar you have to figure out what commit out of the 20 different projects was actually being used at the time.

Of course that comes at a cost and there are other ways to get the same benefit, but my experience is that this is an easy way to keep things consistent.


In my opinion (I work at said company) the best thing about the monorepo is that 1 feature that touches several parts of the infrastructure can be in a single PR, and therefore is much easier to review.


This. Where I work it's common for features to involve 3-4 repos and it makes testing and deploying a huge pain. This leads to people sticking code in the wrong place as a work around, defeating the purpose of splitting the code into services in the first place.


There are a few main benefits of a monorepo:

- much less admin, don’t need to maintain/update so many repos

- easier to coordinate changes between systems at the same time

- everything is on a single version of each library - you don’t have some services on one version, some on another, etc

Benefits of having a single Docker image:

- much faster (overall) build times - doing everything in one go is way faster than doing each one individually

- much faster deployment - only have to download the image once per host, rather than many variations

- most services have a similar set of the same base dependencies, so reduces duplication

- overall the difference in image size between including one service and including all of them is probably only 5-10MB anyway

Those are just the things which immediately come to mind. There are probably many others.


All of the items in your mono-container advantage list go away if you consider the alternative of having a bunch of separate images that share a common base image.

Image layers are immutable and get deduplicated/cached individually when stored or sent over the network. For example if you pull 10 images with a common base to the same host, the first image might be slow to download but after that docker only fetches the deltas between it and the rest.


I am well aware of that. That doesn’t change the significant decrease in admin time spent, or the improvement in usability for our engineers.


Pulling in an unrelated git repo's objects in order to integrate that software with yours is an everyday occurrence. I suspect, lots of git repos have multiple root commits; some probably much more than just two.


I had never heard of Octopus merges, and wondered if they're the reason why the GitHub mascot is Octocat. Turns out they are!

http://cameronmcefee.com/work/the-octocat/


I actually found this while researching octopus merge (mostly looking for background info - I'm working on a tool that traverses and replicates modified git repositories), and I realized that it was actually pretty interesting.


Yeah it's quite funny given the popularity of github how few people know about octopus merges. I only know about them because I built my own git graph GUI. I was about to hard code in two parents and then wondered if there could be more.


Yet Github never creates octopus merges.


They have one octopus and don’t want any more?


The mascot was from a stock illustration GitHub used on their 404 page iirc. It grew too popular and GitHub licensed it soon for exclusive use.


Visualization of octopus merge with 66 parents:

https://imgur.com/gallery/oiWeZmm

Run this:

git log --graph --abbrev-commit --decorate --date=relative --format=format:'%C(bold blue)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(bold yellow)%d%C(reset)' 2cde51fbd0f3


The article was very interesting, but as a nit the best fit curves in the graphs were questionable.

"everything is linear if plotted log-log with a fat magic marker"


Check the legend.


Despite having used Git on a daily basis since more than 5 years, I still have trouble using it, and have to Google things almost everytime I need something a but out of the ordinary.


I recommend having a frontend like SourceTree for git - it uses bare git underneath so the end result and even the state in your working copy will be just the same, but you'll have better picture of what is happening. As the working copy changes in identical way to the use of the command line git, it neatly works as a scaffolding for your learning. You can use the command line client for the stuff you are comfortable with and then resort to the GUI when you get into trouble. This way you can shortcut your learning, provided that you want to learn to use the command line git. Nothing wrong with staying with the GUI forever though.


Toolchains are plagued by pseudo-GUIs - graphical overlays that don't actually abstract a CLI. Git is perhaps the most resistant thing ever to GUI-izing. I don't believe it can be done in a way that doesn't disappoint.


On the other hand, if the abstractions on the CLI tool are well chosen, it is good that the same ideas work are employed in the GUI as well to facilitate learning who the tool actually works.


Trouble is, a complicated CLI isn't an API to a data model. You can help the user visualize, but you have command line tools for that, too. You can further simplify some simple operations. But you can't build a real MVC GUI application.


I use GitKracken, I love it. SourceTree kept crashing :(

Anyway, my comment is that I still need to use CLI for some things, ie. Git merge squash, so while the GUI apps are good learning tools for git in general, you still need to Google things for CLI consumption if the GUI doesn't support it and you're not a git expert.


About the same length of time and I still have to google it.

While technically git is great, usability wise its not (for me).

I've joke it's the source control version of stockholm syndrome before.


I know I'm not contributing to the discussion (sue me), but I just had to let you know that that last sentence made me chuckle.


I rarely use the git cli. Magit does almost everything and makes it much more enjoyable.


> (Update: it was an accident, which Linus responded to in his usual fashion.)

This jab at Linus is unfounded, he have replied calmly and professionally.


As I wrote the last time this came up:

----

Those who only know Linus from his rants might be surprised that here "his usual fashion" means:

- Acknowledging that the root cause was Github's documentation being misleading.

- Not blaming the contributor for being mislead by Github: "I can see why that documentation would make you think it's the right thing to do."

- Admit that the ease with which the accident happened is a deficiency in Git's UI.

- CC the Git maintainer to discuss improving Git to make it harder to do this by accident. (Which eventually lead to the --allow-unrelated-histories flag being needed to do this kind of merge.)


Calling calm and professional response "his usual fashion" isn't a jab :)


All too often people call Linus' rare passionate rants "his usual fashion".


I had the photo of Linus flipping the bird to NVIDIA as my desktop wallpaper for quite a while at a previous job. Used it as a reminder not to commit something stupid (not to Linux - the internal project I was working on at the time).


> rare passionate rants

Wow, that's a really charitable way of spinning them.


His previous message in this thread was less polite:

http://lkml.iu.edu/hypermail/linux/kernel/1603.2/01890.html


That is still not ranting. Linus seems more angry at himself than the person who made the mistake.

Granted, there are some jabs at the user who made the mistake:

> Why did Laxman do that insane merge? Why did it get

> pulled back?

>

> You actually have to work at making shit like this, so I wonder what

> workflow you guys had to make that bad merge.

If you wanna put more spin on this apparently this merge came from Nvidia.


I was just going to comment the same thing. A fairly polite comment to someone who’d added a new root commit to the kernel, acknowledging why that had seemed like a reasonable thing to do and proposing fixes to the people who could implement the fix. Seems ideal to me.


I'd love to read a book on git wizardry that goes beyond the git book.

Manishearth, a Mozillian and Rust contributor, wrote this post on splitting apart a repo. He's good at explaining all sorts of things.

https://manishearth.github.io/blog/2017/03/05/understanding-...


>> It's pulled, and it's fine, but there's clearly a balance between "octopus merges are fine" and "Christ, that's not an octopus, that's a Cthulhu merge".

This made me chuckle.


Just recently I started doing octopus merges regularly, though they won’t ever make it onto master. The situation is where you deploy from a particular ref in git, and have a beta environment; I’ll often want to have unrelated things that are sitting on beta for a while, either because they’re long-running work or because I’ve fixed something but the change hasn’t been reviewed and merged yet. My beta branch is then just an octopus merge of all the feature branches I’m working on at present and want included in the beta environment. That way I am not limited to just one thing on beta at a time, or worrying about maintaining a beta branch as well as the feature branch. It has been very liberating.

I’ve built a fairly simple tool that tracks which branches to merge into my beta branch and automates its regeneration, and I’m polishing it up so it is distributable and can just be a regular Git subcommand, git-managed-branch. What I have already is useful to me, and I think it’d be useful for many others as well.

The essence of the work of the script I have at present boils down to this:

  git stash  # (if necessary)
  git checkout --no-track -B origin/master staging
  git merge --no-edit feature1 feature2 fix3 fix4
  git push --force
  git checkout -
  git stash pop  # (if necessary)


Now I wonder if the warning suggested by Linus in the linked mail was ever implemented. Did also not know git would not even complain.

Does anyone know? (Am on phone, slightly difficult to check).


yes. its implemented:

  --allow-unrelated-histories

    By default, git merge command refuses to merge histories
    that do not share a common ancestor. This option can be
    used to override this safety when merging histories of
    two projects that started their lives independently. As
    that is a very rare occasion, no configuration variable
    to enable this by default exists and will not be added.


See also the relevant lines in the function he proposed a patch for: https://github.com/git/git/blob/master/builtin/merge.c#L1401


I have used this many times while taking the opportunity to reorganise repos while migrating earlier version control systems to Git. I still need a crib sheet every time tho’!


This option was added in git v2.9.0, released on 2016-06-13.


What might have happened had you waited until you were off the phone?


He might have forgotten to check, or ask.

This reminds me of the last interaction I had on #lesswrong - I openly speculated about the UI of a system I didn't have access to, because I wanted to chat about it with people I generally respected. But some of those people just weren't in a good mood and lambasted me for not simply "googling it for the answer." Never mind that the answer isn't easily found, and never mind that that completely invalidates my whole reason for interpersonal interaction anyway...


Interpersonal interaction on #lesswrong is just wasted time that you could have spent praising the coming machine-god so it won't kill you.


I can relate. Seems that people today has no time for the social part of conversation and the creative effects of speculation. I often stop my friends when they reach for their smartphones in an argument or discussion, so that we can converse without proving facts.


From previous reading of Linus rants, it sounds like they're pretty particular about keeping source code and git history clean. Why didn't they go back and reverse the new root created by the README.md repo?


His rants are also pretty particular about not rewriting published history. As he said on the linked message, "[...] I didn't notice the history screw-up until too late, [...]", so he probably already had pushed to the public "master" branch on the git.kernel.org servers. Once it's there, other kernel developers might already have pulled from it, so trying to rewrite the git history to remove the commit would only lead to an unholy mess (and the offending commit coming back) the next time he merges from them.


I have spent the last 2-3 years educating auditors and after quite some effort, they have learned to appreciate git. To the point where they are now starting to ask some of their other clients why _they_ are not doing something similar.

Auditors LOVE immutability. To be fair, git doesn't provide that, but it provides the next-best alternative: tamper detection. If anyone rewrites history, git will show that. The gitrefs between two points in time will not match if anyone has modified data or commits in the meantime. The auditors also have no problem looking at previous years' documents where they have recorded the relevant gitrefs at the time.

This has gone so far that this year's policy review was a breeze. Our compliance documentation is maintained in a git repo, with all documents as markdown files. The final documents are simply compiled PDF and HTML artifacts.

In 2016, the auditors asked if we can provide snapshots of previous policy versions. In 2017, they already understood that we have everything in git, and knew to ask for clarifications as to when a particular change was done and who had signed it off. This year our auditors literally asked for the latest compliance documentation bundle from CI, all the individual commits, and the overall diff over the year.

Wall time spent for policy review: ~20 minutes.


How does someone go about looking for auditors who will be similarly receptive to this kind of enlightenment??

(Understanding that some effort will need to be put in)


As long as the audits are for purposes where there is actual competition[0], this is possible.

The important thing is to never treat compliance audits as box-ticking exercises. That's a never-ending, vicious cycle. In fact, many of the findings are simply different aspects of the same thing. You can pre-emptively work on this: identify what parts of requirements are essentially duplicates, and make improvements that satisfy all of them at once.

Then proudly flaunt them. When you can show to the auditors, in person, that you have considered the wider business implications and worked to understand the compliance requirements, you are on much better ground. That buys trust.

Then, educate the auditors when necessary. Show them in practice how something simple can provide a better trail and an improved experience. When possible, provide evidence in the format they initially ask for, but also in the format which is more suitable and more convenient. Auditors are humans. They just often are not aware of the leading edge, of what is possible. Show off solutions that are more convenient to both of you.

They will learn. They will be impressed by some things you do. Anything that makes their job easier, while satisfying the intent and spirit of the audit, will be an easy sell. They also believe in repeat business. Show repeatedly that you know what you are doing, and why you believe that your approach makes more sense (while delivering better audit trails).

Convenience is a strong currency.

0: There are some domains where a single company has essentially a "Royal Charter". These are much, much harder to deal with, because the monopolist has little need to employ personnel with proper technological understanding. They can also strong-arm and bully their customers at will, because there are no alternatives. Audits like these can very easily degrade into box-ticking bonanzas. My advice for these cases is: pick your fights. Double down on what you truly believe and give in on smaller, less disruptive items. Rinse and repeat on subsequent years.


Honestly his attitude in general seems to be "once live, it stays", no matter if it is git commits, "broken" APIs, or anything really (just check out his recent opinion on the C standard).


For a sampling of his rants about git history rewriting, see here: https://yarchive.net/comp/linux/git_rebase.html


Likely because that would mean rewriting public history, which I believe Linus is opposed to, particularly in the case of a huge project like the kernel.


I'm pretty adamant about rewriting public history for _any_ project. Are other orgs pretty relaxed about rewriting git history?


I find rebasing can get so insane that most of the times, when I'm ready to submit a pull/merge request to master, I create a whole new branch from master, cherry-pick all my commits in order, rebase -i and squash them if I need to, and then push the branch/create a PR/MR.

I find this looks a lot cleaner in tools like Gitlab/Bitbucket as well.


Sounds like you're duplicating work? "git rebase -i master" does pretty much that: starting from master, cherry picks all the commits (in order by default but because you're doing -i you can reorder them, likewise you can omit or amend) and then changes the ref of the branch you were working on to point at the resulting tip commit.

I don't think you need to explicitly check out a new branch and then cherry pick.


I.e. you already have not only a "whole new branch from master" but two of them: a remote tracking branch origin/master, and a local master related to it. You don't need a third one.


Is there a way so you can rebase and push without breaking everyone who already pulled your repo? I have this issue continuously when using rebase on branches that have gitlab or github merge/pull requests. I need to create a different branch all of the time.


You aren't supposed to rewrite history (which is what rebase does) on which other people are basing work. If people have merge/pull requests on your branch, then people are basing work on your branch, and you shouldn't rewrite it.

The best approach (which is what git.git itself uses) is to do the work on a separate branch, which is rebased often; once it's moved to the master branch, it's "frozen" and won't be rebased anymore. All pull requests are based on the master branch, so aren't affected by the constant rebases on the development branch. (Actually, all the work is done on topic branches, what's often rebased/rewritten is a sequence of merges of these topic branches into the master branch.)


That is what I meant. Master is stable, I have my feature or bug fix branch which I already want to show. Nobody is forking from this branch but if the pulled the branch before it will break after my rebase.


Point people to a stable repository that doesn't have non-fastforward changes. Do your history rewrites between your local development repo(s) and a "staging" repo that outsiders do not know about. Only push good changes from "staging" to the public repo. (By "good" I don't mean bug-free code written from an irreproachably perfect design, but simply changes that are not so bad that the best way to fix them is to steamroll their history and rework them entirely.)

You can give people access to the staging repo, but then they have to acknowledge that they are willing to deal with some occasional chaos.


You can always define a new branch pointing to your new rebased ref and push/PR that to avoid having to force-push and messing everyone up.


I had a modular laravel app where each laravel module was a git submodule. Eventually we put all of those submodules into the same git repository, so we then had had n+1 root commits for our n module laravel project.

It worked well until someone ran an accidental `git merge` of one of the submodules into the main project which sewed a bunch of confusion. After that point the practice was banned where I worked. Nice to know that git added a flag to prevent that, though at this point git have also added a subtree command which I think removes the need for our hack.


*sow


Thanks.


(2017)


Thanks! Updated.


If I may- I think the first graph would look better as a histogram, with a smooth polynomial on top. It's hard to see the long-tailed shape in a scatter plot.


I probably will never do octopus merge in my entire career.


The giflow concept requires one to verify if a branch can be merged into both targets

hotfix/${version} needs to go into develop and master, so you need that tiny octopus merge to verify before starting the process

but yeah, incredibly rare.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: