Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Git Internals, Techniques, and Rewriting History (isquaredsoftware.com)
162 points by mxschumacher on Oct 21, 2019 | hide | past | favorite | 58 comments


My collegae told me I'm the guy with his phone number in the git.txt file, don't know if that's a compliment?

But whenever I try to explain Git to someone I try to step away from a computer and just work out the problem on a whiteboard. Using post-it's for branches/HEAD/tags and marker to write down the commits and the commit-tree. Preferrably permanent marker to reflect the permanent nature of commits with regards to branches (eg: rebasing keeps the original commits around). Also I ditch files in favor of a picture of a cat there changes are attributes like added body parts or toys since most people like cats better than files.

Taking it away from a computer really helps to reduce the complexity involved and thinking about what manipulations you make to the Git 'database' visually really helps understanding the concepts imho. Just going through it step by step as if you where the Git binary making the changes, and after a while the Git commands turn from hard to remember trivia to tools in your toolbox.


Conversely, this is a massive indictment of the git user interface.

People used to talk about "CASE tools": Computer Aided Software Engineering, by analogy with the CAD tools that replaced physical drawing tables once the technology got good enough. I joke that certain tools are "computer impaired software engineering", and git can certainly belong in that category at times.

What we have built is a tool with a good internal model but a user interface that it's easier for you to not use a computer, to step away to the whiteboard, in order to work out what to do.

We don't tolerate this for, say, word processors; people don't go to a piece of paper to lay out their headings and only then work out the incantations necessary to achieve that result.

(Has anyone attempted to make a "transparent porcelain" GUI which represents the "intuitive" internals in an actually explanatory way? The success of which could be measured by the reduced number of mistakes and amount of apprehension experienced by non-expert users?)


I don't think it's an indictment since it's a clear tradeoff. Yes git isn't intuitive or self explanatory for newcomers, but that's only because it's optimised to be maximally efficient for experienced users.

To me, this is a sensible decision. The target users of git will use it day in and day out for decades, in this case it makes sense to prioritise experienced usage. For onboarding new users there are plenty of good tutorials, and I think in the long run it makes sense to put the legwork in with those, rather than switch to a more intuitive but less efficient porcelain.


> Yes git isn't intuitive or self explanatory for newcomers, but that's only because it's optimised to be maximally efficient for experienced users.

But that implies that a trade-off has to be there, which it does not. If you design your features well, it can both be quite intuitive for newcomers while still affording efficiency for experts.

A great example is the staging area in git. It is a horrible feature, that causes lots of interactions with other commands that are inconsistent and unclear, especially for newcomers. How to do it better? Just make the staging area a full-fledged commit. Considering that it's already pretty easy to edit a commit anyways, promoting the workflow of just polishing a commit before publication is more intuitive for users, and it means questions like "how do I see what the diff of the staging area by itself is?" is really just "how do I see what the diff of a commit by itself is?" One of those questions I already know the answer to, the other I don't.


It is encouraging with the work going into the `git checkout` split into `git switch` and `git restore` that there is growing pressure among git developers to make things more consistent for users and smooth over some of the roadblocks, including some of the confusion between the staging index and the working directory.


I disagree. Git is complex solution to a complex problem (collaborative code management), so it's understandable there are complexities in mastering Git. However the UI of Git just doesn't make sense in some area's, especially to a newcomer. For a lot of people it's enough to understand the basic concepts with commands that make sense. But as it is right now the UI turns this in to hard to remember concepts and command arguments that belong more in a trivia quiz than a user guide. Also see: https://gitless.com/#research


Mercurial solves the same problem but does it in a much more intuitive fashion without compromises for power users. Saying git is optimised for power users is just after-the-fact rationalization of git's horrible UI.


it's optimised to be maximally efficient for experienced users.

Except it's not. It's an ad-hoc accretion of overlapping functions with unintuitive naming. An attempt is being made to rectify this with the "switch" command, but it's a long way from being optimized for anything at all.


We certainly tolerate it for word processors, it is called LaTeX, and a majority of CS papers are written in it. It is pretty complex, and you sometimes need a help of outside expert, but I cannot imagine using Word instead for complex documents. One you get past the learning curve you can be much more productive.

Hm.. replace a few words and you will get a paragraph about git.


In my experience, the only advantages LaTeX has over Word are mathematics typesetting and being able to be checked into source control. Word has a fair amount of powerful features for document layout, that are often much more easily discoverable than LaTeX (although there's no indication that "Replace All" means "replace all in selection" if you've selected it).


Word certainly has a number of powerful features, but every time I use it, I find it is really hard to have your document look consistent.

In LaTeX (or Markdown, or some other markup), it is pretty easy to enforce common style: font name, font size, indentation, space around figures, and so on.

In Word, not so much -- one bad copy-paste, and your document has a few words in unusual font, or "11.8" pt characters, or "1.17" line spacing. This is especially bad if you are working together, and some of your collaborators are using non-English language -- you see a mix of "Heading 1" and "Titre 1", each configured separately.

I am pretty happy that now that I am not longer in academia, I don't need to deal with either Word or LaTeX -- it's all Markdown, JIRA's markup or, in the extreme cases, Google Docs. Still, if I had to produce a fancy looking document today, I'd probably go for LaTeX with a GUI like Lyx -- good discoverability, and pretty clean .tex files on output.


> We don't tolerate this for, say, word processors; people don't go to a piece of paper to lay out their headings and only then work out the incantations necessary to achieve that result.

There's a reasonably significant set of people that do this sort of thing. The backlash against word processors in favor of simpler focused writing tools or even simpler word processors is what I'd offer as evidence. Scott Hanselmann's video series on "How to REALLY use Word." is another example. (Word has offered powerful stylesheet features since at least 1991 and very few people actually use them, even those that could benefit.)


Very true. But the word processor example is a little scewed. It depends on the audience and the value they want to get from the tool, take Latex for example which is widely used writing scripts, being far away from WYSIWYG.

Have a look at Gitless[0]. It's a layer on top op Git which simplifies/rearranges some of the concepts and commands of Git while still manipulating the Git DB in the same way (you can switch back and forth between Git and Gitless seamlessly).

[0] https://gitless.com/


Oh hey, that's my post! Glad someone found it interesting enough to submit.

Don't expect this to hit critical mass by now, but happy to answer questions on the rewriting thing if anyone has any.


Slide #39 seems to use a graph from https://marklodato.github.io/visual-git-guide/index-en.html#... , that is published under "Attribution-NonCommercial-ShareAlike 3.0 United States (CC BY-NC-SA 3.0 US)". This should be credited appropriately and your content should be shared alike.


Yeah, I grabbed a bunch of images from various sources for the slides. Unfortunately, I was in a hurry and didn't record all the places I got them originally.

I'll try to find time to go through and attribute things. Thanks for the heads-up!


In my book "credited appropriately" reads like, put the site down till you have done your homework!


Look, I'm at work. I agree that it needs to be updated, and will do it as soon as I have time.


Reading the approach you took for reformatting code, I was curious whether after you’d done it, it still seemed worth using the python+libgit2 approach? Do you know how long the shell approach would have taken without using libgit2? Did it save time overall to use the lower level approach, including the presumably extra coding time? The code formatter was probably the limiting run-time factor, no? When I’ve used filter-branch and BFG in the past, I had to run the whole batch multiple times before I got it correct, and before I was sure the result was safe... did you have to do that, run the filter multiple times? Would you take the same approach if you had to do it again from scratch? I’m curious if I should invest in this approach next time it comes up for me.


Yes, it was definitely worth it, for multiple reasons:

- I'm a lot more comfortable with Python than I am with shell scripting

- The Python+libgit2 approach _absolutely_ saved time. Being able to iterate through commits in 4-ish hours vs 24 hours for a complete run was huge, and yes, I did a lot of partial runs to test how the process was going.

- Yes, the code rewriting portion ended up as the bulk of the time. I could do a no-op iteration of the commit history in about 30 minutes, and the final processing was about 4.5 hours.

If I ever needed to do this again, yes, I'd use basically the same approach, just with the advantage of applying the knowledge I learned right away.


Cool post, i wish I could do something similar at work.

I'm wondering why you went through the pain of making an HTTP server to process JS files, instead of say putting the temporary file over a ramdisk and running the formatters in parallel over multiple files. Plus, even if it's slow, you only have to run it a few times right?


A RAM disk never even crossed my mind :) I know they exist and have used one in the past, but it didn't occur to me as a possible tool. Also, I was doing this on Windows, and I don't even know if you _can_ set up a RAM disk on Windows. (edit and of course as I say that, someone literally just linked a Windows RAM disk FS driver upthread.)

And no, I was re-running this _dozens_ of times to get it right, especially when I was in the "run -> transform fails on broken JS code -> debug -> write regex -> rerun" portion.


Ah makes sense you want to be fast then! Thanks for your answer


I love how the first real slide goes straight into the fake manpage generator: https://git-man-page-generator.lokaltog.net/

"App Repo Size Issues:" yes, this is the Achilles heel. By being a full distributed system where every client has to carry a full copy of the history of everything, some common practices become unsustainable. I've had two employers that checked build artefacts into SVN, for example: do that with git and the repo becomes unusably large very quickly. Vendoring dependencies, a useful practice if your project is slow-moving, will also bloat the repo. They should be using "shallow clone" for Jenkins, but even that can be surprisingly large.

I've also been through the "apply BFG to repo" phase (very time-consuming, and blocks all commits while you're doing it!)

> New idea: run the formatter against every commit in the history, so that it looked as if the code was "always formatted the right way".

This is actually brilliant, for the reason they give - keeping "credit" assigned correctly to original commits.

> Determined it was okay if older commits were potentially "broken", as long as the latest commit runs and has all of our changes as of late 2018

I'm less OK with this, as the chances of an automatically introduced horrible bug which you can't trace seem rather high and you've wrecked any chance of using git-bisect! But if the "tip" is the only supported released version, I suppose it's less critical.

It's also interesting that most of the speed issues are addressed by re-architecting to avoid syncing to disk. If there was an easy Windows RAMdisk this might have made almost as much of a difference.


But there is an easy RAMdisk for Windows! http://www.ltr-data.se/opencode.html/#ImDisk


Agreed on all points.

We happen to be in a situation where we have very long development cycles, and also rarely need to go back and rebuild older versions. So, in our case we could get away with this.

And yes, the speed improvements came from a combination of iterating over Git commits in a single Python process vs running Git as an external command, and avoiding touching disk for everything except reading and writing blobs and commits.


The slides don't display well on mobile device


Sorry, mobile formatting was not something I was concerned about when I was putting these together. I presented them from my laptop, so that was all that mattered at the time.


Do I sense an upcoming fork? ;)


Seconded; the text does not wrap in portrait orientation on small breakpoints and trying to scroll trigger the progression of slides.


I always found "rewriting history" misleading. Git is immutable by default, so you cannot "rewrite" the history. In fact, you are creating an "alternate history" (which the article mentions). While the difference appears subtle, it takes away the fear of users who never use any "altering" commands because they think they might lose their changes or mess up something.


The most common scenario is that the original history is immediately discarded, by virtue of no longer having any refs pointing to it—though it lingers in the reflog for a while before being garbage collected. In that situation, I think it’s quite reasonable to call it rewriting history, because the original history no longer exists; for the term “alternate history” to have much meaning, the original must still exist, to be compared to.

I’m looking at the long-term picture here, rather than how you interact with it when in the process of rewriting history.

Clarifying that you’re replacing history rather than modifying it is useful, but I certainly have no beef with the expression “rewriting history”.


You’re right in the case of rebase, and I like the point about making alternate histories, but I don’t mind the word ‘rewriting’ because the net effect is to modify the history that ‘master’ points to. I assume the downvotes are due to the article’s main point being about true history rewriting; using filter-branch and BFG isn’t creating an alternate, it’s truly re—writing.

Anyway, personally, I’d pick on the other word sometimes, and suggest we don’t call it rewriting ‘history’ when rebasing local changes before push. Maybe it’d be better to define “history” to mean things that are pushed and shared with others, and call commits that haven’t been shared something like “local changes”. Might be moot in this case since the article was mainly about rewriting the entire repo, so “history” is totally appropriate, but still perhaps nice to use terminology to distinguish between commits that have been shared from commits that haven’t, since that is the line across which one should be more careful with any kind of rewrite.


What are the best practices for extending git? Every example seems to be a shells script that calls out to git; is there a better approach?


What kind of functionality are you trying to add? For something simply, shelling out to git is probably OK; if you're writing something more complicated many projects rely on libgit2. For really weird stuff you can probably just directly go and edit objects inside of .git, as long as you're careful.


I would like to extend git such that you can check out a commit, amend it, rebase all descendants onto the amended commit, and update any branches accordingly. Effectively like `rebase -i` but not modal.

This requires keeping some state around. libgit2 is a nice pointer, I'll check that out.

(hg users will recognize this as "restacking.")


You can just put any executable in your PATH with name of git-WHATEVER.

e.g.

    $ echo echo hello world > ~/bin/git-hello-world
    $ chmod +x !$
    $ git hello-world
    hello world
You'll even get typo recognition:

    $ git hello-worl
    git: 'hello-worl' is not a git command. See 'git --help'.

    The most similar command is
       hello-world


You can add a section called alias to your ~/.gitconfig

    [alias]
        lol = !git --no-pager log --graph --decorate --abbrev-commit --all --date=local --pretty=format:\"%C(auto)%h%d %C(blue)%an %C(green)%cd %C(red)%GG %C(reset)%s\"
Then you can run "git lol" and get that cool output. That's one way to "extend" git.


Extending how? Are you looking for automation or new commands or something else?


git aliases

custom shell scripts

use a Git library in your language of choice and write something


I find the diagram on page 15 to get an overall feel/flow for basic git usage a very good diagram. If I'd still be teaching at a coding school I'd hand it out to my students.


Isn't a bit off? It shows "git diff" as a diff between staging and workspace when it should be between workspace and local repo.


Not quite: if you stage something, it no longer appears in "git diff" and you have to use "git diff --cached" to show it. One of those things which I trip over occasionally.

"git diff --cached" = staging vs. HEAD

"git diff" = workspace vs. staging (but when nothing is staged, staging == HEAD)

(I think?)


Yep. you are right, "but when nothing is staged, staging == HEAD" is exactly what i had in mind


No, this slide is correct. git diff (without any arguments) shows you what you haven't staged. In other words, the different between the working folder (which here is called "workspace") and the index.


Whoops, I think you're right.

I skimmed over git diff, as when I use git diff I constantly re-figure out for what it precisely applies again. I just care to remember "it diffs".


Note: the presentation has ui flaws and does _not_ fit on a 3:2 screen, so you'll have to zoom to about 80% for it to be visible. This is apparent on slide 8.


Impossible to resize and read on my phone. It amazes me when a developer writes a presentation about a developer tool using a goopy JS-dependent infrastructure that apparently has never been tested on mobile.


As I said down-thread, my only immediate concern when I put the presentation together was presenting it from my own laptop. I appreciate that folks are wanting to read it on mobile, but I've got a lot of stuff on my plate (primarily work around maintaining Redux), and modifying my slides to work better on mobile has not been a priority.


Fair enough. Sorry about not reading before commenting.


No worries.

The complaints about it not being readable on mobile are legit. If I could wave a magic wand and make it all magically responsive, I would.

But, my original goal was simply to actually make the slides for the presentation, and show them during my talk. I do specifically use the Spectacle React/JS toolkit, as I _want_ to publish the slides online later for viewing on the web (one of several reasons why I don't make them in PowerPoint). But, part of the reason I can get away with that is that publishing them on my blog is just a matter of uploading the built assets and moving on, vs having to convert them from Powerpoint by hand or something.

I've got a ton of other priorities and tasks to deal with, and figuring out what's needed to make the slides well-formatted on mobile realistically isn't anywhere on that list. Honestly, this thread is the first time anyone's actually complained about that.

I'm not sure how much of the formatting issue is due to Spectacle's own styles, vs the typical slide layout that I have in there (which is mostly flexboxes with two items side-by-side). If anyone has some specific suggestions on how to alter the styles to make them work better, I could try to apply those and rebuild it.


Safari user here. For some reason the navigation arrows are missing and I can't get past the first page...


Sorry! I built it using the Spectacle web slides toolkit, and have only looked at it in Chrome and Firefox because that's what I use myself.


Would have been nice if we could scroll through the pages and not having to use the arrow keys.


any video archive of the presentation?


No, I only gave this as an internal brown bag talk at work.


We changed the URL from https://blog.isquaredsoftware.com/2019/10/presentation-hooks... to the slides which have the content.


Note that the original post links to an earlier article I wrote that goes into a lot further detail on the actual rewrite process:

https://blog.isquaredsoftware.com/2018/11/git-js-history-rew...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: