Hacker News new | past | comments | ask | show | jobs | submit login
Why SQLite Does Not Use Git (sqlite.org)
508 points by trextrex on April 10, 2018 | hide | past | favorite | 608 comments

"Nobody really understands git" is the truest part of that. While hyperbolic, it really has a lot of truth.

It's always a bit frustrating when working with a team because everyone understands a different part of git and has slightly different ideas of how things should be done. I still routinely have to explain to others what a rebase is and others have to routinely explain to me what a blob really is.

In a team of the most moderate size, teaching and learning git from each other is a regular task.

People say git is simple underneath, and if you just learn its internal model, you can ignore its complex default UI. I disagree. Even just learning its internal model leads to surprises all the time, like the blobs that I keep forgetting why aren't they just called files.

The day I got over what I feel was the end of the steep part of the learning curve, everything made so much sense. Everything became easy to do. I've never been confused or unsure of what was going on in git since.

What git needs is a chair lift up that hill. A way to easily get people there. But I have no idea what that would look like. Lots of people try, few do very well at it.

The whole point about abstractions is you shouldn't need to understand the internals to use them. If the best defense of git is "once you understand the inner workings, it's so clear" then it is by definition a poor abstraction.

Who said it's supposed to be an abstraction? The point, theoretically, of something like Git is that the actual unvarnished model is clear enough that you don't need an abstraction. The problem IMO is that the commands are kind of random and don't map cleanly to the model.

Indeed, the worst offenders are in my opinion checkout, reset and pull.

They mix multiple only slightly related commands in one.

There are couple projects that try to tackle this problem by providing an alternative CLI (on top of git's own plumbing), like gitless and g2. Haven't used any of them myself, but would be interested in experience of others.

Any interface means you'll build an mental model of the system you're manipulating. How else could you possibly know what you want to do and what commands to issue?

So given a mental model is inevitable, seems reasonable that that model should be the actual model.

You don't need to understand how media is encoded to watch a movie or listen to a song. You don't need to understand the on disk format of a Word document to write a letter. When writing a row to an SQL database I don't always understand how that software is going to record that data, but I do know I can use that SQL abstraction to get it back out.

> You don't need to understand how media is encoded to watch a movie or listen to a song.

I recall the time when mp3 was to demanding for many CPUs, so you had to convert to non-compressed formats. Today you do need to know that downloading non-compressed audio will cost you a lot of network traffic. Once performance is a concern, all abstractions have to be discarded.

Exactly, if you stick to the very basics with git, you can live a happy life never caring about the internals. If you however want to dig into the depths of Git and use all its power, I don’t get why people don’t think there would be an obvious learning curve.

Same exact thing above applies to so many things in software development, from IDEs, to code editors (Vim/Emacs/Sublime/etc), to programming languages, to deploy tools, the list goes on. There’s a reason software development is classified as skilled labor and not a low end job generally. You’re expected to have knowledge of, or be willing to learn a lot, to do your job.

The difference is that the video model abstracts over the encoding, the git model does not abstract over the storage model, it exposes it. git commands are operations on a versioned blob store.

It's not versioned.

> So given a mental model is inevitable, seems reasonable that that model should be the actual model.

I think the longevity of SQL has proved there's value is non-leaky abstracted interfaces.

> I think the longevity of SQL has proved there's value is non-leaky abstracted interfaces.

How is sql non-leaky? To be proficient with sql you have to understand how results are stored on disk, how indexes work, how joins work, etc. To debug and improve them you need to look at the query plan which is the database exposing it's inner workings to you.

You have to know about the abstractions an sql server sits on as well. Why is it faster if it's on an SSD instead of an HDD? Why does the data dissapear if it's an in memory DB?

> To be proficient with sql you have to understand how results are stored on disk, how indexes work, how joins work, etc

No, you don’t. As far as I know, the data is stored in discrete little boxes and indexes are a separate stack of sorted little boxes connected to the main boxes by spaghetti. This is the abstraction, it works, and I don’t need to know about btrees, blocksizes, how locks are implemented, or anything else to grok a database.

You've never had to look at a query plan that explains what the database is doing internally? If not then I wouldn't consider you proficient, or you've only ever worked with tiny data sets.

Have you created an index? Was it clustered or non-clustered? That's not a black box, that's you giving implementation details to the database.

I don’t think being a professional DBA managing an enterprise Oracle installation is isomorphic to the general populace that might use git.

There’s no question that knowing more will get you more, but I think for the question of “when will things go sideways and I need to understand internals to save myself”, one would be able to use a relational database with success longer than git, getting by on abstractions alone. Running a high-performance installation of either is really outside the scope of the original point.

Those things don't generally influence how you structure the query, though - you can choose to structure your query to fit the underlying structure better, or you can modify the underlying structure to better fit your data and the manipulations you are trying to preform.

Yes, most of us will have to do both at some point, but they can be thought of as discrete skills.

This isn't a bad analogy though. Git itself is similar - once you understood the graph-like nature of commits (which isn't all that complicated to begin with), it's generally not hard to skim through a repository and understand its history. Diffing etc. is also simple enough this way.

If, on the other hand, you are working to create said history (and devise/use an advanced workflow for that), it's very helpful if you understand the underlying concepts. Which also goes for designing database layouts - someone who doesn't understand the basics of the optimizer will inevitably run into performance problems, just as someone who doesn't understand Git's inner workings will inevitably bork the repository.

You don't need to know more than sql to manipulate the data. The semantic of your query is fully contained in sql.

You may need to go deeper and understand the underline model if you want performance but sticking to normal form can make unnecessary for a lot of people a lot of the time.

You can have a useful separation of work between a developer understanding/using sql and a DBA doing the DDL part and the optimization when needed.

You have a very high standard for what 'proficient' means, and yet a very low one.

That is, I am not proficient with relational databases, and I can handwave why an SDD is faster, and why data may disappear from an in-memory DB.

But I couldn't do an outer join without help. Nor do I know when I would want to do one.

Bob Martin wrote the essay at http://blog.cleancoder.com/uncle-bob/2017/12/09/Dbtails.html , in which he writes:

> Relational databases abstract away the physical nature of the disk, just as file systems do; but instead of storing informal arrays of bytes, relational databases provide access to sets of fixed sized records.

This isn't true. SQLite does not use fixed size records.

This suggests to me that a lot of people who consider themselves proficient with SQL don't know how the results are stored on disk, nor the difference between the SQL model and the actual implementation details, making them not proficient under your definition.

> That is, I am not proficient with relational databases, and I can handwave why an SDD is faster, and why data may disappear from an in-memory DB.

Because you know that information for other reasons as most people would. Just because the information is gained for other reasons does not make it irrelevant when using a database though.

> This isn't true. SQLite does not use fixed size records.

It's actually true of most/all modern databases these days. The point isn't knowing the exact structure the database uses to store it's information (even though it can be useful) but knowing how efficiently it can find the information for any given request. Knowing when a database is doing an index lookup or a full table scan is very important and I wouldn't consider someone that can't make a reasonable guess to be proficient in sql. Many of these details are even exposed in the sql, when you create an index and decide if it's clustered or non-clustered your giving the database specific directions about how the data will be physically stored.

The fact that you need to know anything about how they do their work internally to be reasonably competent at using them makes them a leaky abstraction.

SQL leaks for complex queries and schemas if performance needs to be optimized. I argue virtually all abstractions leak heavily when performance is considered, some more than others. SQL leaks relatively little in comparison to some other technologies IME.

Also, SQL has well-established processes and formalisms to design schemas which generally result in solid performance by themselves. That's what RDBMS are around for, after all: enabling efficient and consistent record-oriented data manipulation. This is quite difficult to do correctly in reality; for example, if you write your own transaction mechanism for disk/solid-state storage, you are going to do it wrong. This is genuinely difficult stuff.

There is a ton of internals that SQL abstracts so well that very few DB programmers know or (have to) care about them. Things like commit and rollback protocols, checkpointing, on-disk layouts, I/O scheduling, page allocation strategies, caching etc.

You wrote "Just because the information is gained for other reasons does not make it irrelevant when using a database though."

Certainly. My comment, however, concerned what you meant by 'proficient', and not simple use.

You used the qualifier "all modern databases". Was that meant to imply that SQLite is not a modern database?

My point remains that there are many people who are proficient in SQL, and would do very well with SQLite, even without knowing the on-disk format.

That is why I disagree with your use of the term "proficient".

You seem to be talking about a different kind of leakiness. In my mind, there are two kinds: conceptual and performance leakiness. You are talking about the latter. Pretty much any non-trivial system on modern hardware leaks performance details. From what I understand, git's UI tries to provide a different model that the actual implementation but still leaks a lot of details of the implementation model.

It probably should be homomorphic to the actual model, but not the actual model. The map cannot be the terrain.

I disagree with that. The point of an abstraction is to not having to know the implementation. Understanding the principles used behind will always lead to a much better use of your abstraction

I'd also say an abstraction could be carrying its weight even if it only reduces the amount you have to think about the implementation details when using it.

Leaky abstractions is how we get stuff like ORMs.

To be fair, most ORMs poorly implement the "leaky" principle. When implemented well, like with SQLAlchemy, the end result is a much nicer ORM.

In fact, one of the things in common among the ORMs that have left a bad taste in my mouth is that they all tried to abstract away SQL without leaking enough of it.

I have a different thesis:

Picking the ideal interface to abstract is critically important (and very hard).

In the case of ORMs, available solutions abstract the schema (tables, rows, fields), the objects, or use templates. My solution abstracted JDBC/ODBC. The only leak in my abstraction was missing metadata, which I was able to plug (with much effort!).

My notions for interfaces, modularity, abstractions are mostly informed by the book "Design Rules: The Power of Modularity". http://a.co/hXOGJq1

This might sound a little out of touch, but am I the only one who doesn't think git is that hard? It is a collection of named pointers and a directed acyclic graph. The internals aren't really important once you have that concept down.

But what about the deal breaker in the article: a way to follow the decendants of a commit.

Took a few seconds with a search engine on "git descendants of a commit": https://stackoverflow.com/questions/27960605/find-all-the-di...

That said, I do feel some "porcelain" git commands are poorly named and operate inconsistently -- compared to the plumbing of the acyclic graph concepts which is good but limited.

So, in git, to show descendants of a commit, you use

    git rev-list --all --parents | grep "^.\{40\}.*<PARENT_SHA1>.*" | awk '{print $1}'
whereas in fossil, you use

    fossil timeline after <COMMIT>
I mean, one of these looks just a little more straightforward than the other, doesn't it?

Also, a cursory test in a local git repo just now showed that command seems to print out only immediate descendants--i.e., unless that commit is the start of a branch, it's only going to tell you the single commit that comes immediately after it, not the timeline of activity that fossil will--and all it gives you is the hash of those commit(s), with no other information.

I use git myself, not fossil, but if this is something you really want in your workflow, fossil is a pretty clear win.

I mean, sure. He really wanted this feature in fossil, gave it a first class command line ui, and its super easy.

How many other ways of looking at commits or trees are there, that are hard in git but impossible in fossil because the author didn’t feel like it?

I don't know why they have the need to retrieve the hash of the descendant commit, but usually what I'm doing is: I use a decent visual tool and just follow the branch (sourcetree).

You could alternatively use:

    git log --graph

  git log <COMMIT>..

`git log` stays in the current branch unless you give it the `--all` option. But when you give it the `--all` option the limitation by `<COMMIT>..` does no longer work. So not a solution.

  git log --all --ancestry-path ^<COMMIT>

You mean you didn't just read git's easy-to-follow, well-structured man pages or built-in help? /s

Half the time, when I know what I want, I both keep forgetting git flags and sub-commands - and struggle to find them in the man pages.

Like the fine:

git diff --name-only # i list only the files that are changed. But I'm not --list-files.

Diff lists files without --name-only though, so the flag specifies you want _only_ the filenames of those with a diff.

True, but it's inconsistent with eg grep.

Getting only the changed filenames is a fairly specialised operation. Often in normal use you can get away with a more generic operation that comes close to what you need, but is way more common, e.g.

    git diff --stat

If you're new to tech or you've got a different mental model of how version control works, getting across the gap to git is a challenge.

My current team are mostly controls engineers, working on PLCs. But the software we're now working with has its configurations tracked in git. These aren't dumb people, they're quite talented, but their education wasn't in CS, and "directed acyclic graph" is not a thing they have a mental model for.

No you're definitely not the only one. Git is one of the simplest and dumbest tools developers have at our disposal. People's inability to conceptualize a pretty straight forward graph is something no amount of shiny UI can ever fix.

I don't understand HN's hardon for hating Git.

Sure, and a piece table is a simple way to represent a file's contents. But if anyone wrote a shell or a text editor that required you to directly interact with the piece table to edit a file—instead of something sane—then they'd rightfully be called out on it. It wouldn't matter how much you argued about how simple the piece table is to understand, and it wouldn't matter how right you were about how simple the piece table is to understand. It's the wrong level of abstraction to expose in the UI.

The only thing Git can really fix is changing it's command flags to be consistent across aliases/internal commands. That's about it. The whole point of an SCM is that graph that you want to move away from. People have asserted your claim many times but can't ever give specific things to fix about the "abstraction."

There are about 5/6 fundamental operations you do in git/hg. If that's too much then again, there's not an abstraction that is going to help you out.

See, you're trying to foist a position on me that isn't mine—that I'm scared of the essential necessities of source control. And you act as if source control were invented with Git. Neither of these are true.

> git/hg

Mercurial was a great solution to the same problem that Git set out to tackle, virtually free of Git's foibles. The tradeoff was a few minor foibles of its own, but a much better tool. It's a fucking shame that Git managed to suck all the air out of the room, and we're left with a far, far worse industry standard.

>Mercurial was a great solution to the same problem that Git set out to tackle, virtually free of Git's foibles.

No, Mercurial's design is fundamentally inferior to Git, and practically the entire history of Mercurial development is trying to catch up to what Git did right from the start. For example having ridiculous "permanent" branches -> somebody makes "bookmarks" plugin to imitate Git's lightweight branches -> now there are two ways to branch, which is confusing. No way to stash -> somebody writes a shelve plugin -> need to enable plugin for this basic functionality instead of being proper part of VCS. Editing local history is hard -> Mercurial Queues plugin -> it's still hard -> now I think they have something like "phases". In Git all of this was easy from the start.

Another simple thing. How to get the commit id of the current revision. Let's search stack overflow:


The top answer is `hg id -i`.

    $ hg id -i
The problem is, this answer is wrong! This simple command can execute for hours on a large enough repository, and requires write privileges to the repository! Moreover, it returns only a part of the hash. There's literally no option to display the full hash.

The "correct" answer is `hg parent --template '{node}'`. Except `hg parent` is apparently deprecated, so the actual correct way is some `hg log` invocation with a lot of arguments.

I would not call "hg log -r tip" a lot of arguments.

Also, on the git/hg debate, I feel I've had problems (like the stash your modification and redownload everything) more often with git that hg. I mean perhaps it tells something about my capability to understand a directed acyclic graph, but hg seems less brittle when I'm using it.

I disagree with some of your comments, is git stash really essential or unneeded complexity? That's debatable, I never use it personally.

What I don't like in git is the loss of history associated with squashing commits, I would prefer having a 'summary' that would keep the full history but by default would ne used like a single commit.

In git you can use merge commits as your "summary" and `--first-parent` or other DAG depth flags to `git log` (et al) to see only summaries first. From the command line you can easily add that to key aliases and not worry about. I think that if GitHub had a better way to surface that in their UI (ie, default to `--first-parent` and have accordions or something to dive deeper), there would be a lot less squashing in git life. (Certainly, I don't believe in branch squashing.)

The DAG is already powerful enough to handle both the complicated details and the top-level summaries, it's just dumb that the UIs don't default to smarter displays.

(I find git stash essential given that `git add --interactive` is a painful UX compared to darcs and git doesn't have anything near darcs' smarts for merges when pulling/merging branches. Obviously, your mileage will vary.)

>you're trying to foist a position on me that isn't mine

I just said you can't give specifics on what to change, because there isn't much too change.

>And you act as if source control were invented with Git

No I'm not?

>and we're left with a far, far worse industry standard.

Yeah, we definitely should have gone with the system that can't do partial checkouts correctly or even roll things back. Branching name conflicts across remote repositories and bookmark fun! Git won for a reason, because it's good and sane at what it does.

That reason was called Linus and Linux kernel development.

The master can do no wrong.

No, the reason is mercurial sucked at performance with many commits at the time, and was extra slow when merging.

Lacked a few dubious features such as merging multiple branches at the same time too.

It has improved but git is still noticeably more efficient with large repositories. (Almost straight comparison is any operation on Firefox repository vs its git port.)

Mercurial has always been better than Git on Windows.

Those dubious features are so relevant to daily work that I didn't even knew they existed.

Git main target is Linux. Obviously. Performance on the truly secondary platform was not relevant and it is mostly caused by slow lstat call.

Instead Mercurial uses additional cache file which instead is slower on Linux with big repos. But happens to be faster in Windows.

And the octopus merge is used by kernel maintainers sometimes if not quite a lot. That feature is impossible to add in Mercurial as it does not allow more than two commit parents.

Which reinforces the position that git should have stayed a Linux kernel specific DVCS, as the Bitkeeper replacement it is, instead of forcing its use cases on the rest of us.

>Which reinforces the position that git should have stayed a Linux kernel specific DVCS

No it doesn't? People use octopus merges all the time, every single day.

Well, I only get blank stares when I mention octopus merges around here.

...as I get stares (okay, mostly of fear) if I point out that we need a branch in my workplace. What you can/can't do (sanely) with your tool shapes how you think about its problem space.

To emphasize that even more: Try to explain the concept of an ML-style sum type (i.e. a discriminated union in F#) to someone who only knows languages with C++-based type systems. You'll have a hard time to even explain why this is a good idea, because they will try to map it to the features they know (i.e. enums and/or inheritance hierarchies), and fail to get the upsides.

Easy, is is called std::variant, available since C++17.

Yeah, I guess. Except that std::variant is basically a glorified C union with all the drawbacks that entails.

But git didn't force its use on anybody, lol. If you need a scapegoat, try GitHub!

You wrote: People have asserted your claim many times but can't ever give specific things to fix about the "abstraction."

That seems like you made an assertion as well. I think there are counter-examples.

For example, the point of gitless is (quoting http://gitless.com/ ):

> Many people complain that Git is hard to use. We think the problem lies deeper than the user interface, in the concepts underlying Git. Gitless is an experiment to see what happens if you put a simple veneer on an app that changes the underlying concepts

Some commentary is at https://blog.acolyer.org/2016/10/24/whats-wrong-with-git-a-c... .

Many HN discussions as well, including https://news.ycombinator.com/item?id=6927485 .

> The whole point of an SCM is that graph that you want to move away from.

I think that's an exaggeration. For example, Darcs and Pijul aren't based around a "graph of commits" like Git is, they use sets of inter-dependent patches instead. I'm sure there are other useful ways to model DVCS too.

Whilst this is mostly irrelevant for Git users, you mentioned Mercurial so I thought I'd chime in :)

> The only thing Git can really fix is changing it's command flags to be consistent across aliases/internal commands.

I mostly agree with this: Git is widespread enough that it should mostly be kept stable; anything too drastic should be done in a separate project, either an "overlay", or a separate (possibly Git-compatible) DVCS.

>For example, Darcs and Pijul aren't based around a "graph of commits" like Git is, they use sets of inter-dependent patches instead.

I said graph, I didn't say which graph. Both systems still use graphs. And still a graph you have to understand how to edit with each tool. The abstraction is still the same, and if you have problems with Git, you're going to have problems with either of those tools as well. The abstraction is not the problem, it's the developers inability to conceptualize the model in their head.

Where is the exaggeration?

> I said graph, I didn't say which graph

You said "that graph" which, in context, I took to mean the git graph.

> Both systems still use graphs


> The abstraction is still the same

Not at all, since those graphs mean different things. Each makes some things easier and some things harder. For example, time is easy in git ("what did this look like last week?"). Changes are easy in Darcs ("does this conflict with that?"). Both tools allow the same sorts of things, but some are more natural than others. I think it's easy enough to use either as long as we think in its terms; learning to think in those terms may be hard. For git in particular, I think the CLI terminology doesn't help with that (e.g. "checkout").

> if you have problems with Git, you're going to have problems with either of those tools as well

Not necessarily. As a simple example, some git operations "replay" a sequence of commits (e.g. cherrypicking). I've often had sequences which introduce something then later remove it (bugs, workarounds, stubs, etc.). If there's a merge conflict during the "replay", I'll have to spend time manually reintroducing those useless changes, just so i can resume the "replay" which will remove them again.

From what I understand, in Darcs such changes would "cancel out" and not appear in the diff that we end up applying.

> Where is the exaggeration?

The idea that "uses a graph" implies "equally hard to use". The underlying datastructure != the abstraction; the semantics is much more important.

For example, the forward/back buttons of a browser can be implemented as a linked list; blockchains are also linked lists, but that doesn't mean that they're both the same abstraction, or that understanding each takes the same level of knowledge/experience/etc.

>The idea that "uses a graph" implies "equally hard to use".

What I'm getting at is that if you don't understand what the graph entails, and what you need to do the graph, any system is going to be "hard to use." This idea that things should immediately make sense without understanding what you need to do or even what you're asking the system to do, is just silly.

I've never seen someone who understands git, darcs, mercurial, pijul, etc go "I totally understand how this data is being stored but it's just so hard to use!" I don't think that can be the case, because any of the graphs those applications choose to use have some shared cross section of operations:

* add

* remove

* merge

* reorder

* push

* pull

I see people confused about the above, because they don't understand what they're really asking the system to do. I don't think any abstraction is ever going to solve that.

Git does have a problem with its command line (or at least how consistent and ambiguous it can sometimes be), but you really should get past it after a week or two of using it. The rest is on you. If you know what you want/need to do getting past the CLI isn't hard. People struggle with the former and so they think the latter is what's stopping them.

Could you please remove the thorniness and condescension for your posts? It breaks the guidelines and makes discussions worse.


Can you tell the other guy to not post false and disingenuous statements? Because I'm pretty sure that is what degrades discussions, not any tone I choose to exhibit. I highly encourage you to read the thread thoroughly. If I switched my position on git we wouldn't be having this discussion, as evidenced elsewhere in the thread where people are taking a notably blunter tone than I am just with the side with popular support on this forum.

I posted a bald statement. He replied directly with snide remarks and fallacies. Look at the timestamps and edits. I have every right to be annoyed and make it known that I am annoyed in my posts when the community refused to consistently adhere to guidelines.

Enforce guidelines that keep discussions rational, not because people don't want to be accosted in public for their misleading, emotionally bloated statements.

It doesn't matter what you're replying to. The guidelines always apply, so please follow them.

>The guidelines always apply

They are currently not being applied. Is it fair for me to point out how inconsistently the posts are being treated?


>Don't say things you wouldn't say face-to-face. Don't be snarky.

"Every day humans make me again realize that I love my dogs, and respect my dogs, more than humans. There are exceptions but they are few and far between." [2]

>Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize.

"Which sort of doesn't matter since everyone thinks GitHub is source management." [1]

>Please don't post shallow dismissals

"You all lost out on "the most sane and powerful" as a result." [1]

"Calling it a sane and powerful source control tool is just not supported by the facts, calling "the most ..." is laughable." [1]

"Calling Git sane just makes it clear that you haven't used a sane source management system." [1]

"Lots of people are too busy/whatever to know what they are missing, maybe that's you. It's not me" [3]

>When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

"Arguing with some random dude who thinks he knows more than me is not really fun." [3]

"Dude, troll much?" [4]

[1] https://news.ycombinator.com/item?id=16806588

[2] https://news.ycombinator.com/item?id=16807652

[3] https://news.ycombinator.com/item?id=16806877

[4] https://news.ycombinator.com/item?id=16807763

At least I had the decency to correct dishonest statements with vanilla citations in my posts, and they still got ignored.

Some of the things you quoted there are admittedly borderline, but you went much further across the nastiness line. Could you please just not do that? It isn't necessary, and it weakens whatever substantive points you have.

>borderline, but you went much further across the nastiness line.

I didn't insinuate that people are worth less than pets that I bought and own (who can't even choose who to be dependent on) because they don't agree with my perspectives over a piece of software. In what context would this be an acceptable statement to make face to face or in a public setting and you go "well, you know, it's kind of okay to say!"

I'm exceedingly interested in where I crossed that line in a considerable manner because that's one distant line to cross. Next time someone says something I perceive to be incorrect, or they get on my nerves for continually disagreeing with me, I'll be sure to tell them my dog is worth more than them since that's actively being allowed and has a precedent of moderator support.

And for the record, my tone is probably "abrasive" in this post because the above actions and outright blind eye towards outright lies and uncalled for statements is aggravating. I have a feeling you're not doing anything just because of who he is, and not because what he is saying is warranted or even accurate (it's definitely not, as I demonstrated across several different posts).

I've archived this thread so people are free to review my actions and moderator actions at a later date: https://web.archive.org/web/20180411062201/https://news.ycom...

I've said my piece.

Exactly, it was no longer mysterious for me after I had to prepare a written branching procedure for our team starting from how to branch off, commit, rebase to doing resets and working with reflog. While doing that I've thoroughly read the official docs, examined lots of examples, created a local repo with a couple of text files to test various commands. An then it became so clear and simple! Especially the reflog – so powerful!

So, my advice is to try to write some instructions for yourself for all the common cases you might run into during your work. It will not only help you realise what you actually need from git, but also will serve as a good cheat-sheet.

This looks like a good chair lift: http://gitless.com/

Git Pro's chapter on git internals does a good job of explaining some of the things going on under the hood.


I wrote a book that tries to :)


I start with simple examples and work up from there. It's based on training I've conducted at various companies, and avoids talk of Merkle trees or DAG.

Like `git help`? I has everything important grouped nicely and hints you to even more subcommands.

I am not a git expert or anything, but I have helped resolve weird git issues for my teammates usually using a lot of Google and StackOverflow.

I just know 5 basic commands; pull, push, commit, branch, and merge. Never ran into any issues. People who run into issues are usually editing git log or doing something fancy with “advanced” commands. I have a feeling that these people get into trouble with git cause they issue commands without really knowing what those commands do or even what they want to achieve.

Start working in a repo with submodules and you suddenly have to understand a lot more and can get into trouble with no idea how you did it.

I use submodules every day, never had a problem with them. What do people complain about when it comes to them?

My mental model is basically that they're separate repos, and the main repo has a pointer to a commit in the submodule. Do your work that needs to be done for the submodule, push your changes, and then check out that new commit. Make a commit in the main repo to officially bump the submodule to that new commit. Done.

The annoying part is when you do a pull on the main repo, you have to remember to run git submodule update --recursive.

Because you have the .gitmodules file, the .git/config file, the index, and .git/modules directory, each of which can get out of sync with the others.

If, for example, you add a submodule with the wrong url, then want to change the url, then you instinctively change .gitmodules. But that won't work, and it won't even nearly work.

If you add a submodule, then remove it, but not from all of those places, and try to add the submodule again (say, to a different path), then you also get wierd errors.

If you add a submodule and want to move it to another directory then just no.

Oh and also one time a colleague ran into problems because he had added the repo to the index directly - with git add ..

Oh and let's talk about tracking submodule branches and how you can mess that up by entering the submodule directories and running commands...

Why do you want to bypass the tool at the first glance? Git submodule command has a way to update these urls...

Heh, good question.

But seriously, the fact that there is a .gitmodules file lulls you into a sense that that file is "the configuration file". If you don't know about these other files, then it's natural to edit .gitmodules. When you make errors, the fixing those errors are pretty hard. There is no "git submodule remove x" or "git submodule set-url" or "git submodule mv".

For example, do you know how, on the top of your head, to get an existing submodule to track a branch?

How do you think someone who does not quite understand git would do it? Even with a pretty ok understanding of git infernal, you can put yourself deep in the gutter. (case in point, if you enter the submodule directory and push head to a new commit, you can just "git add submodule-directory" to get point the submodule to the new commit. But if you were to change upstream url or branch or something else in the submodule, you're screwed. That's not intuitive by a long shot)

Edit: git submodule sync is not enough by the way... You can fuck up your repo like crazy even if you sync the two configuration files.

Right, it’s not that hard, but there are some gotchas. The most common problem I see is the local submodule being out of sync with the remote superproject. Pushes across submodules are not atomic. Accidentally working from a detached head then trying to switch to a long out of date branch can be an issue, as can keeping multiple submodules synced to the head on the same branch. Recursive submodules are, as you mentioned, even more fun.

The same problem appears in any non monolithic project. In any SCM I know of.

Git subrepo or subtree are some of a solution but not quite complete and easy to use.

In some other scms (P4 and SVN, partly hg) the answer is don't do that, which had a whole lot of its own problems.

Oh, so that's what you do!

Heh, I probably made it sound more complicated than it really is. Just think of it as a pointer that needs to be manually updated.

I'm comfortable with most advanced git stuff. I don't touch submodules.

> I don't touch submodules.

What's the alternative? Managing all dependencies by an external dependency manager does not exactly reduce complexity (if you're not within a closed ecosystem like Java + Maven that has a mature, de-facto standard dependency manager; npm might count, too).

It's absolutely not feasible for C++ projects; all projects that do this have horrible hacks upon hacks to fetch and mangle data and usually require gratuitous "make clean"s to untangle.

I use git sub-trees. Actually I love the thing. They give you a 'linear' history, and allow you to merge/pull/push into their original tree, keeping the history (if you require it).

Never heard of them (well, probably in passing); will look into is. Thanks!

Why isn't it feasible for C++ projects?

Oh you can fuck right off with submodules!

^^^ this comment is supposed to be humor, not douchebaggery, by the way. Easy on the downvotes.

I never had any problems the past 6 years I've been using Git professionally. But then someone asked me what to do when Git prevents you from changing branches and not knowing they did not stage, I told them to stash or commit. They stashed and the changes were gone.

My point is, while your basic commands do the work, your habits and knowledge keep you from losing code like this without you knowing.

Why were the changes gone? Why couldn't they "git stash pop"?

Unstaged or untracked changes were gone. They couldn't get those back after pop. I can't remember which.

Untracked files are not stashed, that is true.

They're also not deleted by "git stash" though.

Is no one reading git's help pages before running a command the first time?

Not even once I lost I code worked on with git. stash is a reliable companion across branches and large timespans.

I do like Git, most of the time, but really, not a single problem, in six years?

When using Git daily we never really did anything complicated, just a few feature branches per developer, commit, push, pull-request, merge. Basic stuff. We had Git crap out all the time. Never something that couldn't be fixed, but sometimes the fix was: copy your changes somewhere else, nuke your local repo, clone, copy changes in and then commit an continue as normal.

I’ve been using git since 2007 and never ever even wanted to try nuking a checkout and starting over to recover from anything, much less did so. (Did have a nameless terrible Java ide plug-in do it for me once.)

So you're not using checkout, reset, and diff?

Good point, forgot about checkout, diff, clone, blame, add, rm, rebase, init, and probably a few more.

Haven't used reset personally though but only when trying to fix someone's repo.

Or fetch?

I think a lot of people ignore fetch and only ever pull.

I think it's the most important sources of my cognitive dissonance around git. It strengthens the illusion that a working directly is somehow related to a git store, which it really isn't.

You have a working directly/checkout - that can be: identical (apart from ignored files) to some version in git; or different.

If it's different ; some or all changes can be marked for storing in the git repo - most commonly as a new commit.

It's a bit unfortunate that the repo typically is inside your work directory/checkout - under '.git' along with some files like hooks, that are not in the repo at all...

But you'd have to stash before pull. At least with my config where a pull will rebase automatically.

I use `git config pull.rebase true` too, but that doesn't mean you _have_ to stash first, just as rebase manually wouldn't - depends if there's a conflict.

Same is true of merge-based pull.

Except for saving some typing, is there any benefit to stash over local branches?

In other words, shouldn't git just fix ux for branches and rip out stash?

So "some typing" would be:

    # git stash:
    prev_ref="$(git rev-parse --abbrev-ref HEAD)"
    git checkout -b wip-stash
    git add .
    git commit -m 'wip stuff'
    git checkout "$prev_ref"
    # git stash pop:
    git checkout wip-stash -- .
    git checkout -D wip-stash
It's quite a considerable saving. I suppose by "fix UX" you mean make it so the saving would be less anyway, but I think really they're just conceptually different:

    - branch: pointer to a line of history, i.e. a commit and inherently its ancestors
    - stash: a single commit-like dump of patches
If stashing disappeared from git tomorrow, I think I'd use orphan commits rather than branches to replace it.

pull == "just fuck my shit up"

EDIT: I guess I misread it. On reflection what I wrote really doesn't make sense so let me retract.

`cherry-pick` is just plucking a single commit and adding it the commit history of the current branch, and `rebase` is what civilized people use when they don't want merge commits plaguing their entire code base.

merge is what civilized people who care about getting history and context in their repository use ;) ... I worked a lot in git using both rebase and merge workflows and I'll be darned if I understand the fear of the merge commit ... If work happened in parallel, which it often does, we have a way of capturing that so we can see things in a logical order ...

Polluting the master repo with a bunch of irrelevant commits isn't giving you context, it's giving you pollution. There's nothing to fear about merge commits. It's about wasting everyone's time by adding your 9 commits to fix a single bug to the history. I work on teams, and we care about tasks. The fact that your task took you 9 commits is irrelevant to me. What is relevant is the commit that shows you completed the task.

It's not really a fear of the merge commit. In a massively collaborative project, almost everything is happening in parallel, and most of that history is not important. The merge makes sense when there is an "official" branch in the project, with a separate effort spent on it. It's likely that people working on that branch rebase within the branch when collaborating, and then merge the branch as a whole when it is ready to join the mainstream.

Ah, you can learn the beauty of merge AND rebase at the same time then...

Here to 'present' feature branches, we take a feature development branch will all the associated crud... Once it's ready to merge, the dev checkouts a new 'please merge me' branch, resets (or rebase -i --autosquash) to the original head, and re-lay all the changes as a set of 'public' commits to the subsystems, with proper headings, documentation etc.

At the end, he has the exact same code as the dirty branch, but clean... So he merges --no-ff the dirty branch in (no conflicts, same code!) and then the maintainer can merge --no-ff that nice, clean branch in the trunk/master.

What it gives us is a real, true history of the development (the dirty branch is kept) -- and a nice clean set of commits that is easy to review/push (the clean branch).

Sometimes I want to take a subset of the commits out of a coworker's merge on staging to push to production, and then put all non-pushed commits on top of the production branch to form a new staging branch. I find having a linear history with no merges helpful for reasoning about conflict resolution during this process. What advantages do merged timelines give in this context?

What I like about merges it that it shows you how the conflicts were resolved. You can see the two versions and the resolved and you can validate it was resolved properly. With a rebase workflow you see the resolutions as if nothing else existed, you can't tell the difference between an intentional change and a bad resolution...

> merge is a what civilized people who care about getting history and context in their repository use

> I'll be darned if I understand the fear of the merge commit

I apologize in advance for not adding much substance in this reply, but I agree too much to just upvote alone.

Just curious... are you working in a team using git workflow?

Yes, my direct team is small of 4 devs but the main repo we work on is used by 100+ devs. We use git workflow (new branch for each feature) for the main repo and github style workflow (clone and then submit PR) for some other repos.

The number 1 reason my team has not moved from Subversion to Git is we can't decide what branching model to use. Use flow, don't use flow, use this model, use that model, no, only a moron would use that model, use this one instead. Rebase, don't rebase, etc. No doubt people will say that it all depends on the project/team/environment/etc., but nobody ever says "If your project/team/environment/etc. look like this, then use this model." So we keep on using Subversion and figure that someday we will run across information that convinces us that it is the one true branching model.

I have another solution: just switch to mercurial. I switched some big projects to mercurial from svn many years ago. Migration was painless, tooling was similar but better, the interface is simpler than git, and haven't regretted it once.

This is the path I took for a few projects years ago when Google Code didn’t support git.

Switched to mercurial from svn and workflow was painless for the team. Interestingly, we slowly started adopting more distributed techniques like developer merges being common. With svn, I think I was the only one who could merge and it would be rare and added product risk.

Then after about a year of mercurial we switched to git and our brains had adapted. Our team was small, 5-10 people.

Somewhat relatedly, in 2002, I worked in a large team of 75 people or so with a large codebase of a few hundred thousand lines of active dev. It used Rational ClearCase had “big merges” that happened once or twice a release with thousands of files requiring reconciliation. There was a team who did this so it was annoying to dev in, but largely I didn’t care.

Company went through layoffs and the team was down to one. He quit, the company couldn’t merge, so couldn’t release new software versions.

There was a big crisis so they went to the architects and pulled a few out of dev work. It turns out I was the one who could figure it out and dumb enough to admit it.

That sucked. It took a few weeks to sort out and modify our dev process to make merges easy and common. But it was not fun. Upside is we ended up not having any “non-programmer” op/configuration management people since the layed off/quit team were ClearCase users, who didn’t code.

Moral- don’t let people know you can do hard, mundane tasks.

I have converted all my mercurial repos to git and I have forgotten all mercurial now. It helps me feel less pain when I am forced to work in Git....

> but nobody ever says "If your project/team/environment/etc. look like this, then use this model."

Honestly, its because a lot of it comes down to preference and what value you gain from using version control. It is very much like code style standards -- it doesn't matter what is in the standard so much as your teammates all using the same one.

If part of the blocker for your team is that no one is experienced enough with git to have a strong opinion, I'd be happy to brainstorm with you for an hour to learn about your current process and offer a tailored opinion.

Why not replicate whatever you are doing in Subversion in Git? You'll still be able to take advantage of the better merging algorithms, while maintaining whatever political momentum seems to be driving the team's decisions.

It really, really doesn't matter. That's one great thing about a distributed SCM.

We moved from SVN to Fossil and it has worked out great for us. The other option was Mercurial but it required Python.

If it is import to switch to Git, I suggest a technical leader, imbued with authority from management, make those decisions and just do it. However, I don't necessarily think a team should switch away from Subversion if it's working for them.

> everyone understands a different part of git and has slightly different ideas of how things should be done

This was a big problem that bugged me too, so for every team I've worked with I've created a few scripts for the team's most common version control operations.

Most devs, including me, are pretty lazy so they'd all rather run this script than go to Stack Overflow to figure out git arcania.

This helps standardize conventions too: Feature branches/linear DAGs/topic branches/dev branches/prod branches/whatever weird thing a team does they all just do that using the script so it's standardized.

“rebase” is just “pull before push”, right?

While I have no opinion on git, I can’t abide by all the precious chaotic mutant misuse, like git-flow.

I’d happily accept a subset of primitives, if only to disallow bad ideas. Kinda like Git vs SVN, C/C++ vs Java, flamethrower vs peanut butter.

Rebase is "rewind local changes" "pull" "replay local chances"

Basically it makes it so that all of the local-only commits are sequenced after any remote changes that you have not seen yet.


YZF is correct. In the context of pulling (i.e. "git pull --rebase") my description is correct. However in general rebasing branch X to Y that diverge from commit C is:

rewind branch Y to commit C; call the old tip of Y Y'

play all commits from C -> X on Y

play all commits from C -> Y' to branch Y.

You can rebase between two local branches. The rebase operation has nothing to do with pull or remote vs. local.

Yes. I thought we were in the context of git pull --rebase...

"pull" might be the first thing I'd throw out, if thought there was any hope of fixing git ux. Then add a working merge --dry-run #do i have conflicts?.

I think a default of --ff-only would be fine for pull. This is great for when I'm merely a consumer of a project, and will never silently perform a merge or rebase.

Thanks (all) for the clarifications.

When explaining to others, I should probably say 'pull, reapply, then push'.

Perhaps 'rebranch' is a better word choice than 'rebase', to conceptually more closely match what's actually happening under the hood.

"rebase" is not just "pull before push", though.

It's pull then rewrite all your personal commits to be based on the latest tip from that pull.

rebase is simply(tm) replaying a sequence of commits (or diffs or patches for that matter) over some arbitrary base, hence re-base ...

rebase can do a lot more. Try `git rebase -i` to squash smaller commits, edit the commit msg, or even drop a commit before you push it to your colleagues.

Last time our devop did 20 commits to get something on elasticbeanstalk right, I squashed it all into just one clean commit that got merged into master branch.

It will help you to commit more often without worry until the moment you have to hand in your work.

Rebase is a controversial history altering operation and makes it easy to paint yourself into a corner and get weird error messages or wrong results. Its very different from pull/merge.

History altering is only controversial on things that are published. There is nothing wrong with reordering, combining or splitting your local commits to give more clarity to what you are doing. Keeping this in mine will give you the freedom to commit frequently.

This confusion happens because many popular SCMs historically have the "commit" and "push" operation in a single step. Git keep them separate.

There is no tracking by git on what is published, so it's easy to make the mistake of rebasing things that are published and shared by others. Then you will have a bad time later when you try to sync with others, possibly days later.

Um... git kind of does with remote tracking branches. You can also make it very obvious by your workflow? If you use local feature branches (which you should for juggling between development tasks, etc.), what you are working on vs what's upstreamed should be pretty clear. Sounds like you are not using local branches.

Not using local branch is another confusion caused by the perspective of historical/traditional SCMs (people thinking branches are the domain of a centralized server and are outside of their control.)

Often you want to push changes to a remote, but not yet merge or PR them to upstream.

Keeping "local feature branches" just on your dev machine is bad for many many reasons:

- you want to encourage low barrier cooperation in your team -> sharing changes

- you want changes to the CI pipeline early so the potentially slow testing machinery works in parallel with the developer

- you want to keep the team up to date on what changes you make

- you don't want to lose work if the machine/OS dies, or the developer leaves/becomes sick/goes on a 4 week vacation during which they forget their disk crypto password

So, in practice you can try to use rebase opportunistically, when out of chance your WIP work is still unpushed because the change was only made very recently. This is error prone. Or you can rebase published branches explicitly, by destroying the original branches in the PR merge phase. But all this is big bother if the purpouse is to just beautify history and at the same time hide the real trial and error that went into making the changes.

Did you notice that y2kenny was talking about how, if you use local feature branches, then the remote tracking branches make it really clear what's been published vs not? The implicit meaning is that we should use local feature branches but also publish them to the repo while we're working on them.

But maybe to you, 'publish' means 'publish to master'? In that case I can assure you, they are not necessarily the same thing. I regularly work on a local feature branch, publish that branch to the shared repo, rebase it on top of master, then force-push to the shared tracking branch. When I'm done I merge it into master and don't rebase master on top of anything.

who said anything about not publishing?

I'm not sure if you are being serious? The answer is that published advice on rebase overwhelmingly warns against rebasing published code, and for good reason.

Who said anything about rebasing published code?

I LOVE rebase but when I run into merge conflicts I rather `rebase --abort` and leave that merge commit as it is. But those instances are rare and having a merged branch's commits nice and compact in the log makes me happy every time.

Nobody understands SVN or CVS either.

I discovered this supporting SVN servers for whole bunch of developers.

I always found the mercurial ui super easy.

The error messages are clearer, it is multiplatform, all the advanced functionalities are there, a nice graphic interface exists.

I really do not understand why git won, apart from github.

What I find ironic is that github is massively popular as a central way to use a distributed version control system. The distributed nature only adds to the complexity and I am sure it is only used by a fraction of git users.

Yes...? What's surprising about using a central repo to collaborate? There needs to be a single source of truth for a coherent project, otherwise you're just going to have chaos.

The distributed nature of git led to the simple and secure contribution model of everyone working on their own repos and not needing to give write access to anyone else. This pretty directly led to an explosion of open source software.

Is there any really good tutorial on git that teaches the internal model? Ideally, it would illustrate each command and show the before and after of the internal objects.

https://learngitbranching.js.org/ is the best guide I've seen. It shows you the complete commit graph and all refs on that graph, and updates the graph when you type in commands. It covers and displays workflows involving remotes as well.

If you don't want the tutorial, you can go straight to the sandbox here: https://learngitbranching.js.org/?NODEMO

Indeed. When the article said "younger developers only know git" I immediately thought, no, they don't know anything. These people don't even know what a DAG is. Git was made for people who know these concepts. I've tried explaining git to people and they just don't understand. They just don't.

What's annoying is that git is just expected knowledge these days and having a github account is enough to claim it. There's not a good way to sell the fact that you're a bit more into it than that.

I've even said to git "experts" that branches should really be called refs and their eyes glaze over. It's difficult for me to understand what git is in their heads.

Why would you call branches ref's? They don't point to specific files or commits.

I know you can target commits through them - which utilizes the ref syntax... But they're still not really referencing anything directly.

They're completely arbitrary and are just a feature to improve gits workflow.

I started naming branches 'post-its', as to me that's what they are, labels you place on the real 'branches' (the commit tree). You can take them of easily, move them, discard them, whatever you want. They are just volatile.

I should have said pointers. I didn't mean to overload existing git terminology. My point was just that they are pointers/references to some commit.

> They don't point to specific files or commits.

A branch points to the tip(last commit) of a particular timeline.

But they are also called symbolic refs in git terminology...

A symbolic ref is a ref that points to another ref instead of a ref that points to a commit. `HEAD` is a symbolic ref. (It should be your only symbolic ref.)

Unless it is detached. :)

That term makes sense.

But just as you wouldn't call a symlink to a zip archive a zip file itself, you also shouldn't call a branch a ref.

Hrm, but a ref is a file containing a hash, right? So if the hash is equivalent to the file, the surely a ref is equivalent to a symlink? A symbolic ref, in turn, should be a symlink to a symlink... Or something like that...

A ref points to an object. That object doesn't change unless the hashing algo was tricked.

A branch points to anything you want it to point to. It can be any ref you want and can be changed at will.

sha1 - object (e.g. 5a480efb...) file with sha1 - ref (e.g. master) file with ref - symbolic ref (e.g. HEAD)

right? Seeing as you can git update-ref branches, but you need to git symbolic-ref HEAD.

But it is a ref. It's an alias for the last commit of a particular timeline, as I said above.

So would you rather say a branch is a commit?

A branch is a pointer or symlink if you will.

> It's difficult for me to understand what git is in their heads.

In that case, they were thinking the git was you.

Git is the solution to the problem of doing distributed development on the Linux kernel. People who aren’t doing that, I wonder if they’re entirely clear in their own minds why they use it. I’m certainly not... other than that it’s just the default choice these days, the path of least resistance...

I'm a big fan of Fossil myself. But the SQlite people have something that I don't really have within the teams I operate : the authority to dare and speak out against Git and not be laughed away like a hipster that is just trying to be different.

Hipster source code management? See this Rust project:



A bit like DARCS (also very hipster, in Haskell and has some math behind it), but then fast.



Oh and it uses a cool hi-perf storage lib (also in Rust, by the same devs):


    Pijul lets you describe your edits after you’ve made them, instead of beforehand.
Pardon my French, but about fuckin time.

On a big product, forensics matter. Not day to day, but often enough and if your metadata is rotten then you’re left with the oral history of the project as your only guide. And even that may not exist, depending on project structure.

Git has something similar called git-notes, but at the time I tried using it, it was really early-days. No idea how support is working for that now. You could also make an annotated tag, which has it's own "commit message", but it will show up with all other tags.

[1] https://git-scm.com/docs/git-notes

Git notes is interesting but it’s a manual process.

When selecting technology I look for “a rising tide lifts all boats” situations and opt-in tools have limitations in that regard.

There’s a big gap between ‘can do’ and ‘will do’ and I feel like we downplay that frequently in our industry, and to our own peril.

Standalone, that sounds like a commit message - which you make after editing the code anyway. (And possibly tweak/update with git rebase before pushing)

In that section's context, it sounds like naming a branch after having already started on it. In which case, that seems to me the tiniest bit less useful than git's ability to rename branches (git branch -m oldname newname).

What am I missing?

Darcs was also prior art (certainly the first DVCS I ever encountered), which makes me more inclined to call them innovative than hipster :-)

That would make Haskell innovative instead of hipster as well.. :)

Woah woah woah, let’s not get ahead of ourselves! :-)

I haven't used Pijul, but I did use Darcs for several large production projects back when it was still a thing.

Darcs was magical -- in both senses of the word. It was incredible to see it figure out which patches depended on which, allowing a fluid exchange of changes between branches in a way that quickly becomes a nightmare in git. But it was also magical in that nobody really understood the internals. Not in the sense of git where the underlying data model is pretty simple, and the "version control" aspect is a (thin!) UX veneer on top, but in the sense that it was like quantum physics. When something went wrong, it was almost always impossible to fix. And with Darcs, things did go wrong, because it had bugs, specifically a certain dreaded "exponential conflict" edge case where, if it encountered an identical line change in two patches from different branches (or something like that, it's been more than 10 years), computation time went through the roof and the merge command almost never finished. At several points we had to start history from scratch to avoid spending an entire day fighting the conflict problem. Another thing with Darcs (and presumably Pijul) was that since it tracks patch inter-dependencies, you can rarely cherry-pick individual patches -- pulling out one patch tends to pull with it a whole string of related patches, all connected. Which is often what you want (git just fails horribly in such cases), but sometimes you do want to "forcibly cherry-pick" and manually fix, change identity be damned. I don't know if Pijul supports this.

It looks like Pijul fixes the conflict problem, but it still seems to keep the "quantum theory of patches" that requires an above-average developer to understand. If it has no bugs, then maybe the problem is moot, but in our industry, transparent, "self-repairable" tech seems to win in the long run over the esoteric, opaque and magical.

That said, it's clear the Darcs/Pijul has a vastly better UX, which I'm all for. Git's data model works remarkably well for what it does, but it's always been obvious to me that its "record snapshots and try to make sense of them after the fact" philosophy is a bit flawed. The article mentions branch history. And rename detection doesn't work well with how most people work, for example; it's a clever kind of lazy evaluation, but probably designed for Linux kernel devs, so not clever enough. Darcs had a patch type specifically for renames, and it worked very well.

Another thing I wish version control systems had was what you might call a high-level changelog. It would let you group and annotate commits after the fact, but without changing them. For example, you might want to group a bunch of patches as a single "feature" commit. Then you could make a "release" group that groups a bunch of feature commits. In other words, several levels of nesting, with each commit containing child commits and so on. Viewing the log should show only the highest-level groups, with the option to expand them visually so you can see what they contain. You should be able to group things like this after the fact without changing commit order, and you should be able to annotate the log (e.g. add more information to a commit message) without mutating the underlying patches. Git was on the verge of ventured into this territory with its (now discouraged) "merge commits" -- a high-level commit that represents a single logical merge but encapsulates multiple physical patches -- but that didn't go anywhere. The nice thing about a high-level history like this is that you could use it to drive release notes and change logs, and it would greatly aid in project management and issue tracking, because you could manage entire sets of commits by what issues or pull requests or milestones or whatever they relate to.

> It looks like Pijul fixes the conflict problem, but it still seems to keep the "quantum theory of patches" that requires an above-average developer to understand. If it has no bugs, then maybe the problem is moot, but in our industry, transparent, "self-repairable" tech seems to win in the long run over the esoteric, opaque and magical.

The patch theory is complex, but it isn't that complex. Especially since there is plenty of alternate implementations out there of Operational Transforms (OTs) and Conflict Free Replicated Data Types (CRDTs), it's relatives/cousins/descendants. In theory, any developer than can grok a blockchain or a Redis cache should be able to grok the patch theory.

Darcs suffered much more from being written in Haskell, I think, than from the actual complexity of its patch theory.

Pijul being written primarily in Rust maybe has a chance of also getting over that hump a bit easier than Darcs had. Though now it also has the uphill climb of competing against git's inertia.

> Git was on the verge of ventured into this territory with its (now discouraged) "merge commits"

Discouraged only by people that don't know `--first-parent` exists as a useful `git log` and other command arguments. The useful thing about a DAG is you can very easily slice it to create arbitrary "straight line" views. You don't have to constantly smash and squash history to artificially force your DAG into a straight line.

"....in our industry, transparent, "self-repairable" tech seems to win in the long run over the esoteric, opaque and magical."

Quoted for truth.

This is super, super cool - thanks for sharing!

Most welcome. It's one of those projects that I keep half an eye on because it is just waaay too fantastic while not being unfeasible.

Git is a use-case that is excellent for 90% of development. Sqlite is just an example where the use-case isn't necessarily ideal, not an indicator that it's "better" than git.

I’m a fossil fan.

I’d say that git is fine for 90% of development (or some arbitrarily large number), but so is fossil. I don’t even think that SQLite-in-git would necessarily be a deal-breaker that couldn’t be worked around (drh ‘sqlite can chime in here). The whole space (from personal projects to global collaboration) is diverse enough that there’s no talking about “better” without qualifying the situation, either.

Fossil is good for a large subset of work that can benefit from source control management, regardless of git.

What git definately has is

1) scaleabilty, which is probably of no consequence for 99% of the cases it is employed

2) network effect, for better AND worse

> drh ‘sqlite can chime in here

He already has

> With Git, it is very difficult to find the successors (decendents) of a check-in ... This is a deal-breaker, a show-stopper.

Someone still thinks in the single main branch mode. It is sometimes the main case but definitely not in git world.

This operation is not easy in any DAG. It involves:

- find all or desired branch tips - walk backwards until hitting tge desired checkin - memoize already seen parents to not walk them multiple times

Amusingly enough, git's scaleability is also not that great (e.g. worse than mercurial last I checked).

The network effects are there, though.

Please provide the source about scalability. Also what kind of scalability?

Scalability to large numbers of files and large numbers of changesets.

For the large number of files one https://code.facebook.com/posts/218678814984400/scaling-merc... is one source. There's some earlier discussion of the same issue at https://news.ycombinator.com/item?id=3549679 that goes into some technical details.

There has been some work on git since then to address some of those issues (e.g. see https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g... ) but it's not clear to me that it helped enough to catch up to where Mercurial is for large repos.

For large numbers of changesets, just try running "log" or "annnotate" on any file with a long history in git. I just did this simple experiment:

1) hg clone https://hg.mozilla.org/mozilla-central/

2) git clone https://github.com/mozilla/gecko-dev.git

3) (cd mozilla-central && time hg log dom/base/nsDocument.cpp)

4) (cd gecko-dev && time git log dom/base/nsDocument.cpp)

It's not quite apples to apples because the git repo there has some pre-mercurial CVS history in it. But note that I'm not even using --follow for git and the file _has_ been renamed after the mercurial repo starts, so git is actually finding fewer commits than mercurial is here.

Anyway, if I do the above log calls a few times to make sure the caches are warm, I end up seeing times in the 8s range for git and the 0.8s range (yes, 10x faster) for mercurial.

That all said, most repos do not have millions (or even hundreds of thousands) of changesets or files. So the scalability problems are not problems for most users of either VCS.

I have never once had to use the missing feature that was a dealbreaker for the SQLite guys (find the descendants of an arbitrary commit). I have no idea what they're doing if they super depend on something like that.

What the hell, speaking against git makes you a hipster now? I was called a hipster for proposing git over CSV and SVC.

If you were promoting git back in the days of SVN and are now moving on to something else because git is too popular, just cop to it, you're a hipster. :-P

As I'm not familiar with those, I have to ask: did you mean CVS and SVN?

Ack. I'm damn sorry, I just had a fully botched prod-update from hell and I'm on tilt.

Yes. CVS and SVN. The old and ugly ones.

You want old, ugly, slow, pain-in-the-ass and proprietary,? Try ClearCase. Expensive. Drain on productivity (all commits were individual file based, meaning it was extraordinarily easy to miss a change; also had no concept of an ignore, so really easy to miss adding a new file). Also, very. fucking. slow. A moderately sized project (or VOB in clearcase terminology) could take an hour to update. I've probably lost a year of my life waiting for clearcase to complete. Also, it had a habit of just royally fucking up even trivial merges (dropping braces in C++ code, for example, or ignoring whitespace changes in python code for another).

I've never used Clearcase, but I worked for Sun around 2000 and one of the things I did was analysing kernel dumps of Solaris.

If we saw the Clearcase kernel module you can be sure that that was going to be the root cause of the crash. That thing seemed to really terrible, and it wouldn't surprise me if the rest of the product was as bad.

I don't know if things have changed now, but to use Clearcase you needed a kernel module that provided a special filesystem that you did your work against.

> to use Clearcase you needed a kernel module that provided a special filesystem that you did your work against.

That... Augh, that's actually kind of a good idea, even if it was before its time. But FFS...

The idea is not terrible, but when the kernel module is so unstable it brings down the build machines with a kernel panic, I'd say the execution was somewhat lacking.

> CVS and SVN. The old and ugly ones.

In their respective times they were a big improvement. I believe CVS was the first client/server revision control system (why that feature was added was a horror story)

Never had a problem with SVN personally either. OpenBSD still uses CVS, for an entire operating system.

Which version control you use matters a lot less than having a sane development process.

Very much so. I remember jumping from RCS to CVS and it was a very positive change.

(RCS only worked on single files, moving to CVS allowed you to maintain a tree. Before RCS I was using SCCS!)

Hahah. Nothing to be sorry for, I was just curious. There are enough source systems out there, I wouldn't be surprised if they were a few I hadn't heard of.

See also RCS - the revision control system you discover when you mistype vi. (This frequently happened to people at school back in the SunOS/Solaris days.)

10 years ago?

I worked at a pretty well regarded and reputable organisation that still used SVN. (Pretty sure they still do).

If not for my yammering there wouldn't be any git in use by any team there. (Only a year ago)

All the code progressed fine. git would still have improved it but nobody had bothered to switch yet.

We still use SVN (but tfs git for greenfield projects). We are a small company and only 2 or 3 devs work on the same project at a time. Git itself doesn't solve any problems we have. We still deal with merges when someone goes nuts refactoring.

Fossil is a great replacement for that kind of use case. You can fully convert a git or a svn repository with fossil import, so migration is painless. It's also painless for users since fossil is very easy to pick up compared to git and the basic commands are quite similar to svn. Branching and merging is a breeze compared to svn, quite like git but easier. It's also nice for companies that want everything in one place, since all developer repos sync with a central repo by default. It's also distributed in the sense that everything gets replicated to every repo that's in sync with the central repo (it does this automatically at fossil update or fossil sync) contains the full project history.

As a mostly solo developer I haven't found a need to switch away from SVN to git or anything else; doing so would involve work I can't bill for and for no benefit to myself or existing clients. Some things in software and life in general are Fine Just As They Are.

I also did this until the day came when I wanted to branch and someone showed me how easy it was in git.

I still use SVN from time to time with my old repos, but if given an opening, I migrate it to git without hesitation.

"I also did this until the day came when I wanted to branch and someone showed me how easy it was in git."

Git is an incredibly powerful tool for managing a set of files over time, but if you just use the handful of basic commands, then I agree that the immediate big win is branching.

Personally, I found that moving from Subversion to Git fundamentally changed my work habits (for the better). I was a lone developer at the time, so the collaboration aspect wasn't really important.

I noticed that Git made it so easy to create a repository that I put everything into version control: not just application code, but random scripts and notes.

The other gain was that I learned to work in small, focused commits, because Git is so fast that commiting often is not a burden. Once I made that change, the commit history became meaningful and useful in a way that that Subversion never was: I could quickly revert code, and look back at individual commits for information.

"The other gain was that I learned to work in small, focused commits"


This is surprising?

My company uses SVN. SVN works fine so there isn't really a reason to spend the man hours migrating a crap ton of projects to to Git. Before SVN existed we used CVS and we migrated to SVN from CVS about, idk, 15-ish years ago?

Yes, there are people who are surprised that many have the opinion that SVN is fine for a lot of projects and you don't need to migrate for the sake of it.

That's more or less what I was getting at.

More like 5 - 6 years ago.

So, not recent. But version control is one of the few infrastructural components that's allowed to have decades of churn, in my book. Software lives that long.

We use zip files distributed and synced by blockchain across BTC and Ether

I know this is a joke, but have you checked out mango? https://github.com/axic/mango

Lol, i mean, it's within the realm of plausibility. My secret sauce is the zipping. The SHA integrity and the zippy bits are decentralized separately and n-matrix distributed across multiple cryptos using a randomized forward time arrow only strategy. so, pretty psicc.

You should probably form a corp.

I recently tried to speak out against GitFlow...that was a mistake on my team.

See for example: http://endoflineblog.com/gitflow-considered-harmful "I remember reading the original GitFlow article back when it first came out. I was deeply unimpressed - I thought it was a weird, over-engineered solution to a non-existent problem. I couldn't see a single benefit of using such a heavy approach. I quickly dismissed the article and continued to use Git the way I always did (I'll describe that way later in the article). Now, after having some hands-on experience with GitFlow, and based on my observations of others using (or, should I say more precisely, trying to use) it, that initial, intuitive dislike has grown into a well-founded, experienced distaste. In this article I want to explain precisely the reasons for that distaste, and present an alternative way of branching which is superior, at least in my opinion, to GitFlow in every way."

And a summary of an alternative proposed there: http://endoflineblog.com/oneflow-a-git-branching-model-and-w... "As the name suggests, OneFlow's basic premise is to have one eternal branch in your repository. This brings a number of advantages (see below) without losing any expressivity of the branching model - the more advanced use cases are made possible through the usage of Git tags. While the workflow advocates having one long-lived branch, that doesn't mean there aren't other branches involved when using it. On the contrary, the branching model encourages using a variety of support branches (see below for the details). What is important, though, is that they are meant to be short-lived, and their main purpose is to facilitate code sharing and act as a backup. The history is always based on the one infinite lifetime branch."

I would have backed you up on that.


At work, I use fossil to manage my repository of scripts that I edit and use on different machines. (Strictly speaking, that could be done without an SCM, but I find it more convenient this way.)

On Windows, that fact that fossil comes as a single statically-linked executable that works without any special installation procedure[0] is really nice.

Also, I have to appreciate the builtin Wiki. I use it to keep a kind of diary of what I did and why, as well as gather helpful links I have come across over time.

[0] Other than putting the executable somewhere on your %PATH%, of course.

So, being a fan of Fossil would make you a Hippster then, no?

O_o I'll find my way out.

So is it a lack of humor or just missing the joke that's getting the negative score?

I haven't used Fossil, but just a comment on some of that page, in the order they're presented:

1. It's unclear to me what he means. Yes git doesn't store anything like a doubly linked list of commits, and thus finding the "next" commit is more expensive, but you can do this with 'git log --reverse <commit>..', and it's really snappy on sqlite.git.

It's much slower on larger repositories, but git could relatively easily grow the ability to maintain such a reverse index on the side to speed this up.

2. Yeah a lot of the index etc. is complex, but I wonder how something like "git add -p" works in Fossil. I assume not at all. Is there a way to incrementally "stage" merge conflicts? Much of that complexity comes with significant advantages.

3. This is complaining about two unrelated things. One is that GitHub by default isn't showing something like 'git log --graph' output, the other is that he's assuming that git treats the "master" branch magically.

Yeah GitHub and other viewers could grow some ability to special-case the main branch and say "..and this was merged into 'master'" at the top of that page, but in any case all the same info exists in git as well, so it's just a complaint about a specific web UI.

The "index" is a silly dongle in Git. One way to get rid of it would simply be to make it visible as a special "work in progress" top commit, visible in the history as a commit. "git add -p" would just hoard changes directly into the WIP commit, if it exists, otherwise create it first. Some sort of publish command would flip the WIP commit to retained status; then a "git add -p" would start a new WIP commit. "git push" would have some safeguard against pushing out a WIP commit.

The "--cached" option would go away; if you have a WIP commit on top, then "git diff" does "git diff --cached", and if you want the diff to the previous non-WIP commit, you just say so: "git diff HEAD^".

stashing wouldn't have the duality of saving the index and tree. It would just save the changes in the tree. Anything added is in the WIP commit; if you want to stash that, you just "git reset HEAD^" past it, and later look for it in your reflog, or via a temporary tag.

> The "index" is a silly dongle in Git.

Everyone thinks that until they need to use it for something. If all you do is a bunch of linear small changes with obvious implications and two-line commit messages, then the index is nothing but an extra step.

But at some point you're going to want to drop a thousand-line change (from some crazy source like a contractor or whatnot) on top of a giant source tree and split it up into cleanly separable and bisectable patches that your own team can live with. And then you'll realize what the index is for.

What I described supports staging small changes and turning them into individual commits. Just the staging area is a commit object rather than a gratuitously different non-commit object with different commands and command options to deal with it.

You'd still need separate commands, though. Commit vs. commit --amend. Add vs. add -p. Diff-against-grandparent vs. diff --cached. You just want different separate commands to achieve your goal, which is isomorphic to the index.

So sure: if you want a not-index which works like the index and has a bunch of 1:1 operations that map to the existing ones, then... great. I still don't see how that's much of an argument for getting rid of the index.

Well; yes; the tool can't read your mind whether you'd like to batch a new change with an existing one, or make a separate commit.

Remember that git, initially, didn't hide the index in the add + commit workflow! You had to "git add" and then "git commit". So the fact there is only "git commit" to do everything is because they realized that the index visibility is an anti-pattern and suddenly wanted to hide it from view.

Since the index is already hidden from view (and largely from the user's model of the system) in the add + commit workflow, we are not going to optimize the command set by turning the index into some other representation. That's not what this is about.

The aim is consistency elsewhere.

For instance, if the index is an actual commit, then if we abandon it somehow, like with some "git reset", it will be recorded in the reflog.

Currently, the index is outside of the commit object model, so it gets destroyed.

It's possible for a git index to have content which doesn't match the working tree; in that case when the index is lost with a git reset, that content is gone.

If the index is a commit, it can have a commit message. It can be tagged, etc.

Did you read parent comment? How is dealing with the thousand-line change not possible with what they described (hint, it's totally possible, no index needed)?

HOW is it possible?

Imagine I have a file with lines 10 and 150 changes. How do you commit just one without some form of index (or alike)?

> or alike

Well, that's the weasel word. In my grandparent comment, I proposed the "alike", didn't I?

Nowhere did I say, just remove the index from Git, but don't replace its functionality with any other representation or mechanism.

In git, we can do that today in such a way that the index is only temporarily involved:

  $ git commit --patch
  ... interactively pick out the change you want ...
  $ # now you have a commit with just that change
It is not some Law of Computer Science that the above scenario requires something called an "index", which is a big archive holding all of the files in the repo, where these changes are first "staged" before migrating into a commit.

The problem is not that Git supports staging partial changes. The problem is that Git has shoehorned a tool that "at some point you're going to want"—to help you deal with a rare occurrence—into the default workflow, forcing you to deal with the overhead of staging every time.

> to help you deal with a rare occurrence

..I actually use it with almost every commit, so I don't add reminder comments and debugging statements.

It's basically the overhead of typing -a ... so git commit -a rather than git commit. It's not such a big deal. it does take a while to get used to the git "pipeline" tbh but when the rare occurrence happens you have this option, on source control systems without this option you just don't have it.

You're minimizing the overhead. `git commit -a` also won't help you with new or renamed files. So when you write about "the overhead of typing -a", what you really mean is the overhead of

1. typing `git commit`, checking the output, then typing `git commit -a`, or

2. typing `git commit`, then moving on with your life, and realizing minutes, hours, or days later that the changes you meant to include were not actually included, so you have to go back and add them if you're lucky, untangle them from whatever subsequent changes you were trying to make and/or do an interactive rebase if you're unlucky, and maybe face the prospect of doing a `git push --force` if you're really unlucky

Scale that up to several days or weeks to match the learning period and repeat for every developer who has to sit down and interact with it. That's the overhead we're talking about.

The article got it right; this is a monumental waste of human effort.

> Every developer has a finite number of brain-cycles

I've never used a version control system that didn't have to be notified about which files you would like to add. "vc commit" cannot simply pick up all files and put them under version control, because you have junk files all over the place: object files, editor backups, throwaway scratch and test files and so on.

But even when we use "git add", we are not aware of the index. The user can easily maintain a mental model that "git add" just puts files into some list of files to be pulled into version control when the next commit takes place. That is, until that silly user makes changes to the file after the git add, and those changes do not make it in because they forgot the "-a" on the commit.

I use an IDE with git integration. So I really never worry about most of this but I do also interact with the command line. When I create a new file in my IDE it asks me if I want to add it ...

I won't argue there is a relatively long learning period with git ... It helps if you have some experienced mentors in this area. But you get a lot of power for this...

> asks me if I want to add it.

Where "it" is just the snapshot of that file as it is now, not as it will be at commit time; then you have to add it again!

right, but at that point it is only git commit -a

At what point do you test these "cleanly separable" and "bisectable" patches? Do you do a second pass where you check out and build/test each of these commits?

It's pretty routine for a CI integration to test every patch, yeah. Not all do. (e.g. Gerrit-based systems generally do because the unit of review is a single patch, github likes to do integration on whole pull requests). It's certainly possible. I don't really understand your point. Are you arguing that it's preferable to dump a giant blob into your source control rather than trying to clean it up for maintainability?

No, I prefer small meaningful commits. I am not for or against the index. I have no problems switching my brain between git add -A or -p as necessary. Like you said, it happens too often that someone sends you a huge pile of code (C code, in my case). My first impulse is to build and run it immediately. For me, just compiling the code can take up to an hour sometimes. Running my full test suite takes even more hours.

At some point I am ready to craft this code into multiple commits. After my first git add -p and git commit, I don't know if HEAD is in a state where it even compiles. It takes further work and discipline to produce a whole series of good commits.

I think he is arguing that you should work on one feature at a time.

And I was saying that as a practical matter, you don't always get that option. Individual developers working on their own code don't need the index. But then quite frankly they don't need much more than RCS either (does anyone remember RCS? Pretend I said subversion if you don't).

Situations like integrating a big blob of messy changes happen all the time in real software engineering, and that's the use case for the git index.

I split work consisting of multiple changes in the same file just fine under CVS and Quilt.

I would convert the change to a unified diff, remove the change, and then apply the selected hunks out of that diff with patch. ("Selected" means making a copy of the diff, in which I take out the hunks I don't want to apply. Often I'd just have it loaded in Vim, and use undo to roll back to the original diff and remove something else.)

Using reversed diffs (diff -uR) I used also to selectively remove unwanted changes, similarly to "git checkout --patch"

This is basically what git is doing; it doesn't require the index. The index is just the destination where these selective hunks are being applied.

Is it really that many developers who don't split up their code into seperate commits?

Personally, Yes. Often I discover a commit won’t build so I do a little bit of interactive rebasing to move some dependent change into the same commit or an earlier one.

You would compose a series of local commits and, if you wanted, test them individually before pushing them. With a bunch of changes made to your files locally, you'd use tools like `git add -i` or `git add -p` to stage subsets of your changes to make those commits. As you finish building these commits, you would be left with a series of commits to your local branch, and no additional unstaged or uncommitted changes. You're "draining" the uncommitted changes into commits, part by part. Commands that manipulate the index are how you describe what you want to do to Git.

"Index" is probably not a helpful term. I think of them as simply "staged changes", that is, changes that will be committed to the repository when I run `git commit`, as distinct from local changes that will not be committed when I run `git commit`. With a Git repository checked out, just editing a file locally will not cause it to be included in a commit made with `git commit`. Rather, `git add` is how you describe that you want a change to be included in the "staged changes" that will be committed. You can add some files and not others, or even parts of a file and not other parts.

The need for this doesn't come up especially often, but it's really helpful when it does. One common case where this can come up is when you've been developing for a while locally, and you realize that your changes pertain to two different logical tasks or commits, and you want to break them up. Maybe one commit is "upgrade these dependencies" and the other is "add feature X". You started upgrading dependencies while building feature X, but the changes are logically unrelated, and now the dependency change is bigger than you expected and deserves to be reviewed on its own.

So with all of these changes in your workspace, you'll stage just the changes for "feature X" or "upgrade dependencies" and then run `git commit`. At this point, maybe you'll move this commit into its own branch or pull request in order to code review and ship it separately. (You might use `git stash` to save the remaining uncommitted changes while you do this.) Then you'll return to the remaining changes, which you will stage and commit as well. You've just gone from a bunch of unstructured, conflated changes to two or more separate commits on different branches/PRs (if that's what you want), that can be reviewed and shipped independently. You've gone from a massive change that's too big to review, to multiple bite-sized pieces.

These tools are also especially helpful if you, for any reason, need to manipulate source control history, such as breaking up one already-made commit into several commits, or simply modifying an existing commit. To do this, you would take that commit, apply it to the local workspace as if it's an unstaged change, and then, starting from the point in history before that commit was made, stage parts of the changes again and check them in. At this point, you can push the changes as a new branch, or even rewrite history by replacing your existing branch.

To give a use-case for this last capability, imagine that a developer accidentally checks in sensitive data as part of a big commit. Before (or even after) shipping the change, you realize this, so you want to go back and edit that commit, to remove the part of the change that checked in data while leaving the rest of the changes. You would describe these manipulations with the index as described in the previous paragraph.

You seem to miss the point that instead of staging changes out of your working directory and then committing them, you could just commit those changes out of your working directory. The extra step of staging is not needed.

Everywhere you said stage you could say commit (or amend) and then not need the extra step of committing afterwards.

> You seem to miss the point that instead of staging changes out of your working directory and then committing them, you could just commit those changes out of your working directory. The extra step of staging is not needed.

How would you describe this? One massive `git commit` with a ton of parameters? I don't see how it could work.

How would you describe "commit the first hunk of fileA (but not the second), and the second hunk of fileB (but not the first), and all of file C?". How do you "just commit those changes"? I believe you are missing how to actually describe this on the command line or with an API.

The index is absolutely needed. It's what allows you to build up a commit through a series of small, mutating commands like `git add fileC`, `git add -p fileA`. The value of the index is that you can build up your pending commit incrementally, while displaying what you've got with `git status`, then adding to it or removing from it.

What krupan and kazinator are saying is that you would use exactly the same command as today, i.e git add -p etc.

However, unlike today, those commands would automatically also do the equivalent of:

    if HEAD is marked as WIP
    then git commit --amend
    else git commit --special-wip-flag
As somebody who understands and uses the git index, I would wholly approve of this change.

Do you know how to use Git at all?

You can build a commit using multiple small "commit --amend --patch" commands. These use the index, but only in a fleeting, ephemeral way; changes go into the index and then immediately into a commit. They go into a new commit, or if you use --amend, into the existing top-most commit.

The index is a varnish onion.

Git has too many kinds of objects in its model which are all bags of files.

I do not require a staging area that is neither commit, nor tree.

Look at GIMP. In GIMP, some layer operations get staged: you get a temporary "floating layer". This gets commited with an "anchor" operation, IIRC. But it's the same kind of thing: a layer. It's not a "staging frame" or whatever, with its own toolbox menu of operations.

How do you build up the incremental changes? By staging them into your working directory from the one giant commit?

Huh? In the obvious way:

  $ git commit --patch 
  ... pick out changes: commit 1 ...

  $ git commit --patch
  ... pick out changes: commit 2 ...
  $ git commit --amend --patch
  ... pick out more changes into commit 2 ...

  $ git commit --patch
  ... pick out changes: commit 3 ...
There, now we have three commits with different changes and were never aware of any "index"; it was just used temporarily within the commit operation.

Oops, the last two should have been one! --> git rebase -i HEAD~2, then squash them.

The index is too fragile. You could spend 20 minutes carving out very specific changes to stage into the index. And then you do something wrong and that staging work is suddenly gone. Because it's not a commit, it's not in the reflog.

You want changes in proper commit objects as early as possible, then massage with interactive rebase, and ship.

Suppose HEAD points to a huge commit we would like to break up. One simple way:

   $ git reset HEAD^
Now the commit is gone and the changes are local. Then just do the above procedure: commit --patch, etc.

Thanks for 'you want changes in proper commit objects as early as possible'. I can't tell you how much time I have wasted with a botched index.

>Rather, `git add` is how you describe that you want a change to be included in the "staged changes" that will be committed. You can add some files and not others, or even parts of a file and not other parts.

That's largely outdated. For years now, git's commit command has been able to stage changes and squirrel them into the commit in (apparently to the user) one operation.

Only people who learned git ten years ago (and then stopped) still do "git add -p" and then a separate "git commit" instead of just "git commit --patch" and "git commit --amend --patch" which achieve the same thing.

I've had the same feeling about the index, except I want the opposite setup. Instead of the index becoming a commit, I want to always commit my working tree, and for there to be a special sort of stash for the stuff I don't want to commit just yet. I want to commit what's in my working tree so I can test before committing.

The way it would work is that when I realize I want to do a partial commit I'd stash, pare down the changes in my working tree (probably using an editor with visual diff), test what's in my working tree, commit, and then pop the stash.

I had hoped that this would already be doable with git, but it isn't, at least not in a straightforward way. The problem shows up when you try to apply that stash. You get loads of merge conflicts, because git considers the parent of the stash to be the parent of HEAD, and HEAD contains a bunch of the same changes.

I'm sure there's some workaround for this, but every time I've asked people always tell me to not bother testing before committing!

Hmm, I'm not sure exactly what you're after, but how about this

    git commit --patch # Just commit the bits you want
    git stash # Stash the rest
    <do your tests>
    git stash pop
    <continue developing>

Yes, that works, but it means I'm committing with a dirty work tree, so I can't really test before I commit.

You're probably going to say I shouldn't test before every commit, but I rarely work in a branch where the bar is as low as "absolutely no testing required". I generally at least want my build to pass, or some smoke tests to pass, and I can't reliably verify either of those with a dirty work tree. And actually, the fact that all of the commits on my branch are effectively going to end up in master (unless I squash) makes me want to to have even my feature branches fully tested.

> Yes, that works, but it means I'm committing with a dirty work tree, so I can't really test before I commit.

Ah, I see what you want to do now.

> You're probably going to say I shouldn't test before every commit

If have no business telling you what you should do. If you want to test before committing, your wish is my command

    git stash --patch # Stash the bits you don't want to test
    <do your tests>
    git commit <options> # Commit the rest when the tests pass
    git stash pop
    <continue developing>

Interesting! I guess that works because the stash doesn't contain the changes I'm going commit now, and so it doesn't conflict (at least in simple cases).

I never use `--patch` (even with `git add`). I prefer to use vim-fugitive, which lets me edit a diff of the index and my working tree. It looks like being able to do something similar with stashes is a requested, but not yet implemented, feature for vim-fugitive: https://github.com/tpope/vim-fugitive/issues/236

All I can say is I use

    git commit --patch
(and therefore `add --patch` is never even required!)

    git checkout --patch
to selectively revert hunks, and

    git stash --patch
to selectively stash. I couldn't be happier with this workflow!

This workflow can be yours, in 10 easy* payments:

  $ git stash
  $ git checkout stash@{0} -- ./
  $ $EDITOR # pare your changes down
  $ make runtests # let's assume they pass
  $ git add ./
  $ git commit
  $ git checkout stash@{0} -- ./
  $ make runtests # let's assume they pass
  $ git add ./
  $ git commit
*(In case it appears otherwise, this isn't actually supposed to be a defense of Git. Originally, this was 15 steps, but I edited it into something briefer and more straightforward.)

What you describe changes the name from "index" to "WIP commit", keeping the same semantics. Along the way, you now have a "commit" that doesn't behave like a commit, further adding to the potential for confusion. I strongly believe that things that behave differently should be named differently.

> What you describe changes the name from "index" to "WIP commit", keeping the same semantics.

I.e. you get it.

Importantly, the semantics is available through a common interface rather than a different design and implementation of the semantics for the index versus commits.

> you now have a "commit" that doesn't behave like a commit

Well, now; literally now you have a commit that doesn't behave like a commit: the index.

If a real commit is used for staging, it behaves much more like a commit. It's just attributed as do-not-publish so it doesn't get pushed out. Under this model, all commits have this attribute; it's just false for most of them. Thus, it isn't a different kind of commit.

> I strongly believe that things that behave differently should be named differently.

Things that do not behave completely differently can use qualified names, in situations when it matters:

"work-in-progress commit; tentative commit; ...."

For instance we use "socket" for both TCP and UDP communication handles, or both Internet and Unix local ones.

1. I actually ran it, and ironically, `git log --oneline --reverse` runs faster than `fossil descendants` on the sqlite repository for new commits, and just as fast on old commits. perhaps fossil would do better on a large repository, but I doubt it. git has many flaws, but "insufficent optimization (compared to alternatives)" is not one of them.

As branch names are not recorded in commits, it is pretty much impossible to say what commit was next in a branch.

That's because commits are not owned by branches. The same commit can be present in two or more separate branches with different successor commits in each. And this is a feature, not a bug! It's what allows you to ask the question "Where did these two branches fork?".

Now, I don't know how Fossil answers that question. Maybe it's got some clever trick (like a separate "universal ID" vs. "per-branch ID" for each commit, maybe). Maybe SQLite doesn't need that and doesn't care. But it's not like this is a simple feature request. Git was designed the way it was for a reason, and some of us like it that way.

> Git was designed the way it was for a reason, and some of us like it that way.

For the record, the initial version of git was developed in haste as a replacement for bitkeeper, after Larry McVoy (who shows up here on HN once in a while, and always has great posts) got a little aggressive about the licensing.

BitKeeper was open-sourced a year or two back: https://github.com/bitkeeper-scm . I haven't had a chance to use it extensively yet, but I hear it still has a good selection of features that git either chose not to implement or didn't properly understand.

Don't get me wrong, I think git is pretty great and have been one of its primary advocates over SVN, etc., at most companies I've worked with (I think 2013 was the first time I came on board a team that was already using it), but its history is informative.

EDIT: upon scrolling, I see that Larry has already dropped by. Read his posts! https://news.ycombinator.com/item?id=16806588

For the record, in my opinion git is the worst SCM among the distributed ones and I truly believe it was only hype that carried it. The user interface is genuinely user hostile with so many commands and yet many commands have switches that change the behavior so much it very well might be a different command. And so forth.

Git is just a patch database management system that tries too hard to become a version control system. It is proof that this approach is fundamentally misguided.

However, mercurial offers both named branches (hg branch) and pointer branches (hg bookmark) and when I last used it, it seemed the consensus was shifting to one of two positions, (a) just use hg bookmark or (b) use named branches for long lived branches only (master/default, develop, release branches etc) and bookmarks for feature branches/dev use.

Have you looked at the --branches option for git log? It annotates which branch(es) a commit is part of unless you're talking about something else. There's obviously some cost but all this talk of "slow" is ignoring the fact that in practice on small repos (& yes - sqlite is small for git) you're not going to notice it. Also, you can limit your branches to those you're interested in to speed things up.

No, it's not impossible. It's quite easy. What it isn't is immutable. I.e. you can have a repo now with just "master", and I can push a new branch one for each commit in it, and make any such tool output useless garbage.

I.e. in some DAG implementations the branch would be a fundamental property at write time, in git it's just a small bit of info on the side.

What I'm referring to is that for the common case of something like the SQLite repository that uses branches consistently it's easy to extract info from git saying "this commit is on master, and on the LHS of any merge to it", or "this commit is on master, but also the branch-off-point for the xyz branch".

The branch shown in the article is a perfect example of this. In that case all the same info exists to show the same sort of graph output in git (and there's even options to do that), it's just not being done by the web UI the author is using.

I can view all the branches simoultaneously in fossil by opening a web page or by running fossil timeline.


The lack of an index (or something alike) was what made using hg/svn unusable one I'd moved over to git. The ability to closely review what I'm committing (and not commit "every single change in the repo") is a must.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact