Good news from GitHub: they have extracted the full list of SHA-1 before the forced push !
Many thanks to Nathan Witmer :-)
See below the full CSV with the SHA-1.
He created as well a branch named 'recovery' that points to the candidate point for restoring the master branch.
Hope this will help to sort out the remaining repos.
> Hi Luca.
> Oh man, that sinking feeling!
> But, no worries: I've gone through each of the repos listed above and done the following :
> * retrieved the SHA for the previous `master` before you force-pushed
> * created a branch called `recovery` pointing to each former master
> In some cases, these are the same.
> I can go further and reset the master refs to their former shas if you'd like, or you can recover these yourself. To do so, in each repo:
> $ git fetch
> $ git checkout master
> $ git reset --hard origin/recovery
> $ git push --force origin master
> I've attached a csv containing the shas (former master and forced master) for each branch, for your reference.
> Good luck!
I am having a really hard time understanding why a distributed VCS with hundreds of users needs the admin to the central repo to do anything for them.
Each distributed node only knows the ref the branch was pointing at the last time they fetched, so their certainty on what the branch looked like before the force push is much lower.
git reports it to you when you do the force push:
% git push github master --force
Total 0 (delta 0), reused 0 (delta 0)
+ 4d44b63...ad5b147 master -> master (forced update)
previous what you've forced it to
Ideally git itself should record this information somewhere, though.
Also I am not sure what he did to push to 150 remotes, and in that case this output would be much more tedious to piece together (and that is assuming he was able to capture all of it).
However I disagree that force pushing is rare, I find it is a (sharp) tool I use most days (more often on private repos or branches).
I do like the idea of github providing some kind of web interface to view the reflog and reset branches to various points within it.
I think it is a truism that systems which allow users to do interesting and clever things must also allow users to do remarkably stupid and wrong things.
Rather than focusing on how to prevent a user from doing a silly thing like this, I think a well-designed tool would easily allow the user to undo the silly thing he just did.
Unfortunately, all too often when a Bad Thing happens, the discussion tends to center around how to prevent this Bad Thing from happening again. This is often why those who make mistakes end up being demonized; their mistake has now limited the actions that every other user of the system can make, because the system will be modified to prevent other users from making the same mistake.
Agreed. From a usability perspective, undo is always best. Especially undo that the user: 1) knows about, 2) trusts, and 3) is fast. That encourages exploration and lets users fix mistakes.
> their mistake has now limited the actions that every other user of the system can make, because the system will be modified to prevent other users from making the same mistake.
I don't think this is a black and white issue. It's not about denying all users the ability to express X ever again. I think it's more about having systems that can tell if X is an unusual or heavyweight action and say, "Hey, you're about to do X, which impacts a lot of stuff and you've never done before. Are you sure that's what you mean?"
It's velvet rope permissions, not a locked door.
Yes, but git already does that, by having the user type "--force". Asking would eliminate the possibility of scripting the action.
...not that it would have helped in this case as he was probably using it heedlessly in a jenkins plugin.
-People that shouldn't have access
-People that do have access but can make mistakes or shouldn't have complete, unrestricted access
-Processes that shouldn't have access
-Processes that do have access going haywire
He is right, done properly, this probably shouldn't have been allowed to happen.
edit: it's almost readable now
I could certainly see the argument that the system should have made him more aware of the permissions he had. But it seems like instead he's arguing that he shouldn't have been given those permissions to begin with, which goes back to my point. Because one person was given many permissions and used them irresponsibly, now everyone's default permissions are more restrictive.
The issue is that a client may not have fetched for a while before performing the force push, so it may not have all the required refs for performing the rollback.
The server of course still has all the refs, so it shouldn't be too hard to add a command (run client side) to invoke the required server-side rollback.
It is of course possible to write a post-receive hook that denies non-fastforward pushes to specific branches, but really, git should include a config option out of the box to do this at a more granular level. A common workflow is the expectation that people should be able force-push to their own topic branches all they like, but never to the default branch (except in extreme circumstances). Being able to easily set this via a config option would be very helpful in keeping working repos more "safe" from this kind of thing.
Isn't that exactly what a revision control system is for?
Unfortunately, some allow you to re-write all history.
The old history will possibly be garbage collected eventually, but for quite a while that history is still accessible.
I think the only missing piece is a simple (and obvious) client side 'undo/rollback that force push'.
The issue is that the git client may not have the history pre-force, so it may have insufficient information to do the rollback (at least in the current model), however the server still has all the refs (and they are listed in the reflog).
It's just a pain in the butt to restore it all, especially if in the meantime people started making new commits on top.
But if someone has commit privileges to a repo, they have the ability to mess it up, I don't see any real way around that. The UI of the client side tool(s) that made it so easy to make this mistake can perhaps be blamed.
Force push capabilities make it even easier to mess up than just plain commit priviliges.
Yes, it seems counterintuitive that a version control system should ever let you lose previously committed data. "What's the point of version control, if there's a chance you won't be able to revert?" you might ask.
So at first it seems like we should establish the principle that you can't permanently delete committed data. (Unless you step outside the tools provided by the VCS and do something like hacking the contents of the .git directory.)
But sadly, real-world constraints get in the way of this principle.
Disk space is one such constraint. Sometimes, we need to bring a repo's disk footprint down to a reasonable size. This could happen if we committed a bunch of huge files that we no longer want, or if the repo is simply very old.
Another constraint is intellectual property, secrets, passwords, and such things. Sometimes you check in something that other people shouldn't see. Maybe it's an accident, or maybe you just don't realize the data is sensitive. You need a way to redact the contents of the repo, and that means deleting historical data.
The company I work for has at least 50 projects all using Git and the only time we use --force is when we have to clean up a major mistake (eg a premature merge to release). I can count the number of times this has happened in the past 3 years on my hands.
Now I'm curious: How does it fit into people's everyday workflow?
(on teams that like rebase I tend to do this a lot)
Rebase and squash. Fix commit messages. Force push to PR-branch tidy.
Why? How does taking away force permission on repos he doesn't commit to prevent him from doing clever things?
But I don't think Bad Thing Prevention is necessarily to be avoided. Undo buttons are sweet, but not everything can be undone. An anecdote: During the first month of my first job in finance I accidentally deleted a few mappings in a database which put the entire company at risk to the tune of about $45million. Complete bone-head move, I was new to the game. It all turned out fine and I got off with a slap on the wrist, but my manager was flogged. When markets are involved, something like that has no undo button. Instead, we put some proper controls in place (tests, mostly) to make sure said mappings were always logically consistent. As far as I know, that brand of problem has not occurred in the 6 years since.
And yes, we all make mistakes. But by Git's nature, lo and behold, everybody has a copy.
Git already prohibits forced pushes if you use 'git init --shared' to create your repo.
Well, if the repositories were created with "git init --shared" then it wouldn't have been allowed.
I think it's a valid position to believe that this should be the default on github, although obviously in this case, there might be blame-shifting motivation to that position.
> Rather than focusing on how to prevent a user from doing a silly thing like this, I think a well-designed tool would easily allow the user to undo the silly thing he just did.
Git does make it easy for the user to undo this. It provides the reflog for that. It also provides information in the output of the push command. (However, github does not allow users to access the remote reflog.)
In any case, you can blame github, and you can blame the developer, but at least we should all be clear that you can't blame git.
His insistence that he should not have had right access does come across as a lame excuse, but he has a point. e.g. Take the other poster in the thread who was surprised to find that others have access to his project beyond the half a dozen he knew about - that to me indicates that something isn't quite right permissions-wise.
These systems can at least put up speed bumps ("are you SURE you want to do this potentially work-imploding operation?") for those sorts of things.
Normally, it would be difficult to do this even if you tried, no? 180 repos would usually be in a 180 different clone directories; what can you do that somehow force pushes them all at once, even ones you didn't mean to be working on at all?
In the thread, the person who made the mistake seems to suggest it has something to do with a 'gerrit replication plug-in'? I am not familiar with gerrit. Anyone?
Mistakes will happen; but for a mistake to have this level of consequence took amplification, apparently by the 'gerrit replication plugin'? I'd be inclined to blame that software, rather than human error (or github permissions or lack thereof! If you have write access to a bunch of repos, of course you can mess them all up, no github permissions will ultimately change that. The curious thing remains how you manage to mess up a couple hundred of them in one click).
This is simply not the case. Force pushing is literally the only operation you can do via git protocols that have a destructive effect on the remote repo. Moreover, it is generally considered best a practice never to force push public branches. Thus, preventing forced pushes on branches that are considered public would, in fact, prevent any sort of real destruction to a repo. People could still push new garbage to it, but they can't make old history inaccessible.
You make it sound like force pushing is some sort of special operation, but it's actually just the same operation that's made with every commit. A branch is deleted, and then a new branch is made with that same name pointing to a new commit. It's just that with a normal push there's a check to see if there's a path from the previous HEAD to the new HEAD.
And another thing, it's also not directly destructive, the unreferenced commits don't immediately disappear. They just get cleaned up when git decides it's time for garbage removal. You can still check the commits out, if you know their sha.
The issue is pushing to multiple branches.
This is a local developer's Git issue, not a Github issue: http://stackoverflow.com/questions/948354/git-push-default-b... but one they've thankfully changed: http://stackoverflow.com/questions/13148066/warning-push-def...
Summary: By default Git push used to (they've changed it) push ALL your branches, not just your current one. It is pretty much the first thing I've historically reconfigured whenever starting on any new project. This guy didn't do it, and paid the cost.
Git let's you prevent this easily using the receive.denyNonFastForwards option, but GitHub won't let you control this option. So we're left just telling people not to do it and hoping that they listen and don't do it by accident. That seems like a shitty, precarious situation that could easily be fixed. One checkbox on the GitHub repo settings page and this story would never have happened.
You have to explicitly add the force flag, and if you choose to go down that path, then you better be sure you know what you're doing. If you don't, but still manually setting that flag to fix the error that your push failed? You deserve the rain of hellfire that will come down upon you if you blow away other people's commits.
I know about denyNonFastForwards, and unfortunately I work at a company that turns this on as a rule on ALL branches, which I don't like because I like having personal branches on the server (rather than on my personal git repo). As a result when I want to squash my commits I end up rebasing interactively, deleting the remote branch, and then creating it again.
[Note that using an explicit branch name, as has been suggested here, doesn't really help since you can easily accidentally force push a branch that way too.]
The fact that your company has a draconian blanket policy is unfortunate, but has no bearing on whether or not people should be allowed to opt in to setting denyNonFastForwards on specific branches.
Backing up your local feature branches is no reason to enable rewriting history on the main repository.
They do, in GitHub Enterprise. However, the option is not in the "public" Repo Settings page; it's in a special site-admin-only settings page.
So it seems possible that this same setting could be deployed to github.com, but it would need additional UI development.
One thing I sometimes do for repos where I have commit access, but don't want to push, is to clone the repository via HTTP. That way, any accidental attempts at pushing won't use my ssh key, and so GitHub will prompt me for my password.
Let's not be disingenuous. This is exactly what every git user does when they run `git clone`. You aren't railing against just multiple committers per public repo, you're railing against multiple clones per public repo. One user swapping between two client systems can make this error. Most public repos of any scale will have multiple committers to spread the maintenance workload.
Even Scott Chacon himself describes a simple, effective process called "GitHub Flow" -- just submitting a PR against a branch for merge against master on the same repo. This is a very common workflow on private repos in my experience, and a very easy workflow for small teams.
Making all this worse, the default config setting for `push.default` will only change to "simple" in Git 2.0 -- until then any new host/account that hasn't had push.default changed in .gitconfig is a bomb waiting for the first otherwise legitimate `git push --force` to go off. I've received the batsignal from folks on small teams who've otherwise done all the right things and still been bitten by this.
Fortunately, git has the reflog to save us from this kind of mess. It's just not readily accessible on GitHub, AFAICT.
core.logAllRefUpdates must be manually enabled on bare repos!
To repeat: the reflog is not available on bare remotes by default unless core.logAllRefUpdates has been manually set to "true". See  for documentation on that config setting.
(You could argue that stock git doesn't really support fancy permissions models either, but if you have access all the hooks you can implement them, or you can use gitolite or Gitlab.)
That's exactly what I mean. How is this a difficult concept? Branches are lightweight, clones are heavyweight. There's really very little difference except that they can share branches of the same name that point to different refs, I guess.
The lead dev for that project just got hired by Facebook (actually, almost all of the core hg team is getting hired by FB), and the feature itself is like 85% complete.
Soon we will have a safe way to collaboratively edit history!
Mercurial won't take over the world, and it won't even reach the adoption level of Git, but that doesn't strike me as particularly terrible in any tangible way.
And then there's network effects - it's popular, so it gets built into things like ruby's bundler and now even visual studio. Hg may be nice, but that's an uphill battle, especially over something that's perhaps not even all that important (i.e. even using SVN isn't a development death-knell or anything).
Mercurial may let them continue growing in that way for a while, much like switching to Perforce would let them scale a single repo far beyond what either git or mercurial would tolerate, but it seems to me that they are delaying their problem (and likely digging themselves into quite a hole, depending on the unstated nature of their Mercurial modifications...)
Ultimately, being able to push a single repo into the dozens, hundreds, or even thousands of GB doesn't mean it is a good idea.
In other words, that those infinitely scaling systems already exist, and they are built on top of git (or hypothetically Hg, though I cannot think of any examples).
For one publicly visible example of the sort of thing that I am talking about, look into how Android development works. Android is 'in git', even though it is too large for a 'single' git repo.
(Note that due to the ways that code and resposibility is typically organized, most organizations are probably in a situation where migrating from a single monolith repo to many git repos would be a conceptually straightforward task, provided that they can break from the conceptual notion of having a 'single' repo. Atomic commits across repos are the primary pain-point, but you would be surprised how much that disappears as you grow use to working with many repos. Supporting a strong notion of versioned dependencies between packages goes a long way.)
I agree that atomic commits are a red herring. They're nice to have, but by the time you outgrow a single git repo, you also have projects with different release schedules, and once you have that, you have to deal with version skew anyway, and then you don't really need atomic commits.
I disagree that "those infinitely scaling systems already exist," though. Looking at Android, they had to build a nontrivial wrapper around git to make it work, and it's not totally transparent. You have to think about when to use 'repo' and when to use 'git', and where the boundaries between repos are.
There are huge benefits to having everything be in one giant pile of code and being able to import and sometimes modify code from far away parts of the tree with minimal overhead. The key is to let any directory be a "project" that you can refer to, without any arbitrary distinction between top-level project directories and others. This lets you do things like spin out a part of a project as a semi-independent library without moving files around or creating new repos.
There are some downsides too, of course. Google eventually broke from this model slightly by introducing components, which had issues of their own. Perhaps Facebook has done it better.
I think that's a shame, and it doesn't work very well either. Mercurial does the same thing incidentally (unless facebook have somehow solved that), so it's no better there.
Splitting repos is a pain; perhaps a necessary one, but hardly ideal. It just introduces a bunch of extra administration, and it reduces the power of your VCS primitives (such as branching and merging).
The way I interpreted the reddit post from the Facebook engineer is that they looked into customizing git to scale better for them, but the codebase wasn't to their liking, so they're going to customize mercurial's instead.
> The matter of customizing git came up and people looked at the code and decided it's pretty convoluted when compared to the Mercurial code.
So it's not like mercurial is able to scale in a way that git can't, it's that they plan to make mercurial scale in a way that git can't. (Any over/under on when they give up on that and decide to use Perforce's Git Fusion?)
On a slightly unrelated tangent, the tup build system supports inotify, which I appreciate.
Back in the day? It seems like git just came out, is just a few years already like ancient times? I don't get why they can't use git? Git is used to maintain the entire Linux kernel and they want some system rarely used?
I wish people could make up their minds on what DVCS we should all be using and stick with it for 20 or 30 years.
They are presumably trying to use a single repo for a very large chunk of code. Large, not in the sense that the Linux kernel is large, but a repo one (maybe two or more) magnitude larger. Multiple gigabytes large, with most of those gigabytes taken up by small objects.
Git does indeed start to fall down with these sorts of repos. The thing is, if you keep on growing like that, other VCSs will as well. Even with perforce you'll hit a wall eventually.
The solution is to restructure your project, breaking it into several different repos. Understandably, large organizations capable of creating large repos like this are typically going to be resistant to large changes like this brought on by what they interpret as limitations of tooling. Doing it requires restructuring code, training, and may require extensive modification to internal systems that work with your codebase.
My perspective is that this 'limitation' of git is really a symptom of an anti-pattern that should be corrected sooner, rather than later, before you find yourself in a really painful situation (read: three years later your repo is now 10-100 times as large, and you are discovering that continuing to scale with Perforce is becoming untenable).
I know this is a bit meta, but this seems to go against the entire spirit of git.
Btw, I'm pretty impressed by Facebook's open source efforts.
We have small teams and many mostly small repositories, so i am fetching something like <20 changes into each of <5 repositories every day. I have no complaints about performance or correctness. The main pain is interacting with build systems that quite naturally think in terms of Git hashes; rather than
hg diff -c 244 --stat
I have to write
hg diff -c 'gitnode("2a33c401")' --stat
Or whatever. Not a big deal, just mildly tedious.
Vive la résistance!
I looked at http://mercurial.selenic.com/wiki/Bookmarks/ (I already have a pretty good understanding of HG branches.)
Now I'm still not sure I got everything straight in my head:
The HG wiki says "Git, by contrast, has "branches" that are not stored in history, which is useful for working with numerous short-lived feature branches, but makes future auditing impossible."
Git branches are just commits on a separate path in the tree, right? As long as the branch is merged with git merge --no-ff (which can't be set as the default, yes, I know) -- then a git branch becomes a permanent part of the history due to the commit created during the merge. The fast-forward option is just that: optional. It's only useful as a history-rewriting tool, kind of the way git squash is useful.
I guess I'm asking, what does the HG wiki mean, in concrete terms? What is impossible with git?
Also, to be sure I understood the original comment well, I'll try to sum up the difference between HG branches and bookmarks. Is it correct?
HG branches copy-on-write the entire source tree. This means a HG branch cannot be changed if the history before the branch changes. HG bookmarks (and git branches) can change later if the history changes.
Maybe there's some logic behind it, but I doubt it, and I'm really not sure if GitHub completely thought it through.
Further, because git is a DVCS, so long as someone has a valid recent clone, the pain is much less than, say, a centralized repo being corrupted.
I don't really know what Jenkin is and what "git push --force" does.
'git push --force' pushes your copy of the repository to the remote, regardless of what the remote has. He basically told GitHub, "Hey here's the /real/ copy of the repository, the copy you have is totally wrong" and it overwrote all the history because his copy was a few months out of date.
One part of Git I never really liked.
If someone sets out to mess up a repo ON PURPOSE, he can do force push. Can he also somehow trigger GC to preclude recovery? Do something else?
What are the attack vectors?
However, I always do a --dry-run first on a force push just to verify that it's only going to update the one branch I expect it to.
A modern alternative is only allow a bot to write to the central repo. As an example, the rust guys use a bot: https://github.com/mozilla/rust/pull/10403
The student used Eclipse with its git integration. He tried to push, but it failed (non-fast-forward). Apparently, Eclipse tries to be helpful and says something like "If you want to push anyways, use the 'force' checkbox". Not knowing what a "force push" is, the student followed Eclipse's advice.
Thankfully, the distributed nature of git is very helpful. You can just force push again from another repo to undo the force push.
weird. I had that backwards. oh well, thanks all, I was wrong :)
i kinda prefer the upstream (names can be different but i set up an explicit link) option to simple
Good news from GitHub: they have extracted the full list of SHA-1 before the forced push ! Many thanks to Nathan Witmer :-)
See below the full CSV with the SHA-1. He created as well a branch named 'recovery' that points to the candidate point for restoring the master branch.
Hope this will help to sort out the remaining repos.