Good news from GitHub: they have extracted the full list of SHA-1 before the forced push !
Many thanks to Nathan Witmer :-)
See below the full CSV with the SHA-1.
He created as well a branch named 'recovery' that points to the candidate point for restoring the master branch.
Hope this will help to sort out the remaining repos.
Luca.
> Hi Luca.
>
> Oh man, that sinking feeling!
>
> But, no worries: I've gone through each of the repos listed above and done the following :
>
> * retrieved the SHA for the previous `master` before you force-pushed
> * created a branch called `recovery` pointing to each former master
>
> In some cases, these are the same.
>
> I can go further and reset the master refs to their former shas if you'd like, or you can recover these yourself. To do so, in each repo:
>
> $ git fetch
> $ git checkout master
> $ git reset --hard origin/recovery
> $ git push --force origin master
>
> I've attached a csv containing the shas (former master and forced master) for each branch, for your reference.
>
> Good luck!
>
> Nathan.
The central repository is the only point that knows (with 100% certainty) which ref the branch was set to before the force push.
Each distributed node only knows the ref the branch was pointing at the last time they fetched, so their certainty on what the branch looked like before the force push is much lower.
> The central repository is the only point that knows (with 100% certainty) which ref the branch was set to before the force push.
git reports it to you when you do the force push:
% git push github master --force
…
Total 0 (delta 0), reused 0 (delta 0)
To https://github.com/foo/repo.git
+ 4d44b63...ad5b147 master -> master (forced update)
^ ^
previous what you've forced it to
Of course, if you lose the output, then yes, the reflog is the only thing that has it. But force pushing is so rare, and something that (should) is done with care, I'm puzzled as to how someone "accidentally" >150 repos.
If losing this output is such a problem and force pushing is so rare, then Github could email this output with the two hashes to every developer on every force push.
I was assuming after the fact as that information is not saved anywhere client side.
Also I am not sure what he did to push to 150 remotes, and in that case this output would be much more tedious to piece together (and that is assuming he was able to capture all of it).
However I disagree that force pushing is rare, I find it is a (sharp) tool I use most days (more often on private repos or branches).
I do like the idea of github providing some kind of web interface to view the reflog and reset branches to various points within it.
They didn't absolutely need to, but the alternative would be finding a user with the latest commit in their local git for every single repository, which could be 150 different people, then getting them all to push those changes... it's a bit more involved than just getting a github admin to put the repos back the way they were.
Although the responsible developer's reaction and attitude are both commendable, one element of his response annoyed me: his continued assertion that he should not have been allowed to do this thing that he did.
I think it is a truism that systems which allow users to do interesting and clever things must also allow users to do remarkably stupid and wrong things.
Rather than focusing on how to prevent a user from doing a silly thing like this, I think a well-designed tool would easily allow the user to undo the silly thing he just did.
Unfortunately, all too often when a Bad Thing happens, the discussion tends to center around how to prevent this Bad Thing from happening again. This is often why those who make mistakes end up being demonized; their mistake has now limited the actions that every other user of the system can make, because the system will be modified to prevent other users from making the same mistake.
> I think a well-designed tool would easily allow the user to undo the silly thing he just did.
Agreed. From a usability perspective, undo is always best. Especially undo that the user: 1) knows about, 2) trusts, and 3) is fast. That encourages exploration and lets users fix mistakes.
> their mistake has now limited the actions that every other user of the system can make, because the system will be modified to prevent other users from making the same mistake.
I don't think this is a black and white issue. It's not about denying all users the ability to express X ever again. I think it's more about having systems that can tell if X is an unusual or heavyweight action and say, "Hey, you're about to do X, which impacts a lot of stuff and you've never done before. Are you sure that's what you mean?"
I don't think this is a black and white issue. It's not about denying all users the ability to express X ever again. I think it's more about having systems that can tell if X is an unusual or heavyweight action and say, "Hey, you're about to do X, which impacts a lot of stuff and you've never done before. Are you sure that's what you mean?"
Yes, but git already does that, by having the user type "--force". Asking would eliminate the possibility of scripting the action.
how far does that go though? `git push --force --yes --yes-I-am-absolutely-sure --by-pushing-this-commit-I-agree-I-am-paying-attention`. Seems like `--force` is a succinct way to cover those two parts.
that's why a have a easy to type password to unlock my local private key, and i set my keymanager to never cache the password. It is always nice to have that last chance to review your changes to the VCS.
...not that it would have helped in this case as he was probably using it heedlessly in a jenkins plugin.
From my reading of his post, the one big failure of the system is that he had an access he didn't think he had. This is sort of like me typing find . | xargs -n0 rm -rf in the root directory and thinking "it's OK because it will only delete files I have permission to delete" but not realizing that I'm root.
I could certainly see the argument that the system should have made him more aware of the permissions he had. But it seems like instead he's arguing that he shouldn't have been given those permissions to begin with, which goes back to my point. Because one person was given many permissions and used them irresponsibly, now everyone's default permissions are more restrictive.
More systems need to implement undo (and scalable undo). Hiding behind "are you sure? (Y/N)" type of prompts and saying "Well, you really should've checked that thrice before hitting Enter" is not good enough.
Git provides all the pieces, we just need to put them together in the right way.
The issue is that a client may not have fetched for a while before performing the force push, so it may not have all the required refs for performing the rollback.
The server of course still has all the refs, so it shouldn't be too hard to add a command (run client side) to invoke the required server-side rollback.
Being able to write custom update hooks also allows you to do a lot of neat things (in this case, an update hook could prevent non-fastforward pushes).
You can already do this with git config, but unfortunately it's a giant hammer that affects the entire repo at once.
It is of course possible to write a post-receive hook that denies non-fastforward pushes to specific branches, but really, git should include a config option out of the box to do this at a more granular level. A common workflow is the expectation that people should be able force-push to their own topic branches all they like, but never to the default branch (except in extreme circumstances). Being able to easily set this via a config option would be very helpful in keeping working repos more "safe" from this kind of thing.
Ideally personal topic branches, not yet ready for world consumption, would be pushed but to a different repo. This would allow the main repo to have a global "No force pushes!" setting, and would prevent branch name collisions (how many developers have a scratch branch sitting around named 'scratch'?)
That's fine for some projects, but not every project follows that flow. One example: ompanies using GitHub Enterprise internally generally don't make every regular contributor to a project have their own fork (or at least they shouldn't, as it creates needless extra overhead). In that case, having some extra granularity around denying non-fastforward pushes to some branches but not others would be a helpful addition.
I think it is a truism that systems which allow users to do interesting and clever things must also allow users to do remarkably stupid and wrong things.
Rather than focusing on how to prevent a user from doing a silly thing like this, I think a well-designed tool would easily allow the user to undo the silly thing he just did.
Isn't that exactly what a revision control system is for?
In the case of git 'rewriting history' is not really the problem, and it leaves the old history intact (you cannot change git history, but you can create an alternative history).
The old history will possibly be garbage collected eventually, but for quite a while that history is still accessible.
I think the only missing piece is a simple (and obvious) client side 'undo/rollback that force push'.
The issue is that the git client may not have the history pre-force, so it may have insufficient information to do the rollback (at least in the current model), however the server still has all the refs (and they are listed in the reflog).
Yes, the ability to rollback history more easily than paging through `git reflog` would be great. (in particular, if you rebase often you end up with a lot of cruft).
Locally in git you have the HEAD@{n} notation to step back n steps through the reflog, so after a bad rebase a `git reset --hard HEAD@{1}` should be all you need.
Right. So if your revision control system -- your failsafe in the event of unpredicted mistakes -- is configured such that you can lose data permanently, I think it is fair to ask whether anything can be changed to improve how that tool operates. This doesn't mean you can't also ask what else might be done to avoid any unfortunate repetition of the same problem situation, of course.
Nothing was or could have been lost permanently. The commits were all still there, it's just that none of them were in the commit history for any branches anymore. And the entire history including sufficient info to restore the state prior to the `push -f` is all there in the reflog.
It's just a pain in the butt to restore it all, especially if in the meantime people started making new commits on top.
But if someone has commit privileges to a repo, they have the ability to mess it up, I don't see any real way around that. The UI of the client side tool(s) that made it so easy to make this mistake can perhaps be blamed.
Yes, it seems counterintuitive that a version control system should ever let you lose previously committed data. "What's the point of version control, if there's a chance you won't be able to revert?" you might ask.
So at first it seems like we should establish the principle that you can't permanently delete committed data. (Unless you step outside the tools provided by the VCS and do something like hacking the contents of the .git directory.)
But sadly, real-world constraints get in the way of this principle.
Disk space is one such constraint. Sometimes, we need to bring a repo's disk footprint down to a reasonable size. This could happen if we committed a bunch of huge files that we no longer want, or if the repo is simply very old.
Another constraint is intellectual property, secrets, passwords, and such things. Sometimes you check in something that other people shouldn't see. Maybe it's an accident, or maybe you just don't realize the data is sensitive. You need a way to redact the contents of the repo, and that means deleting historical data.
What sort of workflow are you using when you need to use --force every day?
The company I work for has at least 50 projects all using Git and the only time we use --force is when we have to clean up a major mistake (eg a premature merge to release). I can count the number of times this has happened in the past 3 years on my hands.
It's used often in a team where you each have your own forks. It's much less risky in such a case, of course. Basically, I push to my fork, then later I pull-with-rebase from the mainline, then I push again to my fork. Oops, commit hashes have changed because of my pull-with-rebase. So I need to either push to a different branch or force push.
I'm not sure a "problem" exists here. Keeping your personal topic branch rebased on top of the main development branch is a perfectly reasonable workflow, and quite comparable to what you'd do with Mercurial patch queues.
And in a team where everybody's used to that, and force is only ever used on topic branches where that's the expected 'done thing', it really doesn't cause a problem.
(on teams that like rebase I tend to do this a lot)
Why would you do this, rather than pull into a different branch name, fix it there, and then merge it in? Is there an advantage to doing it by altering your own repo?
Yes, but you can pull into a new local branch, do the history edit there, and then push that new branch, can't you? Not that I'm sure it would be worth it.
If you're working in an organization that employs pre-merge code review, you may have to modify your topic branch several times before it's ready to be merged into the main development branch. Creating a new branch every time so that you can push without force seems pointless in this case and increases confusion for everyone involved in the process.
I'm pleased that the reaction in this thread is generally with understanding towards the developer. We've all had facepalm moments like this, and probably not as publicly.
But I don't think Bad Thing Prevention is necessarily to be avoided. Undo buttons are sweet, but not everything can be undone. An anecdote: During the first month of my first job in finance I accidentally deleted a few mappings in a database which put the entire company at risk to the tune of about $45million. Complete bone-head move, I was new to the game. It all turned out fine and I got off with a slap on the wrist, but my manager was flogged. When markets are involved, something like that has no undo button. Instead, we put some proper controls in place (tests, mostly) to make sure said mappings were always logically consistent. As far as I know, that brand of problem has not occurred in the 6 years since.
That said, sometimes Bad Thing Prevention is something that should have been implemented long ago, and was hitherto unrecognized. My canonical example is increased regulations and requirements for entry when an untrained student was seriously injured in a university machine shop; training is something that can be attained, just as "being the one at the top of the organization with force push access" is something that can be attained. There's no loss of abilities, just a more-optimal mapping of abilities to the people who know how to use them.
Given that Git is such an integral part of the workflow, I find unnacceptable the ability to push to the central repo for people that can't handle Git well. Git is distributed, if you're using it like it's SVN you're doing it wrong.
And yes, we all make mistakes. But by Git's nature, lo and behold, everybody has a copy.
> his continued assertion that he should not have been allowed to do this thing that he did.
Well, if the repositories were created with "git init --shared" then it wouldn't have been allowed.
I think it's a valid position to believe that this should be the default on github, although obviously in this case, there might be blame-shifting motivation to that position.
> Rather than focusing on how to prevent a user from doing a silly thing like this, I think a well-designed tool would easily allow the user to undo the silly thing he just did.
Git does make it easy for the user to undo this. It provides the reflog for that. It also provides information in the output of the push command. (However, github does not allow users to access the remote reflog.)
In any case, you can blame github, and you can blame the developer, but at least we should all be clear that you can't blame git.
Well I think one needs both permission control and easy undo options.
His insistence that he should not have had right access does come across as a lame excuse, but he has a point. e.g. Take the other poster in the thread who was surprised to find that others have access to his project beyond the half a dozen he knew about - that to me indicates that something isn't quite right permissions-wise.
There is a very simple undo, just force push the branch you just deleted? It's not like force pushing is some terrible destructive thing, it just puts the repo in a state people don't expect.
That's some facepalm right here. That said, I do appreciate the reaction of the developer, taking responsibility for his mistake, starting remedial actions right away and discussing preventive measures for the future.
And ditto the reaction of the community. No aggressive denunciations or vindictive chat - just calm and helpful trouble shooting. +1 for the culture they have created.
Let the programmer who hasn't done something dumb with a Git repo cast the first stone. I know I'm guilty as can be and surely I'm going to transgress in the future. It does kind of show a little bit of a need for a more formal permissions system for Git. Most organizations have something like this in place as policy but that doesn't prevent mistakes.
He means none of them are being jerks - for instance a community of jerks might have responded to the announcement of a pretty inconvenient mistake by abusing the guy who made the mistake.
Nobody's yet commented on... what is it you can do to force push to ~180 different git repos at once?
Normally, it would be difficult to do this even if you tried, no? 180 repos would usually be in a 180 different clone directories; what can you do that somehow force pushes them all at once, even ones you didn't mean to be working on at all?
In the thread, the person who made the mistake seems to suggest[1] it has something to do with a 'gerrit replication plug-in'? I am not familiar with gerrit. Anyone?
Mistakes will happen; but for a mistake to have this level of consequence took amplification, apparently by the 'gerrit replication plugin'? I'd be inclined to blame that software, rather than human error (or github permissions or lack thereof! If you have write access to a bunch of repos, of course you can mess them all up, no github permissions will ultimately change that. The curious thing remains how you manage to mess up a couple hundred of them in one click).
> If you have write access to a bunch of repos, of course you can mess them all up, no github permissions will ultimately change that.
This is simply not the case. Force pushing is literally the only operation you can do via git protocols that have a destructive effect on the remote repo. Moreover, it is generally considered best a practice never to force push public branches. Thus, preventing forced pushes on branches that are considered public would, in fact, prevent any sort of real destruction to a repo. People could still push new garbage to it, but they can't make old history inaccessible.
Deleting a branch is also a destructive operation last time I checked. Force push could also be implemented by deleting the branch and pushing a different SHA1 into the same name.
Yes, that's true. Protecting a branch from deletion would be another nice feature and could be tied to denying forced pushes. Both can be construed as the branch being "public".
Disabling force push is a hack. Git does not track branch history, that a branch points to something that can somehow be merged to a branch you have is formalized nowhere. Branches don't have hashes, they're just strings that point to commits.
You make it sound like force pushing is some sort of special operation, but it's actually just the same operation that's made with every commit. A branch is deleted, and then a new branch is made with that same name pointing to a new commit. It's just that with a normal push there's a check to see if there's a path from the previous HEAD to the new HEAD.
And another thing, it's also not directly destructive, the unreferenced commits don't immediately disappear. They just get cleaned up when git decides it's time for garbage removal. You can still check the commits out, if you know their sha.
Maybe a high-profile incident like this will finally convince GitHub to let you disallow force pushing to specific branches in repos – like your project's master branch. I asked about this years ago and iirc was told there were no plans to implement such controls.
Summary: By default Git push used to (they've changed it) push ALL your branches, not just your current one. It is pretty much the first thing I've historically reconfigured whenever starting on any new project. This guy didn't do it, and paid the cost.
No, the issue is force pushing. While it's nice that git changed the default behavior to be far less dangerous, people with push access can still force push – be it one branch at a time in newer gits or all branches in older ones. It's all fine and well to tell them to change their settings and/or ask them please don't force push, but people make mistakes. Hell, I want to prevent myself from accidentally doing this.
Git let's you prevent this easily using the receive.denyNonFastForwards option, but GitHub won't let you control this option. So we're left just telling people not to do it and hoping that they listen and don't do it by accident. That seems like a shitty, precarious situation that could easily be fixed. One checkbox on the GitHub repo settings page and this story would never have happened.
You have to explicitly add the force flag, and if you choose to go down that path, then you better be sure you know what you're doing. If you don't, but still manually setting that flag to fix the error that your push failed? You deserve the rain of hellfire that will come down upon you if you blow away other people's commits.
I know about denyNonFastForwards, and unfortunately I work at a company that turns this on as a rule on ALL branches, which I don't like because I like having personal branches on the server (rather than on my personal git repo). As a result when I want to squash my commits I end up rebasing interactively, deleting the remote branch, and then creating it again.
I guess you've never written an automated tool that works with remote repositories and sometimes has to force them. It's quite possible to accidentally force push a branch that way – and that appears to be precisely what happened here. Sure, he maybe could have been more careful and avoided the mistake, but GitHub could incredibly easily allow people to configure repos so that this mistake can't be made. Your argument boils down to "don't make mistakes", whereas my point is that in this case it would be so easy for GitHub to give repo admins the ability to prevent this entire class of mistakes.
[Note that using an explicit branch name, as has been suggested here, doesn't really help since you can easily accidentally force push a branch that way too.]
The fact that your company has a draconian blanket policy is unfortunate, but has no bearing on whether or not people should be allowed to opt in to setting denyNonFastForwards on specific branches.
Where I work they have a very effective strategy for dealing with this problem: there's a magic 'backup' remote you can push to that is unique to each user and allows --force.
Backing up your local feature branches is no reason to enable rewriting history on the main repository.
After my first bad force push (silly git push default on an unfamiliar machine) I now 'enforce' the `git push remote from_branch:to_branch` syntax on myself when force pushing, I also have adopted the habit of always accompanying any --force with a --dry-run.
I totally agree. There are legitimate reasons you want people to force push to a branch / pull request. But there are very few legitimate reasons to force push to master. And in those scenarios it should be tightly restricted and/or the account owner involved.
Christopher Orr on the list suggested the following, which makes a great deal of sense to me:
One thing I sometimes do for repos where I have commit access, but don't want to push, is to clone the repository via HTTP. That way, any accidental attempts at pushing won't use my ssh key, and so GitHub will prompt me for my password.
You can also just override the push URL for any remote, then have a separate remote for pushes. So you'd have like "origin" for pulls only, and "pushorigin" for pushes.
Switch to Bitbucket. It is easy to selectively disable history rewriting and branch deletion on a per branch basis. I've been told that Github can also do this, but you have to email them.
Alternatively use git the way it was meant to be used, as a distributed source control system. It appears that they had 678 people sharing access to these repos instead of working off their own forks. I'm amazed they haven't stepped on each others toes previously with that structure.
> Alternatively use git the way it was meant to be used, as a distributed source control system.
Let's not be disingenuous. This is exactly what every git user does when they run `git clone`. You aren't railing against just multiple committers per public repo, you're railing against multiple clones per public repo. One user swapping between two client systems can make this error. Most public repos of any scale will have multiple committers to spread the maintenance workload.
Even Scott Chacon himself describes a simple, effective process called "GitHub Flow"[1] -- just submitting a PR against a branch for merge against master on the same repo. This is a very common workflow on private repos in my experience, and a very easy workflow for small teams.
Making all this worse, the default config setting for `push.default` will only change to "simple" in Git 2.0 -- until then any new host/account that hasn't had push.default changed in .gitconfig is a bomb waiting for the first otherwise legitimate `git push --force` to go off. I've received the batsignal from folks on small teams who've otherwise done all the right things and still been bitten by this.
Fortunately, git has the reflog[2][3] to save us from this kind of mess. It's just not readily accessible on GitHub, AFAICT.
Update: there's a critical caveat to remote reflogs:
core.logAllRefUpdates must be manually enabled on bare repos!
To repeat: the reflog is not available on bare remotes by default unless core.logAllRefUpdates has been manually set to "true". See [1] for documentation on that config setting.
Agree 99%, but it's more "use Github" than "use git" as it's meant to be used. GH supports certain development modes very well, fork+push request being the best, but having zero support for ACLs on branches or files makes other dangerous. As the devs in question are finding out (and as I have found out previously on a much much smaller scale, so no schadenfreude here.).
(You could argue that stock git doesn't really support fancy permissions models either, but if you have access all the hooks you can implement them, or you can use gitolite or Gitlab.)
I agree that git doesn't strictly enforce a team workflow, but 'pull requests' predated github and many use a pull request based workflow without github. Having a local fork, a remotely accessible fork, and asking the repo maintainer(s) to pull from your remote repo to merge into the release repo doesn't require github and many projects use that structure without using github at all, most prominently, the linux kernel. `git request-pull` is a built in command after all.
Considering that forks are basically heavyweight branches, I don't get your point.... it's distributed just as much whether they merge from another branch or another repository.
I think forks aren't anything like branches in Git. Forks are clones. By your logic, every clone is just a "heavy duty branch" which doesn't compute for me. Ya know?
> By your logic, every clone is just a "heavy duty branch" which doesn't compute for me. Ya know?
That's exactly what I mean. How is this a difficult concept? Branches are lightweight, clones are heavyweight. There's really very little difference except that they can share branches of the same name that point to different refs, I guess.
git can support centralised systems just as well as subversion or any other vcs. You'll note from the thread had they been using a real git repository a quick look at the reflog would have fixed this. Not to mention disabling forced pushes is available in their enterprise version of github as well as standard git.
Ah true it appears from some SO responses you can contact support and they can restore it. But there appears no way to access the server's reflog through github or is there?
The lead dev for that project just got hired by Facebook (actually, almost all of the core hg team is getting hired by FB), and the feature itself is like 85% complete.
Soon we will have a safe way to collaboratively edit history!
It's such a shame git's pretty much won the DVCS wars - hg has so many neat features and a much cleaner user interface - but it seems that's just no longer relevant, given the difference in adoption.
That's quite a strong statement. Mercurial is less popular, but why would it not be relevant? There are great tools for it, great websites, and it's being very actively developed. Plenty of open source projects use it, and there's little downside to choosing it for your own projects.
Mercurial won't take over the world, and it won't even reach the adoption level of Git, but that doesn't strike me as particularly terrible in any tangible way.
Well, I don't think the VCS matters as much as the community does. And even if hg were clearly better (and git still does have some virtues, so it's not all that cut and dried), then running an open-source project that's not easily open to git users may be worse than missing a VCS with some cool features.
And then there's network effects - it's popular, so it gets built into things like ruby's bundler and now even visual studio. Hg may be nice, but that's an uphill battle, especially over something that's perhaps not even all that important (i.e. even using SVN isn't a development death-knell or anything).
Mercurial may let them continue growing in that way for a while, much like switching to Perforce would let them scale a single repo far beyond what either git or mercurial would tolerate, but it seems to me that they are delaying their problem (and likely digging themselves into quite a hole, depending on the unstated nature of their Mercurial modifications...)
Ultimately, being able to push a single repo into the dozens, hundreds, or even thousands of GB doesn't mean it is a good idea.
Why wouldn't it be a good idea, assuming we had tools capable of scaling indefinitely? We currently split large codebases into "repos" that are effectively isolated even though they're logically connected, and it would often be useful to do, say, atomic commits across them. This seems like a limitation of our current tools, not an intrinsically good thing.
The hypothetical infinitely scaling tool would just be inevitably be functionally equivalent to splitting repos. Basically all you would be doing is swapping around terminology. Instead of "here is Foo, our great big git system which stores thousands of repos" you would be saying "here is our Foo repo, it stores thousands of projects". Objections that git cannot be used because a git repo cannot hold as much data as a Perforce repo are more or less based in improper strict mapping of terminology ("we have a single perforce repo, so we must have a single git repo").
In other words, that those infinitely scaling systems already exist, and they are built on top of git (or hypothetically Hg, though I cannot think of any examples).
For one publicly visible example of the sort of thing that I am talking about, look into how Android development works. Android is 'in git', even though it is too large for a 'single' git repo.
(Note that due to the ways that code and resposibility is typically organized, most organizations are probably in a situation where migrating from a single monolith repo to many git repos would be a conceptually straightforward task, provided that they can break from the conceptual notion of having a 'single' repo. Atomic commits across repos are the primary pain-point, but you would be surprised how much that disappears as you grow use to working with many repos. Supporting a strong notion of versioned dependencies between packages goes a long way.)
Thanks for the pointer to Android, I hadn't looked at how that was organized before.
I agree that atomic commits are a red herring. They're nice to have, but by the time you outgrow a single git repo, you also have projects with different release schedules, and once you have that, you have to deal with version skew anyway, and then you don't really need atomic commits.
I disagree that "those infinitely scaling systems already exist," though. Looking at Android, they had to build a nontrivial wrapper around git to make it work, and it's not totally transparent. You have to think about when to use 'repo' and when to use 'git', and where the boundaries between repos are.
There are huge benefits to having everything be in one giant pile of code and being able to import and sometimes modify code from far away parts of the tree with minimal overhead. The key is to let any directory be a "project" that you can refer to, without any arbitrary distinction between top-level project directories and others. This lets you do things like spin out a part of a project as a semi-independent library without moving files around or creating new repos.
There are some downsides too, of course. Google eventually broke from this model slightly by introducing components, which had issues of their own. Perhaps Facebook has done it better.
So what you're saying: if you manual manage versions of your dependencies, your _version_ control system works?
I think that's a shame, and it doesn't work very well either. Mercurial does the same thing incidentally (unless facebook have somehow solved that), so it's no better there.
Splitting repos is a pain; perhaps a necessary one, but hardly ideal. It just introduces a bunch of extra administration, and it reduces the power of your VCS primitives (such as branching and merging).
> I think that's a shame, and it doesn't work very well either. Mercurial does the same thing incidentally (unless facebook have somehow solved that), so it's no better there.
The way I interpreted the reddit post from the Facebook engineer is that they looked into customizing git to scale better for them, but the codebase wasn't to their liking, so they're going to customize mercurial's instead.
> The matter of customizing git came up and people looked at the code and decided it's pretty convoluted when compared to the Mercurial code.
So it's not like mercurial is able to scale in a way that git can't, it's that they plan to make mercurial scale in a way that git can't. (Any over/under on when they give up on that and decide to use Perforce's Git Fusion?)
I remember reading about this. It would really help if Git supported inotify; there is no reason to stat every file on git status/diff/etc. I remember a mailing list thread about this in the past, but I don't think it ever got off the ground. I know there was a lot of discussion on how to support the feature cross-platform.
On a slightly unrelated tangent, the tup build system supports inotify, which I appreciate.
>> "I give a ton of credit to Linus for having created git back in the day."
Back in the day? It seems like git just came out, is just a few years already like ancient times? I don't get why they can't use git? Git is used to maintain the entire Linux kernel and they want some system rarely used?
I wish people could make up their minds on what DVCS we should all be using and stick with it for 20 or 30 years.
> "I don't get why they can't use git? Git is used to maintain the entire Linux kernel and they want some system rarely used?"
They are presumably trying to use a single repo for a very large chunk of code. Large, not in the sense that the Linux kernel is large, but a repo one (maybe two or more) magnitude larger. Multiple gigabytes large, with most of those gigabytes taken up by small objects.
Git does indeed start to fall down with these sorts of repos. The thing is, if you keep on growing like that, other VCSs will as well. Even with perforce you'll hit a wall eventually.
The solution is to restructure your project, breaking it into several different repos. Understandably, large organizations capable of creating large repos like this are typically going to be resistant to large changes like this brought on by what they interpret as limitations of tooling. Doing it requires restructuring code, training, and may require extensive modification to internal systems that work with your codebase.
My perspective is that this 'limitation' of git is really a symptom of an anti-pattern that should be corrected sooner, rather than later, before you find yourself in a really painful situation (read: three years later your repo is now 10-100 times as large, and you are discovering that continuing to scale with Perforce is becoming untenable).
Linus introduced Linux in 1992. Linus introduced git several years ago primarily for Linux. git is used to maintain code that's over 2 decades old. My comment seems to me to stand.
You're right, I should have been more precise: They seem to be switching to Mercurial because they think it's easier to customize Mercurial in order to address scaling issues they're having. (And I'd guess that some of those customization are going to end up in a future Mercurial release.)
Btw, I'm pretty impressed by Facebook's open source efforts.
HgGit is very slow; and it's hard to use as a real git replacement - I suspect, anyhow. I use it from time to time when I need some specific feature (like mercurial's lovely history query language) that's tricky in git, but I've never tried it as a real replacement, so maybe it's workable nevertheless - have you tried using hggit to really collaborate on a repo actively?
I use it as a real Git replacement, every day at the office.
We have small teams and many mostly small repositories, so i am fetching something like <20 changes into each of <5 repositories every day. I have no complaints about performance or correctness. The main pain is interacting with build systems that quite naturally think in terms of Git hashes; rather than
I think of Mercurial being to Git as FreeBSD is to Linux, Greece was to imperial Rome, and Britain is to America. Less populous, less dominant, but providing a source of creativity and a sophisticating influence.
It's mainly a naming confusion with Git branches. HG branches resemble SVN ones - a copy of the working tree performed from the place of branching. HG 'bookmarks' are Git 'branches' - just references to a commit.
Now I'm still not sure I got everything straight in my head:
The HG wiki says "Git, by contrast, has "branches" that are not stored in history, which is useful for working with numerous short-lived feature branches, but makes future auditing impossible."
Git branches are just commits on a separate path in the tree, right? As long as the branch is merged with git merge --no-ff (which can't be set as the default, yes, I know) -- then a git branch becomes a permanent part of the history due to the commit created during the merge. The fast-forward option is just that: optional. It's only useful as a history-rewriting tool, kind of the way git squash is useful.
I guess I'm asking, what does the HG wiki mean, in concrete terms? What is impossible with git?
Also, to be sure I understood the original comment well, I'll try to sum up the difference between HG branches and bookmarks. Is it correct?
HG branches copy-on-write the entire source tree. This means a HG branch cannot be changed if the history before the branch changes. HG bookmarks (and git branches) can change later if the history changes.
Nothing is impossible, it works like you described. In very simple terms, following Git philosophy rewriting history is a feature, according to HG it's a bug (technically, it is somehow possible AFAIK, but not sure if it's used around for anything except critical fix scenarios). It's just that. Hence the term 'branch' in HG refers to something permanent and 'bookmarks' provide the lightweight functionality. At its roots they have a different approach to managing history (back when I used HG more I didn't understand all the noise around rewriting Git history, obsession with keeping it clean and linear, etc.).
I can only imagine. Good luck! At least everyone's commending your community on keeping it's cool and making the best of it; so hey, that looks professional.
I'm not sure if it contributed to this fiasco, but GitHub's team system is a bit backwards and screwed up. You create teams of people, that automatically have either pull, push/pull, or administrative permissions on every repository they touch. Then, when you go to the 'collaborators' section, you simply add those teams of people, without being able to see what access level you are awarding to that specific repository! It makes no sense. Either the permissions should be RIGHT next to the team name in the drop down or (much better) GitHub should decouple groups from the permissions they have, and permissions should be awarded on a per-repository basis. That you way you are granting X group of people Y permissions on Z repository. You are always explicitly granting a permissions level to a specific group of people, and no permissions are automatically added that you haven't explicitly signed off on for that repository.
Maybe there's some logic behind it, but I doubt it, and I'm really not sure if GitHub completely thought it through.
We've been using the prevent force push hook for Atlassian's Stash on my team to avoid situations like this. If you really need to force push, you can get around it by deleting the branch and re-creating the branch.
Ouch, that hurts. On the plus side, they are looking into process fixing so it doesn't happen again. Not only that, the flameware level is zero (at this point). Major props to the Jenkins/Cloudbees people!
Further, because git is a DVCS, so long as someone has a valid recent clone, the pain is much less than, say, a centralized repo being corrupted.
Jenkins is build server software. It's existence is related to the story only because it was a Jenkins developer.
'git push --force' pushes your copy of the repository to the remote, regardless of what the remote has. He basically told GitHub, "Hey here's the /real/ copy of the repository, the copy you have is totally wrong" and it overwrote all the history because his copy was a few months out of date.
Force pushes are dangerous but turning them off completely prevents some valid use cases like pruning problematic IP or removing large binaries accidentally committed to a git repo. For those reasons, CollabNet ships with a Gerrit plugin that does not prevent history rewrites but records them in an immutable backup refs, sends emails and logs it properly. Recovery can be done on a self service basis and administrators can still delete those backup refs forever. The plugin also treats branch deletions like that.
More details can be found on http://luksza.org/2012/cool-git-stuff-from-collabnet-potsdam...
An unfortunate accident. Makes me wonder though -- what is the worst a developer with full git commit access can do? (without having access to the actual server)
If someone sets out to mess up a repo ON PURPOSE, he can do force push. Can he also somehow trigger GC to preclude recovery? Do something else?
Sometimes I have had to do a git push --force, usually when I want to amend a commit I pushed just a few minutes before and I know that no one else has used it.
However, I always do a --dry-run first on a force push just to verify that it's only going to update the one branch I expect it to.
This kind of situation has been one of my concerns about deploying git internally in a large corporate environment. What steps are appropriate to block this kind of update on a central repo? I don't see a human patch manager scaling.
An easy way to do this is to use the fork/pull request part with a build bot to push commits to the actual master once approved. That way nobody has access to the main repo, all they can do is pull request in, and you can do what ever you want to your own fork.
We have often inexperienced students, who do get push permission to our central repos. So far they all behaved reasonable, however we had one accident.
The student used Eclipse with its git integration. He tried to push, but it failed (non-fast-forward). Apparently, Eclipse tries to be helpful and says something like "If you want to push anyways, use the 'force' checkbox". Not knowing what a "force push" is, the student followed Eclipse's advice.
Thankfully, the distributed nature of git is very helpful. You can just force push again from another repo to undo the force push.
Yes! Eclipse's JGit is afwul. I did a project with 9 others inexperienced in Git. In the final crunch phase, they managed to delete commits via force push multiple times a day! JGit also completely messed up the history with merge commits, instead of doing fetch and rebase.
We use Gitolite (http://gitolite.com/gitolite/index.html). It can be configured to block (force) pushes for repositories, branches or persons, among other things.
The number one feature I wish github would add is a setting to prevent force pushing to master, but allow a forced push to all other branches. This would allow us to use a branching model like http://scottchacon.com/2011/08/31/github-flow.html but we would still be able to rebase branches after code review, without the danger of rewriting history on the master branch. You can do something similar with client side hooks, but it would be nice if it was supported server side.
So when do we do the obvious and start running version control on our version control history? Git the git? Revert our VC mistakes as easily as we revert our code ones?
It would be nice if Github was giving a UI control over force-push behavior.
If you open a ticket they'll manually flip this repository settings on or off for you, but still... Maybe this is the wake up call: it's dangerous and doesn't scale?
Yes, unless others have pushed local commits based on outdated clones in the mean time; then someone will have to rebase and rework (or just blindly merge and repush)
I predict this will happen a LOT more often with git 2.0 changing the default push behavior to pushing all branches. If you've ever --force pushed over a temporary remote branch (e.g. to show something to a coworker), 2.0 will run the risk of overriding everything. Hooray!
But these were lots of repos and not just forks, but the look of it - If that's the case, I'm guessing some scripting was involved, so the defaults wouldn't have mattered.
Good news from GitHub: they have extracted the full list of SHA-1 before the forced push ! Many thanks to Nathan Witmer :-)
See below the full CSV with the SHA-1. He created as well a branch named 'recovery' that points to the candidate point for restoring the master branch.
Hope this will help to sort out the remaining repos.
Luca.