It seems very unwise to have essentially your whole deployment process manager in the hands of an entity which you don't only have no control over, but which has regularly been targeted in attacks by nation-state-level actors because of their role as a code hosting platform.
EDIT: GH hasn't been targeted regularly, but it has been so historically, so this is a plausible thing which might happen again.
We use Github heavily at my work, and at past jobs as well. But at the same time it's not the ONLY way we can work. If github has an outage, our CD will shut down, but we can still run all tests locally, and we can still push to the server directly, and we can still push code between ourselves manually and review it.
Sure, it's a hiccup in our day when it goes down, but it's not like the entire company grinds to a halt. And the alternative of maintaining servers and systems to replicate all of that would take significantly more time, potentially cost more, and is probably more likely to go down.
People often say this but I wonder if it's true. Your replication wouldn't be handling the same load as GitHub itself, so perhaps the issues that GitHub experiences would rarely happen to a self hosted version.
We also have a full copy on our CI server (which is hosted on another service, so still not "in-house"), and we have a copy of at least the master branch (and all it's history) on our production boxes (which also gets pulled into our whole backup system there).
In a disaster recovery scenario, that's more than enough for me.
Sure, if github blinked out of existence we would probably be at a fraction of our normal productivity for a while until we fully recover everything and find new workflows, but the risk vs reward there is well within the margins of what I'd consider acceptable for a company like us.
That's not really enough. You need more than "all the code is around here somewhere" you need a plan with specific steps that have been tested.
Furthermore, the problem with these disaster scenarios are that there are much more dangerous problems than your account being deleted. Someone with admin access, could insert a back door or sell your source code to someone else. That's honestly scarier.
It's like you said, smaller shops probably think it's over-planning and overkill, but we do indeed have 100's of projects that are "mission critical", that might not have been touched in 1+ years.
For me it's actually the opposite, except for one company that had only an internal SCM and no cloud stuff.
I've used AWS for 10 years but for last 5 years, I've never seen it just go down randomly and even if it did, you have the room to redeploy with some clicks (assuming you have your data backed up regularly) instead of waiting for the uncontrollable.
You seem to take the ops work a bit overly.
Where you host it is of less relevance, because you can simply take your backup and server script and run them at any different provider any time you want.
Welcome to the SAAS world. Where you offload institutional knowledge and hiring to a 3rd party and roll the dice.
Same reasons we don't run our application and supporting infrastructure in a cloud provider really - apart from the complexity of our particular network which has a heavy dependency on mobile and satellite links - we cannot guarantee our customer of our availability if we don't control the compute, storage, SAN fabric, and as much of the network infrastructure as is possible.
Also, while they haven't updated this blog post for a while, their status page has been very up-to-date and informative: https://status.github.com/messages
Is that satire? It said 2-hour ETA 5 hours ago and the last update was over two hours ago.
>12:56 British Summer Time
>The majority of restore processes have completed. We anticipate all data stores will be fully consistent within the next hour.
I wonder what the total cost of this ordeal must be. Surely in the tens of millions.
"We are validating the consistency of information across all data stores. Webhooks and Pages builds remain paused."
Which is a bit scary. Half my requests appear to hit some storage which is still many hours behind.
They should be seeing that...
"Try adding a remote called "github" instead:
$ git remote add github
# push master to github
$ git push github master
# Push my-branch to github and set it to track github/my-branch
$ git push -u github my-branch
# Make some existing branch track github instead of origin
$ git branch --set-upstream other-branch github/other-branch"
Every company faces problems like this. GitLab infamously lost their entire production database  -- and I think we can agree that's more serious. Knee-jerk reactions to incidents will leave you without any "trusted" services, because mistakes happen to everybody at some point. BitBucket has certainly had its own fair share of downtime.
What?? The link you posted says
> Database data such as projects, issues, snippets, etc. created between January 31st 17:20 UTC and 23:30 UTC has been lost. Git repositories and Wikis were not removed as they are stored separately.
> It's hard to estimate how much data has been lost exactly, but we estimate we have lost at least 5000 projects, 5000 comments, and roughly 700 users. This only affected users of GitLab.com, self-hosted instances or GitHost instances were not affected.
How is that "their entire production database"? You make it sound so much worse than it was. While it was a horrible incident, they did not lose their whole production database.
That was the final last ditch backup too where something like 5 out of 6 of the planned backups weren't actually working and nobody realised.
So you're right, they didn't lose it, but they came pretty damn close!
As a reference to the above poster.
You do realize that every cloned tree can be a git "repo", regardless of the machine it's on, right? GH and GL surround the repo with some other things (bug reports etc) to rope you in but if you are cloning from one to the other you already aren't migrating that stuff as well, so it's not really clear what additional value that provides.
Is there something I'm not seeing?
if the new data is not presented, users will typically retry which may result in duplicated new content.
It may not be in GitHub's best interest to describe much more than that because which database, which tables, and how they were partitioned can be secret performance sauce.
Postgres partitioning: https://www.postgresql.org/docs/9.1/static/ddl-partitioning....
Redis partitioning: https://redis.io/topics/partitioning
For similar reasons that you suppose they "don't use proper distributed algorithms" (which seems an overly harsh way to put it), I simply presumed the logical entity in question is still a "database", even in the case of a network partition problem. I don't think it was either Postgres or Redis in this case, but they are simple enough examples to illustrate the overall problem.
Either way, my case still seems to stand that we aren't likely to get more information about the specifics of the partition failure because likely both the logical (which "database") and physical layer specifics (which datacenters/"network") are things that are internal to GitHub that they may not be able to publicly describe much more than what they have.
Yup. You notice when they're down. Whereas when your self-hosted git server goes offline for a couple of hours, nobody else notices.
It's a fair bet that developers are more likely to notice and complain about software services being down on the internet.
Or, to put it bluntly, if your CI works like this, it's contributing to climate change.
Youre ignoring half of the problem. If you dont receive events from github because they are down, your CI doesnt work either -- dependency caching doesnt matter at that point.
I can recall only 2 incidents this year. I think that's not too bad considering the level of traffic they have to contend with.
Still not bad, but 2 incidents like this a year, is usually considered unacceptable for infrastructure service providers.
More often than what is considered a standard 99.99% uptime SLA? (about an hour per year.)
You seem to be making it out like a couple of days a year of lost  developer productivity is no big deal.
That said, these things happen and you should probably check your workflows if you're all that blocked by GitHub being down.
GitHub SLA is 99.95% and apparently exclusive to Business Cloud customers.
So on top of your usual problems with keeping a cloud service up and running, you also have that git IO problem to contend with, and to rub salt in the wound, that wrinkle also makes it difficult to fully adopt many "standard" cloud architectures or vendors (such as AWS) which work for non-IO-heavy applications: you always have this major part of your infrastructure that has this special requirement holding you back at least partially (and that can hurt your availability for related services which are not even IO-heavy).
(That said, it's hard to guess whether that was the problem, a contributing factor, or unrelated entirely based on the details provided here.)
source: I work at Atlassian (though not on the Bitbucket team) and occasionally chat to current and former Bitbucket devs on this topic.
But really, do they have that many problems?
It’s easy to change the remote origin of a git repository. It’s not hard to migrate a project to Gitlab etc, and duplicate all the technical features of Github.
What’s really hard is replacing the social graph. If you’re a large project with a lot of contributors, onboarding everyone is not going to be easy. About as easy as convincing all your friends to stop using Facebook.
Basically the only things I was going to do on Github today :)
After one or two wrong guesses it just shows that you’re making it up.
However, it's still intermittently failing as of 14 UTC, so I haven't managed a Maven release build yet.
As I initialized it with a readme and an ignore file, I had to clone it. Cloning only succeeded by doing `watch git clone` and waiting a few minutes. But it worked.
Netlify has caused me to stop using GitHub Pages and between the clownshoes outage reports and the security issue I am now a GitLab user.
This is GitHub’s jump the shark episode. :(
https://gitlab.com/explore/projects is a live feed of project activity, suitable for code surfing. It also sorts by stars and trending.
Github reports 28,337,706 users by 2018-06-05 . Lets assume 50% of these are active. Lets also assume that, due to the unavailability of GH, around 2 usable hours per developer are lost. Another assumption is that each developer contributes around 50 US$ per hour.
This means, this outage has cost us users: (28337706 * .5 * 2 * 50) = 1.351 billion US$.
Perhaps not use MySQL for such critical systems?
Me and my team pushed tons of code to origin today (JST btw), but we were almost at a stand still as far as merging to master and closing out branches. Github being down had a huge affect on the process -- review, merging and CI success. Merging was our big one since we have protected branches via github -- master; CI was second since we received no webhook events. So either we wait it out (~8 hours) or we throw out our process and do something different until they fix it. The later didnt seem like a reasonable course of action.
github != git
Its easy to say, "well they are down, that doesnt affect git", but the reality is that a lot of orgs dont just use git. They use github. That fact envelopes a lot of process, routine, infra, schedule, money, etc. Luckily, its only been one day. But developer time has definitely beem lost. You cant reasonably say otherwise if you use github.
edit: I will say I dont fully agree with the GP. Blaming the downtime on MySQL is silly and coming up with dollars lost is over the top. Things happen. That comment was a weird attack on a reasonably stable database. Github had a bad day; it happens.
Your comment disappeared. I think you had a good point. And itd be cool if it could be a real thing, but...
I dont disagree. But our team realized that we rely almost too much on github; we just decided to put up with it. Is there a solution that doesnt depend on running your own "github"? My infra lead and I had the usual, fun tongue and cheek chat that began with... them: "maybe its time we switch to gitlab", me: "can we be up tomorrow?" Weve had that same discussion many times before.
In the end, it comes down to process. If you buy into the features, PRs, CI hooks, etc, then its really hard to just say "well we can maintain and replicate the alternative for the .1% edge-case". Otherwise, you might as well just use that and not github, gitlab, etc. Its hard to decouple from github. They do that by nature -- dev still continues; they are a piece of the process puzzle. I think abstracting them away just complicates things unnecessarily.
May I ask what makes your company feel secure about using cloud hosted solutions. I mean, can’t a disgruntled employee easily clone the git repo to his own GitHub account ? I suppose they could do that anyways, just by copying the repo somewhere, but having an entire company’s secret code on the cloud just seems to remove too many barriers of entry for protecting the code.
I'd estimate a couple of order of magnitudes less...
Average lost I'd guess to 10 minutes (most of it being "huh, wonder what's up with github")
Unfortunately, it's easier to down vote, than to come with a better estimate of the total cost of a 13 hour down-time of github.
Doesn't happen in practice and usually the whole thing is just blamed on "process" with subsequent "process changes".
I helped manage a hosted DVCS and CI system in my previous job, and do you know what we would've called an outage that happened at 7:00PM and had recovered partially before start of business?
Wait for the RCA to come out before throwing any stones.