Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Major Service Outage
204 points by DeepWinter on May 31, 2017 | hide | past | favorite | 81 comments
See https://status.github.com



A bit weird. GitHub says they fixed it, but on the other hand CircleCI still considers it as an outage :

> Monitoring

> May 31, 2017 3:08 PM

> GitHub have declared the outage resolved and we are starting to see incoming GitHub hooks. Builds are being triggered again. However we are still seeing failures with the GitHub API. This continues to prevent our webapp > from fetching data from GitHub. We are monitoring the situation and will ensure sufficient capacity for when their service resumes normal operations.


Maybe fixing the problem is different than recovering? Like stopping blood loss vs slowly replacing the blood.


Good thinking on GitHub's part not using github.com/github/status to host the content of status.github.com. Amazon, take notice.


Nitpick: They would still fail for DNS issues with *.github.com, so a domain like githubstatus.com would be even more resilient.


Nitpick: www.githubstatus.com is more flexible, potentially more resilient.


Nitpick: www.statushub.com is less obviously related, and therefore won't be attacked in tandem. If they want to go all the way, maybe just something like www.wesellnikesdiscount.com.


Were they doing this before amazon had that major outage?


Yes. They've been hosting it outside of their production infrastructure for several years.


Seeing an interesting thing where my github issue comments just posted are apparently posted a short time in the future. http://imgur.com/a/eQSc9

Doesn't seem to break anything, but it is a bit curious. May not be new though...I just happened to notice it today.


I've seen that, but I've chalked it up to clock skew on the client. It only seems to happen on one of my machines, and only after a couple months of uptime.


Ah, yes. I hadn't looked at it, but now I see that fixed date/times are in the html source, and the "x hours ago" messages are rendered in the browser.

Fairly compact implementation too: https://gist.github.com/anonymous/33710bd9c7175a645dd0d72d1a...


I've noticed this behavior with a lot of services.

I can only chalk it up to something like clock drift between the processing node and the database server.

Irritatingly I can't remember which site it was but I posted something somewhere a couple days ago and immediately after hitting enter the site marked what I'd said as submitted "a few seconds from now". I never fail to be amused that the fuzzy time library being used has code specifically designed to handle this edge case scenario. :D


I don't think it's really an edge case. Probably one of the main uses, actually.

Sure, if you use it to show comment age, you shouldn't ever see it, but I'm sure they fully support using it for countdowns, too.

EDIT: it's the 4th example under relative time for Moment.js (https://momentjs.com/).


I can totally agree about relative timestamping in both directions (past+future) - my argument is more about the UX of situations where you're canonically referring to a past event.

So as not to spam with my reply to a similar comment, I'll link it: https://news.ycombinator.com/item?id=14452335


Git timestamp can be set with different timezone so technically GitHub can analzye the timestamp (which they do) and compare with the actual clock of timezone. There's only a small problem though: the latency from the clock check server must be low otherwise we would be 1 second ahead by the time we get response and then another second or two after webpage render.

So from a UX we should simply discount time drift +-5 seconds simply says "blah blah just now" and anything larger might warrant as a warning to the author (Github can reject the commit and ask for confirmation). Although re-editing commit is a hassle for changing timestamp. It's more of a "please make sure you computer time is synced next time."


Wow, interesting. This info is definitely filed away, thanks.

I don't quite remember but I think I might've been commenting on something on GitHub when I saw the time glitch.

Initially for a moment I thought "why not just have OCD local NTP tracking?" but then I realized that time glitching around (even at the millisecond level) can be disastrous. One way to solve this is to obsess about keeping up to date with NTP, but instead of updating the time, update a global reference to the offset. Then your time server is simply (system date)+-(saved offset), which should be super fast. And of course this can run on the node generating the HTML.


If you look at "man git-commit-tree"

       While parent object ids are provided on the command line, author and committer information is taken from the following environment variables, if set:

           GIT_AUTHOR_NAME
           GIT_AUTHOR_EMAIL
           GIT_AUTHOR_DATE
           GIT_COMMITTER_NAME
           GIT_COMMITTER_EMAIL
           GIT_COMMITTER_DATE

       (nb "<", ">" and "\n"s are stripped)

       In case (some of) these environment variables are not set, the information is taken from the configuration items user.name and user.email, or, if not present, the environment variable EMAIL, or, if that is
       not set, system user name and the hostname used for outgoing mail (taken from /etc/mailname and falling back to the fully qualified hostname when that file does not exist).

       A commit comment is read from stdin. If a changelog entry is not provided via "<" redirection, git commit-tree will just wait for one to be entered and terminated with ^D.

    DATE FORMATS
       The GIT_AUTHOR_DATE, GIT_COMMITTER_DATE environment variables support the following date formats:

       Git internal format
           It is <unix timestamp> <time zone offset>, where <unix timestamp> is the number of seconds since the UNIX epoch.  <time zone offset> is a positive or negative offset from UTC. For example CET (which is
           2 hours ahead UTC) is +0200.

       RFC 2822
           The standard email format as described by RFC 2822, for example Thu, 07 Apr 2005 22:13:13 +0200.

       ISO 8601
           Time and date specified by the ISO 8601 standard, for example 2005-04-07T22:13:13. The parser accepts a space instead of the T character as well.

               Note
               In addition, the date part is accepted in the following formats: YYYY.MM.DD, MM/DD/YYYY and DD.MM.YYYY.

Edit see this repo and screenshot:

* https://github.com/yeukhon/demos/commits/master/git-date

* https://github.com/yeukhon/demos/blame/a9fc9dfe6d35c5ffe14af...

* https://github.com/yeukhon/demos/commits/a9fc9dfe6d35c5ffe14...

You see I got a 3 minute (using local time) and then 4 hours ago because I set the timestamp manually in the most recent commits. So yes you can set time in the future / past.


Very interesting. Thanks for the demo and references!


Message queues and eventual consistency. Unless your request requires something "atomicy", 200/201 response should be a sign of "got the message, will get to work on it when we can".


201 is "request has been fulfilled, resource has been created". It is explicitly not "we will get to it when we can". You are thinking of 202, "request has been accepted for processing", for asynchronous request processing.


I assure you, 201 is used all the time to say something has been done when it's only been queued, regardless of what the RFC says. This is based on real world integration experience.

Http response status codes rarely align with RFC guidelines.


A lot of people back up to tape and never test the tapes, too. Doesn't mean bad practice should be expected.


Disagree. Plan defensively.


Actually, most of the "relative time" JS libraries can be used for countdown as well ("this event is scheduled in two hours"). Accidentally getting a timestamp that's supposed to be in the past shouldn't break it :)


Very true.

I just think there should be an intent argument specifiable to the library to indicate that the duration in question refers to an event that has happened in the past.

In such a scenario, the library should mark the duration as happening "just now" and possibly flag a warning or raise an exception.

The reason I say this is that, humanly speaking, "Your message was sent 23 seconds from now" is amusing at best to developers who know what's happening (negative time delta) and linguistically confusing to general users ("was sent" vs "from now").


As others have said, Github's postmortems are always great.

But frankly, I'd rather they have better uptime. Every couple months is too much. I pay them. My work pays them.

If their CEO is serious about zero downtime, how about he offers his paying customers a credit for time they cannot access the service?


> I pay them. My work pays them.

Hmm. You pay them to uphold a contract. What does that contract say about SLAs and availability? Probably the same as the TOS that I agreed to when paying and those specifically say:

    GitHub does not warrant that the Service will meet your requirements; 
    that the Service will be uninterrupted, timely, secure, or error-free; 
    that the information provided through the Service is accurate, reliable or 
    correct; that any defects or errors will be corrected; that the Service will 
    be available at any particular time or location; or that the Service is free 
    of viruses or other harmful components. You assume full responsibility and 
    risk of loss resulting from your downloading and/or use of files, 
    information, content or other material obtained from the Service.
If you negotiate, you might get better terms and guarantees, for example with github enterprise. You might also have to pay substantially more for those.

I understand, it sucks when github is down. But we all get what we pay for and we all don't want to pay for more. And yes, I do have clients that meticulously mirror all their dependencies from outside sources and spend significant money on this - money that pays off in exactly these situations.


Huh? You don't have to use Github Enterprise (self-hosted) to get an SLA. Github Business, which is hosted on github.com has a 99.95% uptime SLA: https://github.com/pricing

An upgrade from Team to Business is "only" a 2.3x price bump per dev. I have no experience with this though, my team is still of the Team plan and thus suffered from the outage today.


Huh indeed. 99.95% uptime -- AKA: three and a half nines. My quick math tells me that 99.95% uptime equates to a downtime of ~4:23/yr. If github is down for an hour once every few months, I'd say they're likely well within their stated SLA.


They are? https://status.github.com/messages/2017-01-18 has a bunch of major service outages and no link to any post-mortem.

The vague rumour always seems to be 'DDoS attack I guess' but there's very little in the way of formal reporting as far as I can tell...


My problem with a credit is that it never even comes close to what I'm losing in income. An ISP is an excellent example. I might get a $10 credit for 24 hours of downtime. I'm charging slightly more per hour than that... /s

Maybe switch to bitbucket or other competition for a while?


If the price of the service working (or not working) is disproportionately large compared to the price of your lost business, that's a problem at your end: you needed to calculate the risk vs return for redundancy.

Eg, if your Internet costs $100/mo, but you'd lose $100/hour when it's down during business hours, buy a fallback connection from a competing ISP. ;)


> a competing ISP

Wow! That actually exists in some places? ;-)

Infrastructure so often becomes a monopoly. I can't pay a competing bridge service to drive to work quicker, I can't pay a competing gas company to deliver gas via different pipelines to my house. And I can't pay a competing electric company that uses different wires.

I actually am lucky enough to live in a city where there are many competing high speed ISPs. But guess what? I've paid for fallback connections in the past and when one goes down, the other goes down, so I go out to lunch and see the guys working on the wires in the cabinet down the street. The wires that both my ISPs share. I suppose I could get a satellite ISP? That latency. True redundancy for infrastructure is actually very expensive in most cases.


switched to gitlab a few weeks ago, didn't notice the outage until I saw this thread.


I use VSTS, it's actually improved much over the years!


Gitea's pretty good if you just need a GitHub-like experience (eg good UX) git repo host:

https://gitea.io




Funny thing is Github sometimes makes more sales after an outage because clients want to upgrade to the enterprise edition to host on their own servers.


A little funny like Silicon Valley episode , saw the news from GitHub CEO yesterday saying our goal is zero downtime and now it's down


Perhaps an issue with punctuation?

Goal: zero downtime.

vs

Goal zero: downtime.


  Works on contingency?
  No, money down!


Zero downtime is always a goal and never achieved for any complex service. They literally all go down sooner or later.


You sure it isn't zero downtime deployment? But I thought Github runs infrastructure globally? I remember some outage were caused by DoDS, and some were software bugs / bad config.

Probably good idea to do rolling deployment. I will be surprised if they haven't for the kind of top engineering team they are running.


Business idea: github hosting failover. You'd probably need a modified git client, but if you can't push/pull/whatever from github, it transparently fails over to your service which will sync up with github once they've recovered.

Even better idea: github should stop failing.


When expanded to see the monthly trend it shows 99.6% availability.

Serious Question: is there enough people that would pay for that 0.4% to support a business?


Gitlab has both push/pull mirroring, I wonder if it would be possible to use them together to accomplish this.


A couple times I have configured multiple "remotes" for my local git repo and pushed to both, e.g. GitHub + Google Cloud Source Repository, or even just a bare repo on a VPS.


Why not just use Github Enterprise then?


As of right now...

On the one hand, I see "Everything operating normally." at the top in green, and no flags or alerts.

On the other hand, the charts look good, but "App server availability" looks interesting, the right edge of the chart is pretty much at 0%.


MEAN WEB RESPONSE TIME - 262ms

98TH PERC. WEB RESPONSE TIME - 1134ms

4.3x?


And the 98th percentile is still faster than the 50th percentile of GitLab...


Apparently it's resolved. I'd like to read their postmortem on it. They write those extremely well


github daily availability history:

https://apps.axibase.com/chartlab/25f38b08/2/


Looks like their cdn is having problems as well now, seeing timeouts when trying to download archives.


Was receiving random downtimes when I've tried opening a certain project and its wiki ~1 hour ago. Nothing big and a couple of refreshes fixed it, just minor annoyance.


The outage seems to be resolved as of 8:58 EDT


Yep. Now wonder what the issue was. Ghost in the shell?


web looks fine but repository are not so responsive


feel free to make a better one


We detached this subthread from https://news.ycombinator.com/item?id=14452709 and marked it off-topic.


This is the biggest cop-out of a reply. I hate it. OP has already stated:

>I pay them. My work pays them.

Github is raking in oodles of cash and they STILL can't keep their service up without going down, quoted from OP, "[e]very couple months".

It's not about "making a better one", nor is it about paying for the fancier/premium features; it's about the uninterrupted service, which Github keeps failing to provide.


I fail to see what the problem is. If you don't like their service, then switch to a competitor, or just set up your own git server. It's like this for any vendor: if you don't like the product or service you're getting, you can either bitch and complain endlessly, or you can look for alternatives. One of these choices is more productive than the other.


I think hindsight is 20/20. They obviously did not think they needed to find a competing service; nor build one as they were paying for one. Given that they have been inconvenienced by downtime they may switch or make their own.

> Feel free to make a better one

The GitLab team did that already so I use their service. ;)


It's called GitLab. Not 100% uptime but better (and constantly improving)


It will take a long time before GitLab can even begin to regain my trust from their missing-backups outage.

Problems are to be expected. But as great as it is that they had multiple levels of backup, none of them worked. They hadn't even been tested.


We have a pretty detailed post-mortem on the timeline of events for the outage, as well as actions taken after it to prevent anything similar in the future (with links to related issues). You can check it out in https://about.gitlab.com/2017/02/10/postmortem-of-database-o...


They actually gained my trust from that outage.


If you paid for GitLab there was no missing-backups outage...


Do you have any evidence that Github does any better?


They haven't lost my data so far and they had to restore the production database at least once in their history. All circumstantial, but we'll have to wait and see.


They did not lose that much data, only a few hours.


Gitlab does not have features parity with Github. I personally also stumbled upon a bizzare bug that doesn't allow a friend to add me to any of his private repo's. Asked on IRC and nothing much came of it unfortunately.


Hiya,

Which features would you like to see in GitLab? We'd love to talk about it. You could also open a feature proposal issues in https://gitlab.com/gitlab-org/gitlab-ce/issues.


Err... you're responding to someone who very clearly said they hit a (serious for them) bug which needs fixing. That's definitely not a feature proposal. :D


Do you have data to support that claim about uptime?


GitLab's public service had a major outage a few months ago:

https://about.gitlab.com/2017/02/10/postmortem-of-database-o...


GitLab, really ?


I'm very happy I switched. I even posted here on their last outage. That was a stressful 15 minutes.

Not perfect, just working loads better for me and my teams


Git access to repos seems to be working for me (pull, push).


I guess they just rolled out their new DNS infrastructure (https://githubengineering.com/dns-infrastructure-at-github/) :-P




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: