So... you were doing a production redeploy, and it crashed?
The URL "git://github.com/components/ember.git" [1] suggests this is an internal GitHub Bower build log, but your post history doesn't mention anything about GitHub (let alone whether you work there), so I'm not 100.00% sure.
Assuming this is, in fact, a GH Bower log, the first thing that came to mind was that this architecture isn't (and possibly should be) using a dual-silo approach: when you upgrade, the upgrade gets loaded into a new blank namespace/environment, tested, and if it worked (passes CI test coverage or something like that), the main entry point is switched to the new environment (maybe with a web server restart or config rehash) and the old environment gets purged (possibly after a trial period). The current stack looks quite akin to "click this button to flash the new firmware and DO NOT UNPLUG your device or you'll brick it."
But then I realized... wait. You guys have like... isn't it like, a few dozen RoR worker boxes? Was this crash on the inbound router or something? xD
It's all good though; consider this "curious criticism" - like constructive criticism, but with extra sympathy. And hey, GH's never broken on my watch before (not that I need it atm... hopefully); this is interesting =P
Shit happens some time. Who cares who broke it- what matters is they're fixing it. I hope github does make public a official reason tomorrow. Not looking to blame anyone- I just want to know what they're doing internally to make sure it does not happen again... adaptation. Even if it's from some external hack or ddos attack- how can they plan on building redundancies in. I am pretty sure GitHub's dev team is talented- I'm being such an anti-troll.
> I hope github does make public a official reason tomorrow
I do too, but, mostly I find post-mortems quite interesting to read.
If your services are at some 95% uptime or lower, you're doing something (serveral things) very wrong and its probably not that interesting to me.
Getting from 95% -> 99% you probably did some interesting things there.
Going into multiple .9s beyond that, you're likely doing quite a bit of interesting stuff but what I find more interesting is where you went wrong. Figuring out not just where you went wrong, but, what your incorrect assumptions were and WHY they were wrong. "We believed X could never fail because of Y, and even if X did fail, it would not cause production impact because of Z!"
Ah, glad I didn't presume it was a valid message! I've heard people joke about this kind of thing exactly this way in the past.
But now my "technical breakdown info" box has no tidbits in it. :P
I'm glad you're back up now (sortakinda - what sort of traffic are you sustaining right now? :D), but a rough idea of what asploded would be really cool to know about.
Speaking of which, I'd like to take a moment to make a strong point about the fact that disaster-recovery situations don't get blogged about enough. Vague "we fixed it" datapoints get buried in status update logs like it's something to hide and hope nobody brings up.
In situations like these, the only constructive perspective is for everyone to accept that something went horribly wrong and not make a fuss about it, and if such a mentality can be established, this creates an environment within which we can share technical breakdowns of "we found ourselves in XYZ position and then we did these thirty highly specific things in heroically record time to be up and running again", and I think sharing this type of info would potentially be more educational than setup tutorials or the "we switched to X and it improved Y by 1400%" type things the Net's full of. Sure, you'd have to generalize and probably give a lot of backstory about infrastructure, but it's becoming trendy (in a sense) for companies to describe their operations in precisely this way, so it's not completely nonviable.
(Note my use of the angle of "we found ourselves in XYZ position" - maybe a small highlight of what led up to the disaster would be included (worth considering if the information would be educational), maybe not. In a blog context, moderating comments to keep the discussion on-track and constructive may be necessary, but IMO would be worth it.)
I did that too. Destroyed out Zabbix database. Neither that Zabbix server, nor the other one monitoring the server I destroyed, could alert us that anything had gone wrong for over an hour. I finally realized it when I couldn't login...
I was able to painstakingly rebuild the server after 9 hours without anyone noticing. To this day one of my biggest fuck ups and prouder accomplishments.
When I worked at an ecommerce company the boss (who also was a programmer) did that too (on our master engine- not just website orders, but ebay, amazon, everything historical, inventory, RMAs.... etc etc... OOPS indeed... Yep- shit happens. *Recovered from it- was a hastle and shut down everything, but it was one of thoes whoops moments. Happens to the best of us.
First we connect to GitHub to find out what the latest
revision for this repository is, so that we know what we
want to get. GitHub tells us it’s 5fbfea8de... Then we
go out to the GitTorrent network.
Every time this happens people make clever remarks about how Git is distributed but we're all depending on GitHub for so much that we defeat the purpose. But once GitHub comes back up, everyone just gets back to work, trusting and relying on it as much as ever. Eventually it goes down again, and we come back to complain. Convenience is the only thing that we seem to value. (I'm no different, which makes my comment completely hypocritical.)
> clever remarks about how Git is distributed but we're all depending on GitHub for so much that we defeat the purpose
Yeah, but they'd be wrong about that purpose.
Distributed systems used by people (eg. email, BitTorrent) always have their major hubs (Gmail, The Pirate Bay). That's understandable: no product reaches critical mass without a main stream. The strength of a distributed system isn't that it has no points of failure: it's that, in the event of a significant failure in an established node (eg. TPB's downtime at the end of 2014), the community can retarget around a new solution (eg. KickassTorrents) at the point in which the inconvenience of the downtime outweighs the inconvenience of switching habits, without a significant dip in service associated with the switch.
In contrast, a truly centralized system like BitKeeper would just outright block progress if the central node were to deny access (which was, of course, the purpose that led to devs like Linus Torvalds changing their focus for a few months, so they they could scale back to full kernel production as they constructed a workable alternative, in Git).
The bigger problem here is the number of build packs only relying on Github as a source of truth. We need to find an abstraction to distribute storage so that there is no single point of failure.
So I had to install Gems from RubyGems, not that big of a deal, and I had to look them up on RubyGems since Google gives me GitHub first, but that's OK too, but then all the documentation seems to be on.... Github. Except not, it's also on RDoc. (Though RDoc kinda sucks, compared to GitHub...)
Pretty big win imho, Rails got a few more points with that. :D
We have npm as our single source of truth, although you could host your own npm, or use a tool like sinopia [0] which makes pretty reasonable tradeoffs while being usable. Instead of asking you to replicate all of npm, it just keeps local copies of your packages, and if a package isn't found it'll hit npm.
Companies/people don't learn on their mistakes. Almost everything is on GitHub nowadays. This makes it a SPOF even if Git itself is distributed. More companies should host their projects on premises. There are good open source alternatives to GitHub: Apache Allura, Fossil, GitBucket, GitLab, Phabricator, and Redmine.
Only that I don't expect data loss since it is distributed and even if GitHub gets wiped clean I can just happily push up all my stuff from the local repositories and get on with my life.
I'm guessing the parent doesn't mean only bower, but bower uses github as the source of packages/code (I don't use bower, so I'm only going from memory). On a build that downloads everything 'fresh' it's not going to be able to get sources that are only on github.
Bower certainly isn't the only thing that does it. It's a trend that's becoming more and more common.
I was in a go tutorial a couple of years ago where the presenter was including directly from github.
And sitting in a ruby talk when the rubygems compromise happened just after everyone had been told to go upgrade a gem because of flaws in a specific gem.
You shouldn't notice a webhook deluge because the site isn't generating events. I'm watching our webhook services though and will let you know if that changes.
Well, it's not like we have invented binary and source packages two decades
ago, so people don't need to re-download and re-build their build-time
dependencies every time they start compilation.
Yep, but nobody ever includes them. Nice stuff to automate, gotta chek if there is already DevOpsy stuff to fix that, and if there isn't, got me a new weekend project...
It is obviously gonna be hosted on GitHub. Oh, the irony...
I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened.
I was going to stay up for a middle-of-the-night release (DB migrations, bleh). Instead, posted https://status.github.com/ into slack and am packing up and going to the bar.
jim% git clone git@github.com:pfsense/pfsense.git
Cloning into 'pfsense'...
fatal: remote error:
GitHub is offline for maintenance. See
http://status.github.com for more info.
Regardless, most dependencies are on GitHub which breaks bower install for most people. It's crazy how much infrastructure relies on this single point of failure.
What would that look like? Can you describe it? It's not very useful to say stuff without providing some sort of useful idea. You can't just say "someone should write some sort of stupid content tracker, and give it some random three-letter combination that is pronounceable," or whatever.
Completely agree. But whoever makes it, they should make it free and open source and designed to handle everything from small to very large projects with speed and efficiency.
Additionally, it should be easy to learn and have a tiny footprint with lightning fast performance. It should outclass SCM tools like Subversion, CVS, Perforce, and ClearCase with features like cheap local branching, convenient staging areas, and multiple workflows.
That sounds very cool. I was just pointing out how divorced these ideas still are from the way we actually program and implement things on the web.
In an ideal world, a link to a source code repository (or a link to anything for that matter...) would never fail because there would be automatic mirrors to at least provide read-only access to it.
It sort of forces one to ask whether this is the result of fundamental mistakes within HTTP / DNS itself. It's not realistic for a web designer to put time into making external links fault tolerant by running some query to switch the routing link to an available node.
Content addressed links at least make this possible. With http you have to reach a particular server. If that server is down, the link is broken. With ipfs all you need is at least one machine on the entire network to be serving a file and the link will work and as an added bonus you can verify the hash of what you get so a mitm attack on that link is impossible.
npm i semantic-ui
npm ERR! fetch failed https://github.com/derekslife/wrench-js/tarball/156eaceed68ed31ffe2a3ecfbcb2be6ed1417fb2
npm WARN retry will retry, error on last attempt: Error: fetch failed with status code 503
So, in this case, it looks like github isn't acting as a version control server, but rather a static server... It turns out that github isn't just a single point of failure for some application, but rather multiple points of failure for slightly different reasons...
I changed jobs recently- previously used Github, now on BB- glad I'm not at my old job, I'm sure several people's phones are blowing up from auto-emergency fail measures.
We use Bitbucket in our ~30 member academic robotics research lab, it's wonderful because they are nice enough to supply us with as much as we need (private repos, teams, etc.) for free!
If the internet would hire people to advertise for them, it would bring marketing to the next level. Take for example if BitBucket hired someone to go online, and compare the two: GitHub and BitBucket. Of course the comparison would have to disclose that thy work for BitBucket, but it would be a very thoughtful response.
We would get these detailed as fuck responses on why one project (BitBucket) is worth more to the developer than GitHub instead of these little jabs at why it is better. I want thoughtful responses.
If there was a duplicate response, the employee that responds to various threads online can link them to their answers.
This isn't even directed at you @ntaylor. I think it's an exploitable marketing strategy. A clear example of this is Katie from PornHub. /u/Katie_PornHub (or whatever the user name is) posts on reddit, which in tern gets more interest in PornHub. PornHub is basically using reddit for free advertising.
Basically. I want to know WHY something is better than something else, and that WHY is with as much detail as possible with a lot of thought put into it. Give me a pros and const list between the two. Anything but two good things about it.
- - -
Holy fuck, this would just add a human element to advertising. You hire humans to serve ads to people. Whenever "GitHub" is mentioned, and you have a competing better service, their only job is to advertise your service to anyone have problems with the other service.
In cases like these, it always seems to happen naturally. Humans are willingly advertising for free, when they could be paid for it.
The downfall of the internet:
> We make up the ad network
> It carries us far;
> however---giants must fall.
Thinking about this more, this could be automated to some degree. You could have a bot that listens and responds when key words appear somewhere online. You could use Google's API for when it's bot finds something of interest to you "As it happens"
This would only get you so far. What if a user has a question? How will you help them out if it's a bit. The solution?
Write API's for every site that you want to use. You can either scrape every site for responses that fit your criteria, or let google do the digging for you with the "as it happens" notifications which would indicate that you have to scrape that site for the information.
If the information you get from the site is something you want to respond to, you can.
- - -
Holy fucking shit.
I just described a system that would allow for you to use all websites as inboxes for chat messages.
- - -
Refine:
With this early model, we would have one program on our machine that would understand X websites. It can parse user comments directed at us from a website (A). It can send responses back to A, and so on.
Now, users on Program X can chat with theoretically any
person on any website (A, B, ...), but users of website A can only chat with other people on website A OR people using Program X.
Why doesn't everyone use program X? Because it does not exist yet.
I'd actually like a gitolite-like system that takes my pushes and replicates them among Gitlab/Github/Bitbucket/repo.or.cz. I'm sure it's possible with hooks, but every time I get around to looking into it, GitHub is back up
Here is a version which uses OSX's say command for a spoken announcement of "github is up again". It also checks once every minute instead of once a second.
while true; do curl -s https://status.github.com/api/status.json | egrep 'good|minor' && say -r 160 "github is up again" || echo -n .; sleep 60; done
Edit: accept a status of minor as an indication of up-ness.
Judging by the fact that their status page is still up I'd say that the service is pretty independent of the rest of the site. If my one-liner gets popular enough to do damage to Github when China can't then I will be extremely proud of myself. :D
Ah, I like yours better for GitHub specifically, but for any website, my .bashrc got this as its latest addition tonight:
# watch for a website to come back online
# example: github down? do `mashf5 github.com`
function mashf5() {
watch -d -n 5 "curl --head --silent --location $1 | grep '^HTTP/'"
}
Yeah, I am confused by the existence of a status report for a time 3.5hrs in the future. Guess they're just trying to get ahead of the game and set some realistic expectations.
This is reminding me of last year's major Facebook outage. If I recall correctly, that outage was a bug in service discovery that took down all data centers (a CLI accepted a negative value when the Zookeeper variant treated it as an unsigned int , then all service discovery went down). I feel like service discovery is the biggest point of failure at large companies, and it would explain why services across so many different domains and systems went down.
sigh Many moons ago when I was a starry-eyed lad just learning to use Git, I remember all the cheerful sentences in the Git book like "Unlike SVN, with Git you can work even when the server is down!"
The more things change, the more they stay the same.
Fiddling around with several deployment and configuration management options I just came to the conclusion that there is too much code between me and the systems.
I would like to have one bash script that I could start in such worst case scenarios that tests everything and tells me exactly what the problem is.
Just imagine everything is down and you can not trust your dashboards.
What would you put into such a script?
What would you test for?
Does this have something to do with the constant DoS attacks they've been getting from a certain country for hosting certain open source projects that defy the censorship authority of said certain country?
cd /[mason@IT-PC-MACPRO ~]$ cd Code/rollerball
[mason@IT-PC-MACPRO rollerball (master)]$ git pull
remote: Internal Server Error.
remote:
fatal: unable to access 'https://masonmark@github.com/RobertReidInc/rollerball.git/': The requested URL returned error: 500
[mason@IT-PC-MACPRO rollerball (master)]$
Luckily that came back up 5 or 10 minutes after the outage began. And they updated it with a down notice.
A side project of mine called StatusGator [1] monitors status pages and alerts you when something is posted on them. I built it to handle the use case that you're trying to diagnose an issue only to remember to check the status pages of your dependencies and notice that they are already working on it.
It's pretty useful for those services that keep their pages up to date. But not very useful in cases like this when it's not updated.
Apparently they can see into the future and know for a fact that they'll still be having issues in a few hours.
January 28, 2016
00:00 EST The status is still red at the beginning of the day
January 27, 2016
20:02 EST We're continuing to investigate a significant network disruption affecting all github.com services.
scary...
"remote: Unexpected system error after push was received.
remote: These changes may not be reflected on github.com!
remote: Your unique error code: 7527a6e1bbc9fe126d51c97feac7b4e3
remote: Unexpected system error after push was received.
remote: These changes may not be reflected on github.com!
remote: Your unique error code: 7527a6e1bbc9fe126d51c97feac7b4e3"
seems pushes are going into oblivion, but the client side gets a return code that the push was successful.
Yeah I got this as well and no changes on github.com so much for creating this release tonight. :/ I hope they will fix this otherwise this push went into oblivion and the client thinks all is well.
$ traceroute github.com
traceroute to github.com (192.30.252.130), 30 hops max, 60 byte packets
...
9 level3-pni.iad1.us.voxel.net (4.53.116.2) 17.609 ms 15.057 ms 10.113 ms
10 unknown.prolexic.com (209.200.144.192) 9.186 ms 9.462 ms 9.315 ms
11 unknown.prolexic.com (209.200.144.197) 17.753 ms 17.767 ms 18.851 ms
12 unknown.prolexic.com (209.200.169.98) 9.922 ms 9.542 ms unknown.prolexic.com (209.200.169.96) 11.471 ms
13 192.30.252.215 (192.30.252.215) 13.569 ms 192.30.252.207 (192.30.252.207) 9.660 ms 192.30.252.215 (192.30.252.215) 13.150 ms
14 github.com (192.30.252.130) 9.051 ms 8.833 ms *
We're still recovering; please give the site a bit of time to come back! Not everything can be expected to work until we've gone green on status.github.com. Thanks
No, but there's recently been a lot of talk about the Hidden Tear ransomware source being taken off of GitHub soon. Given that the author has already been blackmailed and all the drama surrounding that, this is a possibility.
Huh, TIL. I always assumed that they meant that the errors themselves are unicorns because they rarely happen and even when they do, you usually don't see them.
If people are going to depend on github being up for their CI workflows, etc., there should be serious effort expended at the network and cache layer to be suitably reliable. It's probably fine to not be able to do developer-level actions for hours, but even a 5m outage in deploying other systems is unacceptable for most businesses.
I heavily rely on Github # tags to document my codebase. Failures like this make me wonder if there is any way to decouple my codebase from Github while preserving all the comments and issues on commits I've built up.
Good thing we've got mirrors, and don't depend on GitHub for releases. Still, we interact with our customers through GitHub and use it heavily in our workflow. Productivity is impacted if not stopped.
Github is back online! Github is back online! Github is back online! Github is back online! Github is back online! Github is back online! Github is back online! Github is back online!
(Positive in this sense has a similar meaning in behavioral psychology's terminology for operant conditioning, adding punishment.. which both come off as odd with our common use of positive as good and beneficial :/ https://en.wikipedia.org/wiki/Operant_conditioning)
...Carrying on with your analysis, since uninterrupted positive feedback loops result in explosion, I guess that would mean either (for this outage) github's fallback status page would go down or the parts of the network carrying that traffic, which could not handle their increased load, would go down.
Probably better we would rework these control mechanisms and reroute \ rewire \ entirely change these processes.
The best approach to this was shown recently by Slack, where they had a huge team of people on social media while their engineers figured out the outage.
Too bad I'm not teaching someone to use git and/or github right now. Perfect opportunity for a practical joke. "Oh, great, look... you broke GitHub with that last command."
Grammar police are on the job- everyone can calm down... Unless the grammar police are the REASON for the outage. Mother should I trust the government?
Self-hosted isn't 100% reliable either. Nobody is that good. I've done serious high availability (financial system that would land on the front page of the New York Times the way Github lands on HN), and it wasn't 100%. And the overhead of high availability is incredible.
No way would I do self-hosted version control. I have better things to do than babysit servers for commodity services.
"Fool me once, shame on you. Fool me twice, shame on me."
I can learn from my mistakes and improve progressively.
If I rely on another instead, I cannot, and I cannot (necessarily and confidently) see to it that they learn from their mistakes and improve.
By relying on another (at least an unreliable \ uncommunicative \ uncooperative one), I cannot improve my chances of them not making those same mistakes again, taking me down with them.