Someone is lying here in my opinion. I hope I'm proven wrong because this is a terrible excuse for the company to make.
EDIT: And as someone else pointed out... their IP addresses could be pinged which further goes to disprove a routing issue. More than likely high traffic crashed one or more routers (THIS I have seen happen) and the live/saved configs didn't match. I'd put more money on something like this happening if it was router related.
"Routing" is a rather generic term when it comes to large networks, and everything from border routers, firewalls, load balancers, and switches actually perform routing.
Especially (as I've mentioned in another post) when you add fault tolerance / failover configurations to the mix.
Routing failure doesn't have to be an all-or-nothing thing. There are a number of ways in which I can see ICMP echo packets working but other traffic not, especially when you include complexities of source routing, load balancing, failover, etc.
Even something as "simple" as a poisoned ARP cache in a single box could screw up the entire internal network and cause the problems they've had, and still be considered a "routing issue".
You are correct that I don't know the details of their internal network and I never said otherwise, just that the chain of events and their claims don't necessarily match up!
After that, coming up with an automated process for migrating what must be a shit-ton of zone information to another system must have taken some time. I have no idea what their specific solution was, but I'm fairly confident in the fact that it wasn't just a matter of copying over a few zone files. They'd probably have to do SOME sort of ETL (extraction / translation / load) process that would take some time to develop, test, never mind run.
And I can't remember the last time I gave technical information to a PR person who actually got it 100% technically correct. ;)
My intention wasn't to shit on your point or anything, or in any way defend Go-Daddy and their screwup, I'm just thinking it's a bit unrealistic to try and infer detailed information from a PR release.
In the end, it was technical, they screwed up, and I doubt they'd ever release a proper, detailed post-mortem of what happened.
I think perhaps the takeaway from here is to not trust what is being said, go with your gut... and move any services off GoDaddy ;). Would be nice if like Google or Amazon they would release a real post-mortem post. Even if it's an internal 'uh-oh' I trust companies that are willing to admit to mistakes.
Possible but highly unlikely. Godaddy is "old school" which means they will release as little info as necessary and move on. They aren't interested in the hacker community. Their primary market is SMB's.
The way it's described now is a weakness in their infrastructure of which I wonder if it's possible to prevent this from happening again.
Godaddy has plenty to lose by f-ing up. And to my knowledge (as a somewhat small competitor; I'm just pointing that out so my thoughts are taken in context) they have a fairly robust system (anecdotal) for the amount of data they manage. My issues with godaddy (as a competitor) were always on the sell side, the issues of constantly selling you things you don't need etc. Technically I really didn't have any issues with them.
Our interim CEO confirmed that affected users are receiving a full month refund.
I want to mention a couple things though. The first is that the blame is being placed (at least reading slightly into the PR release) on a 'technology' failure. That is fairly distinct from human error.
The other thing, is that if it was human error, how did the chain occur without a second pair of eyes or similar, such that the outage lasted more than a bit of time?
Third, why was the DNS changed to Verisign? That is still the biggest outstanding question I think in terms of their claimed outage reports. I should also mention that I do have skin in this game as plenty of customers were running at least SOMETHING through Godaddy and were yelling in this direction for stuff breaking...
That said I agree that Godaddy's handling and RFO doesn't smell right.
Routers can "crash" for different reasons, but atypically due to high traffic. If you really wanted to fuck with someone you make one BGP change. Only newbs use DDoS's. (Which, Anonymous being newbs, would be their MO, but unlikely they could DDoS a connectionless resource record database)
If you go with 'router tables' being the culprit, they probably had a core router that maxed out its RAM when they added another router in place, but they had already moved a part of the network that housed DNS by the time the routers synced and RAM filled from too many BGP lists to sort. You can still ping 'hosts' (which are I am almost certain a hardware load balancer and not an actual DNS host) while the DNS traffic is going nowhere because the backend DNS services were moved. Would take a couple hours to unfuck all of that.
I think he's implying "until yesterday".
I've used GoDaddy for 10 years or so, and this is the first DNS outage I can remember.
That's an impressive record, imo.
(Personally I've largely switched off GoDaddy to Namecheap now, but not because of uptime, rather because I grew tired of the sucky GoDaddy website and sucky APIs)
Basically they're guaranteeing 99.999% uptime across some interval and you get some kind of compensation if/when they fail to meet it.
I just misread their statement. They probably meant the up-time before this incident.
"Throughout our history, we have provided 99.999% uptime in our DNS infrastructure. This is the level our customers expect from us and the level we expect of ourselves. We have let our customers down and we know it."
I wouldn't be surprised to hear that GoDaddy's corporate culture wouldn't respond well to someone admitting to a mistake this damaging.
Everyone there is on pins and needles at this point. Since Silverlake's investment in the company, many hatchets have dropped on jobs, and it's really the only decent tech firm in Phoenix to work at.
My guess is that there is some hiney covering going on with this explanation, and the interim CEO has little cause to care too much about responsibility, since he'll likely be out before year's end anyway.
I really feel for the folks who work there. Many, many talented people who don't have an inch to make a mistake. When I was last there, they had just released an internal communication about the new company motto: "It won't fail because of me." Horrible, horrible, backward-ass culture.
Sorry to hear this as any reasonable corporate structure understands 'shit' happens in the tech universe. Fix it as expediently as possible and move on, putting in place as much protection as possible to prevent it in the future.
Do you think we'll see a true outage report or is this unlikely?
I've never known Go Daddy to make any public statements without testing them first. Whatever story gets released will likely be the one with the highest conversion rate.
> it's really the only decent tech firm in Phoenix to work at.
> Horrible, horrible, backward-ass culture.
what part of the company is good if not the culture?
So should you opt to go into tech, GoDaddy is the largest and best-paying, and has fairly decent benefits. Unless you were a Senior level professional elsewhere, it would likely be in your best interest to work at GD.
It's got tons of middle management, which in itself isn't a bad thing, except that everyone's fighting to own the creation a product, but no one wants to be accountable for it, should it not go well.
Essentially this creates an environment of fear against innovation, accountability and iteration.
That isn't good for standards/best practices, but it does mean that teams have a lot of leeway to do things pretty much however they want. My team has a ton of flexibility on how we code, test, and deploy, though we have to work within some red tape imposed by other teams that have their own way of doing their things.
In my day to day work I'm actually pretty free to work how I want without being pushed by people up the food chain. That could be because the product I work on is a supplementary product and not a 'core' product, though.
Leave now. It will be a black mark on your resume if you stay.
It's funny you say this. This is exactly the reason why I left, and exactly what I told HR when I left. Unsurprisingly, they had told me that others who had recently left gave similar sentiment.
I've just switched to DNSMadeEasy - for anyone concerned about the time involved, they have some cool timesavers like templates you can apply to all of your domains at once. Really makes a difference not having to manually set up individually the entries for Google Apps on 20+ domains.
> It was BGP related and more details should be posted today
So Anonymous0wn3r or whoever was just claiming responsibility for something they had no hand in? The router tables just corrupted themselves?
Yeah, with a name/handle like Anonymous0wn3r they sound very trustworthy. If they claimed they had a hand in something, it must be true.
Not sure if those tweets were actually from before it happened.
* GoDaddy: 10:35 AM - 10 Sep 12
* AnonymousOwn3r: 11:57 AM - 10 Sep 12
So, AnonymousOwn3r does not seem to have announced the attack before it happened.
Links to tweets:
Gee, why would an anonymous person on the internet lie?
I'm witholding any judgement on internal vs external involvement till this series of events is defined. (doubt it ever will be)
BGPlay (http://bgplay.routeviews.org/) does not show anything indicative in the BGP default-free table (what the Internet sees), as abnormal or misconfigured. While there could be iBGP issues, like others have stated there was (intermittent) connectivity by IP during the outage.
It's both bullshit PR and more importantly spreading disinformation to save face. Why?
A security breach would instill customer fear and generate negative press. Customers would leave by the droves.
A DoS/DDoS displays that GoDaddy has inadequate infrastructure while competitors such as CloudFare actually do. Furthermore, why would a company that pisses off the Internet be appealing to anyone? Again it will generate negative/bad press, and customers will leave by the drove.
Spreading disinformation by claiming it was either a human error or equipment fault? From a company perspective this is actually the best option. Just provide generous service credit to your customers, you may generate positive press, you will gain customer goodwill and regain their confidence. This is GoDaddy's best option.
Until they provide actual details with proof that it was a misconfiguration or hardware fault, I will continue to call bullshit. Too many factors don't add up, especially the publicly available data which monitors the BGP DFT on the Internet.
The two conjectures that seem plausible so far is the SQL injection in their web interface for DNS and/or a DoS/DDoS attack.
Just at the IP level they have to deal with (at the edges and across substantial WANS) BGP - a notoriously ugly and fragile protocol. Internal routing protocols such a OSPF are equally ugly and prone to breakage. Many are the tales of some small company misconfiguring their edge routers slightly (say a 1 char typo) and having the entire internet route through their T1, across their lan, out their backup T1. Other issues are BGP flapping, resulting in scary percentages of lost traffic. This doesn't even cover trickier stuff like routing loops...
Other considerations in big routers are things like ASN identifiers and peering points. Considerations like traffic cost, SLAs and QoS all go in to traffic balancing on such routers. MPLS clouds complicate (and oddly enough simplify) these things as well.
There are also important issues like Anycast, CDN and NAT that largely rely on router tricks and add to the complication.
Finally, on top of all this, is the security concern - you can't just throw a firewall in front of it, as many firewall issues are routing issues, therefore must also be present in the router.
All these layers interact and affect each other. Any given machine can only handle so much traffic and so many decisions, so something that is drawn as a single router on a networking chart may actually be several boxes cascaded to handle the complexity.
Oh yeah, and switches are getting progressively smarter with other rules and weirdnesses that provide horribly leaky abstractions that shouldn't matter to the upstream router, but turn out to add issues to the configuration and overall complexity on top of it all.
> Many are the tales of some small company misconfiguring their edge routers slightly (say a 1 char typo) and having the entire internet route through their T1, across their lan, out their backup T1.
This is what route filters are for. If you peer with someone and they advertise 0.0.0.0/0 or something equally ridiculous, and you accept this as a valid route then you deserve to fail (and then given a firm stare if you then proceed to advertise it to other peers).
A similar fail on the part of Telstra (http://bgpmon.net/blog/?p=554) was to blame for much of Australia dropping off the map earlier this year.
> Just at the IP level they have to deal with (at the edges and across substantial WANS) BGP - a notoriously ugly and fragile protocol.
I take offence at this. 80% of BGP related issues are due to misconfiguration by a given party, 19% is due to bad or missing route filters and the other 1% is due to bugs in router software. The actual implementation of BGP v4, originally designed back in the early 90's isn't completely without it's issues and behavioural quirks (I'm looking at you, route flaps) but the theory/algorithm behind is a work of art, and has coped amazingly with explosive growth, and growth that's only going to increase with IPv6.
Without it, there would be no HN
Doing the wrong thing to router table(s) is the network equivalent of "sudo rm -rf /".
excluding static routes (which are then usually advertised to other peers), routING tables are dynamically built and only exist in non-persistent memory.
having up to date backups of router configuration is another matter entirely
At a first level, routers will cache information about what machines are on which port of the router. This enables them to not forward a packet to every port on the router (reducing network load). This is normally done using MAC Addresses.
On more expensive routers, the router can understand IGMP, or multicast and route based on multicast joins. This enables an optimization on simple broadcasts because not every machine on the network needs to see these packets, but multiple machines might want to.
It gets really complex really fast, and even the slightest lapse in attention to extreme detail can be disastrous and a nightmare to troubleshoot.
For instance, I once had a router reboot for some unknown reason, and the firmware/config wasn't properly flashed, so it reverted to the last point release of the firmware. That was enough to cause the failover heartbeat to be constantly triggered, and the master/slave routers just kept failing over to each other and corrupted all the ARP caches. Makes for a lot of fun when things work and then stop working in an inconsistent manner.
It is a little sad that they didn't try rebooting the routers for so many hours.
This assumes they're being literal and "corrupt" means "a bit randomly flipped" instead of using "corrupt" in the figurative sense of "operator error".
Can you be more specific?
Which other domain names did you try?
Also, I believe some parts of the world were unaffected by the outage.
I would guess a large majority of GoDaddy customers would not even know this outage occurred. They are "casual" domain name registrants and in some cases "casual" website operators. They registered some names and then never did anything with them. Or they operate a website but it's very low traffic and they rarely think about it. That is only a guess.
I don't know the details of the environment, but even in smaller systems I've worked on there is a fair bit of hardware separation between various network segments. Complete failure on one part would not affect the others.
For that matter, even a slight corruption in some ARP caches, or stale internal tables, etc., could cause the problems they had... it's not just a complete failure that could cause problems.
And "routing" is such a generic term, when it could really be any number of feature sets that failed; load balancing, source routing configs, etc.