Someone is lying here in my opinion. I hope I'm proven wrong because this is a terrible excuse for the company to make.
EDIT: And as someone else pointed out... their IP addresses could be pinged which further goes to disprove a routing issue. More than likely high traffic crashed one or more routers (THIS I have seen happen) and the live/saved configs didn't match. I'd put more money on something like this happening if it was router related.
"Routing" is a rather generic term when it comes to large networks, and everything from border routers, firewalls, load balancers, and switches actually perform routing.
Especially (as I've mentioned in another post) when you add fault tolerance / failover configurations to the mix.
Routing failure doesn't have to be an all-or-nothing thing. There are a number of ways in which I can see ICMP echo packets working but other traffic not, especially when you include complexities of source routing, load balancing, failover, etc.
Even something as "simple" as a poisoned ARP cache in a single box could screw up the entire internal network and cause the problems they've had, and still be considered a "routing issue".
You are correct that I don't know the details of their internal network and I never said otherwise, just that the chain of events and their claims don't necessarily match up!
After that, coming up with an automated process for migrating what must be a shit-ton of zone information to another system must have taken some time. I have no idea what their specific solution was, but I'm fairly confident in the fact that it wasn't just a matter of copying over a few zone files. They'd probably have to do SOME sort of ETL (extraction / translation / load) process that would take some time to develop, test, never mind run.
And I can't remember the last time I gave technical information to a PR person who actually got it 100% technically correct. ;)
My intention wasn't to shit on your point or anything, or in any way defend Go-Daddy and their screwup, I'm just thinking it's a bit unrealistic to try and infer detailed information from a PR release.
In the end, it was technical, they screwed up, and I doubt they'd ever release a proper, detailed post-mortem of what happened.
I think perhaps the takeaway from here is to not trust what is being said, go with your gut... and move any services off GoDaddy ;). Would be nice if like Google or Amazon they would release a real post-mortem post. Even if it's an internal 'uh-oh' I trust companies that are willing to admit to mistakes.
Possible but highly unlikely. Godaddy is "old school" which means they will release as little info as necessary and move on. They aren't interested in the hacker community. Their primary market is SMB's.
The way it's described now is a weakness in their infrastructure of which I wonder if it's possible to prevent this from happening again.
Godaddy has plenty to lose by f-ing up. And to my knowledge (as a somewhat small competitor; I'm just pointing that out so my thoughts are taken in context) they have a fairly robust system (anecdotal) for the amount of data they manage. My issues with godaddy (as a competitor) were always on the sell side, the issues of constantly selling you things you don't need etc. Technically I really didn't have any issues with them.
Our interim CEO confirmed that affected users are receiving a full month refund.
I want to mention a couple things though. The first is that the blame is being placed (at least reading slightly into the PR release) on a 'technology' failure. That is fairly distinct from human error.
The other thing, is that if it was human error, how did the chain occur without a second pair of eyes or similar, such that the outage lasted more than a bit of time?
Third, why was the DNS changed to Verisign? That is still the biggest outstanding question I think in terms of their claimed outage reports. I should also mention that I do have skin in this game as plenty of customers were running at least SOMETHING through Godaddy and were yelling in this direction for stuff breaking...
That said I agree that Godaddy's handling and RFO doesn't smell right.
Routers can "crash" for different reasons, but atypically due to high traffic. If you really wanted to fuck with someone you make one BGP change. Only newbs use DDoS's. (Which, Anonymous being newbs, would be their MO, but unlikely they could DDoS a connectionless resource record database)
If you go with 'router tables' being the culprit, they probably had a core router that maxed out its RAM when they added another router in place, but they had already moved a part of the network that housed DNS by the time the routers synced and RAM filled from too many BGP lists to sort. You can still ping 'hosts' (which are I am almost certain a hardware load balancer and not an actual DNS host) while the DNS traffic is going nowhere because the backend DNS services were moved. Would take a couple hours to unfuck all of that.