Hacker News new | comments | ask | show | jobs | submit login
GoDaddy outage caused by corrupted router tables (godaddy.com)
100 points by jakeludington on Sept 11, 2012 | hide | past | web | favorite | 90 comments



I find this extremely suspicious (I.E. knowing routers, I call bullshit). The change to the Verisign anycast DNS service which I noted yesterday in another thread... brought godaddy.com back up, yet did not result in bringing other DNS services back up.

Someone is lying here in my opinion. I hope I'm proven wrong because this is a terrible excuse for the company to make.

EDIT: And as someone else pointed out... their IP addresses could be pinged which further goes to disprove a routing issue. More than likely high traffic crashed one or more routers (THIS I have seen happen) and the live/saved configs didn't match. I'd put more money on something like this happening if it was router related.


Unless you have intimate knowledge of their network topology, and know the specifics of where those pinged IP's live in that topology, and what routes were used to provide DNS results, you can't say that it wasn't a routing issue.

"Routing" is a rather generic term when it comes to large networks, and everything from border routers, firewalls, load balancers, and switches actually perform routing.

Especially (as I've mentioned in another post) when you add fault tolerance / failover configurations to the mix.

Routing failure doesn't have to be an all-or-nothing thing. There are a number of ways in which I can see ICMP echo packets working but other traffic not, especially when you include complexities of source routing, load balancing, failover, etc.

Even something as "simple" as a poisoned ARP cache in a single box could screw up the entire internal network and cause the problems they've had, and still be considered a "routing issue".

$0.02


None of that is necessarily incorrect... but per their news release 'corrupted router data tables' (their words) were the issue. I can't read too much into that, but that still doesn't change the fact that DNS wasn't resolving for a while after they made their Verisign change (for clients), yet their website was resolved with this change.

You are correct that I don't know the details of their internal network and I never said otherwise, just that the chain of events and their claims don't necessarily match up!


I can imagine that they'd understandably work to get their own site/etc up and running first as the priority, as a manual "hack". After all, it's the main page everyone would be going to for information on what's going on.

After that, coming up with an automated process for migrating what must be a shit-ton of zone information to another system must have taken some time. I have no idea what their specific solution was, but I'm fairly confident in the fact that it wasn't just a matter of copying over a few zone files. They'd probably have to do SOME sort of ETL (extraction / translation / load) process that would take some time to develop, test, never mind run.

And I can't remember the last time I gave technical information to a PR person who actually got it 100% technically correct. ;)

My intention wasn't to shit on your point or anything, or in any way defend Go-Daddy and their screwup, I'm just thinking it's a bit unrealistic to try and infer detailed information from a PR release.

In the end, it was technical, they screwed up, and I doubt they'd ever release a proper, detailed post-mortem of what happened.


Heh, yeah. It is a bit difficult to interpret PR speak (and I have had to correct our guy before).

I think perhaps the takeaway from here is to not trust what is being said, go with your gut... and move any services off GoDaddy ;). Would be nice if like Google or Amazon they would release a real post-mortem post. Even if it's an internal 'uh-oh' I trust companies that are willing to admit to mistakes.


"Would be nice if like Google or Amazon they would release a real post-mortem post."

Possible but highly unlikely. Godaddy is "old school" which means they will release as little info as necessary and move on. They aren't interested in the hacker community. Their primary market is SMB's.


I don't see it as defending GoDaddy at all, quite the opposite. I would be more reassured if it was an unexpected massive DDoS which they weren't prepared for but one which they might prepare for in the future.

The way it's described now is a weakness in their infrastructure of which I wonder if it's possible to prevent this from happening again.


"The way it's described now is a weakness in their infrastructure"

Godaddy has plenty to lose by f-ing up. And to my knowledge (as a somewhat small competitor; I'm just pointing that out so my thoughts are taken in context) they have a fairly robust system (anecdotal) for the amount of data they manage. My issues with godaddy (as a competitor) were always on the sell side, the issues of constantly selling you things you don't need etc. Technically I really didn't have any issues with them.


While it could indeed be a routing issue, who's to say that it wasn't caused intentionally by the guy in the tweets? It would be in GoDaddy's interests to cover that up and fix whatever exploit he used to get in, instead of admitting a security breach.


What we're being told internally matches the public response. Reading between the lines, it sounds like there may have been human error involved, but that's just speculation.

Our interim CEO confirmed that affected users are receiving a full month refund.


And if it was human error, that's fine! Stuff happens and I can certainly say that I've done my share of human error.

I want to mention a couple things though. The first is that the blame is being placed (at least reading slightly into the PR release) on a 'technology' failure. That is fairly distinct from human error.

The other thing, is that if it was human error, how did the chain occur without a second pair of eyes or similar, such that the outage lasted more than a bit of time?

Third, why was the DNS changed to Verisign? That is still the biggest outstanding question I think in terms of their claimed outage reports. I should also mention that I do have skin in this game as plenty of customers were running at least SOMETHING through Godaddy and were yelling in this direction for stuff breaking...


Router bugs aren't unheard of. There was the Juniper MX bug that caused multiple outages for Level3 & Time Warner Cable. That was supposedly just a bad pattern of route injections and withdrawls.

That said I agree that Godaddy's handling and RFO doesn't smell right.


Do you know how BGP works? There are easily 50 different ways routing problems can cause outages like this. More than likely there was a compound failure which can cause all kinds of retarded behavior, including different networks getting different kinds of traffic, to say nothing of a plain old network service on a single net being down.

Routers can "crash" for different reasons, but atypically due to high traffic. If you really wanted to fuck with someone you make one BGP change. Only newbs use DDoS's. (Which, Anonymous being newbs, would be their MO, but unlikely they could DDoS a connectionless resource record database)


shrug I know how BGP works, yes. I think though from the symptoms of the issue it is going to end up being (If Godaddy is telling the truth about not being hacked), as you say, a compound failure. The exact cause of such a failure is left up to a truthful, full account of the outage being released. Further into this thread someone reported that an engineer from their side is going to release more information, so we'll see soon whom gets the prize :P.


I'm familiar with BGP. I'm unfamiliar with how BGP has anything to do with me being able to ping their IP, but not get a response on UDP/53 or TCP/53 with any data in it.


Off the top of my head? One network they multihome had a weird packet loss only experienced by DNS and other services, so they tried to cut the routes over to the second network, but packets were still getting sent to the first network (which had DNS disabled but ICMP enabled on the hosts) and further router fuckage prevented them from switching back easily. Hell, they probably just couldn't get their BGP to propagate once they made the first change.

If you go with 'router tables' being the culprit, they probably had a core router that maxed out its RAM when they added another router in place, but they had already moved a part of the network that housed DNS by the time the routers synced and RAM filled from too many BGP lists to sort. You can still ping 'hosts' (which are I am almost certain a hardware load balancer and not an actual DNS host) while the DNS traffic is going nowhere because the backend DNS services were moved. Would take a couple hours to unfuck all of that.


It would certainly be in GoDaddy's interests to claim a technical fault, rather than admit to a hack, which implies lax security.


How can they claim 99.999% uptime, when they just had several hours of service outage? I'm not sure how long they've been providing DNS hosting, but by the most generous assumption this would be the entire 15 years of their existence. 99.999% allows them about 1.3 hours of outage in 15 years.


> "Throughout our history, we have provided 99.999% uptime in our DNS infrastructure"

I think he's implying "until yesterday".

I've used GoDaddy for 10 years or so, and this is the first DNS outage I can remember.

That's an impressive record, imo.

(Personally I've largely switched off GoDaddy to Namecheap now, but not because of uptime, rather because I grew tired of the sucky GoDaddy website and sucky APIs)


I think what they're trying to say is that 99% of the time, they have 99.999% uptime.


Silly: 99% of the time they have 100% uptime.


60% of the time it works every time?


Was responding to the silly parent post, claiming similar nonsense.


Yes, wasn't criticizing, just enjoying.


That's not how SLAs work.

Basically they're guaranteeing 99.999% uptime across some interval and you get some kind of compensation if/when they fail to meet it.


I don't believe they have a 99.999% SLA. That was their historical up-time.

I just misread their statement. They probably meant the up-time before this incident.

"Throughout our history, we have provided 99.999% uptime in our DNS infrastructure. This is the level our customers expect from us and the level we expect of ourselves. We have let our customers down and we know it."


No, it would allow them 131 hours of downtime in 15 years: ( 365 days * 24 hours/day * 15 yrs ) * .001 = 131.4


99.999% uptime == 0.001% downtime 0.001% is a factor of 0.00001, not 0.001. Remember, % means 1/100th.


Based on a long history of working in datacenters I'd bet someone misconfigured something and later claimed it was "corrupted" to save their ass - happens all the time. It's just so simple to make very confusing and damaging mistakes in a complicated network.

I wouldn't be surprised to hear that GoDaddy's corporate culture wouldn't respond well to someone admitting to a mistake this damaging.


Ex-GoDaddy employee here:

Everyone there is on pins and needles at this point. Since Silverlake's investment in the company, many hatchets have dropped on jobs, and it's really the only decent tech firm in Phoenix to work at.

My guess is that there is some hiney covering going on with this explanation, and the interim CEO has little cause to care too much about responsibility, since he'll likely be out before year's end anyway.

I really feel for the folks who work there. Many, many talented people who don't have an inch to make a mistake. When I was last there, they had just released an internal communication about the new company motto: "It won't fail because of me." Horrible, horrible, backward-ass culture.


Ughh. That sounds about what I expected unfortunately.

Sorry to hear this as any reasonable corporate structure understands 'shit' happens in the tech universe. Fix it as expediently as possible and move on, putting in place as much protection as possible to prevent it in the future.

Do you think we'll see a true outage report or is this unlikely?


Based on the OP's statements about the company culture, is there an incentive for employees to be honest about their role in the outage? Do you think that enough Very Good Engineers stuck around post-acquisition to perform an independent root-cause analysis?

I've never known Go Daddy to make any public statements without testing them first. Whatever story gets released will likely be the one with the highest conversion rate.


can you clarify what you mean by:

> it's really the only decent tech firm in Phoenix to work at.

and then:

> Horrible, horrible, backward-ass culture.

what part of the company is good if not the culture?


There are many tech (and non-tech) companies in the Phoenix area. Your choice, as always, is to work at a place where the importance of the technology and the talent of your engineers and designers is recognized, or work in PetSmart's IT/Web department (where you're constantly fighting budget and recognition battles).

So should you opt to go into tech, GoDaddy is the largest and best-paying, and has fairly decent benefits. Unless you were a Senior level professional elsewhere, it would likely be in your best interest to work at GD.

That said, it's also a company run by bean counters and marketing. Most (read: all) important decisions regarding what choices the company makes goes through a ringer that includes assessing how much direct money comes from an innovation or change. If test a yields or saves $1 and test b yields or saves $2, test b wins, no matter how poor a choice it is in terms of user experience, customer care or any other metrics that relate to long-term customer retention.

It's got tons of middle management, which in itself isn't a bad thing, except that everyone's fighting to own the creation a product, but no one wants to be accountable for it, should it not go well.

Essentially this creates an environment of fear against innovation, accountability and iteration.


I've only been here less than a year so I haven't been exposed to that. What I've observed is that the culture is very compartmentalized in the sense that there are lots of small teams that each have a specific product they work on. Each team basically has it's own culture and way of doing things.

That isn't good for standards/best practices, but it does mean that teams have a lot of leeway to do things pretty much however they want. My team has a ton of flexibility on how we code, test, and deploy, though we have to work within some red tape imposed by other teams that have their own way of doing their things.

In my day to day work I'm actually pretty free to work how I want without being pushed by people up the food chain. That could be because the product I work on is a supplementary product and not a 'core' product, though.


Classic engineer. I'm talking about across a product. It's true that within your core group, you are given leeway, but the buck stops there.


The fact that it is a tech company in Phoenix?


possibly the subcultures that emerge to complain about the backward-ass mainstream within the company, i.e. having something to gripe about with your coworkers is a bonding force.


> "It won't fail because of me."

Leave now. It will be a black mark on your resume if you stay.

Seriously.


> It will be a black mark on your resume if you stay.

It's funny you say this. This is exactly the reason why I left, and exactly what I told HR when I left. Unsurprisingly, they had told me that others who had recently left gave similar sentiment.


Good move. Having a nose for when an organization will work for you among good organizations. (It will work against you in bad organizations "What?! You didn't stay to help turn it around?" SLP is most likely going to strip out whatever they can and then find somebody to unload it on.)


Ex-GoDaddy employee here


I should have been more clear that I was writing to those that were not ex.


The DNS is designed to provide resiliency to these kinds of problems by providing the ability to list multiple NS records located in different networks. It is standard practice for top-level domain operators and other high-activity domains to place their name servers in different networks to guard against these kinds of issues. When companies put all their name servers in the same network, they are removing the diversity benefit and create a single point of failure. Domain operators should take this as a cautionary tale that they shouldn't have all their eggs in one basket and make sure a single network failure couldn't take all their name servers offline.


There are many other single points of failure besides network failure, such as pushing the wrong configuration. In fact it seems to me that it would be rather rare for a multi-homed datacenter to have a network failure.


If you have a routing issue, whether it is due to "corruption" or misconfiguration, having some of your name servers on an entirely different network (i.e. a different AS) with a different routing policy is not going to be affected.


I have to wonder how many extra customers the various third-party DNS services have gained as a direct result of this.

I've just switched to DNSMadeEasy - for anyone concerned about the time involved, they have some cool timesavers like templates you can apply to all of your domains at once. Really makes a difference not having to manually set up individually the entries for Google Apps on 20+ domains.


Semi-OT: Why is it so hard to find yesterday's highly-rated GoDaddy outage discussion? Neither sorting by relevance nor recency nor points will find it. Or maybe there wasn't one?


I don't know why it is hard but I picked it up from my browser history: http://news.ycombinator.com/item?id=4500993


Your confirmation of my sanity is appreciated :-)


someone has hidden it, no idea why though.


hnsearch.com. I love it.


On the outages mailing list[1], Mike Dob (GoDaddy Network Engineering Manager) just added more details, saying:

> It was BGP related and more details should be posted today

[1] http://puck.nether.net/mailman/listinfo/outages


"The service outage was not caused by external influences. It was not a "hack" and it was not a denial of service attack."

So Anonymous0wn3r or whoever was just claiming responsibility for something they had no hand in? The router tables just corrupted themselves?


> So Anonymous0wn3r or whoever was just claiming responsibility for something they had no hand in?

Yeah, with a name/handle like Anonymous0wn3r they sound very trustworthy. If they claimed they had a hand in something, it must be true.


I don't trust Anonymous0wn3r but I distrust GoDaddy just the same. Of course those aren't the only two answers, it could have been external and not Anonymous0wn3r as well.


But didn't he announce the attack before-hand?


Got a link on that?


http://techcrunch.com/2012/09/10/godaddy-outage-takes-down-m...

Not sure if those tweets were actually from before it happened.


I searched for those tweets mentioned in the Techcrunch article (first from GoDaddy and first from AnonymousOwn3r on this topic) and the time stamps are:

* GoDaddy: 10:35 AM - 10 Sep 12

* AnonymousOwn3r: 11:57 AM - 10 Sep 12

So, AnonymousOwn3r does not seem to have announced the attack before it happened.

Links to tweets:

* https://twitter.com/GoDaddy/status/245213898683318272

* https://twitter.com/AnonymousOwn3r/status/245234582205652992


> So Anonymous0wn3r or whoever was just claiming responsibility for something they had no hand in?

Gee, why would an anonymous person on the internet lie?


"We have determined the service outage was due to a series of internal network events that corrupted router data tables."

I'm witholding any judgement on internal vs external involvement till this series of events is defined. (doubt it ever will be)


Their engineer claims it was an issue with BGP (http://permalink.gmane.org/gmane.org.operators.isotf.outages...).

BGPlay (http://bgplay.routeviews.org/) does not show anything indicative in the BGP default-free table (what the Internet sees), as abnormal or misconfigured. While there could be iBGP issues, like others have stated there was (intermittent) connectivity by IP during the outage.

It's both bullshit PR and more importantly spreading disinformation to save face. Why?

A security breach would instill customer fear and generate negative press. Customers would leave by the droves.

A DoS/DDoS displays that GoDaddy has inadequate infrastructure while competitors such as CloudFare actually do. Furthermore, why would a company that pisses off the Internet be appealing to anyone? Again it will generate negative/bad press, and customers will leave by the drove.

Spreading disinformation by claiming it was either a human error or equipment fault? From a company perspective this is actually the best option. Just provide generous service credit to your customers, you may generate positive press, you will gain customer goodwill and regain their confidence. This is GoDaddy's best option.

Until they provide actual details with proof that it was a misconfiguration or hardware fault, I will continue to call bullshit. Too many factors don't add up, especially the publicly available data which monitors the BGP DFT on the Internet.

The two conjectures that seem plausible so far is the SQL injection in their web interface for DNS and/or a DoS/DDoS attack.


I don't know much about hardware at all, but aren't routers fairly simple, time tested pieces of hardware? Can they really corrupt en-masse in this way?


Routers at this level aren't just scale-ups of your home wifi/nat box. They aren't even scale ups of the simple IP routers for a basic IT data-closet that manages subnets and whatnot (already much more complex by dealing with vlans and subnets and dmz and vpn issues). At the level of big networking company they are a truly complex beast.

Just at the IP level they have to deal with (at the edges and across substantial WANS) BGP - a notoriously ugly and fragile protocol. Internal routing protocols such a OSPF are equally ugly and prone to breakage. Many are the tales of some small company misconfiguring their edge routers slightly (say a 1 char typo) and having the entire internet route through their T1, across their lan, out their backup T1. Other issues are BGP flapping, resulting in scary percentages of lost traffic. This doesn't even cover trickier stuff like routing loops...

Other considerations in big routers are things like ASN identifiers and peering points. Considerations like traffic cost, SLAs and QoS all go in to traffic balancing on such routers. MPLS clouds complicate (and oddly enough simplify) these things as well.

There are also important issues like Anycast, CDN and NAT that largely rely on router tricks and add to the complication.

Finally, on top of all this, is the security concern - you can't just throw a firewall in front of it, as many firewall issues are routing issues, therefore must also be present in the router.

All these layers interact and affect each other. Any given machine can only handle so much traffic and so many decisions, so something that is drawn as a single router on a networking chart may actually be several boxes cascaded to handle the complexity.

Oh yeah, and switches are getting progressively smarter with other rules and weirdnesses that provide horribly leaky abstractions that shouldn't matter to the upstream router, but turn out to add issues to the configuration and overall complexity on top of it all.


Network geek here:

> Many are the tales of some small company misconfiguring their edge routers slightly (say a 1 char typo) and having the entire internet route through their T1, across their lan, out their backup T1.

This is what route filters are for. If you peer with someone and they advertise 0.0.0.0/0 or something equally ridiculous, and you accept this as a valid route then you deserve to fail (and then given a firm stare if you then proceed to advertise it to other peers).

A similar fail on the part of Telstra (http://bgpmon.net/blog/?p=554) was to blame for much of Australia dropping off the map earlier this year.

Also: > Just at the IP level they have to deal with (at the edges and across substantial WANS) BGP - a notoriously ugly and fragile protocol.

I take offence at this. 80% of BGP related issues are due to misconfiguration by a given party, 19% is due to bad or missing route filters and the other 1% is due to bugs in router software. The actual implementation of BGP v4, originally designed back in the early 90's isn't completely without it's issues and behavioural quirks (I'm looking at you, route flaps) but the theory/algorithm behind is a work of art, and has coped amazingly with explosive growth, and growth that's only going to increase with IPv6. Without it, there would be no HN


Thanks for your views on BGP - a lot of my knowledge of it comes from post-incident beers with our network grey-beards when I worked at an ISP, so my views are probably somewhat biased. There is nothing like a network explosion at an ISP to get people ranting about BGP, but I'll admit that the ranting is from a place of anger and frustration and largely venting rather then a fair technical assessment.


...and if you don't have up to date backups of your router tables, it will take a long time to recover from an "oops".

Doing the wrong thing to router table(s) is the network equivalent of "sudo rm -rf /".


router tables are a stationary woodworking machines in which a vertically oriented spindle of a woodworking router protrudes from the machine table and can be spun at speeds typically between 3000 and 24,000 rpm

excluding static routes (which are then usually advertised to other peers), routING tables are dynamically built and only exist in non-persistent memory.

having up to date backups of router configuration is another matter entirely


uhm... no.


Thanks for the great information, appreciate it.


Routers can get quite complicated.

At a first level, routers will cache information about what machines are on which port of the router. This enables them to not forward a packet to every port on the router (reducing network load). This is normally done using MAC Addresses.

On more expensive routers, the router can understand IGMP, or multicast and route based on multicast joins. This enables an optimization on simple broadcasts because not every machine on the network needs to see these packets, but multiple machines might want to.


Never mind when you start adding fault-tolerance and failover setups/configs to the mix.

It gets really complex really fast, and even the slightest lapse in attention to extreme detail can be disastrous and a nightmare to troubleshoot.

For instance, I once had a router reboot for some unknown reason, and the firmware/config wasn't properly flashed, so it reverted to the last point release of the firmware. That was enough to cause the failover heartbeat to be constantly triggered, and the master/slave routers just kept failing over to each other and corrupted all the ARP caches. Makes for a lot of fun when things work and then stop working in an inconsistent manner.


They're not that simple due to aggressive speed optimizations like caching frequently used routing information in hardware registers. If the cache gets corrupted due to electronic bogon flux, bad things happen. These events are rare (like once every billion operating hours) so you can imagine it's hard for the router vendors to find & fix them all.

It is a little sad that they didn't try rebooting the routers for so many hours.


> It is a little sad that they didn't try rebooting the routers for so many hours.

This assumes they're being literal and "corrupt" means "a bit randomly flipped" instead of using "corrupt" in the figurative sense of "operator error".


For anyone interested, the person who claimed responsibility for this is tweeting about GoDaddy's response: https://twitter.com/AnonymousOwn3r/status/245568841160196096


I'm amazed at how many seemingly non technical people are cheering this guy on, or least seem to think it's "cool" what he's doing.


Just checked his twitter feed, looks like he's attempting to claim source from an old open source project as part of the claimed hack...credibility gone.


Not a good week for claiming credit...


"yet did not result in bringing other DNS services backup"

Can you be more specific?

Which other domain names did you try?

Also, I believe some parts of the world were unaffected by the outage.

I would guess a large majority of GoDaddy customers would not even know this outage occurred. They are "casual" domain name registrants and in some cases "casual" website operators. They registered some names and then never did anything with them. Or they operate a website but it's very low traffic and they rarely think about it. That is only a guess.


This doesn't do anything to explain why it was out for so long. I guess I should expect this type of thing from GoDaddy though, they are mainly a consumer company.


This statement makes a lot of sense. I found it a bit suspicious that Anonymous Own3r twitted: "When i do some DDOS attack i like to let it down by many days, the attack for unlimited time, it can last one hour or one month" Which sounds like he actually has no control over what is happening and makes a statement that is impossible to disprove.


What is their 99.9% SLA liability going to be?


Zero. .001 of 1 year is just under 9 hours of downtime.

http://www.wolframalpha.com/input/?i=.001+*+1+year


Nothing, since 99.9% allows them nearly 9 hours of downtime a year.


Didn't they have more than 9 hours of outage?


I wasnt really paying attention to the outage, but if it was indeed a routing issue, then you shouldnt have been able to reach any godaddy ip address. ICMP/traceroutes would have failed and showed the error.


Why would you say that? Large networks have a ton of routers in them, and even a lot of switches provide routing functionality.

I don't know the details of the environment, but even in smaller systems I've worked on there is a fair bit of hardware separation between various network segments. Complete failure on one part would not affect the others.

For that matter, even a slight corruption in some ARP caches, or stale internal tables, etc., could cause the problems they had... it's not just a complete failure that could cause problems.

And "routing" is such a generic term, when it could really be any number of feature sets that failed; load balancing, source routing configs, etc.


They havn't said where the router issue was. I'd be surprised if their public facing DNS servers are actually storing the data as well - more likely the actual records are in a SQL, LDAP or similar backend on their internal network - which will have a huge internal infrastructure, including many, many routers.


You're forgetting internal routing; e.g. unreachable database server.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: