
GoDaddy outage caused by corrupted router tables - jakeludington
http://www.godaddy.com/newscenter/release-view.aspx?news_item_id=410
======
druiid
I find this extremely suspicious (I.E. knowing routers, I call bullshit). The
change to the Verisign anycast DNS service which I noted yesterday in another
thread... brought godaddy.com back up, yet did not result in bringing other
DNS services back up.

Someone is lying here in my opinion. I hope I'm proven wrong because this is a
terrible excuse for the company to make.

EDIT: And as someone else pointed out... their IP addresses could be pinged
which further goes to disprove a routing issue. More than likely high traffic
crashed one or more routers (THIS I have seen happen) and the live/saved
configs didn't match. I'd put more money on something like this happening if
it was router related.

~~~
nettdata
Unless you have intimate knowledge of their network topology, and know the
specifics of where those pinged IP's live in that topology, and what routes
were used to provide DNS results, you can't say that it wasn't a routing
issue.

"Routing" is a rather generic term when it comes to large networks, and
everything from border routers, firewalls, load balancers, and switches
actually perform routing.

Especially (as I've mentioned in another post) when you add fault tolerance /
failover configurations to the mix.

Routing failure doesn't have to be an all-or-nothing thing. There are a number
of ways in which I can see ICMP echo packets working but other traffic not,
especially when you include complexities of source routing, load balancing,
failover, etc.

Even something as "simple" as a poisoned ARP cache in a single box could screw
up the entire internal network and cause the problems they've had, and still
be considered a "routing issue".

$0.02

~~~
druiid
None of that is necessarily incorrect... but per their news release 'corrupted
router data tables' (their words) were the issue. I can't read too much into
that, but that still doesn't change the fact that DNS wasn't resolving for a
while after they made their Verisign change (for clients), yet their website
was resolved with this change.

You are correct that I don't know the details of their internal network and I
never said otherwise, just that the chain of events and their claims don't
necessarily match up!

~~~
nettdata
I can imagine that they'd understandably work to get their own site/etc up and
running first as the priority, as a manual "hack". After all, it's the main
page everyone would be going to for information on what's going on.

After that, coming up with an automated process for migrating what must be a
shit-ton of zone information to another system must have taken some time. I
have no idea what their specific solution was, but I'm fairly confident in the
fact that it wasn't just a matter of copying over a few zone files. They'd
probably have to do SOME sort of ETL (extraction / translation / load) process
that would take some time to develop, test, never mind run.

And I can't remember the last time I gave technical information to a PR person
who actually got it 100% technically correct. ;)

My intention wasn't to shit on your point or anything, or in any way defend
Go-Daddy and their screwup, I'm just thinking it's a bit unrealistic to try
and infer detailed information from a PR release.

In the end, it was technical, they screwed up, and I doubt they'd ever release
a proper, detailed post-mortem of what happened.

~~~
druiid
Heh, yeah. It is a bit difficult to interpret PR speak (and I have had to
correct our guy before).

I think perhaps the takeaway from here is to not trust what is being said, go
with your gut... and move any services off GoDaddy ;). Would be nice if like
Google or Amazon they would release a real post-mortem post. Even if it's an
internal 'uh-oh' I trust companies that are willing to admit to mistakes.

~~~
larrys
"Would be nice if like Google or Amazon they would release a real post-mortem
post."

Possible but highly unlikely. Godaddy is "old school" which means they will
release as little info as necessary and move on. They aren't interested in the
hacker community. Their primary market is SMB's.

------
highfreq
How can they claim 99.999% uptime, when they just had several hours of service
outage? I'm not sure how long they've been providing DNS hosting, but by the
most generous assumption this would be the entire 15 years of their existence.
99.999% allows them about 1.3 hours of outage in 15 years.

~~~
mikedouglas
I think what they're trying to say is that 99% of the time, they have 99.999%
uptime.

~~~
JoeAltmaier
Silly: 99% of the time they have 100% uptime.

~~~
dllthomas
60% of the time it works every time?

~~~
JoeAltmaier
Was responding to the silly parent post, claiming similar nonsense.

~~~
dllthomas
Yes, wasn't criticizing, just enjoying.

------
staunch
Based on a long history of working in datacenters I'd bet someone
misconfigured something and later claimed it was "corrupted" to save their ass
- happens all the time. It's just so simple to make very confusing and
damaging mistakes in a complicated network.

I wouldn't be surprised to hear that GoDaddy's corporate culture wouldn't
respond well to someone admitting to a mistake this damaging.

~~~
dclowd9901
Ex-GoDaddy employee here:

Everyone there is on pins and needles at this point. Since Silverlake's
investment in the company, many hatchets have dropped on jobs, and it's really
the only decent tech firm in Phoenix to work at.

My guess is that there is some hiney covering going on with this explanation,
and the interim CEO has little cause to care too much about responsibility,
since he'll likely be out before year's end anyway.

I really feel for the folks who work there. Many, many talented people who
don't have an inch to make a mistake. When I was last there, they had just
released an internal communication about the new company motto: "It won't fail
because of me." Horrible, horrible, backward-ass culture.

~~~
citricsquid
can you clarify what you mean by:

> it's really the only decent tech firm in Phoenix to work at.

and then:

> Horrible, horrible, backward-ass culture.

what part of the company is good if not the culture?

~~~
dclowd9901
There are many tech (and non-tech) companies in the Phoenix area. Your choice,
as always, is to work at a place where the importance of the technology and
the talent of your engineers and designers is recognized, or work in
PetSmart's IT/Web department (where you're constantly fighting budget and
recognition battles).

So should you opt to go into tech, GoDaddy is the largest and best-paying, and
has fairly decent benefits. Unless you were a Senior level professional
elsewhere, it would likely be in your best interest to work at GD.

That said, it's also a company run by bean counters and marketing. Most (read:
all) important decisions regarding what choices the company makes goes through
a ringer that includes assessing how much _direct_ money comes from an
innovation or change. If test a yields or saves $1 and test b yields or saves
$2, test b wins, no matter how poor a choice it is in terms of user
experience, customer care or any other metrics that relate to long-term
customer retention.

It's got tons of middle management, which in itself isn't a bad thing, except
that everyone's fighting to own the creation a product, but no one wants to be
accountable for it, should it not go well.

Essentially this creates an environment of fear against innovation,
accountability and iteration.

~~~
Osiris
I've only been here less than a year so I haven't been exposed to that. What
I've observed is that the culture is very compartmentalized in the sense that
there are lots of small teams that each have a specific product they work on.
Each team basically has it's own culture and way of doing things.

That isn't good for standards/best practices, but it does mean that teams have
a lot of leeway to do things pretty much however they want. My team has a ton
of flexibility on how we code, test, and deploy, though we have to work within
some red tape imposed by other teams that have their own way of doing their
things.

In my day to day work I'm actually pretty free to work how I want without
being pushed by people up the food chain. That could be because the product I
work on is a supplementary product and not a 'core' product, though.

~~~
dclowd9901
Classic engineer. I'm talking about across a product. It's true that within
your core group, you are given leeway, but the buck stops there.

------
kijeda
The DNS is designed to provide resiliency to these kinds of problems by
providing the ability to list multiple NS records located in different
networks. It is standard practice for top-level domain operators and other
high-activity domains to place their name servers in different networks to
guard against these kinds of issues. When companies put all their name servers
in the same network, they are removing the diversity benefit and create a
single point of failure. Domain operators should take this as a cautionary
tale that they shouldn't have all their eggs in one basket and make sure a
single network failure couldn't take all their name servers offline.

~~~
andreasvc
There are many other single points of failure besides network failure, such as
pushing the wrong configuration. In fact it seems to me that it would be
rather rare for a multi-homed datacenter to have a network failure.

~~~
kijeda
If you have a routing issue, whether it is due to "corruption" or
misconfiguration, having some of your name servers on an entirely different
network (i.e. a different AS) with a different routing policy is not going to
be affected.

------
mootothemax
I have to wonder how many extra customers the various third-party DNS services
have gained as a direct result of this.

I've just switched to DNSMadeEasy - for anyone concerned about the time
involved, they have some cool timesavers like templates you can apply to all
of your domains at once. Really makes a difference not having to manually set
up individually the entries for Google Apps on 20+ domains.

------
SilasX
Semi-OT: Why is it so hard to find yesterday's highly-rated GoDaddy outage
discussion? Neither sorting by relevance nor recency nor points will find it.
Or maybe there wasn't one?

~~~
mongol
I don't know why it is hard but I picked it up from my browser history:
<http://news.ycombinator.com/item?id=4500993>

~~~
SilasX
Your confirmation of my sanity is appreciated :-)

------
oasisbob
On the outages mailing list[1], Mike Dob (GoDaddy Network Engineering Manager)
just added more details, saying:

> It was BGP related and more details should be posted today

[1] <http://puck.nether.net/mailman/listinfo/outages>

------
hbz
"The service outage was not caused by external influences. It was not a "hack"
and it was not a denial of service attack."

So Anonymous0wn3r or whoever was just claiming responsibility for something
they had no hand in? The router tables just corrupted themselves?

~~~
sendos
> _So Anonymous0wn3r or whoever was just claiming responsibility for something
> they had no hand in?_

Yeah, with a name/handle like Anonymous0wn3r they sound very trustworthy. If
they claimed they had a hand in something, it must be true.

~~~
andreasvc
But didn't he announce the attack before-hand?

~~~
46Bit
Got a link on that?

~~~
andreasvc
[http://techcrunch.com/2012/09/10/godaddy-outage-takes-
down-m...](http://techcrunch.com/2012/09/10/godaddy-outage-takes-down-
millions-of-sites/?utm_source=dlvr.it&utm_medium=twitter)

Not sure if those tweets were actually from before it happened.

~~~
sendos
I searched for those tweets mentioned in the Techcrunch article (first from
GoDaddy and first from AnonymousOwn3r on this topic) and the time stamps are:

* GoDaddy: 10:35 AM - 10 Sep 12

* AnonymousOwn3r: 11:57 AM - 10 Sep 12

So, AnonymousOwn3r does not seem to have announced the attack before it
happened.

Links to tweets:

* <https://twitter.com/GoDaddy/status/245213898683318272>

* <https://twitter.com/AnonymousOwn3r/status/245234582205652992>

------
mbell
"We have determined the service outage was due to a series of internal network
events that corrupted router data tables."

I'm witholding any judgement on internal vs external involvement till this
series of events is defined. (doubt it ever will be)

------
dingdingpop
Their engineer claims it was an issue with BGP
([http://permalink.gmane.org/gmane.org.operators.isotf.outages...](http://permalink.gmane.org/gmane.org.operators.isotf.outages/4279)).

BGPlay (<http://bgplay.routeviews.org/>) does not show anything indicative in
the BGP default-free table (what the Internet sees), as abnormal or
misconfigured. While there could be iBGP issues, like others have stated there
was (intermittent) connectivity by IP during the outage.

It's both bullshit PR and more importantly spreading disinformation to save
face. Why?

A security breach would instill customer fear and generate negative press.
Customers would leave by the droves.

A DoS/DDoS displays that GoDaddy has inadequate infrastructure while
competitors such as CloudFare actually do. Furthermore, why would a company
that pisses off the Internet be appealing to anyone? Again it will generate
negative/bad press, and customers will leave by the drove.

Spreading disinformation by claiming it was either a human error or equipment
fault? From a company perspective this is actually the best option. Just
provide _generous_ service credit to your customers, you may generate positive
press, you will gain customer goodwill and regain their confidence. This is
GoDaddy's best option.

Until they provide actual details with proof that it was a misconfiguration or
hardware fault, I will continue to call bullshit. Too many factors don't add
up, especially the publicly available data which monitors the BGP DFT on the
Internet.

The two conjectures that seem plausible so far is the SQL injection in their
web interface for DNS and/or a DoS/DDoS attack.

------
TomGullen
I don't know much about hardware at all, but aren't routers fairly simple,
time tested pieces of hardware? Can they really corrupt en-masse in this way?

~~~
sophacles
Routers at this level aren't just scale-ups of your home wifi/nat box. They
aren't even scale ups of the simple IP routers for a basic IT data-closet that
manages subnets and whatnot (already much more complex by dealing with vlans
and subnets and dmz and vpn issues). At the level of big networking company
they are a truly complex beast.

Just at the IP level they have to deal with (at the edges and across
substantial WANS) BGP - a notoriously ugly and fragile protocol. Internal
routing protocols such a OSPF are equally ugly and prone to breakage. Many are
the tales of some small company misconfiguring their edge routers slightly
(say a 1 char typo) and having the entire internet route through their T1,
across their lan, out their backup T1. Other issues are BGP flapping,
resulting in scary percentages of lost traffic. This doesn't even cover
trickier stuff like routing loops...

Other considerations in big routers are things like ASN identifiers and
peering points. Considerations like traffic cost, SLAs and QoS all go in to
traffic balancing on such routers. MPLS clouds complicate (and oddly enough
simplify) these things as well.

There are also important issues like Anycast, CDN and NAT that largely rely on
router tricks and add to the complication.

Finally, on top of all this, is the security concern - you can't just throw a
firewall in front of it, as many firewall issues are routing issues, therefore
must also be present in the router.

All these layers interact and affect each other. Any given machine can only
handle so much traffic and so many decisions, so something that is drawn as a
single router on a networking chart may actually be several boxes cascaded to
handle the complexity.

Oh yeah, and switches are getting progressively smarter with other rules and
weirdnesses that provide horribly leaky abstractions that shouldn't matter to
the upstream router, but turn out to add issues to the configuration and
overall complexity on top of it all.

~~~
gvb
...and if you don't have _up to date_ backups of your router tables, it will
take a _long_ time to recover from an "oops".

Doing the wrong thing to router table(s) is the network equivalent of "sudo rm
-rf /".

~~~
darkr
router tables are a stationary woodworking machines in which a vertically
oriented spindle of a woodworking router protrudes from the machine table and
can be spun at speeds typically between 3000 and 24,000 rpm

excluding static routes (which are then usually advertised to other peers),
routING tables are dynamically built and only exist in non-persistent memory.

having up to date backups of router configuration is another matter entirely

------
kevincennis
For anyone interested, the person who claimed responsibility for this is
tweeting about GoDaddy's response:
<https://twitter.com/AnonymousOwn3r/status/245568841160196096>

~~~
stef25
I'm amazed at how many seemingly non technical people are cheering this guy
on, or least seem to think it's "cool" what he's doing.

------
xtdx
Not a good week for claiming credit...

------
wethesheeple
"yet did not result in bringing other DNS services backup"

Can you be more specific?

Which other domain names did you try?

Also, I believe some parts of the world were unaffected by the outage.

I would guess a large majority of GoDaddy customers would not even know this
outage occurred. They are "casual" domain name registrants and in some cases
"casual" website operators. They registered some names and then never did
anything with them. Or they operate a website but it's very low traffic and
they rarely think about it. That is only a guess.

------
oomkiller
This doesn't do anything to explain why it was out for so long. I guess I
should expect this type of thing from GoDaddy though, they are mainly a
consumer company.

------
goldeneye
This statement makes a lot of sense. I found it a bit suspicious that
Anonymous Own3r twitted: "When i do some DDOS attack i like to let it down by
many days, the attack for unlimited time, it can last one hour or one month"
Which sounds like he actually has no control over what is happening and makes
a statement that is impossible to disprove.

------
ww520
What is their 99.9% SLA liability going to be?

~~~
RKearney
Nothing, since 99.9% allows them nearly 9 hours of downtime a year.

~~~
ww520
Didn't they have more than 9 hours of outage?

------
overworkedasian
I wasnt really paying attention to the outage, but if it was indeed a routing
issue, then you shouldnt have been able to reach any godaddy ip address.
ICMP/traceroutes would have failed and showed the error.

~~~
nettdata
Why would you say that? Large networks have a ton of routers in them, and even
a lot of switches provide routing functionality.

I don't know the details of the environment, but even in smaller systems I've
worked on there is a fair bit of hardware separation between various network
segments. Complete failure on one part would not affect the others.

For that matter, even a slight corruption in some ARP caches, or stale
internal tables, etc., could cause the problems they had... it's not just a
complete failure that could cause problems.

And "routing" is such a generic term, when it could really be any number of
feature sets that failed; load balancing, source routing configs, etc.

