
DNSSEC Outage on www.cloudflare.com – 2019-03-21 - wglb
https://ianix.com/pub/dnssec-outages/20190321-www.cloudflare.com/
======
prdonahue
This blip was caused by configuration changes we made to our marketing site
(www), which is distinct from any control plane (api/dash) or edge services
provided to our customers.

Specifically, we were breaking out our "www" subdomain as a separately managed
zone, and some of the configuration wasn't properly migrated.

This resolution failure was regrettable, but there was no impact to the
service we provide for our customers.

------
ohyeshedid
Somewhere tptacek is in tears, and laughing.

~~~
masklinn
[https://twitter.com/tqbf/status/1116112320567087104](https://twitter.com/tqbf/status/1116112320567087104)

------
dsl
We really need to push back against centralization on a small number of
providers. As we have seen with previous Cloudflare issues and the big Dyn
incident, a single issue can basically make the internet unusable.

~~~
therealmarv
I'm also really worried about that. (It's just a feeling but..) for a service
as big as Cloudflare I think it may be not as failure proof and clustered
designed as it should be for a massive DNS provider.

I for myself pay for a smaller DNS provider like CloudDNS with a bigger
redundancy in NS servers (4 instead of 2 at Cloudflare) than going for free at
Cloudflare.

~~~
rdl
Cloudflare is one control plane, but it's 1) anycast 2) over 100 locations 3)
multiple servers per site. The SPOF with Cloudflare is either your account
getting hacked, or (much less likely) a CF control plane/configuration outage.

Someone could probably build a better DNS provider than Cloudflare, at least
for their specific site, but it would be difficult. I don't know of any other
DNS provider I'd consider more robust, particularly against high load, DoS,
etc.

(Disclaimer: worked for Cloudflare 2014-2016)

~~~
zzzcpan
This is rather weird claim. Getting your account hacked is the least of your
worries, when every single thing is maintained by a single organization, a
single team, on a single autonomous system, with lots of reliance on single
hardware and software components. This is basically three nines level stuff,
with hours of downtime per year on average. It's relatively easy to do better
than that with do-it-yourself CDN using DNS-routed multiple independent
hosting providers for edge nodes, getting to four, five nines level. (and DNS
separately is not worth talking about, because it has built in redundancy,
anything will always be more reliable than a centralized DNS provider)

~~~
viraptor
> with lots of reliance on single hardware and software components

Are your saying you have inside knowledge and that's the architecture? (I'm
guessing no)

~~~
peterwwillis
I've never seen any company use completely different software or hardware for
each redundant instance of a service, so it's a pretty good bet CloudFlare
isn't doing this either.

~~~
close04
That would imply a SPOF. One component of any kind that can take down the
whole thing. The only such “component” I can think of is a human, not a
technical one.

~~~
zzzcpan
I meant single kind of component, not one thing handling everything.

If you use juniper routers, for example, a juniper specific software, hardware
problem or a mistaken assumption about juniper specifics could take down all
routers and an entire network.

~~~
close04
Usually any kind of such issue wouldn’t take out _every_ component of a type
unless user error is involved. Like deploying the wrong policy to all devices.
Which again comes back to humans being the weakest link. I can’t really
imagine a particular hardware problem that would kill _all_ components at
once. This being said I probably didn’t experience all possible problems so
far :).

~~~
peterwwillis
Sure, humans are the weakest link, who cares? Regardless of the cause of the
failure, catastrophic cascades are more likely if the systems and operations
involved are homogeneous. That's why evolution prefers diversity.

~~~
close04
_Natural_ evolution. Because it follows vastly different rules. I still fail
to see any example given for how one “cascade failure” takes down all the
routers in an infrastructure like Cloudflare’s. I can think of a dozen ways a
non-homogeneous system can fail (like incompatible updates and config,
confusions between the types of components, etc.). In the end you usually end
up with massive expenditure that introduces just as many problems as it
solves.

You may be onto something that nobody else spotted but so far the arguments
are too generic (“failure”) or incompatible with the premise (rules of natural
evolution applied to very artificial systems).

Don’t get me wrong, I’m happy to learn something new and understand a
different point of view but so far the arguments don’t help me.

~~~
zzzcpan
I actually used that example because they had this exact problem in the past,
here's the link: [https://blog.cloudflare.com/todays-outage-post-
mortem-82515/](https://blog.cloudflare.com/todays-outage-post-mortem-82515/)

~~~
peterwwillis
And Google had a similar issue recently where they didn't test their firmware
upgrade auto rollback, so a boned firmware upgrade bricked a region until they
could manually recover it. If they had rolled out to all regions at once,
_and_ all the devices were the same, it would have been chaos.

Often the only thing preventing mass upgrades from becoming mass outages is
some part of the network just happens to work differently.

~~~
close04
Well... sure. Just as long as you keep in mind that a "not really tested
[something]" will very easily get replicated on multiple vendor devices. It's
not like they did it because they hate Vendor1 but will make sure to test all
the corner cases with Vendor2. Some scenarios will by design ensure you can't
have all of them fail because they are not susceptible to the same failure
modes. But to compensate you introduce additional failure modes (especially
human error) by using a heterogeneous environment. Not to mention the expense.

I was actually hoping for an interesting human independent failure. A cosmic
radiation caused bit flip that would occur on a particular device. :) Human
error can overcome any setup.

------
bluejekyll
"why should a DS record for www.cloudflare.com matter?!" \- that was my exact
question when I looked at the output from the DNSSEC Debugger.

Now I'm curious as to why www has NS records...

(DS records are used to tie DNSKEYs to a child zone from the parent zone).

~~~
prdonahue
It has NS records because that's how the owner of cloudflare.com (within the
Cloudflare Dashboard) delegates control of www.cloudflare.com to another team.

~~~
bluejekyll
Ah. That makes a lot of sense, thanks.

------
iofiiiiiiiii
The title gives the impression that this was an outage of Cloudflare's DNSSEC
service, whereas in reality the Cloudflare website was down due to a DNSSEC
error. Ideally the title should be clarified to make it obvious that
Cloudflare customers were not affected.

------
localhostdotdev
time doesn't match exactly but that might be it:
[https://www.cloudflarestatus.com/incidents/3k7gtg4fml1p](https://www.cloudflarestatus.com/incidents/3k7gtg4fml1p)
([https://www.cloudflarestatus.com/history](https://www.cloudflarestatus.com/history)
for the list)

------
gscott
The Cloudflare Android 1.1.1.1 VPN app was down most of the day.

