
Cloudflare outage on July 17, 2020 - tomklein
https://blog.cloudflare.com/cloudflare-outage-on-july-17-2020/
======
QuentinM
Head of DevOps at a major financial exchange where latency & resiliency is at
the heart of our business, and yes, we pay Cloudflare millions. I see two
things here:

# Just be ready

Most definitely not the first time Cloudflare has had trouble, just like any
other system: it will fail eventually. If you're complaining about the outage,
ask yourself the question: why were not you prepared for this eventuality?

Spread your name servers, and use short-TTL weighted CNAMEs, defaulting to
say, 99% Cloudflare, 1% your internal load balancer. The minute Cloudflare
seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure
completely. This should be tested periodically to ensure that your backends
are able to scale & take the load without shedding due to the lack of CDN.

# Management practices

Cloudflare's core business is networking. It actually embarrasses me to see
that Cloudflare YOLO'd a BGP change in a Juniper terminal without peer reviews
and/or without a proper administration dashboard, exposing safe(guarded)
operations, a simulation engine and co.? In particular, re-routing traffic /
bypassing POPs must be a frequent task at scale, how can that not be automated
so to avoid human mistakes?

If you look at the power rails of serious data centers out there, you will
quickly notice that those systems, although built 3x for the purpose of still
being redundant during maintenance periods, are heavily safeguarded and
automated. While technicians often have to replace power elements, the
maintenance access is highly restricted with unsafe functions tiered behind
physical restrictions. An example of a common function that's safeguarded is
the automatic denial of an input command that would shift electrical load onto
lines beyond their designed capacity - which could happen by mistake if the
technician made a bad assumption (e.g. load sharing line is up while it's
down) or if the assumption became violated since last check (e.g. load sharing
line was up when checked, became down at a later time - milliseconds before
the input even).

~~~
deathanatos
> _and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1%
> your internal load balancer. The minute Cloudflare seems problematic, make
> it 0% 100% to bypass Cloudflare’s infrastructure completely._

Except if you're using CF for DNS service, this wouldn't have worked, as both
CF's website & DNS servers were impacted by the outage.

~~~
txcwpalpha
That can't be possible, CF's website explicitly says that their DNS is "always
available" with "unparalleled redundancy and 100% uptime"! ;)

In all seriousness, I wonder if they are going to have to change the marketing
on the site now...

[https://www.cloudflare.com/dns/](https://www.cloudflare.com/dns/)

~~~
esperent
If they are truthful it will now probably change to 99.997% uptime or similar.
I expect that's still good compared to many DNS providers.

~~~
belorn
Some dns provider only have one name server which is naturally bad.

However it is a bit bad of CF that a single configuration error can bring all
the slave servers down. It mean that they have no redundancy in term of BGP
misstakes. Customers of CF that want to avoid this would benefit to add an
additional slave server outside the hand of CF.

Zonemaster (dns sanity checking tool) actually complain about CF hosted domain
names because of the lack of AS redundancy. The outage yesterday demonstrate
nicely why that is an concern and why one should care.
[https://zonemaster.iis.se/?resultid=7d1fab165987e195](https://zonemaster.iis.se/?resultid=7d1fab165987e195)

~~~
jontro
Same goes for route53 too unfortunately

------
redm
CloudFlare is a good company and everyone has outages. IMHO the post-mortems
they post are not only some of the best I've read from a big company, but they
are produced quickly.

I only wish they could update cloudflarestatus.com more quickly. Shouldn't
there be some mechanism to update that immediately when there is an incident?
When the entire internet knows your down and your status page says All Systems
GO! it looks very poorly on them.

~~~
aeyes
Their comment regarding the late update here:
[https://news.ycombinator.com/item?id=23878496](https://news.ycombinator.com/item?id=23878496)

> Because of the way we securely connect to StatusPage.io from most locations
> where our team is based. The traffic got blackholed in ATL, keeping us from
> updating it.

~~~
nostrebored
Circular dependencies like this are not a good look when the core of your
business is networking...

~~~
aeyes
Yep, I was joking with our rep that them using Cloudflare Access for their
internal services sounds like a problem waiting to happen.

Guess I wasn't wrong, they might even have lost access to internal monitoring
systems which is pretty unfortunate in such a situation. If you ask them about
Cloudflare Access they will happily tell you that it was built for internal
tool access and that they use it for everything, later they went on to sell it
as a product.

~~~
dylan604
An example of dogfooding your own product is actually a bad idea?

~~~
microcolonel
While they could dogfood the product, status monitoring systems should be
separated from your bread-and-butter product's failures. If you are in the
business of messing with BGP, then the BGP that controls the routes that let
you report outages should not be the same one you are messing with regularly,
or at least, there should be redundancy.

------
eastdakota
Here's our blog post detailing what happened today and the mitigations we've
put in place to ensure it doesn't happen again in the future:
[https://blog.cloudflare.com/cloudflare-outage-on-
july-17-202...](https://blog.cloudflare.com/cloudflare-outage-on-
july-17-2020/)

~~~
txcwpalpha
I mean this in only a slightly judgemental way: What kind of change
management/testing is going on over there at CF? This is not the first time
that someone at CF has made a hasty config change that brought down a
significant part of the network.

Now I'll admit that proper change management won't catch _every_ issue, but
the issue described in this post seems like something that should have been
caught. It's a little worrisome that a company that so much of the internet
relies on is apparently playing fast and loose with major config changes. The
changes you describe in the post-mortem sound like they will fix the immediate
problem and possibly prevent future occurrences of this exact same problem,
but what about making broader changes about how you deploy these things?

Not only that, but this single config change not only brought down the CDN,
but brought down _both_ 1.1.1.1 and the secondary 1.0.0.1. Was this type of
failure never tested, or...? What's the point in having both if they both go
down at the same time?

~~~
rohansingh
What makes you think that whoever would be responsible for approving the
change would somehow know better and catch the issue?

~~~
teraflop
The fact that such a critical problem can only be detected by a human noticing
it is, itself, part of the problem. Having multiple people look at the change
instead of just one is a little bit better, but it's still just a band-aid.
Ideally, there would be automated processes in place that could prevent and/or
mitigate this kind of thing. (If we were talking about software instead of
configuration, how would you view an organization that required every commit
to be reviewed, but had no automated tests?)

One possibility would be to parse the config file and do a rudimentary
simulation, taking into account traffic volumes, and warn the user if the
result would be expected to overload any routes.

Another possibility would be to do something a bit smarter than just
instantly, blindly propagating a change to every peer at the same time. If the
bad routes had been incrementally rolled out to other routers, there might
have been time for someone to notice that the metrics looked abnormal before
Atlanta became overloaded. (I don't know whether that's feasible with BGP, or
if it would require a smarter protocol, but it seems like it would be worth
looking into.)

Finally, it seems like if a config change is made to a router and it
immediately crashes, there should be systems that help to correlate those two
events, so that it doesn't take half an hour to identify and revert the
change.

------
katzgrau
I was on a call with an investor and an employee mouthed to me, silently,
"everything is down!"

Immediate hot flash.

After I got off the call (thank god he had an appointment to run to), I
checked it out. Our internal dashboards were all green so we realized it was a
DNS issue pretty quickly.

Since we couldn't get into Cloudflare we searched Twitter and realized it was
their issue and I stopped worrying.

One of the benefits of CF and other major internet vendors is that when
they're down, you can kind of shrug it off and tell your customers to wait a
bit. Not so if you're using a smaller/unknown CDN company.

~~~
jzawodn
Why did you stop worrying?

~~~
ksec
You have a bigger name to blame. It is much easier to tell customer Amazon is
down, or Cloudflare is down, rather than explaining something that customer
dont want to hear.

And if everyone else on the AWS / CF is down as well, then it is no one's
fault. We all keep calm and just wait it out.

~~~
LaserToy
Your customer doesn’t care, as they don’t have business with cloudflare. Your
fault is in not having plan b.

~~~
peteretep
I was CTO at a SaaS company with a whole host of companies you've heard of
relying on us as a core part of their business. They absolutely would and did
accept "AWS was down, soz" as an excuse, where they were much spikier about
fuckups where we were to blame.

Tangential: nobody got fired for buying IBM

~~~
LaserToy
I was in eng leadership at largest gaming network. Our customers (gamers)
absolutely didn’t care about reasons they couldn’t play. It was always our
fault as we were responsible for tech choices, not them. And I believe it is a
way. If I’m selling you a service it is my responsibility to make sure it is
reliable and not to blame it on something that is under my control (like cloud
providers). Of course I can’t fix local internet going down, but I can make
sure I’m not married to 1 vendor

~~~
katzgrau
I'd argue that you're probably holding yourself to a standard that is
more/less unachievable in such an interdependent world. It's idealistic and
idealism is a square peg in the funny shaped hole of reality.

Taking accountability and having backup plans are extremely important, but you
simply can't remove every last shred of dependence. You eventually have to
accept that there are things that are out of your control and may take you by
surprise despite best efforts.

~~~
zzzcpan
In web and online infrastructure pretty much nothing is out of your control
except for two things: ISPs people use and domain name registrar you use for
your domain name. And even domain name registrar centralization can be
mitigated against by having multiple domains from multiple registrars and
promoting different domains to different users and having backup communication
channels to inform users about new domains in case something happens.

Other than that it's your choice whether to make your infrastructure dependent
on a bunch of unreliable centralized SPOFs from big corporations or build
highly available infrastructure relying on servers from many different
providers running your own DNS servers with DNS routing, failover, etc. You
will definitely beat Cloudflare's availability this way many times over.

~~~
katzgrau
And you will still be exposed to being blindsided by something out of your
control. It's really only in your control of you can think of and plan for it
ahead of time. And there will certainly be things that we don't consider. You
can call that a failure but it happens all the time and it's reality.

What if a political event impacts you, for instance? A pandemic? A storm
taking out a major data center? A weird Linux kernel edge case that only
happens beyond a certain point in time? That only sounds ridiculous because it
hasn't happened, but weird things like that happen all the time. There are so
many unseen possibilities.

I understand that might sound unreasonable or facetious or like I'm expanding
the scope.

The point is, the more confident that you've built something that has no SPOF
the more exposed your are to the risk of it, because one probably does exist.

~~~
zzzcpan
Honestly, you are not making any sense. This is not how engineering works. If
you design for resilience, you get more resilience and you build confidence as
you see the evidence how the system works in real world. Furthermore, with
resilience you have to always cover all risks, it's just that you don't
immediately reach fine granularity of decisions that don't trigger failover to
servers in different countries, you improve granularity as you learn from
actual operations and modify your designs accordingly.

I remember when I first deployed DNS routed system it was too reactive,
constantly jumping between servers, monitoring was too sensitive, it didn't
wait for servers to stabilize to return them into the mix and exponential
backoff was taking servers out for far too long. But even given all that it
was still able to avoid outages caused by data center failures and
connectivity problems.

~~~
katzgrau
It does make sense, and it's paradoxical, I know.

> If you design for resilience, you get more resilience and you build
> confidence as you see the evidence how the system works in real world.

You simply can't foresee or eliminate all risk. This is referred to as "the
turkey problem." It's not my idea, but one I certainly subscribe to.

[https://www.convexresearch.com.br/en/insights/the-turkey-
pro...](https://www.convexresearch.com.br/en/insights/the-turkey-problem-2)

~~~
katzgrau
> The whole idea behind resilience is to cover unforeseeable risks

Speaking of things that don't make sense... if it's unforeseeable, one will
have a difficult time adequately preparing for it

~~~
zzzcpan
It's not difficult, it's just different. It's the difference between
predicting that a truck might crash into a data center and building concrete
wall around it, and designing a system in a such way that users only ever
resolve to servers that are currently available regardless of what happened to
some of them in a data center that had a truck crashed into it.

~~~
katzgrau
... and after you've solved for the truck problem, you have a potentially
infinite list of other things to plan for, some of which you will not foresee.
And of course, there's probably an upper bound on the time you can spend
preparing for such things.

Famous to the point of being a cliche, the titanic was thought to be
unsinkable, and I would have a similarly hard time convincing the engineers
behind the ship's design to believe otherwise.

The level of confidence you're displaying in predicting the unforeseeable is
something you may want to take a deeper look at.

~~~
zzzcpan
You are missing the point. Solving the truck problem is exactly what you
shouldn't do, well, at least until your system is resilient. Because it could
be something entirely different, it could be law enforcement raiding a data
center and your wall around it won't protect it from them. So instead you
approach the system in terms of what it has to rely on and all possible states
of the thing it has to rely on. Which maps to a very small number of
decisions. Like whether a server is available or not. If it's not available it
really doesn't matter which of the infinite things that could happen to it or
to a data center it is in actually did, you simply don't return it to users if
it's not available and have enough independent servers to return to users in
enough independent data centers to achieve specific availability. It's really
not difficult.

I understand that most of those leetcode corporations don't care much about
resilience, likely even incapable of producing highly reliable systems, and
may give you a false impression that reliability is something of an
unachievable fantasy. But it's not, it's something we have enough research
done on and can do really well today if needed, we are not in titanic era
anymore.

I have high confidence in these things (not in "predicting the
unforeseeable"), because I've done them myself. My edge infrastructure had
like half an hour of downtime total in many years, almost a decade already.

------
cflewis
Compare this to Facebooks SDK “postmortem” and you can tell which company
cares more about its customers.

~~~
notwhereyouare
They actually did a postmortem?

~~~
john-shaffer
I think this is the "postmortem" he's referring to:
[https://developers.facebook.com/blog/post/2020/07/13/bug-
now...](https://developers.facebook.com/blog/post/2020/07/13/bug-now-resolved-
fb-ios-sdk-outage-causing-disruption-third-party-ios-apps/)

> Last week we made a server-side code change that triggered crashes for some
> iOS apps using the Facebook SDK. We quickly resolved the issue without
> requiring action from developers.

~~~
tpmx
I'm pretty sure at least half a billion people were affected, and this is the
best they can come up with...

------
combatentropy
It's seems these major infrastructure outages always are from a configuration
change. I remember Google had a catastrophic outage a few years ago, and the
postmortem said it all began as a benign configuration update that snowballed
worldwide. In fact I tried googling for it and found the postmortem of a more
recent outage, also from a configuration change.

Some seasoned sysadmin will say to me, "Of course it's always from a
configuration change. What else could it be?" I don't know, it seems like
there are other possible causes. But in today's superautomated
infrastructures, maybe config files are the last soft spot.

~~~
jeffbee
There's always a lot of initiative to get rid of unsafe manual network config
changes right after a major outage, but the difficulty of automating such
changes is surprisingly high, and the rate at which the initiative decays is
also surprisingly high.

~~~
clipjokingly
Is it possible to test config changes in a simulated version of the network?

------
Rapzid
> We saw traffic drop by about 50% across our network. Because of the
> architecture of our backbone this outage didn’t affect the entire Cloudflare
> network and was localized to certain geographies.

I'm not even sure.. Is that second sentence supposed to signal some sort of
success? Dropping 50% of your traffic isn't isolated. If your gonna try to
spin it, at least bury the damn lede. Further:

> The affected locations were San Jose, Dallas, Seattle, Los Angeles, Chicago,
> Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt,
> Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto
> Alegre. Other locations continued to operate normally.

Locations with THEIR equipment, but certainly not all "affected" locations. I
live 4 hours from Dallas and can assure you that I was impacted. That coverage
is like.. Most of the United States, Europe, Brazil and who knows how much of
South America? Oh right, 50% of their traffic!

~~~
dannyw
I appreciate the transparency of the 50% figure other than some generic spin
“some connectivity was degraded”.

------
spenczar5
Wow, BGP brings down globally-used DNS. It’s like a perfect lesson in weak
points of the modern web’s accidental design.

~~~
dilyevsky
If you own a prefix and announce some bad routes that cause all your traffic
to be blackholed due to misconfiguration on your end i dont see how it’s bgp’s
fault.

------
rexarex
Network Engineering: Invisible when you’re killing it at your job, instantly
the enemy when you make a mistake.

------
dahfizz
We need to do something about BGP.

Just in the past year Verizon, IBM, Apple, now Cloudflare have seen outages
from BGP misconfiguration. The Verizon issue took down a significant part of
the internet.

BGP is a liability to society. We need something which doesn't constantly
cause widespread outages.

~~~
tyingq
Any replacement would also need the ability to route traffic, and subject to
similar risks. A "pre-push" testing simulator might be easier than throwing
out BGP.

~~~
Cyph0n
I recall watching a Microsoft talk where they explain how they do exactly
this.

------
simonswords82
In the last three years we have hosted our enterprise software on Azure, and
the only outages we've had have been caused by mistakes or issues at
Cloudflare. Azure has been rock solid but our customers don't understand that
and assume that we're "just down", which impacts our SLAs.

During the most recent outage a few weeks ago, Azure were available to discuss
the issue by phone. I wish I could say the same for Cloudflare.

I would be interested to hear from anybody who knows of a good alternative to
Cloudflare. I'm completely fed up with them.

~~~
coder543
If you're so happy with Azure, why aren't you using Azure CDN?
[https://azure.microsoft.com/en-
us/services/cdn/](https://azure.microsoft.com/en-us/services/cdn/)

Obviously, AWS and GCP offer their own CDN systems as well. (CloudFront and
Cloud CDN, respectively)

There are tons of third party CDNs as well.

Unless by "good alternative" you mean that you're on Cloudflare's free plan,
and hoping to find someone else who will willingly soak up huge amounts of
bandwidth for free?

Cloudflare is one of the only services I know of that offers this, but it's
hard to complain about a short outage every once in awhile when you're paying
nothing or very little. The Cloudflare customers who are paying quite a bit
are surely upset.

~~~
simonswords82
My head of product is looking at the Azure offerings and how they compare with
what we have from Cloudflare as we speak.

No we are not using Cloudflare's free plan.

I was simply interested to know if anybody else had recommendations for a
Cloudflare alternative.

------
bogomipz
>"We are making the following changes:

Introduce a maximum-prefix limit on our backbone BGP sessions - this would
have shut down the backbone in Atlanta, but our network is built to function
properly without a backbone. This change will be deployed on Monday, July 20.

Change the BGP local-preference for local server routes. This change will
prevent a single location from attracting other locations’ traffic in a
similar manner. This change has been deployed following the incident."

It should be noted that configuring prefix limits for your BGP peers is kind
of BGP 101. It's mentioned in every "BGP Best Practices" type document.[1]
It's there for exactly this purpose to prevent router meltdown and resource
exhaustion. For a company who blows their horn as much as these folks seem to
about their network this is embarrassing.

I think it's worth mentioning that it was this time last year when Verizon
bungled their own BGP configuration and brought down parts of the internet.
When that incident occurred Cloudflare's CEO was front and center excoriating
them for accepting routes without basic filtering [2]. This is exact same
class of misconfiguration that befell them yesterday.

[1] [https://team-cymru.com/community-
services/templates/secure-b...](https://team-cymru.com/community-
services/templates/secure-bgp-template/)

[2]
[https://twitter.com/eastdakota/status/1143182575680143361?la...](https://twitter.com/eastdakota/status/1143182575680143361?lang=en)

------
rob-olmos
Commented on the original outage: Hopefully with this outage Cloudflare will
provide non-Enterprise plans a CNAME record, allowing us to not use Cloudflare
DNS and more quickly bypass Cloudflare if the need arises.

~~~
zamadatix
If Cloudflare DNS is down how would that CNAME resolve? In the case that
record is not hosted by Cloudflare DNS then why does it matter if it's a CNAME
or not? Sorry, probably not familiar with the offering you're referring to.

~~~
toast0
The idea is you host in your DNS

foo.example.org. CNAME blah.customer.cloudflare.whatever. with a ttl of loke 5
minutes

Then when cloudflare goes down, you switch that to your origin server, or your
static system is down page or something. Most of your traffic moves within 5
minutes and when you're satisfied Cloudflare is working again, you move
traffic back.

If you've delegated your domain to Cloudflare, you can switch your delegation
at your registrar, and a lot of TLDs update their servers pretty quick, but
the TTL is usually at least a day, so you'll be waiting a while for traffic to
move.

~~~
usr1106
But now you have an additional point of failure. You would need a very
reliable DNS provider to host that CNAME. It increases your flexibility when
CF has an issue, bit it does not necessarily increase the reliability of your
site.

~~~
toast0
Yes, but if all your fancy names are simply cnames, you can use normal zone
transfers to copy between servers, and use servers from multiple providers.
Most recursive resolvers will retry requests against multiple delegated name
servers until they get one that responds (or they all fail to respond). It
adds some delay, so you wouldn't want the servers to be down often, but it's
tolerable.

------
iso947
2 years ago, about 1 in 3 people in the UK were watching England in the World
Cup.

Towards the end of the game, a CDN the bbc used crashed, taking a million
people’s live streams offline.

Traditional TV with its 20 million plus viewers worked fine.

A 15 minute global outage during the World Cup or Super Bowl is not acceptable
in the world of boring old TV

Meanwhile github has been down how many times this year?

IT is still a terrible industry of cowboys. It’s just hidden under the veneer
of abstaction, microservices and outsourcing. Other industries like the
national grid or water of radio have outages that affect a local geographic
area or a limited number of people, but they are far more distributed than the
modern internet. It’s ironic a network designed to survive nuclear war can’t
survive a typo.

[https://m.huffingtonpost.co.uk/entry/bbc-iplayer-crashes-
in-...](https://m.huffingtonpost.co.uk/entry/bbc-iplayer-crashes-in-final-
minutes-of-england-v-sweden-in-the-world-cup_uk_5b40e196e4b09e4a8b2d88fc)

------
maxdo
what is funny you go to cloudflare customers page, check all these companies
status page, all down, non of them admits it's due to third party cloud
provider e.g. cloudflare. In most cases it was "performance issue". Its so
silly... you're in the interconnected world. It's ok your major cloud
providers went down...

------
itsjloh
Why did it take so long for a status page to be published I wonder?

From the timeline in the blog post the issue with Atlanta was fixed between
21:39 to 21:47 but a status page wasn't published until 21:37. Everything had
been broken for over 20 minutes at that stage with lots of people already
posting about it or other status pages reflecting issues. See
[https://twitter.com/OhNoItsFusl/status/1284239769548005376](https://twitter.com/OhNoItsFusl/status/1284239769548005376)
or
[https://twitter.com/npmstatus/status/1284235702540984321](https://twitter.com/npmstatus/status/1284235702540984321)

Without an accurate status page it leaves businesses pointing the finger
everywhere wondering whether its their hosting provider having issues, their
CDN, DNS provider etc etc.

~~~
eastdakota
Because of the way we securely connect to StatusPage.io from most locations
where our team is based. The traffic got blackholed in ATL, keeping us from
updating it. An employee in our Austin office was finally able to use his
Google Fi phone and connect through a PoP that wasn’t connected to our
backbone so didn’t have traffic blackholed. Something we’ll address going
forward.

~~~
itsjloh
Damn that sucks, sounds like a stressful Friday evening for all involved.
Thanks for taking the time to answer.

------
superkuh
You mean the service that everyone is centralizing in caused problems because
everyone centralized in it? Pikachu shock face. If you're web or network dev
act responsibly. Don't just pick cloudflare because everyone else does. Don't
pick cloudflare _because_ everyone else does.

------
eric_khun
I got the same issue today when I was on-call. took me 1 hour to figure out it
was Cloudflare.

I'm currently working on a project to monitor all the 3rd party stack you use
for your applications. Hit me up if you want, access I'll give free access for
a year+ to some folks to get feedbacks.

~~~
dubcanada
Not to take away from your project, but check out
[https://statusgator.com/](https://statusgator.com/)

------
iJohnDoe
Cloudflare is experiencing a bit of karma and a bit of Murphy’s Law.

They slung some mud not long ago and now it came to bite them. They were a bit
righteous on their reliability. However, anyone in this game long enough knows
it’s only a matter of time before shit goes down. If they didn’t have any
graybeards over there to tell them this, then hopefully they earned some gray.

Stuff was down long enough for Googles and OpenDNS caches to expire, and to
take down DigitalOcean in some respects.

Thankfully CF can afford to learn and make improvements for the future. Not
all organizations are that lucky.

~~~
bnkamalesh
> They slung some mud not long ago...

Oh, at whom?

~~~
bogomipz
[https://twitter.com/eastdakota/status/1143182575680143361?la...](https://twitter.com/eastdakota/status/1143182575680143361?lang=en)

------
devy
This is not the first time human error on BGP routing configuration which then
caused a significant portion of the Internet down. Is there any kind of
configuration validator that can be implemented to prevent and catch this type
of errors? I am fairly sure this won't be the last time we will hear about
human error on BGP routing config causing Internet down.

Or is BGP intrinsically a unsafe protocol without builtin protections on this
sort of human mistakes?

~~~
dcow
BGP is a fitting name for such a distinct plane of computation. Indeed the
border where any remaining physical concerns are cut loose and reality melts,
receding behind a shroud of gateways, giving way to the vast expanse of
cyberspace. Traffic whirls past. Raw elemental ether flows with abundance in
this region. Any wizard who happens to experience even brief exposure to it
normally considers themselves lucky. But to be enlisted to serve as a warden
of the border, whether punishment or honor, is a a responsibility most high.
BGP is either the final, or the primal, abstraction depending on which side of
the gates you most intimately inhabit. And the task of maintaining it a
meticulous and manual art.

~~~
solotronics
We live and die by the finite state machine, hallowed be the Protocol.

------
apaprocki
It would be interesting to classify network outages and determine the number
that involved practices that would be obviated by a standard VCS / release
process like found in software. Routers/firewalls seem to be a particular pain
point everywhere.

------
jlgaddis
I guess it's easy to become complacent when you're a networking expert at
Cloudflare and likely making several of these ad-hoc, on-the-fly config
changes every single day, but it's always good to remember why Juniper
introduced the automatic rollback feature.

Of course, this particular outage would not have been _prevented_ even if they
had used

    
    
      # commit confirmed
    

as it can't stop you from screwing up but it almost certainly would have
limited the duration of the outage to ~10 minutes (plus a minute or two,
perhaps, for the network to recover after the rollback) -- and it could have
been shorter than that had they used "commit confirmed 3", for example.

Even as a lowly network engineer working on networks much, much smaller than
Cloudflare's, for pretty much my entire professional career, it's been my
standard practice to start off pretty much _ANY_ change -- no matter how
trivial -- with

    
    
      # commit confirmed 3
    

or

    
    
      # reload in 3
    

or similar, depending on what type of gear I was working on (and, of course,
assuming the vendor supported such a feature).

This applies even when making a changes that are so "simple" that they just
"can't" go wrong or have any unexpected or unintended effects.

In fact, it applies _ESPECIALLY_ in those case! It's when you let your guard
down that you'll get hit.

\---

Fortunately, all that was necessary (I assume) to recover in this case was to

    
    
      # rollback
    

to the previous configuration. Then, the correct configuration could be made.
That still had to be done manually, however, and resulted in a 27 minute
outage instead of what could have been a 5 or 10 minute outage.

I would hope that Cloudflare has full out-of-band access to all of their gear
and are able to easily recover from mistakes like this. If they had lost
access to the Atlanta router and weren't able to log in and revert the
configuration manually, this outage could have lasted much, much longer.

------
mathattack
Did this sink any eCommerce websites?

~~~
synunlimited
Shopify went down so all of those storefronts were down.

~~~
mathattack
Wow - so real dollars!

------
jitbit
Everyone has outages, CloudFlare is a decent company making a good product.

BUT

What's interesting here is that so many non-CloudFlare services went down
(including even AWS - partially) caused by DNS outage - because every sysadmin
and his mom are using 1.1.1.1 as their DNS.

~~~
KozmoNau7
And that is precisely why you shouldn't try to centralize like that and send
everything to and through the same vendor.

Decentralization is what makes the internet function in a robust fashion.

------
sm2i
looks like Tesuto was a good buy

[https://investors.fastly.com/news/news-
details/2020/Fastly-A...](https://investors.fastly.com/news/news-
details/2020/Fastly-Achieves-100-Tbps-of-Edge-Capacity-
Milestone/default.aspx):

> By emulating networks at scale, Tesuto’s technology can be used to create
> sandbox environments that simulate the entire Fastly network, providing a
> view into the potential impact of a deployment cycle before it is put into
> production.

------
kissgyorgy
It's fine that they made mistakes and there were an outage, shit happens to
everybody.

What is really scary though that half of the internet stopped working. That's
not ok!

------
bamboozled
It's amazing to me that in this situation, humans still had to intervene and
update a configuration file.

I'm surprised stuff like this wouldn't happen more often and there would at
least be a well tested, automated remediation step in place which also
validates the change prior to going live.

I get they may be busy solving other issues, but it's interesting this isn't a
more fool proof procedure given the huge impact a mistake can have.

------
microcolonel
I wish they would go into why the rather complete outage was not visible to
cloudflarestatus.com. I fully understand that mistakes can be made, but I'm
really not pleased with how hard it was to tell if I was experiencing a
localized issue. During the entire outage, cloudflarestatus.com displayed "all
systems operational" for me, once accessed it with a functioning DNS resolver.

------
tristor
Fun, I got hit by this since I use cloudflared behind my Pi-Hole. I was able
to troubleshoot the issue, localized the cause to Cloudflare, found the
partial outage in various regions and assumed I was affected, switched to
using Level 3 DNS temporarily. I'm glad to see it's back up and this is a
great retrosepctive.

~~~
KozmoNau7
Set up unbound instead of clousflared, and switch to using smaller trusted DNS
services. Don't put all your eggs in one basket.

~~~
tristor
I am intentionally using cloudflared in order to have DOH between me and
Cloudflare to protect my DNS traffic from snooping by my ISP for marketing/ad
purposes.

~~~
KozmoNau7
I forgot to add that the main reason for setting up unbound is to use DNS over
TLS.

I am not suggesting to ditch DoH and go back to unencrypted DNS.

There are a number of small independent and trusted DNS providers who support
DoT. Uncensored DNS is one that I woul absolutely recommend.

------
Fritsdehacker
I don't understand Cloudeflare is making direct configuration changes to a
router like this. If these are changes that are made regularly, why not use a
tool to make them. You can then assure that only certain changes are possible,
preventing simple mistakes like this.

------
MayeulC
Hmm, duckduckgo and qwant also seem to have some trouble, time to head for
[https://searx.space/](https://searx.space/) I guess?

------
cosud
Fascinating to see that airport codes are used as datacenter names. I know at
least one other company which does that but I thought that's something
peculiar to them.

~~~
rospaya
I was always a fan of that. Airport codes are unique and almost never change,
even after the city/country changes.

------
badrabbit
Is there any work being done to replace BGP or current IGP's? Wondering if
modern computing and memory capacity and algorithms can be used to make more
fail-safe protocols.

~~~
infinisil
Scion is such an effort: [https://www.scion-
architecture.net/](https://www.scion-architecture.net/)

------
kj4ips
Random question:

The Cloudflare revolvers definitely went down (1.1.1.1 and 1.0.0.1), do we
know if authoratative DNS did?

------
H8crilA
Large portion of some network is down? Oh, right, it's BGP. It's always BGP.

------
exabrial
What configuration error? Was this human or automatic? What was done to
mitigate?

------
m3kw9
It shows that the system is pretty fragile.

