
Route Leak Impacting Cloudflare - xPaw
https://www.cloudflarestatus.com/incidents/46z55mdhg0t5
======
dang
[https://news.ycombinator.com/item?id=20267790](https://news.ycombinator.com/item?id=20267790)
is a more recent thread on this.

------
jgrahamc
This appears to be a routing problem. All our systems are running normally but
traffic isn't getting to us for a portion of our domains.

 _1128 UTC update_ Looks like we're dealing with a route leak and we're
talking directly with the leaker and Level3 at the moment.

 _1131 UTC update_ Just to be clear this isn't affecting all our traffic or
all our domains or all countries. A portion of traffic isn't hitting
Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.

 _1134 UTC update_ We are now certain we are dealing with a route leak.

@dang etc.: could someone update the title to reflect the status page "Route
Leak Impacting Cloudflare"

 _1147 UTC update_ Staring at internal graphs looks like global traffic is now
at 97% of expected so impact lessening.

 _1204 UTC update_ This leak is wider spread that just Cloudflare.

 _1208 UTC update_ Amazon Web Services now reporting external networking
problem [https://status.aws.amazon.com/](https://status.aws.amazon.com/)

 _1230 UTC update_ We are working with networks around the world and are
observing network routes for Google and AWS being leaked at well.

 _1239 UTC update_ Traffic levels are returning to normal.

~~~
holycowzer
Thanks for the updates. I wish I could get this information somewhere other
than hacker news though. :(

~~~
jgrahamc
The team is updating the status page but not with granular detail because
they'd have to spend time discussing what to say. I'm giving you the blow by
blow.

~~~
portman
I have a year-old startup and this is the first major Internet outage we've
had to deal with... was really awesome to have your play-by-play and
definitely changed our incident response (for the better!). Thank you so much.

------
nikisweeting
There's one thing I don't understand about this all, it looks like Allegheny
Technologies Incorporated (AS396531, a suspected original leaker) was
originally announcing 192.92.159.0/24.

How the heck did their peers not manage to filter a sudden announcement for a
range big enough that it managed to snag both 8.8.8.8 and 1.1.1.1. Do
upstreams really allow a tiny /24 AS to randomly announce a /4 and get away
with it? Or am I misunderstanding something fundamental about how BGP routes
are allowed to propagate?

~~~
kbirkeland
Leaking a /4 into BGP would do basically nothing unless the originator was
originally advertising a /4\. IP forwarding is based on the longest-prefix
match. Since allocations are sized from /8 to /24, anybody actually
advertising their space would not get hijacked by a /4\. The leaker would just
get traffic destined toward non-advertised networks.

~~~
nikisweeting
Then my next question is: If they didn't leak a massive range, then why was it
a big problem? I assume if they leaked a bad /24 it surely wouldn't be enough
to take down Cloudflare and Google for everyone... no? Did they just leak tons
of bad /24s or was it something else?

~~~
kingbirdy
My understanding is they had an optimizer that broke the /4 down in to /24s
and those got announced

~~~
nikisweeting
Aha! That was the missing piece in my understanding, it all makes sense now!
<3 You're the only person out of the ~5 people I asked who explained that bit.

------
emilstahl
“AS396531 "Allegheny Technologies Incorporated" is leaking a better-reachable
route for AS13335 "Cloudflare, Inc." towards AS701 "Verizon Business/UUnet"
explaining the current LSE going on.”

[https://twitter.com/OhNoItsFusl/status/1143117619106652160](https://twitter.com/OhNoItsFusl/status/1143117619106652160)

~~~
mbell
> AS396531 - Allegheny Technologies Incorporated

That appears to be a steel/alloys company. Why are they operating BGP
equipment?

~~~
Hikikomori
Why not? Pretty much everyone that needs a redundant internet connection (dual
ISP) does it.

~~~
mbell
> Why not?

It seems silly to me that an end user company not providing any network
services which only has a 256 IP block has the ability to break a significant
portion of the internet with a configuration mistake. There are several ways
to setup dual ISPs and routing that don't involve such risk.

~~~
wbl
You need BGP and provider independent space for your two ISPs to both announce
your space. What's the alternative approach?

~~~
mbell
Don't rely on a single IP routing through multiple ISPs, use DNS.

~~~
kazen44
what? this statement makes no sense from a networking perspective.

thisissue still exists if you break up your IP space, it just makes it far
harder to manage.

------
jgrahamc
Final update from me. This was a widespread problem that affected 2,400
networks (representing 20,000 prefixes) including us, Amazon, Linode, Google,
Facebook and others.

[https://twitter.com/bgpmon/status/1143149817473847296](https://twitter.com/bgpmon/status/1143149817473847296)
[https://twitter.com/bgpmon/status/1143149817473847296](https://twitter.com/bgpmon/status/1143149817473847296)

------
jgrahamc
We've written this incident up: [https://blog.cloudflare.com/how-verizon-and-
a-bgp-optimizer-...](https://blog.cloudflare.com/how-verizon-and-a-bgp-
optimizer-knocked-large-parts-of-the-internet-offline-today/)

~~~
btown
Great article! A couple missing periods at the ends of paragraphs FYI.

I'm curious why so much of this lies on Verizon's shoulders. Couldn't DQE and
Allegheny have implemented the exact same best practices that Verizon should
have, so it never leaked to Verizon's level? And to the extent non-Verizon
subscribers were affected, couldn't _their_ ISPs have implemented the same
best practices in distrusting Verizon? Is Verizon directly responsible for
routing that much of global traffic?

~~~
foota
I'm not very knowledgeable on network routing, so be warned.

But I think at some point a network peering with verizon trusts it to route
things, i.e., if I as an ISP always go through verizon to deliver traffic to
cloud flare then it's out of my hands the route they take.

As for downstreams adding mitigation, ideally this would happen, but I would
think you should place blame proportionally to the resources and criticality.
A ten person ISP won't necessarily do everything right, and it shouldn't
matter that they do, since there's a small part of the internet.

------
jbergstroem
As an aside, there is something to be said about the size of Cloudflare when
global network problems are reported as being Cloudflare issues.

------
taf2
I’m pretty sure 1.1.1.1 for dns is impacted by this. Initially I thought my
WiFi was having issues this morning until I realized it must be dns switching
1.1.1.1 out and besides the cloudflare sites everything is normal again

------
nikisweeting
What's weird is that 8.8.8.8 is also intermittently down for me. Are other
people having issues with Google DNS too?

[https://i.imgur.com/3ySmVLW.png](https://i.imgur.com/3ySmVLW.png)

~~~
RKearney
Google rate limits ICMP to 8.8.8.8. It’s not meant to be used as your personal
“is the internet up” test.

~~~
mindslight
Seriously? If true that's an awfully quick bait and switch, even for Google.

~~~
ceejayoz
How is it bait and switch?

8.8.8.8 was never marketed as a "ping me to see if the Internet is up"
service, as far as I know. Just as a fast, public DNS server.

~~~
mindslight
An important use of well known easy to type IP addresses is when you're
mucking around to figure out if your upstream network isn't working. I could
see if they attempted to set a new standard by just not responding to ICMP at
all (although turning around an icmp echo takes less work than a DNS
lookup...), but responding intermittently is actively harmful.

~~~
ceejayoz
You haven't identified the "bait" bit of the bait and switch. At no point has
Google promised to respond to pings on 8.8.8.8, nor are they obliged to _ever_
do so. Rejecting ICMP isn't "a new standard".

~~~
mindslight
The promise is implicit when competing for mindshare with 4.2.2.2. Typing an
IP address into a router setup is quite infrequent, compared to "let's check
connectivity by ping x.x.x.x". Setting expectations that 8.8.8.8 can fill this
role is the bait.

As I said, it's much easier to respond to a ping than even a cached DNS query.
Or it would also be consistent to simply never respond to ping.

Now obviously in the modern "you get nothing for nothing" world, Google is
able to violate whatever expectations they'd like. But "rate limiting" in a
way that makes basic ping(8)s look flaky, especially on a service that will be
used for debugging, is downright nasty and deserves to be shouted from the
rooftops (iff it's true).

~~~
chimeracoder
> The promise is implicit when competing for mindshare with 4.2.2.2. Typing an
> IP address into a router setup is quite infrequent, compared to "let's check
> connectivity by ping x.x.x.x". Setting expectations that 8.8.8.8 can fill
> this role is the bait.

4.2.2.2 is not even meant to be used as a public DNS server (and has sometimes
hijacked DNS requests at times to remind people of that). So it's weird to use
4.2.2.2 to criticize Google for blocking ICMP on their actually-public DNS
server.

~~~
mindslight
Sure, that's Level 3's official position. Unofficially, everyone uses it and
there is clearly someone inside making the deliberate decision to keep it
publicly available. [https://www.tummy.com/articles/famous-dns-
server/](https://www.tummy.com/articles/famous-dns-server/)

As I said, the crux of the problem isn't Google's "blocking", but rather
making it intermittent. Obviously it's well within their rights to play
_whatever_ games they want - drop every other packet, vary the latency based
on your IP, duplicate packets, or make it appear some queue occasionally holds
your packets for 3 seconds. It's also within their rights to redirect all DNS
lookups to an April Fool's page. And to do any of this selectively based on
how many different Google services you use.

But that is not what any user expects, and in the end that's all protocols are
- expectations. To me, the pushback I've gotten here fits right in with
Surveillance Valley's general attitude of shirking responsibility with some
fine print disclaimer, knowing full well what the constructive situation is.
"I'm just going to go like this [spinning arms], it's not my fault if you walk
into me".

If you can't see how people would expect to be able to reliably ping 8.8.8.8,
or how intermittently dropping pings causes confusion (as in the original
comment above), then I can't help you.

~~~
Hello71
there are lots of services that are available to the public, but intended only
for a specific set of people. if you go to the local supermarket and take a
few dozen bags without buying anything, that's immoral and illegal. nobody
will stop you from stealing the 1 cent bags, but that doesn't mean that it's
OK. in this case, they have specifically put up signs saying "bags for paying
customers only". if you continue to regularly go in and take bags without
paying, that is theft, both legally and morally.

your argument boils down to "it is convenient for me, and I see other people
stealing bags too".

~~~
mindslight
What in ze hell?

1\. It is straightforward to restrict a DNS server so that it only answers
specific networks. This doesn't even need to be close to comprehensive to get
the message across. Level 3's (née BBN's) _intent_ is to continue to respond
to the wider Internet community, regardless of what their ambient PR says.
Likely for similar reasons that they run a looking glass.

2\. The frequency and magnitude of your scenario makes it a straw man. A more
worthwhile example is someone using a business's bathroom without buying
anything. Yet most places don't really care as in the end it balances out, and
we're all humans that have needs that can't be fully met by commercial
provisions. The major concern is people who mess up the bathroom, paying or
not.

3\. While a common touchstone, _theft_ does not apply has nothing has been
taken. Perhaps unjust enrichment. But given that anybody using 4.2.2.2 to
answer production DNS queries is actually harming themselves with additional
latency more than anything "taken" from Level3, that's a stretch too.

Have we really become so full of corporate bullshit that we're stuck analyzing
things in its myopic paradigm? I thought this was _Hacker_ News?

PS I notice 77.77.77.77 also responds to pings and DNS queries. Should I
expect to get a bill for their services? Because I'd much rather just relish
the feeling of a fleeting shared purpose with someone halfway around the world
in a vastly different culture.

------
shacharz
Hi Shachar from Peer5 here, we're operating a MultiCDN. Cloudflare is actually
one of the best performing CDNs. All CDNs encounter issues small to big -
that's why using multiple providers and intelligently routing between them is
critical for high resilience.

~~~
shacharz
Right now we're seeing issues in the following ASNs: 9,541 . 59,257 . 38,264 .
132,165 . 23,888 . 55,714 . 45,773 . 45,669 . 9,260 . 58,895 . 17,557 . 38,547
. 38,193 . 135,407 . 23,966 . 7,590 . 136,525 .

~~~
nikisweeting
Are you seeing ASN 396531 as the original leaker?

------
throwawayflower
The main internet and phone service provider of the Netherlands is down. Even
the emergency number (112, our equivalent of 911) is down. Almost everyone is
unreachable. The whole telephone network is disrupted.

I wonder if it's related to this? It does say this kind of BGP thing can be a
deliberate malicious attack. Perhaps this?
[https://en.wikipedia.org/wiki/BGP_hijacking](https://en.wikipedia.org/wiki/BGP_hijacking)

~~~
throwawayflower
Oh, and the country's train and public transport infrastructure is
experiencing some major problems too due to the phone service outage.

~~~
x86_64Ubuntu
You have to wonder if these outages aren't the result of hostile states laying
the groundwork and testing the viability of certain attacks.

~~~
nikisweeting
Heh I think based on BGP's track-record, if a state-level actor wanted to mess
up everyone's BGP routes they wouldn't have to try very hard...

------
jgrahamc
@dang etc. Be good if someone changed the title here. 2,400 networks were
affected (including parts of Cloudflare, Google, Amazon, Linode, Facebook,
...).

------
colinodell
This seems like a partial outage, likely region-based. We have a large number
of sites routed through Cloudflare and I can access all of them from home, but
our HTTP monitoring software reports the sites as down.

------
robbiemitchell
Unless it’s a total coincidence, this looks like this is affecting some Amazon
services, including Sagemaker notebooks and Echo devices.

------
dajonker
"Cloudflare is observing network performance issues." not just performance
issues, our entire website is unavailable because of it.

edit: availability has been alternating between available and unavailable

~~~
monkin
It's not their fault as accidents happen to everyone. You should be prepared
for such a scenario and for many others too.

~~~
dajonker
You are right, accidents happen to anyone. I cannot really be prepared for
Cloudflare to go down though. What are my alternatives? Turn it off and route
traffic to our servers directly? The DNS propagation takes longer than it just
took for our website to be available again.

~~~
jasongill
It shouldn't - Cloudflare keeps the TTL for their cache-enabled records very
low (like 300 seconds).

If you just log in to Cloudflare and click the "orange cloud" icon on the DNS
tab, which points the domain back directly to your origin, you'll see the site
up within a couple minutes.

~~~
yjftsjthsd-h
Is 300 very low? I've occasionally seen 60 in the wild.

~~~
jasongill
It's very low compared to 24 hours, which is what used to be the most common
setting and that (among other factors) was a big part of the "DNS propagation
takes forever" mentality

------
foobarbazetc
It’s definitely all countries, just for a specific range of anycast IPs.

Our CloudFlare stuff isn’t even pingable. Sometimes it’ll return an echo from
a far away DC.

It’s been like this for over an hour now and your status page doesn’t even
acknowledge it apart from “Network performance issues”.

~~~
nikisweeting
It was updated a few minutes ago confirming that it's a route leak.

~~~
foobarbazetc
Yeah but they just wasted an hour of everyone’s lives trying to figure out WTF
was going on at 3:34am.

(The average CF user has no idea what a route leak is, tbh.)

~~~
corobo
"Everyone" speak for yourself, middle of the workday here :P

------
nikisweeting
One of the weirder leaks I've seen, 8.8.8.8 and 1.1.1.1 are both down for me,
but everything else is working fine.

~~~
cityzen
8.8.8.8 is google’s DNS, though.

~~~
nikisweeting
Exactly, that's why it's weird. It would have to be a huge range leaked to get
both 8.8.8.8 and 1.1.1.1, surprising that a peer didn't filter it before it
worked it's way up the chain.

------
pgt
What is a route leak?

~~~
nikisweeting
The Border Gatway Protocol is what network providers use to announce which IP
ranges they can route traffic for. The problem is it's almost totally
unauthenticated, so rogue ISPs and network operators can suddenly take over
parts of the internet by "leaking" routes for ranges they shouldn't be able to
control.

They do this by announcing something like "send me all traffic for 1.1.1.1 -
1.1.1.255", and if their peers don't verify it, they'll just start routing
that traffic to them. Peer by peer, the route then propagates and a larger
portion of the internet, and as routers learn the new bad route, more of the
traffic to those IPs gets sent to incorrect network.

------
kristofferR
I realized that it would be an issue like this when
downforeveryoneorjustme.com didn't load either.

------
sudhirj
Isn't HN on Cloudflare? How are we reading about a CF outage on a site that
runs behind CF?

~~~
srushtika
Doesn't seem so
[http://www.doesitusecloudflare.com/?url=https%3A%2F%2Fnews.y...](http://www.doesitusecloudflare.com/?url=https%3A%2F%2Fnews.ycombinator.com%2F)

~~~
has-ams
[http://www.doesitusecloudflare.com/?url=cloudflare.com](http://www.doesitusecloudflare.com/?url=cloudflare.com)

~~~
jakejarvis
I just put in my own Cloudflare-enabled website and it came back negative as
well.

------
jcalabro
This is affecting far more than just Cloudflare.

------
fluxsauce
I was going back and forth about whether to post this; we're clearly
experiencing some sort of partial outage, New Relic Synthetics shows
everything offline (except for Sydney for whatever reason), but the actual
webserver logs indicate healthy traffic and the sites load when I visit it.
"Cloudflare is observing" is pretty opaque.

~~~
nikisweeting
We're seeing the same situation with our alerts.

------
Dunning-Kruger
The current it stack needs a do-over. These outages already happen on accident
often because of human error. Imagine the damage a state actor could inflict
by targeting these large data centers. I hope that some of the newer
decentralized cloud startups like dfinity or storj takes over.

~~~
forgottenpass
>The current it stack needs a do-over.

"The network is unreliable" is a rule of thumb that was drilled into my head
in network programming class.

It always has been, it always will be. Doesn't matter if it's the internet or
the link between your computer and a device sitting on your desk. And it
doesn't matter what the tech is.

Making the internet more resilient only increases the severity of the failure
when organizations that don't understand the risk they're taking on experience
network outages.

The network is unreliable.

------
tomcam
It’s caused my site monitoring via PagerDuty to go insane, with texts sent
every few minutes.

------
nnx
Seems to be BGP/routing related, some networks can access CloudFlare networks
normally.

------
lordelph
We've been evaluating Cloudflare mainly for doing failovers faster than DNS.
This morning I ran some tests to generate graphs to show the typical delay
incurred in preparation for a show-and-tell with some key people.

I started seeing delays of up to 300 seconds! At best there was a 1 second
delay. I wondered if I was going to have present "Why we've decided not to go
with Cloudflare!"

Any longtime Cloudflare users comment on how rare an event this sort of thing
is? It _seems_ rare from eyeballing the recent alert history.

~~~
shdon
Things like this are not unique to CF and actually originate from outside
their network. It does happen every once in a while, but I have far more
confidence in CF's ability to resolve it than my own. They have the clout in
the industry, the connections and the expertise to deal with this kind of
thing. I've been with CF since late 2011 and am quite satisfied with their
services.

~~~
lordelph
That's a good point, and I must admit I didn't know what a route leak was or
that it could inflict this kind of damage. I appreciate now it's not
CloudFlare's fault, and my hat is off to the CTO for posting more detail here.

On the plus side, I did get to test the "Pause CloudFlare" button in a real-
world scenario!

------
jiveturkey
Interesting, isn't it, that when it's a US based steel plant, it's a route
leak. When it's China Telecom, it's a route hijack.

The description of it as a leak AFAICT seems to be due to CF getting first
dibs on the announcement[†] and positioned it as such. However, I firmly
believe that had the general tech press gotten ahead of it first, it still
would be treated much more generously than we treat China leaks.

[†] grin

------
sudhirj
How would that kind of disruption happen? Someone else also anycasting the CF
IP addresses?

~~~
arghwhat
BGP route/prefix leaks. BGP is the protocol that deals with routing across the
various internet backbones (known in the protocol as "autonomous systems",
identified by an AS number).

On that protocol, the various systems broadcast what prefixes they can route,
which then affects the rest of the networks' routing decisions.

By error or malice, a system can report a prefix they cannot or should not
route, causing other systems to start routing traffic across it. This will
either just cause weird routes (such as ones going through certain suspicious
countries), cause poor performance for those routed, or no connection at all
for those routed.

~~~
nikisweeting
At 3-4 major leaks per year it seems like we should probably fix BGP one of
these days...

~~~
purerandomness
The way I understand it it's not BGP, it's mostly human error, or malicious
intent.

The protocol is fine.

~~~
nikisweeting
An unauthenticated protocol that allows unsigned routes to be blindly accepted
is not a good protocol, that's why Cloudflare has been pushing RPKI for a
while [https://blog.cloudflare.com/rpki/](https://blog.cloudflare.com/rpki/)
[https://blog.cloudflare.com/rpki-details/](https://blog.cloudflare.com/rpki-
details/)

~~~
Hikikomori
It has authentication and requires explicit configuration to form a neighbor
relationship.

BGP was designed for operators to implement a routing policy. In most
implementations it allows everything by default with no modifications to route
metadata, so if you do not set up your policy correctly you'll have issues
like this.

~~~
nikisweeting
It has authentication for only _one hop_ , if routes propagated all the way up
the chain with signatures, it would be much easier to block/limit bad AS
behavior.

~~~
Hikikomori
Your peering relationship is only for one hop. What it lacks is prefix/path
validation, not authentication.

~~~
nikisweeting
But authentication of every advertised range all the way up the chain would
allow upstream providers to easily differentiate valid large prefix
announcements that were done intentionally (e.g. big ISP announcing some
routes) from crazy nonsense done by an unknown party that isn't a big ISP. We
definitely need prefix filtering, but there needs to be some easily verifiable
source of identity tied to each announcement to be able to automate the
process of accepting and rejecting large prefix announcements.

------
ussrlongbow
Experiencing around 60% traffic drop on customer's sites.

------
peterwwillis
Hey CloudFlare: this page is dependent on ajax.googleapis.com, and if js is
disabled, googletagmanager.com. (Also, weirdly, they still have a link to
Google+ posts?)

------
filistar
Does anyone know which global sites were or still are unavailable because of
Cloudflare crash? Maybe some media sites?

~~~
vdfs
WPEngine is down in all regions
[https://wpenginestatus.com](https://wpenginestatus.com)

------
steelaz
Disabling HTTP proxy and leaving "DNS only" option in CloudFlare DNS settings
solved the problem for us.

~~~
hugoromano
not very safe for some users.

------
markplindsay
This morning I'm finding out just how many of our supporting services rely on
Cloudflare as well.

------
ipmb
Seeing ~60% drop in traffic here.

------
shacharz
Is the specific IP range of the leak known ?

------
filistar
Does anyone know which global sites were unavailable because of Cloudflare
crash?

~~~
nikisweeting
You're not going to be able to get a solid list, this is a different category
of problem than something like CloudBleed, and even then the list wasn't
solid. This issue is affecting AWS, Cloudflare, Cloudflare DNS, Google DNS,
and the tens of thousands of other services that depend on them, but it's
region specific and will break different things for different users as the
leak propagates.

~~~
cntlzw
One source:
[https://twitter.com/atoonk/status/1143143943531454464](https://twitter.com/atoonk/status/1143143943531454464)

90 AS 13335 Cloudflare, Inc. 18 AS 7018 AT&T Services, Inc. 8 AS 63949 Linode,
LLC 8 AS 2828 MCI Communications Services, Inc. d/b/a Verizon Business 6 AS
26769 Bandcon 6 AS 16509 Amazon.com, Inc. 4 AS 6428 CDM 4 AS 2914 NTT America,
Inc. 2 AS 9808 Guangdong Mobile Communication Co.Ltd. 2 AS 6939 Hurricane
Electric LLC 2 AS 62904 Eonix Corporation 2 AS 55081 24 SHELLS 2 AS 54113
Fastly 2 AS 46606 Unified Layer 2 AS 45899 VNPT Corp 2 AS 4246 New Jersey
Institute of Technology 2 AS 3257 GTT Communications Inc. 2 AS 27695 EDATEL
S.A. E.S.P 2 AS 22781 Strong Technology, LLC. 2 AS 20473 Choopa, LLC 2 AS
16625 Akamai Technologies, Inc. 2 AS 12129 123.Net, Inc.

~~~
nikisweeting
I think it's like 2.4k ASNs at this point, each with 10s-1000s of IPs, I guess
you can make a list from that but it's going to be as unreliable as the
Cloudbleed list was. Also not always easy to do reverse hostname lookups from
the IPs to see the site names.

