
Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline - steveklabnik
https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/
======
londons_explore
Cloudflare managed to get an in-depth blog post, one which has incident
details, points blame to other parties, and makes some really quite aggressive
(for corporate blog posts) claims, all during an incident, and they did all
that in _8 hours_.

I'm impressed. At most other similar size companies, this would take 4 days.
And in something like Amazon, it would be 2 weeks of approvals, editing, and
review before a watered down version with all specifics removed is published.

~~~
jawns
Regarding those really aggressive claims, I was a bit shocked by that as well.

Either Cloudflare has some pre-existing beef with Verizon and is using this as
an opportune moment to dump on them ... or Tom Strickx (who wrote the blog
post) had his beauty rest interrupted early this morning to deal with
Verizon's screw-up and was not having it.

~~~
bogomipz
>"Either Cloudflare has some pre-existing beef with Verizon and is using this
as an opportune moment to dump on them"

Indeed. And that's not going to help them or their customer's in the least the
next time they need Verizon's cooperation to resolve an issue. You would never
see this type of behavior on the NANOG mailing list which has been on the
front line of communications between ISPs and providers for BGP issues since
the beginning of the commercial internet. It is very much a "community" with
reciprocal respect and professionalism, things this blog post was devoid of.

~~~
rayvd
[https://mailman.nanog.org/pipermail/nanog/2019-June/101614.h...](https://mailman.nanog.org/pipermail/nanog/2019-June/101614.html)

~~~
lima
Weird response by the Verizon employee.

> _You guys have repeatedly accused them of being dumb without even speaking
> to anyone yet from the sounds of it._

Not for lack of trying...

> _Should they have been easier to reach once an issue was detected? Probably.
> They’re certainly not the first vendor to have a slow response time though.
> Seems like when an APAC carrier takes 18 hours to get back to us, we write
> it off as the cost of doing business._

It wasn't a slow response, it was _no_ response. And either is unacceptable
for a tier 1 carrier.

> _But this industry is one big ass glass house. What’s that thing about
> stones again?_

And other carriers are actively working to change that - including, in
particular, CloudFlare.

------
empath75
My favorite outage when I worked for a voip company was when one of our tech
support people told a new customer that she needed to ‘add our ip address to
your router’, meaning add it to the firewall whitelist, but she repeated that
verbatim to the telco tech who misunderstood and then escalated her way up the
chain at a major telco until some engineer with the wrong rights said ‘fuck
it’ and updated bgp to route all of our traffic down her T-1 line.

That was a fun conference call, and listening to the lady on the phone I could
see how the engineer got to that point.

------
tssva
AS701 was the UUNET/Worldcom AS for the US and eventually the US/Canada
network. Verizon bought Worldcom in 2006 and they became Verizon Business.

From the late 90s to early 2000s I worked for UUNET/Worldcom as an engineer in
the network planning and design group. I worked in the international group but
among other things we were responsible for the build out of AS701 into Canada,
the exchange sites where AS701 connected to the various other international
UUNET AS's and the PoP's where dedicated circuits for international customers
who wished to connect directly to AS701 would be terminated. The point being
that I am familiar with how AS701 was operated at that time.

UUNET's reputation at the time might not have been sterling due to the
business decision to basically be a safe haven for spammers but from a
technical standpoint the network was operated at a high standard. The basic
BGP filtering referenced in the article was certainly in place at the time and
if this had happened then heads would have rolled.

~~~
w8vY7ER
As they should, this is day-one BGP configuration stuff. Nothing advanced,
nothing especially technical or time consuming. Speaks to a gross degree of
professional negligence on the part of AS701. Really disappointing to see this
degree of ignorance.

~~~
jrockway
Networking infrastructure / tools are in the stone age compared to what we
have in the software world. I am inclined to cut them some slack.

------
jboy55
Here's a shoutout to all the on-calls who woke up this morning to deal with
"someone else's problem". I think everyone who woke up gets to, at least,
order a "fancy coffee" and send the bill to Verizon.

~~~
teejmya
Amen. We need a support group.

"Hi, I'm Teejmya, and I was on call last night"

~~~
eastdakota
Yes, apologies and thank you!

If you email me (matthewatcloudflaredotcom) your shirt size, preference for
men's or women's cut, and your postal address with the subject line:

"Verizon BGP Leak On Call Support Group"

I'll send you a Cloudflare tshirt. Least we can do.

~~~
dreamcompiler
I think somebody should make some "I Fixed the Internet after Verizon Broke
It. 20190624" T-shirts.

~~~
IntelMiner
"Verizon broke the internet and all I got was this lousy T-shirt"

~~~
hn20180220
I would need a bunch of them

------
munificent
_> All of the above suggestions are nicely condensed into MANRS (Mutually
Agreed Norms for Routing Security)_

Whoever came up with that name and acronym deserves an award.

~~~
mffnbs
Wow, thanks for pointing that out, I missed it on my once-over.

------
Scoundreller
I appreciate how sympathetic Cloudflare is to the root-cause party because
they answered the phone and undid what they shouldn’t have done.

(If my understanding is correct, they shouldn’t have told Verizon about the
better routing, while Verizon should have known better)

~~~
saxonww
I read this as Allegheny's fault, actually. DQE published to Allegheny (DQE's
customer), who in turn re-published to Verizon (Allegheny's other provider).
While most of the 'prevention' section talks about what Verizon didn't do, it
doesn't seem to mention that Allegheny should not have re-published the DQE-
published routes up to Verizon.

It's startling that Verizon doesn't appear to have any leak mitigations in
place, but I feel like Allegheny is getting a pass here because they are
small, or something.

~~~
steveklabnik
Allegheny is a steel company. I don’t think most people expect them to have
the same responsibility for internet health as Verizon, even though they are a
$4B company.

(I’m from Pittsburgh, my grandfather and a bunch of my relatives worked for
this company for decades. I’ve been kinda giggling about this intersection of
my past and my present all day. I don’t work in the parts of Cloudflare that
deal with this kind of thing; I’m glad my co-workers were on top of it.)

~~~
Scoundreller
I had to do a double take because they seem to be selling themselves as a
technology company nowadays.

I guess “steel” in your name isn’t good for share prices.

~~~
steveklabnik
Tech or healthcare, that’s the way of things now.

------
ziddoap
Thank you for the summary.

And, a sincere thank you for not mincing words when it comes to something as
important as this.

> _However, against numerous best practices outlined below, Verizon’s lack of
> filtering turned this into a major incident that affected many Internet
> services such as Amazon, Fastly, Linode and Cloudflare._

> _IRR filtering would not have increased Verizon 's costs or limited their
> service in any way. Again, the only explanation we can conceive of why it
> wasn't in place is sloppiness or laziness._

In an attempt to find any statement given by Verizon, I found that The
Register was able to get this amazing statement:

"Verizon sent us the following baffling response to today's BGP cockup: "There
was an intermittent disruption in internet service for some [Verizon] FiOS
customers earlier this morning. Our engineers resolved the issue around 9am
ET."" [1]

[1][https://www.theregister.co.uk/2019/06/24/verizon_bgp_misconf...](https://www.theregister.co.uk/2019/06/24/verizon_bgp_misconfiguration_cloudflare/)

~~~
yingw787
Verizon's response seems very non-committal and it appears this type of
incident may happen again if they don't take any action. Are there ways for
companies like Google or Cloudflare to work around ISPs like Verizon without
affecting ISP customers, or is this a blocker? Was the 10% of the re-routed
traffic from Cloudflare 100% of the traffic from Verizon to Cloudflare?

~~~
eastdakota
It's worse than that. BGP provides the "map" of the Internet. That map is
relayed from network to network. So, as a result, Verizon announcing a bad
route can mess up the map not just for them but for any other network that
connects to them (directly or indirectly).

We're actually fortunate at Cloudflare because of our scale and wide-spread
interconnection. That limited the impact more than it would have for a
smaller, less-connected network. The crazy thing about BGP is that any router
can announce that it's responsible for a block of IP addresses and, if it's
trusted enough, that's what the map of the Internet will reflect.

The long term solution is for networks to implement and enforce RPKI. AT&T,
for instance, implemented RPKI and we did not see any drop in traffic to their
network today.

Verizon not only didn't implement RPKI, which would be the best-of-breed
approach, but also didn't do even basic route filtering. It's as if a trusted
traffic cop (Verizon) overheard from a random passing motorist that the main
road was closed and, as a result, directed all traffic off a pier and into the
ocean.

More about RPKI if you're interested:
[https://blog.cloudflare.com/rpki/](https://blog.cloudflare.com/rpki/)

~~~
psim1
The amount of posturing and blaming in Cloudflare's response is breathtakingly
unprofessional. If the article was just a few sentences longer, you could have
squeezed in a few more statements of blame. We know, they messed up. But
Cloudflare isn't making itself look any better by rolling the bus over Verizon
again and again.

~~~
txcwpalpha
I 100% agree with you, though I am unsurprised that HN is downvoting you. HN
seems to revere Cloudflare bigtime despite the fact that Cloudflare often uses
HN as their own corporate PR platform. I _absolutely loathe_ Verizon and I'll
be the first to line up for a good publish lashing of US ISPs, but even I feel
like this blog post is unnecessarily unprofessional.

What strikes me the most is that this whole "event" would have hardly even
registered on anyone's radar (it affected less than 10% of their traffic
during early hours of the morning. I saw one single news article about it,
buried on The Verge, but other than that nothing), _except_ for the fact that
Cloudflare's CTO was on HN this morning fanning the flames of the one thread
about it. It's like they dug their own hole drawing attention to the
"Cloudflare outtage" headline, and now they're overcompensating by going to
drastic measures to blame someone else.

And now they keep harping on the fact that Verizon still hasn't responded?
Sure, part of that is probably the fact that Verizon is a giant corporation
that doesn't want to bother with this stuff, but the other part is that this
"event" was hardly even big enough of a deal to register on VZ's PR team's
radar, no matter how much CF whines about it.

This blog post (and the accompanying HN comments from Cloudflare execs) just
scream "immature company" to me. There's a reason that Cloudflare is the one
making this blog post and devoting CEO time to it while the established
behemoth is just going about their business as usual.

~~~
SpicyLemonZest
The context (which isn't obvious, and I don't blame you for not knowing it) is
that the Internet is held together by spit and duct tape. The only reason it
works at all is that major participants are good actors, in the sense that:

1\. They implement basic precautions to prevent dumb things from going wrong.

2\. They're available 24/7, to immediately respond to and remediate whatever
does go wrong.

3\. Both of the above are core obligations, which supersede any questions of
public relations or maturity or higher-ups not wanting to be bothered.

If Verizon can't be trusted to properly operate their network, that's an
immediate threat to the health of the Internet, and many people do need to be
made aware of it. It's not just Cloudflare being salty because their customers
yelled at them.

~~~
txcwpalpha
I know the context, but that's irrelevant here. Whatever the cause, a root
cause analysis pointing back to CF is nice for CF to help solve the situation,
and is even nice to have for us tech enthusiasts here on HN (though it should
still maintain professionalism). But for customers and decision makers at
companies that might be looking at considering purchasing Cloudflare, you know
what I _don 't_ care about? Who's fault it was. There are multiple buckets of
companies here:

1\. Cloud providers that were effected enough to apparently devote not
insignificant CEO and CTO time to it (Cloudflare)

2\. Cloud providers that were affected but seemingly not enough for it to even
register as anything more than a blip on their status tracker (Google, AWS,
etc)

3\. Cloud providers that weren't effected

As a potential customer thinking about buying services from one of these
companies, which one do you think I am doing to do with? It certainly won't be
CF. And if I am already engaged with CF, I want to know _what CF is going to
do to mitigate this situation in the future_ , and _no_ , pointing fingers
like a child and saying "it wasn't our fault!" doesn't count.

Cloudflare can't really control Verizon's actions that lead to this situation,
but they _can_ control how they respond to it and mitigate it. They had an
opportunity to stand up as a leader and improve the internet (which is
literally their company motto). As you pointed out, the internet working
correctly is a matter of companies working together as good actors, and
getting these companies to work together via good, strong relationships is a
part of that.

Did Cloudflare do that? Nah. Instead, they made a petty blog post and their
CEO is on Twitter telling Verizon they should be ashamed. I don't know exactly
what his goal there was, but I assume it has something to do with hoping
they'll be better in the future (if that's not his goal, then it really is
just petty finger pointing). And if Cloudflare's CEO's method of getting
people to improve their work is to _publicly shame them_ , I really feel bad
for anyone who works under him.

~~~
SpicyLemonZest
To be frank, your post makes it clear that you don't know the context. CF
simply cannot do anything on their own to mitigate the problem where Verizon
constructs bad BGP routes to Cloudflare IPs and then advertises those routes
to third parties. The only mitigation possible is to contact whoever's
advertising the bad routes and get them to stop.

~~~
txcwpalpha
Have you read Cloudflare's multiple blog posts regarding BGP? Did you read the
tweets from their directors talking about how other customers were unaffected
by the event today because of the mitigations put in place? Did you even do
the simplest Google about BGP protocols and the plans in place to prevent this
from happening in the future?

If you're going to try to impose yourself as the gatekeeper of "knowing the
context", you should probably know it yourself. Saying CF "simply cannot do
anything" is narrow minded at best, and completely wrong otherwise. In fact,
in _this very blog post_ linked in the OP, Cloudflare talks about taking steps
to mitigate BGP issues in the future. That's great, if only it wasn't also
paired with a childish finger pointing session.

~~~
ubercow13
AT&T customers were unaffected because of mitigations put in place by AT&T
that _Verizon_ hasn't put in place. The steps you refer to in the blog post
are ones that _Verzion_ has to take.

~~~
txcwpalpha
Yes, and? Cloudflare themselves are the ones pushing their own company as
"leaders" in this field, and being a "leader" does not mean "pointing fingers
and trying to avoid blame whenever something bad happens". If they fancy
themselves leaders regarding BGP, as said on their website, then they need to
actually act like leaders.

And as I've said multiple times now, Cloudflare _was_ in a great position here
to stand themselves up as a strong leader on this topic to start working
together with other companies (a la Verizon) to start to make real headway to
fix the BGP problem. As other commenters have noted, the internet is entirely
built on multiple organizations acting in good faith towards one another.
Verizon failed to do that, and Cloudflare's response _also failed to do that_.
I said it in another comment, but I'll also say it here: publicly berating the
people that you are supposedly taking a leadership position over _is not good
leadership_. This entire episode is going to do _nothing_ to encourage Verizon
to work closely with CF to fix this issue. In fact, I imagine it will do the
exact opposite.

Today was a display of incompetence from Verizon, and a display of bad
leadership by Cloudflare. I have no idea why any objective-minded person would
be applauding Cloudflare for this. As I mentioned elsewhere, I would normally
love a good public bashing of Verizon, but not when it comes at the cost of
professionalism and progress.

~~~
Dylan16807
Companies respond just fine to public scrutiny, caused by them being
rightfully and loudly blamed. The way you lead people isn't the same as the
way you lead companies.

Verizon was acting _so_ badly that it's clear the pure friendly approach was
doing absolutely nothing. And I'm sure Cloudflare is willing to give very real
and pleasant engineering help if desired.

If Verizon doesn't want to talk to Cloudflare, that's fine too. This is not a
problem that requires active cooperation. They just have to do their job.

~~~
txcwpalpha
>rightfully and loudly blamed

There is an enormous difference between assigning fault in a good faith
attempt to find a root cause/solution, and casting unnecessary, unprofessional
insults such as "Verizon's team should be ashamed of themselves". One is
productive, and the other is just being a dick.

>The way you lead people isn't the same as the way you lead companies.

Yes, it certainly is. A company is an organization of _people_ , after all.
You don't get to eschew professionalism and start throwing around insults just
because a group of people has decided to attach an additional label over their
heads.

And just to put an even finer point on it, Matthew Prince's tweets about the
issue were not targeted at Verizon "the company". He specifically attacked
Verizon's NOC and its team members. Despite everything, this isn't a faceless,
soulless corporation that's having insults hurled at them. He specifically
went after a specific group of _people_ and publicly shamed them. And then he
has the gall to shame them even more for not immediately chomping at the bit
to help someone who just aggressively insulted them.

Ask yourself: if Matthew Prince had sent a tweet berating team members from
his own company, telling them they should be ashamed of themselves, and spent
the rest of the day commenting on the internet insulting their competence,
would you still be saying he is a good leader? Or even a good CEO? Of course
not. It's Leadership 101 that insulting your team members isn't a good
leadership style. And that doesn't change just because Prince isn't the one
signing the Verizon team's paychecks.

> This is not a problem that requires active cooperation.

This is clearly not the opinion of those at Cloudflare that are loudly kicking
their feet and whining that Verizon didn't devote enough resources to actively
cooperate with Cloudflare's troubleshooting today.

~~~
Dylan16807
> A company is an organization of people

Blaming a specific team can get too personal. Blaming an entire company is
more about the decision-making structure, and is close to as impersonal as you
can get. It's really not the same as blaming a person.

> This is clearly not the opinion of those at Cloudflare that are loudly
> kicking their feet and whining that Verizon didn't devote enough resources
> to actively cooperate with Cloudflare's troubleshooting today.

They didn't notice, acknowledge, or fix the problem. That's different from a
lack of resources devoted to active cooperation. Heck, two messages of "on it"
and "it's fixed" would be a pleasant level of "active cooperation", and that
takes only a minute or two.

~~~
txcwpalpha
> Blaming a specific team can get too personal.

And yet blaming a specific team is exactly what they did.

>They didn't notice, acknowledge, or fix the problem. That's different from a
lack of resources devoted to active cooperation. Heck, two messages of "on it"
and "it's fixed" would be a pleasant level of "active cooperation", and that
takes only a minute or two.

Sure, I'm not defending Verizon's inaction. My point is that regardless of the
level of the cooperation, _some_ cooperation is clearly still required. And
now because of Cloudflare's hostility towards Verizon after this incident, I
wouldn't be surprised if Verizon is much less inclined to participate in _any_
cooperation. That not only seems counterproductive to Cloudflare's goal, it's
also bad for all of us that use the internet.

~~~
Dylan16807
> And yet blaming a specific team is exactly what they did.

In this specific case, just blaming "Verizon", it was not personal. (There are
a variety of things that can be classified under "blaming a team" so I can't
give it a blanket okay/not okay.)

Knowing it's the NOC team, as an amorphous blob of nameless people, is not
getting too personal.

Just because something can be traced to a team doesn't mean that shaming the
company is the same as shaming specific people from that team.

Going down that road would declare _everything_ as personal, and that's really
not how things work.

> I wouldn't be surprised if Verizon is much less inclined to participate in
> any cooperation.

The public pressure should be stronger than any pettiness, and if it's not
then the solution is to let even more people know it was Verizon's fault.

~~~
txcwpalpha
>In this specific case, just blaming "Verizon", it was not personal.

That isn't what they did. They specifically called out teams, which according
to what you just said, is too personal.

[https://twitter.com/eastdakota/status/1143182575680143361](https://twitter.com/eastdakota/status/1143182575680143361)

> The teams at @verizon and @noction should be incredibly embarrassed at their
> failings this morning ... It’s networking malpractice that the NOC at
> @verizon has still not replied to messages

Not only does he specifically call out the NOC, he also calls out teams. It is
very obvious which "the teams" he is referring to, and "the NOC" is indeed a
specific team. In other comments he also calls out Verizon's support team.

This wasn't the case of "tracing it back to a team". CF's CEO specifically
addressed them and told them to be ashamed of themselves. That's personal, and
it's also being a dick to boot. Was there _anything_ in this situation that
was gained by Prince calling these people out in these tweets? Would it not
have been just as effective at calling out Verizon (while being less
unprofessional and less personally malicious) if those tweets had been less
vitriolic?

> The public pressure should be stronger than any pettiness, and if it's not
> then the solution is to let even more people know it was Verizon's fault.

So the solution to pettiness is more pettiness? Why does CF have a license to
be petty but VZ apparently does not?

~~~
Dylan16807
> according to what you just said, is too personal

That is not what I said!

I said it _can_ be, and then I clarified with: There are a variety of things
that can be classified under "blaming a team" so I can't give it a blanket
okay/not okay.

I see the tweet. I call this case not personal. He's pointing the blame at
large groups inside someone else's opaque company.

If you're pointing at a blob of 100+ people (like you said, support is also
being blamed) then you're not making it personal.

> Was there anything in this situation that was gained by Prince calling these
> people out in these tweets?

People know what company to blame (a good thing), but nobody outside that
company even knows _how many teams_ , let alone specifics about the people on
those teams (an acceptable thing). Overall positive.

> Would it not have been just as effective at calling out Verizon (while being
> less unprofessional and less personally malicious) if those tweets had been
> less vitriolic?

Being less vitriolic would not make it more or less personally targeted.

I'm not sure if the vitriol _helped_ exactly but I think Verizon did enough to
deserve it that there's no need to berate Cloudflare for the vitriol itself.

> Why does CF have a license to be petty but VZ apparently does not?

Presuming I even agree with your definition of pettiness, the problem is not
the pettiness itself, but the actions they take or don't take.

It's not terrible for VZ to be petty as long as they still fix their broken
equipment.

~~~
txcwpalpha
>If you're pointing at a blob of 100+ people (like you said, support is also
being blamed) then you're not making it personal.

Ahh, I see. So it's okay that he was offensive and insulting, because he was
offensive and insulting to many people? It wouldn't have been okay if he was
offensive and insulting to only a handful of people, but because it was more
than that, it's okay? Is this some weird perversion of "one death is a
tragedy, 1000 deaths is a statistic"?

He isn't pointing the blame at a large group inside an "opaque" company. He's
insulting _people_. The people at Verizon will know full well that he is
talking to them. People that work with the Verizon NOC will know full well
that _those specific people_ are being insulted by this CEO. The fact that it
was personally directed at multiple people doesn't make it any less personal,
it just makes it personal to more people, no matter how much you move the
goalposts.

> I'm not sure if the vitriol helped exactly but I think Verizon did enough to
> deserve it that there's no need to berate Cloudflare for the vitriol itself.

So it didn't _help_ to berate Verizon, but it was still okay because they
"deserved it"? And then you don't apply the same logic to Cloudflare
themselves? There absolutely is a need to berate Cloudflare for their
unnecessary use of vitriol, _especially_ if you're telling me the bar for
berating someone is as low as "it didn't help but that's okay".

It's clear at this point that you're moving goalposts and adjusting your own
principles in some weird attempt to defend Cloudflare. Cloudflare did nothing
positive here, and your attempt to justify their vitriol and maliciousness is
telling.

~~~
Dylan16807
> Is this some weird perversion of "one death is a tragedy, 1000 deaths is a
> statistic"?

Nah. If you deliver 100 insults to 100 people, that's terrible. But if you
deliver one insult to a vague blob of 100 people, that barely registers. The
amount of insult directed at any specific person is tiny. That's why I'm not
bothered by it.

> it just makes it personal to more people

No.

> no matter how much you move the goalposts

Really?

Someone disagrees with you so they must be moving goalposts?

Do better than that. I've been consistent on what I consider personal.

Also, I think you're too focused on vitriol. You can single people out and
cause them harm while using the nicest and most polite language in the world.
The way you target and your underlying meaning is far more important than your
choice of words.

> So it didn't help to berate Verizon, but it was still okay because they
> "deserved it"? And then you don't apply the same logic to Cloudflare
> themselves? There absolutely is a need to berate Cloudflare for their
> unnecessary use of vitriol, especially if you're telling me the bar for
> berating someone is as low as "it didn't help but that's okay".

Let's put it this way. I regard "impersonal beration" as one tenth the crime
of "being obviously and extremely negligent with equipment that can break the
internet". And I'm willing to forgive vitriol when it's deserved and
impersonal.

You don't forgive that, and want to say Cloudflare acted somewhat badly? Okay,
sure.

You want to claim they are failing as a leader, overcompensating with drastic
childish measures to blame someone else for something they could and should
have mitigated themselves? I completely disagree.

~~~
dang
Please don't break the site guidelines. Also, please don't do these intense
tit-for-tat arguments with another user. They don't help, they lower the
signal/noise ratio, and they bore everyone else. I know it's hard (believe me
I know how hard it is), but at some point someone needs to be the first to let
go.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

~~~
Dylan16807
That's fair. While I haven't had a _huge_ number of arguments like this, I can
only name one or two that resolved successfully. I'll leave things earlier.

------
woliveirajr
> One of our network engineers made contact with DQE Communications quickly
> and after a little delay they were able to put us in contact with someone
> who could fix the problem. DQE worked with us on the phone to stop
> advertising these “optimized” routes to Allegheny Technologies Inc. We're
> grateful for their help. Once this was done, the Internet stabilized, and
> things went back to normal.

It's funny how we have to still use _phone_ to help fixing some internet
routing problem, even if _phone_ doesn't literally means the old black curly-
wire equipment

~~~
azernik
Gotta have a channel that's out-of-band with the internet to fix problems with
the internet.

~~~
jedberg
Nowadays with voip “the phone” isn’t as out of band as we’d like.

~~~
heywire
Is it time to put an HF ham radio rig in each major provider’s office? I
shudder thinking of a major outage where even phone communication can’t take
place.

I’m only half joking.

Edit: and maybe we just give Verizon a toy walkie talkie

~~~
kortex
You joke, but cascade failure ain't no laughing matter, either. Emergency
plans aren't for fair weather, they are for sh!tstorms. I don't want to say
"we are too reliant on the internet" because it's cliche and connectivity is
just part of growing the modern world. But we sure as heck need several layers
of backup plans in case things go sideways.

I, for one, hope there is a secret society of HAMs lurking as mild-mannered
employees at every telco and ISP, ready to wire things back together when they
short out.

~~~
minou
There is an IRC server (several servers, actually) + channel that a variety of
network operators are on that has existed since the very early 2000s for these
sorts of events. 414 users on it now, most peoples nicknames include their ASN
to make it easier to find each other.

Unfortunately, Verizon is one of those networks that isn't present there. But
many other networks are represented there and it provides a direct path to
those who have config/enable access on some of the largest networks out there.
Cuts out the having to go via formal escalation paths and NOC groups that
require a trouble ticket before you can engage them.

------
eropple
Long ago, in the mists of time when I was a wee lad, the internet was a
simpler place. There were seven buttons around the world, all pressed down by
volunteers. If any four of them were released, the world would end. “The
world” was defined as “the internet”, and at the time that meant “the world”
was defined as “men with beards and suspenders and real opinions about Star
Trek”, and so that wasn’t so bad.

Today in 2019, “the world” is defined as “you know, _the world_ “, and there
are seven _million_ buttons being held down all over the world.

If any four of them are released, the world ends.

We have made mistakes, is what I’m saying.

(I once had call to explain to nontechnical people how and why the internet is
the way it is and why my ops crews tend to be full of people who are a little
too calm about things being constantly on fire. This was my best crack at it.)

~~~
inopinatus
Back in that oft-forgotten age I was privileged to know, work, and chill with
the brave volunteers (and, subsequently, paid RIR staff) holding down the
buttons. It’s worth mentioning that even back then there dwelt in the west a
large ugly troll whose name was AS701 (AS701 was not alone, either, having two
hideous siblings, AS702 and AS703, that lived in other climes). Everyone
tiptoed around the beast, because when angered it would flap its routes and
there would be a great wailing and severe packet loss. The brave volunteers
tried many times to tame the awful creature and I’m very sorry to see that it
is still fucking everyone’s announcements even today.

~~~
Gaelan
For the lazy:

AS701 Verizon Business/UUnet

AS702 Verizon Business/UUnet Europe

AS703 Verizon Business/UUnet ASPAC

------
icedchai
I remember the early days of the Internet, when I could log in to an ISP
router running BGP, with a blazing fast T1 to an early tier 1 Internet
provider. We could literally announce any route we wanted, no filtering. We
used to regularly black hole spammers, then turn them back on an hour or two
later.

~~~
nikisweeting
Please write about your life / these times! Or link me to your blog/writing if
you already have. <3

~~~
icedchai
It's not terribly interesting. No blog, sorry.

------
avip
CloudFlare continues to raise the bar in terms of communicating technical
issues to the public. Thank you for yet another enlightening writeup.

------
dreamcompiler
It's clear that Verizon broke the Internet this morning through incompetent
BGP management, and they could do it again. Who holds them accountable?

~~~
Uehreka
_glances over at Ajit Pai_

Nobody.

~~~
driverdan
It doesn't need government intervention. It needs other companies to hold them
accountable.

~~~
taildrop
Absolutely. If you are a Verizon service provider customer, please call them
and register your feelings on this matter. Make sure you let them know in no
uncertain terms that you are considering switching providers based on their
lack of following best practices.

~~~
ep103
And will do so, once another provider becomes available in your area.

------
OBLIQUE_PILLAR
I hope we remember this thread the next time Shady Russians or Crafty Chinese
make a BGP mistake and are accused of Active Measures.

~~~
xen2xen1
This is an obvious mistake made by a steel company. The BGP "errors" that make
tons of traffic go to Russia or China OFTEN WORK COMPLETELY. Some have worked
so well no one noticed for months. That's far, far different than an error,
that a large section of a country's traffic being successfully hijacked to
another country. Accidents usually break things. What Russia and China are too
successful to be accidents.

------
dang
[https://news.ycombinator.com/item?id=20262214](https://news.ycombinator.com/item?id=20262214)
is the earlier thread on this.

~~~
jgrahamc
Thanks. And thanks for the help today updating the title on the original post
as the real cause came to light!

------
cesarb
> For example, our own IPv4 route 104.20.0.0/20 was turned into 104.20.0.0/21
> and 104.20.8.0/21\. [...] The prefixes Cloudflare announces are signed for a
> maximum size of 20. RPKI then indicates any more-specific prefix should not
> be accepted, no matter what the path is.

Did RPKI help reduce the scope of this incident, by stopping propagation of
these faulty routes earlier than otherwise? Or did it have no effect in this
case?

~~~
symfoniq
Anecdotal, but:

The article notes that AT&T has implemented RPKI, and a client mentioned to me
that he wasn't having problems accessing Cloudflare-hosted infrastructure via
his AT&T phone. The rest of his employees were having major issues though via
the municipal fiber service provider.

~~~
eastdakota
Yup. There was no notable impact to AT&T traffic because they rejected the
routes because they're filtering based on RPKI. Here's a Tweet from
Cloudflare's Head of Network showing the AT&T vs. Verizon graphs:
[https://twitter.com/Jerome_UZ/status/1143276134907305984](https://twitter.com/Jerome_UZ/status/1143276134907305984)

------
qaq
One would think Cloudflare team would have a direct line of communication to
all tier 1 Internet providers.

~~~
eastdakota
We thought we did. And tried both public and private lines of communication —
without reply. Still waiting.

~~~
avip
Hats off to Verizon for treating everyone equally without discrimination.

~~~
ohithereyou
"We don't care. We don't have to. We're the phone company." \- Verizon,
probably

It's good to know that Net Neutrality isn't dead.

------
unethical_ban
I am surprised that CF is as aggressive toward Verizon in public as they are.
Once you start breaking the Internet for stupid reasons, though, you probably
deserve it.

I know very little about BGP operations; I did not know that there was PKI and
route validation like they described in the article.

~~~
fencepost
They'd probably be less aggressive if they'd been able to reach _anyone_ there
or had received any response (none as of 8 hours after the incident). As noted
above in this discussion the CF team _thought_ they had appropriate contact
information for all top tier carriers, and I suspect they do have what Verizon
would call the appropriate contact info. Not much they can do if Verizon
ghosts them, though.

I guess they could take steps to null route everything to/from Verizon to see
if they could get someone's attention that way.

------
kosmet
Thank you Verizon for giving me a recent example to talk about while teaching
BGP in Computer Networks class. My favorite example will still be Pakistan
routing all YouTube traffic to itself while trying to restrict access to it.
It is a little bit old though: [https://www.cnet.com/news/how-pakistan-
knocked-youtube-offli...](https://www.cnet.com/news/how-pakistan-knocked-
youtube-offline-and-how-to-make-sure-it-never-happens-again/)
[http://web.mit.edu/6.02/www/s2012/handouts/youtube-
pt.pdf](http://web.mit.edu/6.02/www/s2012/handouts/youtube-pt.pdf)

------
camgunz
Verizon's lucky it's a blog post that doesn't mince words, rather than a
lawsuit.

~~~
clvx
Can they get a lawsuit?. Has Verizon broken their SLA?. Is there a manual to
mitigate all the edge cases? What about being aware internally this had to be
improved but it was delayed due bureaucracy.

~~~
MrStonedOne
>edge cases?

This is not an edge case, allowing downstream networks to broadcast routes for
networks they do not own is a very well known security and operational issue
with operating an ISP. Massive parts of the internet went down in the 90s to
teach us this lesson.

Likewise, bureaucracy does not excuse not fixing an issue thats existed since
the 90s, and not deploying any 1 of 3 mitigation tricks (let alone all 3).

Negligence causing damage from lost sales/traffic is sue-able.

The case would basically resolve around rather or not V had an obligation to
prevent this from happening, and rather or not they were grossly negligent in
that obligation.

In my view the answer is yes.

~~~
henryfjordan
Who has standing though?

A Verizon customer might say "my internet was down" but there is 100% a clause
in their contract about outages and SLAs.

Any company that lost sales likely doesn't have a contract with them, so what
are they going to sue for? "Verizon didn't carry my 1s and 0s for free this
morning"? Person A on the freeway having an accident and causing Person B to
sit in traffic and miss their sales meeting isn't liable for that...

Maybe their peers (other telcos) have more standing because they couldn't
deliver to their customers as a result but they of course all have a clause in
their contracts about outages and SLAs that means ultimately they lost no
money so there are no damages.

And this is why we need government regulation, either to break up the Telcos
or nationalize them

------
djhworld
Very interesting and easy to understand overview, thanks.

I'm not familiar with this side of networking, but it sounds to me the "BGP
Optimiser" product was left largely to its own devices and automated a
configuration change without any explicit approval from a human operator (I
could be wrong)

With the protocol being prone to problems like leaky routes and sloppy peers
accepting them, is it really wise to leave these BGP optimiser products
running without some level of supervision?

EDIT: of course I guess the human operator might wave the change through too
without fully appreciating the problem...

~~~
linsomniac
I think the reality of it is that these BGP optimizers really can't be human
checked. There's just too much it is doing, and for them to be really
beneficial they need to respond quickly to network path congestion. I would be
surprised if overseeing such a system could be done with fewer than 6 full
time people, as a WAG.

... Which is why you should be really sure that these optimized routes never
leak! And on top of it, Verizon should never have accepted those
announcements.

------
ralph84
Nothing is ever new in IT. A similar incident happened 22 years ago. Buggy
and/or misconfigured gear disaggregates and reannounces routes and creates a
blackhole.

[https://en.wikipedia.org/wiki/AS_7007_incident](https://en.wikipedia.org/wiki/AS_7007_incident)

------
kevinreedy
> The RPKI framework that we implemented and deployed globally last year is
> designed to prevent this type of leak. It enables filtering on origin
> network and prefix size. The prefixes Cloudflare announces are signed for a
> maximum size of 20. RPKI then indicates any more-specific prefix should not
> be accepted, no matter what the path is.

Does RPKI prevent Cloudflare from announcing additional /22 routes during an
incident like this? Any network with RPKI implemented would reject the /22s,
but those who ignore it should pick them up over the leaked /21s.

~~~
eastdakota
We could break our prefixes into smaller routes, but 1) the Internet's routers
have limited memory; 2) we have a lot of routes; and 3) we want to be good
Internet citizens.

If every network announced all their routes as /24s — the smallest route
generally accepted over the public Internet — the routing table would be a
giant mess and would overwhelm many routers' ability to store them.

That said, after today we are thinking about ways that, in case of an
emergency, we could break the routes down to be more specific than whatever is
leaking. Given how broadly peered we are, Cloudflare's network will be as
protected as anyone's. However, that's not really a good solution for the
Internet generally. Better that we all implement and enforce RPKI.

~~~
hn20180220
Kudos for a CEO that understands in and outs of Internet routing, making me
want to join CF's neteng team

------
jepler
Is Verizon's engineer on vacation, or did he just call in sick today?

------
hyperion2010
Over a decade ago one of my friends was banned from the Sheffield Uni network
for playing around with BGP and knocking the whole campus offline. One kind of
has to wonder whether Verizon can suffer the same consequences simply by
collective action on the part of other affected parties.

~~~
azernik
Nope - because said collective action would probably involve denying service
to tens of millions of Verizon customers.

~~~
MrStonedOne
They could do better

~~~
henryfjordan
Can they? How many choices of ISP do you have? I have 2, Spectrum (Time
Warner) or some no-name for the same price with 1/10th the speed. I could not
practically switch off Time Warner if I wanted to without moving. I imagine
that Verizon customers are in the same boat.

------
foota
Amusing callout to pager duty in the screenshot of the call-log :)

~~~
cabaalis
Which is more interesting, philosophically? The internet had a problem due to
a single issue, or that the internet's problem was fixed due to a single
person calling various people on a cell phone?

(device interactions versus human interactions..)

------
qwerty456127
Every time I see the BGP abbreviation it's about a huge fuck-up. Either
somebody hijacks routes intentionally or something like this happens.

~~~
avip
As any other critical underlying infrastructure of our lives, it's taken for
granted and ignored until it breaks.

------
flatiron
I work from home in NJ and I knew something was screwy this morning. I wish
there was a place I could have checked to see it was this issue. I rebooted
pretty much everything in my house.

~~~
BonesJustice
I’m in NJ and had issues this morning as well. Figured it was a local outage
and promptly forgot about it after I left for work. I never expected it was
something so serious. Crazy.

------
VectorLock
"Why is PagerDuty calling me before I've had my coffee?" _Call declined._

~~~
avip
Well, you just lost yourself a T-shirt Sir.
[https://news.ycombinator.com/item?id=20269076](https://news.ycombinator.com/item?id=20269076)

------
stackzero
I don't fully understand all the networking protocols involved but who/what is
responsible to manage this kind of failure in the network? How should this
ideally be handled besides cloudflare engineers calling someone?

------
himangshuj
[https://blog.codinghorror.com/coding-for-violent-
psychopaths...](https://blog.codinghorror.com/coding-for-violent-psychopaths/)

------
InTheArena
I don’’t think Cloudflare is going to get any business from Verizon anytime
soon.

This may be as a professional of a “Hey Verizon, you don’t know what the fsck
you are doing” as I’ve seen.

~~~
nikisweeting
They're both big enough that they probably can't live without each other,
which makes for an interesting relationship because they can probably swear at
each other all day long and nothing will come of it.

------
stunt
What are the concerns around RPKI?

I'd like to know why some companies aren't implementing it if it solves
similar problems. What kind of criticism does it receive?

------
dentemple
So how many phone calls do you think Allegheny Technologies or that ISP in
Pennsylvania received this morning?

------
nodesocket
I saw a few of my static sites that are hosted on Cloudflare and few other
auxiliary 3rd party services I use flapping back and forth on PagerDuty, but
not all of my Cloudflare sites triggered down.

Is this just because the unaffected Cloudflare sites were not within the CIDR
range affected?

------
wil421
>It doesn't cost a provider like Verizon anything to have such limits in
place. And there's no good reason, other than sloppiness or laziness, that
they wouldn't have such limits in place.

I see you haven’t had much experience with large Telcos. They are all like
this.

------
kossTKR
I still get a Cloudflare 1020 error here:
[http://shadow.tech](http://shadow.tech) (my location : Scandinavia ) Are
these sites waiting for some kind of propagation or cache busting? It's a
pretty large gaming service.

~~~
judge2020
Works from the ATL DC, what is the airport code that shows up on
[https://cloudflare-test.judge.sh/#shadow.tech](https://cloudflare-
test.judge.sh/#shadow.tech) ? Might be a local [maybe routing] issue with CF
-> shadow's web server.

~~~
AnssiH
Not working here either (Finland), that page shows HEL for me.

shadow.tech shows "Error 1020 Ray ID: 4ec1c24b2a945b25 • 2019-06-24 21:22:45
UTC Access denied What happened? This website is using a security service to
protect itself from online attacks."

~~~
judge2020
Oh, 1020 access denied means some firewall rule (any combination of rules [0]
and `if`/`or` statements) or access controls (such as blocking IP ranges and
countries) blocked access to the website. These are always set up by the site
operator, so this isn't caused by any CF issues or the earlier routing issues.

I guess best course of action for those who want to access the site is to
tweet them
[https://twitter.com/shadow_official](https://twitter.com/shadow_official)
with your Ray ID.

0: [https://developers.cloudflare.com/firewall/cf-firewall-
rules...](https://developers.cloudflare.com/firewall/cf-firewall-rules/fields-
and-expressions/)

------
antihero
Would this have affected stuff in the UK? All sorts of sites like MealPal were
inaccessible this morning for a bit

~~~
ethbro
I wouldn't be surprised if various other transit providers slurped the bogus
routes up from Verizon (without filtering), so it's certainly possible.

------
mvkel
Has Verizon responded in any capacity yet?

------
MBCook
Can someone explain why the optimizer would split one route into two? Wouldn’t
it be more optimized to coalesce routes whenever possible?

~~~
teraflop
If an ISP has multiple physical connections that it could use to reach
Cloudflare's network, it makes sense to distribute the traffic that's
addressed to different IPs across different links, instead of using a single
route that sends all the traffic over one link and leaves the others idle.

~~~
MBCook
Good point. Thanks.

------
ggg2
They automated a system that relies solely on trust? ...China hackers are
surely having a great day.

------
bogomipz
From the post:

>"It doesn't cost a provider like Verizon anything to have such limits in
place. And there's no good reason, other than sloppiness or laziness, that
they wouldn't have such limits in place."

Is "sloppiness or laziness" really the only possible attribution here? I'm not
a big fan of Verizon but I'm a big fan of civility and empathy, two qualities
which your blog post lacks. Outages are a really unfortunate fact of life.
We've seen them recently with Google, AWS, Dyn - all companies where technical
competency is generally not questioned. It's quite possible the cause of of
this outage was some "perfect storm" scenario such as an eBGP router rebooted
and came up with a stale or incorrect config. "Perfect storm" scenarios even
happen at companies with very rigorous engineering cultures as we saw with the
most recent Google outage.

Your attempt to shame an organization without knowing all the details reeks of
immaturity and pettiness. Ditto for your willingness to turn this into yet
another Cloudflare marketing opportunity. Have you forgotten about your own
Cloudbleed incident? How would you feel if it a security company took that as
an opportunity to shame you for "sloppiness or laziness"? Or some other
company's CEO was offering to send people "Cloudbleed Support Group" T-Shirts
on HN as your own CEO is doing in this thread?

Lastly RPKI isn't a silver bullet, RPKI authorities can also be misconfigured
and attacked[1][2]. This happened with the LACNIC incident in 2013[2]. It's
also worth mentioning that RPKI potentially creates new threats[2]. But again
it seems more important to you to use this as a marketing opportunity and
promote yourself while throwing someone else under a bus while uttering pithy
summations.

Also from your post:

>"And, in particular, we're looking at you Verizon — and still waiting on your
reply."

Although Verizon is the 400lb gorilla in the room, their NOC and network
engineers are still regular people with kids and families and feelings. They
are also people who have had a really shit day today. Why you can't extend
just a bit of human compassion and feel compelled to try to shame is quite
inexplicable.

You may think that your blog post was a marketing coup but I see it as a
massive failure in in both leadership and civility.

As a thought exercise maybe Cloudflare leadership could think about how they
would like the community to react the next time they are at fault.

[1]
[https://www.cs.bu.edu/~goldbe/papers/hotRPKI.pdf](https://www.cs.bu.edu/~goldbe/papers/hotRPKI.pdf)

[2]
[https://www.cs.bu.edu/~goldbe/papers/sigRPKI_full.pdf](https://www.cs.bu.edu/~goldbe/papers/sigRPKI_full.pdf)

~~~
shakna
Cloudflare reached out multiple times in multiple ways to Verizon, to attempt
to resolve the situation.

More than eight hours on, after utilising everything from what they were told
was a Tier 1 support line to Twitter, they have nothing.

Even if we're kind to Verizon about the network failure, which was a global
issue, they haven't done anything or said anything to suggest that Cloudflare
should be treating them kindly in any way.

Not even a "we're aware, we're handling it".

Ghosting one of the world's largest (as in utilised) companies is not wise for
administrative, technical or PR reasons.

Verizon have shown a complete lack of leadership.

~~~
bogomipz
Have you ever worked for a Tier 1 ISP during a big outage? There is not enough
personnel bandwidth in a NOC for everyone to get an individual response.

>"Ghosting one of the world's largest (as in utilised) companies is not wise
for administrative, technical or PR reasons"

Oh the Cloudflare marketing machine. Largest by "utilized"? What does that
even mean? Cloudflare is not a Tier 1, a Tier 2, or a major eyeball network.
They are pretty far down in the pecking order despite what your marketing
department wants us to believe. There's always some fuzzy stat isn't there?

Being too inundated to respond to everyone on the day of outage is a human
resource problem, plain and simple The fact that you have taken this so
personally is kind of embarrassing. What this blog post, the opportunistic
marketing ploy and finger pointing have shown is a complete lack of maturity
on your part. You want to call out Verizon for their behavior yet your own
behavior is unnecessarily aggressive.

~~~
shakna
> The fact that you have taken this so personally is kind of embarrassing.

What? I have said nothing personal.

> What this blog post, the opportunistic marketing ploy and finger pointing
> have shown is a complete lack of maturity on your part.

Ah. You seem to be confused. I am not affiliated with Cloudflare, and have not
worked with Cloudflare at any point in time.

