Hacker News new | past | comments | ask | show | jobs | submit login
Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline (cloudflare.com)
863 points by steveklabnik 22 days ago | hide | past | web | favorite | 274 comments



Cloudflare managed to get an in-depth blog post, one which has incident details, points blame to other parties, and makes some really quite aggressive (for corporate blog posts) claims, all during an incident, and they did all that in 8 hours.

I'm impressed. At most other similar size companies, this would take 4 days. And in something like Amazon, it would be 2 weeks of approvals, editing, and review before a watered down version with all specifics removed is published.


Regarding those really aggressive claims, I was a bit shocked by that as well.

Either Cloudflare has some pre-existing beef with Verizon and is using this as an opportune moment to dump on them ... or Tom Strickx (who wrote the blog post) had his beauty rest interrupted early this morning to deal with Verizon's screw-up and was not having it.


The sequence of events went a bit like this:

Team in London started working the problem and called in reinforcements from elsewhere;

Upper management (me and one other person) got involved as it was serious/not resolved fast;

I spoke with the network team in London who seemed to have a good handle on the problem and how they were working to resolve but we decided to wake a couple of other smart folks up to make sure we had all the best people on it;

Problem got resolved by the diligence of an engineer in London getting through to and talking with DQE;

Some people went back to bed;

Tom worked on writing our internal incident report so that details were captured fast and people had visibility. He then volunteered to be point on writing the public blog (1415 UTC);

Folks in California woke up and got involved with the blog. Ton of people contributed to it from around the world with Tom fielding all the changes and ideas;

Very senior people at Cloudflare (including legal) signed off and we posted (1958 UTC).

No one had an axe to grind with Verizon. We were working a complex problem affecting a good chunk of our traffic and customers. Everyone was calm and collected and thoughtful throughout.

Shout out to the Support team who handled an additional 1,000 support requests during the incident!


Thank you! CloudFlare's response is appropriate.

The incident itself and lack of response (for HOURS) from Verizon's side is absolutely unacceptable. It's 2019, filtering ALL of your customer's routes according to - at least - the IRR (including the legacy ones connected to the old router in the closet) and having a responsive 24/7 NOC contact in PeeringDB are a matter of course.

Proper carriers like NTT go above and beyond simple IRR filtering nowadays with things like peerlock (http://instituut.net/~job/peerlock_manual.pdf).

AT&T uses RPKI and was completely unaffected: https://twitter.com/Jerome_UZ/status/1143276134907305984


So basically we should avoid all Verizon Enterprise Product.

EdgeCast used to be my favourite CDN. Not sure how they are doing now.


Keep in mind that Verizon is a huge organization


Awesome. Thanks for this timeline and for the team being absolutely amazing.

I love the shaming of Verizon without the sugar coat. Divisive for sure, but a welcomed one.


Tom is based in London. So he had a good night's sleep and was well rested.


We don’t know that he wasn’t up late fixing another bug that we thankfully never saw, and that his life hasn’t been like a season of 24 this past day.


we don't know either if he's an alien coming from a planet that has a shorter day and his circadian cycle is being disrupted being on earth, but Occam would suggest no.


If Cloudflare is correct on the technical part (no idea, I don't have enough knowledge to evaluate) then they are completely within their rights to call out Verizon on not doing the right thing. It's not "aggressive" at all. They don't mean harm to Verizon, they just point out Verizon is not doing their job correctly, and should step up and fix it. And not having somebody to talk to for 8hr for a provider than is capable of downing significant part of the internet is also something worth calling out. It's not some mom-and-pop shop that the owner can just hang a sign on the door and go to the beach, it's a major provider that should always have open channels to communicate on things like this.

>"Either Cloudflare has some pre-existing beef with Verizon and is using this as an opportune moment to dump on them"

Indeed. And that's not going to help them or their customer's in the least the next time they need Verizon's cooperation to resolve an issue. You would never see this type of behavior on the NANOG mailing list which has been on the front line of communications between ISPs and providers for BGP issues since the beginning of the commercial internet. It is very much a "community" with reciprocal respect and professionalism, things this blog post was devoid of.


> You would never see this type of behavior on the NANOG mailing list which has been on the front line of communications between ISPs and providers.

What element of the blog post are you referring to? NANOG often speaks in jargon and obtuse-professional speak, but with large routing leaks there are always strong opinions expressed. It's been this way going back well over a decade.

Another counterexample: go search the NANOG archives for opinions on AWS, EC2, and SES. You won't find much reciprocal respect - you'll find a bunch of unabashed criticism on how AWS operates, and how that affects the internet.

This is a clash of cultures. Cloudflare knows their customers expect a fast, accurate, transparent explanation. NANOG participants are used to an environment where their dirty laundry isn't aired in public to the point where they get calls from reporters asking about it.

Cloudflare is walking a tight line where they're trying to accurately explain to a lay audience what happened to their customers. They can't assume their audience knows what AS 701 is, or BCP 38, or the DFZ, or the prior harm that BGP optimizers have been known to cause.

A "professional" NANOG thread would touch on all of that, it just wouldn't be pieced together under a single byline for a mass audience.


"the next time they need Verizon's cooperation to resolve an issue" According to the Cloudflare post, they didn't get Verizon's cooperation to fix a partly-Verizon-caused issue this time, so what do they have to lose?



Weird response by the Verizon employee.

> You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it.

Not for lack of trying...

> Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off as the cost of doing business.

It wasn't a slow response, it was no response. And either is unacceptable for a tier 1 carrier.

> But this industry is one big ass glass house. What’s that thing about stones again?

And other carriers are actively working to change that - including, in particular, CloudFlare.


[flagged]


I think what lima is saying is the Verizon employee basically says "Why didn't you call us for comment before publicly complaining that we never answer our phones?"


CloudFlare is not a tier 1 carrier if you go by the strict definition of the term, just like Google isn't - but it's one of the largest content networks, reponsible for a significant percentage of internet traffic, with a global carrier-level backbone. Google and CloudFlare even tend to have better internal routing than most tier 1 providers.


They are not a carrier period! They don't sell transit. A Tier 1 carrier does settlement-free peering. They are a CDN. And no the most certainly do not have a "global carrier-level backbone." Argot or whatever they are calling their product is not an actual backbone with dedicated fiber, submarine cable etc. There would be no reason for them to invest in fiber and lightwave gear as they are en edge network full stop. Your comment shows a lack of understanding of how the internet actually works.


Whats up with the Verizon employee comparing AS701 to APAC carriers? That’s a super harsh thing to publicly say about your employer.


I’d say that Verizon’s lack of cooperation was devoid of any respect or professionalism.


Yes, very much professionalism including such recent email threads titled "Russian Anal Probing"


Cloudflare's bet is essentially that they can control so much of the internet infrastructure that they can behave however they like and we all simply have to deal with it.


> behave however they like and we all simply have to deal with it

So basically what Verizon did by looking at BCP194 and saying “nah, too much bother”??


Not really. You don't have to be a massive player to screw things up with BGP.


No doubt.

But it’s 2019 and I can’t muster up much sympathy for a tier 1 who can’t get inbound filters and a responsive NOC implemented correctly - things which were table stakes in 2009.


2009? You are being generous. I'm pretty sure when I was managing BGP announcements for my small ISP in 1999 route filtering was a thing.


Exactly. All these comments about how rude cloudflare is forget this style of public shaming of AS’s that can’t perform basic hygiene on their own network has been the norm rather than the exception for over twenty years. And further, all the surprise that cloudflare was quick to report - here’s the deal: bgp doesn’t lie. The second something is wrong, everyone knows who did it. There’s zero mystery. It’s not like some grand caper that takes months of investigations. Operators basically have one rule - don’t leak bad routes to everyone else. That’s pretty much the only rule that’s a constant. When you break it your karma goes to zero, everyone dumps on you but life goes on.


You don't need to be a massive player to initiate the screwup, but you kind of need a massive player like Verizon to amplify it for you. That's why the onus on them should be greater.


There’s no excuse for the personal slander here - “his beauty rest interrupted” is a mean-spirited, inappropriate way to discuss Cloudflare’s post. Please don’t insult people unnecessarily here.


I read it as a reasonable and slightly humorous way to acknowledge that the author is a human being and may have been (justifiably?) in a bad mood writing the post because lack of sleep does that. No idea of how sensible that suggestion is or isn't in this case but didn't really think it was derogatory. A human being is not a machine. Sanity-sleep might be a better term for it but nobody uses that whereas beauty-sleep is in common, albeit ironic, usage.

Let's all take our sleep seriously, yeah? :-)


Uh, I think the idiom just means “got woken up early.” I don’t see an attack there.


Likewise, don't insinuate people are ugly or don't deserve sleep.


Blog posts are their speciality


(deleted)

Yeah, that wasn’t contributing. Everyone has bad days.


Your profile mentions:

> Chief Architect, Information Security, Akamai Technologies. I do not speak for my employer.

Probably best to disclose this more directly in comments on topics related to competitors.


Why? Comments don't typically disclose "I am a corporate spambot" and yet such things exist, so you should probably read every comment as though it has some angle / vested interest. At least brians was nice enough to mention their bias in their profile. Most cases won't be so easy. Making people "opt in" to disclosing biases will just make it easier for the real bad actors to slip through.


brians seems like a good actor, which is why I assumed they might appreciate the feedback.

I agree there will always be bad actors on hn, which is too bad. I think the moderators try hard to combat it, which I am grateful for.


Hot take from Akamai.


Strong words for an outage that affected "Cloudflare, Amazon, Linode, Google, Facebook, and others"


My favorite outage when I worked for a voip company was when one of our tech support people told a new customer that she needed to ‘add our ip address to your router’, meaning add it to the firewall whitelist, but she repeated that verbatim to the telco tech who misunderstood and then escalated her way up the chain at a major telco until some engineer with the wrong rights said ‘fuck it’ and updated bgp to route all of our traffic down her T-1 line.

That was a fun conference call, and listening to the lady on the phone I could see how the engineer got to that point.


AS701 was the UUNET/Worldcom AS for the US and eventually the US/Canada network. Verizon bought Worldcom in 2006 and they became Verizon Business.

From the late 90s to early 2000s I worked for UUNET/Worldcom as an engineer in the network planning and design group. I worked in the international group but among other things we were responsible for the build out of AS701 into Canada, the exchange sites where AS701 connected to the various other international UUNET AS's and the PoP's where dedicated circuits for international customers who wished to connect directly to AS701 would be terminated. The point being that I am familiar with how AS701 was operated at that time.

UUNET's reputation at the time might not have been sterling due to the business decision to basically be a safe haven for spammers but from a technical standpoint the network was operated at a high standard. The basic BGP filtering referenced in the article was certainly in place at the time and if this had happened then heads would have rolled.


As they should, this is day-one BGP configuration stuff. Nothing advanced, nothing especially technical or time consuming. Speaks to a gross degree of professional negligence on the part of AS701. Really disappointing to see this degree of ignorance.


Networking infrastructure / tools are in the stone age compared to what we have in the software world. I am inclined to cut them some slack.


They most likely upgraded the equipment since the build out and never put the filtering back.


I for one remember UUNET being expensive transit, but rock fucking solid.


Here's a shoutout to all the on-calls who woke up this morning to deal with "someone else's problem". I think everyone who woke up gets to, at least, order a "fancy coffee" and send the bill to Verizon.


Woke up repeatedly this morning to PagerDuty after working late last night. Threw numerous wailing electronics at wall. Later, ordered several “fancy coffees”. Will be sending bill to Verizon for two phones, a pager, a wall, and three fancy coffees.

Edit/Disclaimer: Yes, this is a joke. I woke up to several serious alarms just as the problem was starting. Luckily, I thought to check Cloudflare’s status page from my phone around the second or third time PagerDuty called me. I saw a preliminary notice from them indicating that they were observing networking issues. At that point, I decided I’d rather watch the world burn from my bed than my computer, so I “scheduled” maintenance for a few hours and went back to bed. Our whole infrastructure had split by that point, but there was nothing I could do about it.


Amen. We need a support group.

"Hi, I'm Teejmya, and I was on call last night"


Yes, apologies and thank you!

If you email me (matthewatcloudflaredotcom) your shirt size, preference for men's or women's cut, and your postal address with the subject line:

"Verizon BGP Leak On Call Support Group"

I'll send you a Cloudflare tshirt. Least we can do.


I think somebody should make some "I Fixed the Internet after Verizon Broke It. 20190624" T-shirts.


"Verizon broke the internet and all I got was this lousy T-shirt"


I would need a bunch of them



This is so awesome.


"I'm Coldreactor and I was traumatized by the massive number of calls last night"


“Or I would have been if routing was working correctly”

There’s a silver lining to everything.


I needed to burn some vacation time before EoY when it expires, so I started taking Mondays off once a month. Today was such a day. But I was supposed to be on-call, so I traded a day with a teammate.

I went off-call at 8:30am EST. Then, while the internet burned down, I slept in and played video games.


> All of the above suggestions are nicely condensed into MANRS (Mutually Agreed Norms for Routing Security)

Whoever came up with that name and acronym deserves an award.


Wow, thanks for pointing that out, I missed it on my once-over.


I appreciate how sympathetic Cloudflare is to the root-cause party because they answered the phone and undid what they shouldn’t have done.

(If my understanding is correct, they shouldn’t have told Verizon about the better routing, while Verizon should have known better)


I don't think your understanding is correct. I think they're supposed to be allowed to "optimize" the BGP routes that they advertise to their customer (Allegheny). I'm unclear on whether Allegheny should have relayed that advertisement to Verizon, but it's clear that Verizon should definitely not have then broadcasted that to everyone else.


I read this as Allegheny's fault, actually. DQE published to Allegheny (DQE's customer), who in turn re-published to Verizon (Allegheny's other provider). While most of the 'prevention' section talks about what Verizon didn't do, it doesn't seem to mention that Allegheny should not have re-published the DQE-published routes up to Verizon.

It's startling that Verizon doesn't appear to have any leak mitigations in place, but I feel like Allegheny is getting a pass here because they are small, or something.


Allegheny is a steel company. I don’t think most people expect them to have the same responsibility for internet health as Verizon, even though they are a $4B company.

(I’m from Pittsburgh, my grandfather and a bunch of my relatives worked for this company for decades. I’ve been kinda giggling about this intersection of my past and my present all day. I don’t work in the parts of Cloudflare that deal with this kind of thing; I’m glad my co-workers were on top of it.)


I had to do a double take because they seem to be selling themselves as a technology company nowadays.

I guess “steel” in your name isn’t good for share prices.


Tech or healthcare, that’s the way of things now.


No customer should be able to bring down the internet due to misconfiguration this is all on Verizon imho

Any ISP worth their salt has route filtering on any customer connections, nobody should be able to announce prefixes they don't own if the ISP is doing their job properly.


Thank you for the summary.

And, a sincere thank you for not mincing words when it comes to something as important as this.

>However, against numerous best practices outlined below, Verizon’s lack of filtering turned this into a major incident that affected many Internet services such as Amazon, Fastly, Linode and Cloudflare.

>IRR filtering would not have increased Verizon's costs or limited their service in any way. Again, the only explanation we can conceive of why it wasn't in place is sloppiness or laziness.

In an attempt to find any statement given by Verizon, I found that The Register was able to get this amazing statement:

"Verizon sent us the following baffling response to today's BGP cockup: "There was an intermittent disruption in internet service for some [Verizon] FiOS customers earlier this morning. Our engineers resolved the issue around 9am ET."" [1]

[1]https://www.theregister.co.uk/2019/06/24/verizon_bgp_misconf...


Verizon's response seems very non-committal and it appears this type of incident may happen again if they don't take any action. Are there ways for companies like Google or Cloudflare to work around ISPs like Verizon without affecting ISP customers, or is this a blocker? Was the 10% of the re-routed traffic from Cloudflare 100% of the traffic from Verizon to Cloudflare?


It's worse than that. BGP provides the "map" of the Internet. That map is relayed from network to network. So, as a result, Verizon announcing a bad route can mess up the map not just for them but for any other network that connects to them (directly or indirectly).

We're actually fortunate at Cloudflare because of our scale and wide-spread interconnection. That limited the impact more than it would have for a smaller, less-connected network. The crazy thing about BGP is that any router can announce that it's responsible for a block of IP addresses and, if it's trusted enough, that's what the map of the Internet will reflect.

The long term solution is for networks to implement and enforce RPKI. AT&T, for instance, implemented RPKI and we did not see any drop in traffic to their network today.

Verizon not only didn't implement RPKI, which would be the best-of-breed approach, but also didn't do even basic route filtering. It's as if a trusted traffic cop (Verizon) overheard from a random passing motorist that the main road was closed and, as a result, directed all traffic off a pier and into the ocean.

More about RPKI if you're interested: https://blog.cloudflare.com/rpki/


I do love that the CEO of Cloudflare is throwing technical shade at Verizon and others here and on Twitter for being useless.


This is the whole reason CF will never be seen as professional. Incidents like this, blog posts that've clearly not been anywhere close to a PR department, the whole CEO-blocking-sites-he-doesn't-like incident.. no wonder Verizon ignored their emails.

CF is great if you need "free" protection for a pet project, not really anything more.


Do you really think that "throwing shade" is what the internet needs? Is "throwing shade" an admirable quality in someone who is supposed to be demonstrating leadership? Anyone who has worked as a network engineer for a major ISP knows the internet is quite brittle. During my entire time in that profession I can't remember a time when attempting to shame people was used to resolve a routing issue or to improve relations in order to resolve future routing issues.


Shame is one of the most effective tools in influencing human behavior, and from the sounds of this post and the other coverage on the incident, Verizon has earned far more ire than is directed at them in this blog post.

A lot of people seem to conflate speaking professionally with speaking like a doormat. Verizon, specifically the team in charge of this system, fucked up. There are varying levels to that of course; if you mess up the fonts in the end of month report to your super and he calls you a fucking idiot, he's probably an unbalanced person in need of mental help. If on the other hand you knock dead 15% of GLOBAL Internet traffic out of sheer laziness, I'd say you've earned more than a few 'go fuck yourself's.


How about leaking session tokens and other sensitive data for millions of people during "Cloudbleed"? Were you advocating public shaming for Cloudflare then? Was that also "sheer laziness" and did they earn "more than a few" of your "go fuck yourself's"?


I fail to see how Cloudbleed and this event are the same. Cloudbleed was caused by a Cloudflare bug, true, but it wasn't caused by outright laziness (which this incident clearly was). Furthermore, unlike Verizon's distinct lack of communication regarding this incident, Cloudflare has generally been very good about reporting and communicating with the community.


> outright laziness (which this incident clearly was)

I don't think it was clearly laziness. It could have been a configuration mistake.


Indeed and we saw that as a cause recently for a major Google outage. However that likely possibility of a bad config or edit doesn't fit the narrative Cloudflare is spinning here - that Verizon is simply dumb and lazy.

Clearly Verizon has inbound prefix filtering in place otherwise this would be a common occurrence for AS 701 and it is most certainly not. And it's quite surprising and sad to see how willing people are to just blindly parrot Cloudflare here and pile on. This of course was the desired outcome of the blog post.


They are not same nor was I implying they were the same. The point being that oversight and genuine mistakes happen. However Cloudflare wants to characterize it as "sloppiness and laziness" when it's someone else. And even here you are "parroting" them here when neither you or they actually know the details do you? Clearly Verizon has prefix filtering in place in other places or this would be a regular occurrence for them and it is not. Cloudflare is pretty far down the list in importance during an outage that affected many companies - other Tier 1s, 2s etc. Just because Cloudflare didn't get a response does not mean they weren't communicating. I know two network engineers who were in contact with Verizon yesterday during the outage. But again you seem intent to just "pile on" after reading a one-sided and self-serving Cloudflare blog post.


Are you able to comment on how a company like yourself, or the other companies which were affected, can pursue anything with Verizon? Or is Verizon free to continue with bad practices and have a repeat of this issue?


Very nice writeup on RPKI! I don't know anything about network engineering, but it appears that RPKI will distribute trust from ISPs to RIRs (Regional Internet Registries) like ARIN and RIPE. As I understand it, the RIR will sign your IP allocation with RPKI, which means fat-fingering on your side will result in the ISP not finding you as it takes BGP announcement and RIR confirmation for the ISP to acknowledge your IP. Again, very nice and understandable writeup :)

I guess this does shift the burden of trust from an ISP to the RIR, and the blog post mentions international law as RIR and ISP memberships can be part of different countries and only RIRs would know who has what IP address since only they are TAs (which empowers certain governments over others). So I guess the debate is whether the pain of BGP route leaks and such is greater than the stress of another country having your RPKI entry.

I guess we'll just have to see how badly Verizon messes up in the future.


RPKI uses CAs at the RIRs because the RIRs are who make the IP allocations and have a relationship with the IP holders and can (at least in theory) authenticate the holders.

Just as a RIR could issue a certificate for your IPs to someone else, they could change WHOIS, which is how IP delegations are generally cross referenced.

You're welcome to accept (or propagate) someone's advertisements without RPKI in case of some dispute with their RIR, but expect to get called out for it if the routes are bogus if you don't answer your NOC phone or email or twitters.

Actually, I don't think Cloudflare was even calling Verizon out for not doing RPKI, which is fairly new and has costs, it was more for not limiting prefix counts; a small customer should probably be limited to 2n + 4 prefixes where N is the average number of prefixes they've advertised over the past 30 days; or like they have to put their prefixes in a portal or something.

Filtering customer advertisements with IRRs is also pretty normal.

But really, you gotta answer the phone. The steel guys answered the phone.


The IRR is also controlled by a few entities that would be vulnerable to government intervention, but that's the tool we currently rely on.

RPKI roots trust at the RIRs, and that is a vulnerability, but any government intervention would end that trust and end the use of the RIRs as trust anchors. It's pretty unlikely to ever be used that way.

Disclaimer: I co-authored some of the drafts for RPKI and helped implement RPKI systems at an RIR.


It's critical for the internet to work. Actually, some types of emergency services rely on it.

So this level of negligence is dangerous. Shouldn't there be criminal charges? Or at least some kind of legal action.


Note that RPKI won’t prevent outages caused by route leaks (because the leak has a valid signed origin).


This is a great call out, BTW.

The fact that the original AS Origin is included here makes this even more weaponized.

Brings it back to why doesn't the Noction platform "dirty" the injected announcements. For example, throwing out some Private ASNs or ASNs of "tier 1" providers to prevent those announcements from ever getting propagated around.


Your comments indicate to me that you don't really understand RPKI beyond a marketing value because if you did you wouldn't be throwing it around as some silver bullet. RPKI although a step in the right direction is also susceptible to misconfiguration and attack by hostile entities[1]. Additionally outages as bad as any BGP misconfiguration are also possible if an RIR's RPKI repo becomes unavailable. This has already happened. See:

https://www.arin.net/vault/announcements/2018/20181024_updat...

and

https://www.ripe.net/support/service-announcements/service-a...

There are also issues such a broken legacy ROAS: https://blog.apnic.net/2018/10/16/cleaning-up-roas-inconsist...

And the list goes on. Please stop with the hype and hand waving.

[1] https://www.cs.bu.edu/~goldbe/papers/sigRPKI_full.pdf


The amount of posturing and blaming in Cloudflare's response is breathtakingly unprofessional. If the article was just a few sentences longer, you could have squeezed in a few more statements of blame. We know, they messed up. But Cloudflare isn't making itself look any better by rolling the bus over Verizon again and again.


I was thinking the exact same thing—right up until I got the part where they still haven't responded 8 hours later (to say nothing of apologizing), and played no role in fixing the problem (DQE did that, apparently).

We all make mistakes. It's unreasonable to expect 100% uptime from anyone. But if you operate a service that so many people are relying on, and you make billions of dollars in profit each year (we're not talking about an unpaid volunteer open-source maintainer here), you absolutely have a responsibility to at least try to help fix it when there's a problem. It's brazenly irresponsible to go radio silent while your customer's other vendor fixes the problem.


This problem has been known for decades. As of April 2019, 56.1% of the world's population has internet access. Do you think it is acceptable for a major transit ISP to have no basic filters in 2019 let alone implement RPKI?

Can you explain what part of the Cloudflare statement you consider to be posturing? A cursory review of the BGP announcements referenced in the article are pretty clear. Facts are facts regardless of how the message is delivered.


I can't seem to muster much sympathy for any publicly-traded US ISP not performing technical due diligence.

If they can afford to lobby against non-profit competition and for local monopolies, they should damn well be able to staff a NOC for this type of issue.


I don’t think Verizon is a victim here. They’re big enough to have figured it out. They didn’t, so they’re being held to account.


A better question:

If they are a malicious/malfeasant actor, can non-Verizon ASNs partition Verizon off the internet until they fix their shit?


That's a dangerous precedent to set - because then others will jump in and demand knocking other countries (think Cuba, Venezuela, Iran, Syria or any other country on the US sanction list) offline.

IIRC the only cases where this has happened was when a couple of self-proclaimed "bulletproof hosters" were booted off of their uplinks, but even this wasn't a direct partition of the Internet.


That would just result in Verizon customers being unable to access stuff which isn't good either. Their users don't have much say in the matter, and due to internet monopolies in the US, may even be the only option for some people. They literally cannot even vote with their wallet, and thus any type of repercussion to Verizon would mostly be affecting the users.


I think your definition of unprofessional needs recalibrating if it applies more to the people calling out professional negligence than to the people committing it.


What the hell do you think cloudflare was supposed to have done here?


Blame is due where blame is due.


I think it makes them look pretty fine to very good


I 100% agree with you, though I am unsurprised that HN is downvoting you. HN seems to revere Cloudflare bigtime despite the fact that Cloudflare often uses HN as their own corporate PR platform. I absolutely loathe Verizon and I'll be the first to line up for a good publish lashing of US ISPs, but even I feel like this blog post is unnecessarily unprofessional.

What strikes me the most is that this whole "event" would have hardly even registered on anyone's radar (it affected less than 10% of their traffic during early hours of the morning. I saw one single news article about it, buried on The Verge, but other than that nothing), except for the fact that Cloudflare's CTO was on HN this morning fanning the flames of the one thread about it. It's like they dug their own hole drawing attention to the "Cloudflare outtage" headline, and now they're overcompensating by going to drastic measures to blame someone else.

And now they keep harping on the fact that Verizon still hasn't responded? Sure, part of that is probably the fact that Verizon is a giant corporation that doesn't want to bother with this stuff, but the other part is that this "event" was hardly even big enough of a deal to register on VZ's PR team's radar, no matter how much CF whines about it.

This blog post (and the accompanying HN comments from Cloudflare execs) just scream "immature company" to me. There's a reason that Cloudflare is the one making this blog post and devoting CEO time to it while the established behemoth is just going about their business as usual.


The context (which isn't obvious, and I don't blame you for not knowing it) is that the Internet is held together by spit and duct tape. The only reason it works at all is that major participants are good actors, in the sense that:

1. They implement basic precautions to prevent dumb things from going wrong.

2. They're available 24/7, to immediately respond to and remediate whatever does go wrong.

3. Both of the above are core obligations, which supersede any questions of public relations or maturity or higher-ups not wanting to be bothered.

If Verizon can't be trusted to properly operate their network, that's an immediate threat to the health of the Internet, and many people do need to be made aware of it. It's not just Cloudflare being salty because their customers yelled at them.


I know the context, but that's irrelevant here. Whatever the cause, a root cause analysis pointing back to CF is nice for CF to help solve the situation, and is even nice to have for us tech enthusiasts here on HN (though it should still maintain professionalism). But for customers and decision makers at companies that might be looking at considering purchasing Cloudflare, you know what I don't care about? Who's fault it was. There are multiple buckets of companies here:

1. Cloud providers that were effected enough to apparently devote not insignificant CEO and CTO time to it (Cloudflare)

2. Cloud providers that were affected but seemingly not enough for it to even register as anything more than a blip on their status tracker (Google, AWS, etc)

3. Cloud providers that weren't effected

As a potential customer thinking about buying services from one of these companies, which one do you think I am doing to do with? It certainly won't be CF. And if I am already engaged with CF, I want to know what CF is going to do to mitigate this situation in the future, and no, pointing fingers like a child and saying "it wasn't our fault!" doesn't count.

Cloudflare can't really control Verizon's actions that lead to this situation, but they can control how they respond to it and mitigate it. They had an opportunity to stand up as a leader and improve the internet (which is literally their company motto). As you pointed out, the internet working correctly is a matter of companies working together as good actors, and getting these companies to work together via good, strong relationships is a part of that.

Did Cloudflare do that? Nah. Instead, they made a petty blog post and their CEO is on Twitter telling Verizon they should be ashamed. I don't know exactly what his goal there was, but I assume it has something to do with hoping they'll be better in the future (if that's not his goal, then it really is just petty finger pointing). And if Cloudflare's CEO's method of getting people to improve their work is to publicly shame them, I really feel bad for anyone who works under him.


To be frank, your post makes it clear that you don't know the context. CF simply cannot do anything on their own to mitigate the problem where Verizon constructs bad BGP routes to Cloudflare IPs and then advertises those routes to third parties. The only mitigation possible is to contact whoever's advertising the bad routes and get them to stop.


Have you read Cloudflare's multiple blog posts regarding BGP? Did you read the tweets from their directors talking about how other customers were unaffected by the event today because of the mitigations put in place? Did you even do the simplest Google about BGP protocols and the plans in place to prevent this from happening in the future?

If you're going to try to impose yourself as the gatekeeper of "knowing the context", you should probably know it yourself. Saying CF "simply cannot do anything" is narrow minded at best, and completely wrong otherwise. In fact, in this very blog post linked in the OP, Cloudflare talks about taking steps to mitigate BGP issues in the future. That's great, if only it wasn't also paired with a childish finger pointing session.


AT&T customers were unaffected because of mitigations put in place by AT&T that Verizon hasn't put in place. The steps you refer to in the blog post are ones that Verzion has to take.


Yes, and? Cloudflare themselves are the ones pushing their own company as "leaders" in this field, and being a "leader" does not mean "pointing fingers and trying to avoid blame whenever something bad happens". If they fancy themselves leaders regarding BGP, as said on their website, then they need to actually act like leaders.

And as I've said multiple times now, Cloudflare was in a great position here to stand themselves up as a strong leader on this topic to start working together with other companies (a la Verizon) to start to make real headway to fix the BGP problem. As other commenters have noted, the internet is entirely built on multiple organizations acting in good faith towards one another. Verizon failed to do that, and Cloudflare's response also failed to do that. I said it in another comment, but I'll also say it here: publicly berating the people that you are supposedly taking a leadership position over is not good leadership. This entire episode is going to do nothing to encourage Verizon to work closely with CF to fix this issue. In fact, I imagine it will do the exact opposite.

Today was a display of incompetence from Verizon, and a display of bad leadership by Cloudflare. I have no idea why any objective-minded person would be applauding Cloudflare for this. As I mentioned elsewhere, I would normally love a good public bashing of Verizon, but not when it comes at the cost of professionalism and progress.


Companies respond just fine to public scrutiny, caused by them being rightfully and loudly blamed. The way you lead people isn't the same as the way you lead companies.

Verizon was acting so badly that it's clear the pure friendly approach was doing absolutely nothing. And I'm sure Cloudflare is willing to give very real and pleasant engineering help if desired.

If Verizon doesn't want to talk to Cloudflare, that's fine too. This is not a problem that requires active cooperation. They just have to do their job.


>rightfully and loudly blamed

There is an enormous difference between assigning fault in a good faith attempt to find a root cause/solution, and casting unnecessary, unprofessional insults such as "Verizon's team should be ashamed of themselves". One is productive, and the other is just being a dick.

>The way you lead people isn't the same as the way you lead companies.

Yes, it certainly is. A company is an organization of people, after all. You don't get to eschew professionalism and start throwing around insults just because a group of people has decided to attach an additional label over their heads.

And just to put an even finer point on it, Matthew Prince's tweets about the issue were not targeted at Verizon "the company". He specifically attacked Verizon's NOC and its team members. Despite everything, this isn't a faceless, soulless corporation that's having insults hurled at them. He specifically went after a specific group of people and publicly shamed them. And then he has the gall to shame them even more for not immediately chomping at the bit to help someone who just aggressively insulted them.

Ask yourself: if Matthew Prince had sent a tweet berating team members from his own company, telling them they should be ashamed of themselves, and spent the rest of the day commenting on the internet insulting their competence, would you still be saying he is a good leader? Or even a good CEO? Of course not. It's Leadership 101 that insulting your team members isn't a good leadership style. And that doesn't change just because Prince isn't the one signing the Verizon team's paychecks.

> This is not a problem that requires active cooperation.

This is clearly not the opinion of those at Cloudflare that are loudly kicking their feet and whining that Verizon didn't devote enough resources to actively cooperate with Cloudflare's troubleshooting today.


> A company is an organization of people

Blaming a specific team can get too personal. Blaming an entire company is more about the decision-making structure, and is close to as impersonal as you can get. It's really not the same as blaming a person.

> This is clearly not the opinion of those at Cloudflare that are loudly kicking their feet and whining that Verizon didn't devote enough resources to actively cooperate with Cloudflare's troubleshooting today.

They didn't notice, acknowledge, or fix the problem. That's different from a lack of resources devoted to active cooperation. Heck, two messages of "on it" and "it's fixed" would be a pleasant level of "active cooperation", and that takes only a minute or two.


> Blaming a specific team can get too personal.

And yet blaming a specific team is exactly what they did.

>They didn't notice, acknowledge, or fix the problem. That's different from a lack of resources devoted to active cooperation. Heck, two messages of "on it" and "it's fixed" would be a pleasant level of "active cooperation", and that takes only a minute or two.

Sure, I'm not defending Verizon's inaction. My point is that regardless of the level of the cooperation, some cooperation is clearly still required. And now because of Cloudflare's hostility towards Verizon after this incident, I wouldn't be surprised if Verizon is much less inclined to participate in any cooperation. That not only seems counterproductive to Cloudflare's goal, it's also bad for all of us that use the internet.


> And yet blaming a specific team is exactly what they did.

In this specific case, just blaming "Verizon", it was not personal. (There are a variety of things that can be classified under "blaming a team" so I can't give it a blanket okay/not okay.)

Knowing it's the NOC team, as an amorphous blob of nameless people, is not getting too personal.

Just because something can be traced to a team doesn't mean that shaming the company is the same as shaming specific people from that team.

Going down that road would declare everything as personal, and that's really not how things work.

> I wouldn't be surprised if Verizon is much less inclined to participate in any cooperation.

The public pressure should be stronger than any pettiness, and if it's not then the solution is to let even more people know it was Verizon's fault.


>In this specific case, just blaming "Verizon", it was not personal.

That isn't what they did. They specifically called out teams, which according to what you just said, is too personal.

https://twitter.com/eastdakota/status/1143182575680143361

> The teams at @verizon and @noction should be incredibly embarrassed at their failings this morning ... It’s networking malpractice that the NOC at @verizon has still not replied to messages

Not only does he specifically call out the NOC, he also calls out teams. It is very obvious which "the teams" he is referring to, and "the NOC" is indeed a specific team. In other comments he also calls out Verizon's support team.

This wasn't the case of "tracing it back to a team". CF's CEO specifically addressed them and told them to be ashamed of themselves. That's personal, and it's also being a dick to boot. Was there anything in this situation that was gained by Prince calling these people out in these tweets? Would it not have been just as effective at calling out Verizon (while being less unprofessional and less personally malicious) if those tweets had been less vitriolic?

> The public pressure should be stronger than any pettiness, and if it's not then the solution is to let even more people know it was Verizon's fault.

So the solution to pettiness is more pettiness? Why does CF have a license to be petty but VZ apparently does not?


> according to what you just said, is too personal

That is not what I said!

I said it can be, and then I clarified with: There are a variety of things that can be classified under "blaming a team" so I can't give it a blanket okay/not okay.

I see the tweet. I call this case not personal. He's pointing the blame at large groups inside someone else's opaque company.

If you're pointing at a blob of 100+ people (like you said, support is also being blamed) then you're not making it personal.

> Was there anything in this situation that was gained by Prince calling these people out in these tweets?

People know what company to blame (a good thing), but nobody outside that company even knows how many teams, let alone specifics about the people on those teams (an acceptable thing). Overall positive.

> Would it not have been just as effective at calling out Verizon (while being less unprofessional and less personally malicious) if those tweets had been less vitriolic?

Being less vitriolic would not make it more or less personally targeted.

I'm not sure if the vitriol helped exactly but I think Verizon did enough to deserve it that there's no need to berate Cloudflare for the vitriol itself.

> Why does CF have a license to be petty but VZ apparently does not?

Presuming I even agree with your definition of pettiness, the problem is not the pettiness itself, but the actions they take or don't take.

It's not terrible for VZ to be petty as long as they still fix their broken equipment.


>If you're pointing at a blob of 100+ people (like you said, support is also being blamed) then you're not making it personal.

Ahh, I see. So it's okay that he was offensive and insulting, because he was offensive and insulting to many people? It wouldn't have been okay if he was offensive and insulting to only a handful of people, but because it was more than that, it's okay? Is this some weird perversion of "one death is a tragedy, 1000 deaths is a statistic"?

He isn't pointing the blame at a large group inside an "opaque" company. He's insulting people. The people at Verizon will know full well that he is talking to them. People that work with the Verizon NOC will know full well that those specific people are being insulted by this CEO. The fact that it was personally directed at multiple people doesn't make it any less personal, it just makes it personal to more people, no matter how much you move the goalposts.

> I'm not sure if the vitriol helped exactly but I think Verizon did enough to deserve it that there's no need to berate Cloudflare for the vitriol itself.

So it didn't help to berate Verizon, but it was still okay because they "deserved it"? And then you don't apply the same logic to Cloudflare themselves? There absolutely is a need to berate Cloudflare for their unnecessary use of vitriol, especially if you're telling me the bar for berating someone is as low as "it didn't help but that's okay".

It's clear at this point that you're moving goalposts and adjusting your own principles in some weird attempt to defend Cloudflare. Cloudflare did nothing positive here, and your attempt to justify their vitriol and maliciousness is telling.


Please don't break the site guidelines. Also, please don't do these intense tit-for-tat arguments with another user. They don't help, they lower the signal/noise ratio, and they bore everyone else. I know it's hard (believe me I know how hard it is), but at some point someone needs to be the first to let go.

https://news.ycombinator.com/newsguidelines.html


> Is this some weird perversion of "one death is a tragedy, 1000 deaths is a statistic"?

Nah. If you deliver 100 insults to 100 people, that's terrible. But if you deliver one insult to a vague blob of 100 people, that barely registers. The amount of insult directed at any specific person is tiny. That's why I'm not bothered by it.

> it just makes it personal to more people

No.

> no matter how much you move the goalposts

Really?

Someone disagrees with you so they must be moving goalposts?

Do better than that. I've been consistent on what I consider personal.

Also, I think you're too focused on vitriol. You can single people out and cause them harm while using the nicest and most polite language in the world. The way you target and your underlying meaning is far more important than your choice of words.

> So it didn't help to berate Verizon, but it was still okay because they "deserved it"? And then you don't apply the same logic to Cloudflare themselves? There absolutely is a need to berate Cloudflare for their unnecessary use of vitriol, especially if you're telling me the bar for berating someone is as low as "it didn't help but that's okay".

Let's put it this way. I regard "impersonal beration" as one tenth the crime of "being obviously and extremely negligent with equipment that can break the internet". And I'm willing to forgive vitriol when it's deserved and impersonal.

You don't forgive that, and want to say Cloudflare acted somewhat badly? Okay, sure.

You want to claim they are failing as a leader, overcompensating with drastic childish measures to blame someone else for something they could and should have mitigated themselves? I completely disagree.


Please don't break the site guidelines. Also, please don't do these intense tit-for-tat arguments with another user. They don't help, they lower the signal/noise ratio, and they bore everyone else. I know it's hard (believe me I know how hard it is), but at some point someone needs to be the first to let go.

https://news.ycombinator.com/newsguidelines.html


That's fair. While I haven't had a huge number of arguments like this, I can only name one or two that resolved successfully. I'll leave things earlier.


As far as mitigations, you're demanding they magically fix something they don't control. It's kind of infuriating, really. Stop what you're doing and reassess, please.


>Did Cloudflare do that?

Yes?

>Cloudflare has decided that it's high-time we took a leadership role to finally secure BGP routing

etc.

https://blog.cloudflare.com/rpki/

>their CEO is on Twitter telling Verizon they should be ashamed

Yes, well

>I'll be the first to line up for a good publish lashing of US ISPs


There's a huge difference between saying you're going to be a leader and, y'know, actually being a leader. And there's an even huger difference between that and being an effective leader. I follow Cloudflare and eastdakota a lot. He clearly has the capability to be an effective leader (he is a CEO after all), and I personally admire him. However, in this particular situation, publicly berating the people that he is supposedly taking a "leadership role" over does not a good leader make.


So your definition of successful and mature business is a management team that doesn’t give 2f?

Wow


> What strikes me the most is that this whole "event" would have hardly even registered on anyone's radar (it affected less than 10% of their traffic during early hours of the morning.

It wasn't just CloudFlare who were affected. And the time of day is completely irrelevant, I live in Australia and was affected by this during evening peak time. Some very popular services (eg: discord) were completely knocked offline.

I think you're underestimating the impact of this event.


Please remember that this was early in the morning if and only if you happen to live in US time zones. There are many parts of the world where the sun shines when it is dark in the US.


I, in New Zealand, was well aware due to the number of inaccessible websites. It was a significant event not just on many people's radar, but actively preventing them from doing everyday things.

That said, I do think the tone of this blog post may have been taken a bit far.


You should go back and read what I wrote: https://news.ycombinator.com/item?id=20262316

I didn't fan flames. There was already a link to our status page on the front page of HN. While the event was happening I gave short updates by editing a comment here.

Also, your "affected less than 10% of their traffic during early hours of the morning" is incredibly parochial and seems to ignore the fact that people use the Internet world over.


>While the event was happening I gave short updates by editing a comment here.

It is disingenuous to only state you edited "a comment" there. You posted 10 comments in that thread, with at least another 10 edits. Of the top 5 comments, three of them are yours. On HN, each time you make a comment and people upvote your comment, it contributes to ranking the post higher on HN's front page. I fully understand that you were probably just trying to be communicative, but unintentionally or not, you did "fan the flames" by drawing additional attention to the issue.


So, you'd propose I sit back and leave a story with incomplete information on HN's front page and say nothing?

True that I posted other comments but they are short and don't say much. The real action was the main top comment.


Absolutely not! I appreciate your communication during the incident, and I definitely don't mean to discourage any participation in threads or reduce communication. I'm just pointing out that it did, if unintentionally, draw more attention to the issue.

What I don't appreciate is your company's unprofessional response re: Verizon after the issue ended, but that's been discussed elsewhere.


The happens all the time to lesser degrees and is a fact of life on the Internet. No action will be taken.


Back in the suspender wearing neckbeards days, the answer was simple... blackhole all Verizon routes.


> One of our network engineers made contact with DQE Communications quickly and after a little delay they were able to put us in contact with someone who could fix the problem. DQE worked with us on the phone to stop advertising these “optimized” routes to Allegheny Technologies Inc. We're grateful for their help. Once this was done, the Internet stabilized, and things went back to normal.

It's funny how we have to still use phone to help fixing some internet routing problem, even if phone doesn't literally means the old black curly-wire equipment


Gotta have a channel that's out-of-band with the internet to fix problems with the internet.


I'd love to know how out-of-band telephone networks (landline and mobile) actually are any more. A surprising amount goes via SIP on public IPs.


Nowadays with voip “the phone” isn’t as out of band as we’d like.


Is it time to put an HF ham radio rig in each major provider’s office? I shudder thinking of a major outage where even phone communication can’t take place.

I’m only half joking.

Edit: and maybe we just give Verizon a toy walkie talkie


You joke, but cascade failure ain't no laughing matter, either. Emergency plans aren't for fair weather, they are for sh!tstorms. I don't want to say "we are too reliant on the internet" because it's cliche and connectivity is just part of growing the modern world. But we sure as heck need several layers of backup plans in case things go sideways.

I, for one, hope there is a secret society of HAMs lurking as mild-mannered employees at every telco and ISP, ready to wire things back together when they short out.


There is an IRC server (several servers, actually) + channel that a variety of network operators are on that has existed since the very early 2000s for these sorts of events. 414 users on it now, most peoples nicknames include their ASN to make it easier to find each other.

Unfortunately, Verizon is one of those networks that isn't present there. But many other networks are represented there and it provides a direct path to those who have config/enable access on some of the largest networks out there. Cuts out the having to go via formal escalation paths and NOC groups that require a trouble ticket before you can engage them.


You can't use the amateur bands for commercial purposes though.


Long ago, in the mists of time when I was a wee lad, the internet was a simpler place. There were seven buttons around the world, all pressed down by volunteers. If any four of them were released, the world would end. “The world” was defined as “the internet”, and at the time that meant “the world” was defined as “men with beards and suspenders and real opinions about Star Trek”, and so that wasn’t so bad.

Today in 2019, “the world” is defined as “you know, the world“, and there are seven million buttons being held down all over the world.

If any four of them are released, the world ends.

We have made mistakes, is what I’m saying.

(I once had call to explain to nontechnical people how and why the internet is the way it is and why my ops crews tend to be full of people who are a little too calm about things being constantly on fire. This was my best crack at it.)


Back in that oft-forgotten age I was privileged to know, work, and chill with the brave volunteers (and, subsequently, paid RIR staff) holding down the buttons. It’s worth mentioning that even back then there dwelt in the west a large ugly troll whose name was AS701 (AS701 was not alone, either, having two hideous siblings, AS702 and AS703, that lived in other climes). Everyone tiptoed around the beast, because when angered it would flap its routes and there would be a great wailing and severe packet loss. The brave volunteers tried many times to tame the awful creature and I’m very sorry to see that it is still fucking everyone’s announcements even today.


For the lazy:

AS701 Verizon Business/UUnet

AS702 Verizon Business/UUnet Europe

AS703 Verizon Business/UUnet ASPAC


uunet , alternet and ? now it's verizon eh


There's no need to fear, for seven men with keys, masters of the universe, well versed universal programming linguistically... and if there is a big blackout.. up to 6 of them can back out!

https://youtu.be/Odw9Md9Lm6g?t=164


You came up with this analogy when Lost was still airing, no?

Although BGP probably makes about the same amount of sense to most people.



Hoisting my pitchfork a bit, but the internet might be better off without hierarchical DNS. I certainly wouldn't call that "the world ending."


So we run into the age-old problem of "who decides". Also, how do we prevent fragmentation when there is disagreement.


Content-addressable schemes seem to be pretty effective in their respective niches. You lose the semantic component of dns, though. Perhaps you could add some sort of local name pinning.

If we imagine the internet is going to keep expanding at anywhere near its historical rate it seems like we might have to let go of the idea of letting a single entity universally control a namespace.


Freedom isn't free. Web of trust.

Inconvenient, but that's a price I'm willing to pay for a network that empowers users rather than commercial interests.


Other than "not enough people are interested" what is stopping you or any group of people from using such a decentralized system as your primary name resolver today? I.e. if it's not in the web of trust use existing DNS as a fallback and watch it grow. I'm not sure I'd trust such a system to prevent banksite.com from being hijacked but I don't need to for you to.


Needs more blockchain



The flaw here is that very few of the people who use and enjoy that internet are willing to pay that same price.

All is not lost though. You can always opt out and run your own DNS or use hostfiles. Then you can have the internet you want and everyone the can have the internet they want.


Hmm. My issue with that is a majority consensus could decide that, for instance, .gov domains should no longer be relegated to the American government. Same issue as with a crypto 51% takeover.


> Hmm. My issue with that is a majority consensus could decide that, for instance, .gov domains should no longer be relegated to the American government. Same issue as with a crypto 51% takeover.

I'm sure other countries would like to use .gov.


They are entitled to their opinion; I saw a lot of this on the .amazon thread yesterday. However, right now, it's used by America only. There are some perks to inventing the internet.

My point is I don't like a system in which the majority can decide they don't like you having something and take it. For instance, what if people decided they didn't like facebook, so decided to seize its domain?


It's used by America only because America made the rules. That's not a great justification for the system as it stands.

Are there any examples of longstanding institutions that are _not_ beholden to the will of the many? All things fall if you can get enough people to revolt.


Web of trust is not consensus.

Fragmenting and an inability to formulate a standard answer when communicating are serious issues with using it, but a takeover is not.

If Chinese users would rather trust some Chinese government entity to resolve URLs, then that's what they'd get. At the same time as other users might get something else.


Wow I've got some beachfront property in Nevada to sell you if you think the "web of trust" actually addressed any credible threat model.


Not a threat model, but gives me freedom to decide whom I trust.


That's precisely what it doesn't do. It's a transitive trust relationship.


I did not. Never saw it either. ;)


I remember the early days of the Internet, when I could log in to an ISP router running BGP, with a blazing fast T1 to an early tier 1 Internet provider. We could literally announce any route we wanted, no filtering. We used to regularly black hole spammers, then turn them back on an hour or two later.


Please write about your life / these times! Or link me to your blog/writing if you already have. <3


It's not terribly interesting. No blog, sorry.


CloudFlare continues to raise the bar in terms of communicating technical issues to the public. Thank you for yet another enlightening writeup.


It's clear that Verizon broke the Internet this morning through incompetent BGP management, and they could do it again. Who holds them accountable?


glances over at Ajit Pai

Nobody.


It doesn't need government intervention. It needs other companies to hold them accountable.


Oh geez looks like the other companies aren't holding them accountable and the government shouldn't hold them accountable so I guess they just aren't accountable for causing broad swathes of economic damage because they were lazily managing their networks.

The system works!


Absolutely. If you are a Verizon service provider customer, please call them and register your feelings on this matter. Make sure you let them know in no uncertain terms that you are considering switching providers based on their lack of following best practices.


And will do so, once another provider becomes available in your area.


I did this yesterday, I called and expressed strong concern over Verizon's incident response and BGP security in general. I said I would not longer even want to be a FiOS customer, let alone business customer if I know that Verizon doesn't do basic prefix filtering on their BGP peers/customers.

I was careful not to berate/blame the T1 support people who have no clue what BGP is or even that an incident happened, but I tried to express the severity of the issue well enough that they would escalate a serious complaint to the network infra team.


I think that may be a little too subtle for the parent, but well done.


And enjoy their fantastic Super Complaint Automation Tasker system, which takes your valuable feedback and "$1 > /dev/null"


Who holds them accountable?

Anyone working in their NOCs with thoughts of working elsewhere? Having a predetermination of "stupid and/or lazy" because of your workplace can't help employment prospects.


> Who holds them accountable?

> Anyone working in their NOCs with thoughts of working elsewhere? Having a predetermination of "stupid and/or lazy" because of your workplace can't help employment prospects.

Some people select workplaces based on the notion that they are lazy.


I hope we remember this thread the next time Shady Russians or Crafty Chinese make a BGP mistake and are accused of Active Measures.


This is an obvious mistake made by a steel company. The BGP "errors" that make tons of traffic go to Russia or China OFTEN WORK COMPLETELY. Some have worked so well no one noticed for months. That's far, far different than an error, that a large section of a country's traffic being successfully hijacked to another country. Accidents usually break things. What Russia and China are too successful to be accidents.


https://news.ycombinator.com/item?id=20262214 is the earlier thread on this.


Thanks. And thanks for the help today updating the title on the original post as the real cause came to light!


> For example, our own IPv4 route 104.20.0.0/20 was turned into 104.20.0.0/21 and 104.20.8.0/21. [...] The prefixes Cloudflare announces are signed for a maximum size of 20. RPKI then indicates any more-specific prefix should not be accepted, no matter what the path is.

Did RPKI help reduce the scope of this incident, by stopping propagation of these faulty routes earlier than otherwise? Or did it have no effect in this case?


Anecdotal, but:

The article notes that AT&T has implemented RPKI, and a client mentioned to me that he wasn't having problems accessing Cloudflare-hosted infrastructure via his AT&T phone. The rest of his employees were having major issues though via the municipal fiber service provider.


Yup. There was no notable impact to AT&T traffic because they rejected the routes because they're filtering based on RPKI. Here's a Tweet from Cloudflare's Head of Network showing the AT&T vs. Verizon graphs: https://twitter.com/Jerome_UZ/status/1143276134907305984


It didn't. Virtually no major operator rejects invalids. However some do more-or-less strict prefix+ASN filtering.


One would think Cloudflare team would have a direct line of communication to all tier 1 Internet providers.


We thought we did. And tried both public and private lines of communication — without reply. Still waiting.


The lack of communication is the bit that justified blog scorching for me.

If everyone was working in good faith, professional courtesy.

But not being able to get someone on the phone? Wtf?

Glad the original provider was responsive and able to resolve. And hats off to your poor reliability engineers today.


I love the inclusion of the "help4u@verizon.com" address in the email attempting to get their attention on a Severity 0 event. It was worth a shot!


We tweeted at @Verizon and @VerizonSupport also.


Could always announce a few of their routes and wait for them to call you I guess? :)


Arrange to drop all traffic from Verizon

Then wait just as long to pick up the phone

The sheer amount of calls taken would cost them money, the only thing they actively seem to notice


Brutal. Should do that for those that don't meet a voted-upon (no veto's allowed) deadline for implementing the mitigation(s).


So what's the over/under on Verizon actually picking up?


Hats off to Verizon for treating everyone equally without discrimination.


"We don't care. We don't have to. We're the phone company." - Verizon, probably

It's good to know that Net Neutrality isn't dead.


'Can you hear me now?'

......

'Can you hear me now?'

......

'Can you hear me now?'

6 hours later, when the 9-5p NOC wakes up....

'Can you hear me now?'

Oh, you were talking. Whoops!


Unbelievable.

Thanks for the excellent write-up.


Thank you all at Cloudflare for your attention to detail. I'm currently rethinking my stance of years believing you were crimeflare. I do hope Verizon responds and everyone can learn from this.


Makes me wonder whether Cloudflare will start running quarterly drills with all their alerting peers.


I am surprised that CF is as aggressive toward Verizon in public as they are. Once you start breaking the Internet for stupid reasons, though, you probably deserve it.

I know very little about BGP operations; I did not know that there was PKI and route validation like they described in the article.


They'd probably be less aggressive if they'd been able to reach anyone there or had received any response (none as of 8 hours after the incident). As noted above in this discussion the CF team thought they had appropriate contact information for all top tier carriers, and I suspect they do have what Verizon would call the appropriate contact info. Not much they can do if Verizon ghosts them, though.

I guess they could take steps to null route everything to/from Verizon to see if they could get someone's attention that way.


... Sometimes you just have to call a spade a spade. I know a little about BGP, having run it for a small provider in the distant past, and Verizon should have never been in a position to cause this.


More details on RPKI if you're interested: https://blog.cloudflare.com/rpki/


Almost nobody rejects invalid routes. RPKI is basically a research project given that 85% of routes don't even have ROAs. See https://rpki-monitor.antd.nist.gov/


Thank you Verizon for giving me a recent example to talk about while teaching BGP in Computer Networks class. My favorite example will still be Pakistan routing all YouTube traffic to itself while trying to restrict access to it. It is a little bit old though: https://www.cnet.com/news/how-pakistan-knocked-youtube-offli... http://web.mit.edu/6.02/www/s2012/handouts/youtube-pt.pdf


Verizon's lucky it's a blog post that doesn't mince words, rather than a lawsuit.


Can they get a lawsuit?. Has Verizon broken their SLA?. Is there a manual to mitigate all the edge cases? What about being aware internally this had to be improved but it was delayed due bureaucracy.


>edge cases?

This is not an edge case, allowing downstream networks to broadcast routes for networks they do not own is a very well known security and operational issue with operating an ISP. Massive parts of the internet went down in the 90s to teach us this lesson.

Likewise, bureaucracy does not excuse not fixing an issue thats existed since the 90s, and not deploying any 1 of 3 mitigation tricks (let alone all 3).

Negligence causing damage from lost sales/traffic is sue-able.

The case would basically resolve around rather or not V had an obligation to prevent this from happening, and rather or not they were grossly negligent in that obligation.

In my view the answer is yes.


Who has standing though?

A Verizon customer might say "my internet was down" but there is 100% a clause in their contract about outages and SLAs.

Any company that lost sales likely doesn't have a contract with them, so what are they going to sue for? "Verizon didn't carry my 1s and 0s for free this morning"? Person A on the freeway having an accident and causing Person B to sit in traffic and miss their sales meeting isn't liable for that...

Maybe their peers (other telcos) have more standing because they couldn't deliver to their customers as a result but they of course all have a clause in their contracts about outages and SLAs that means ultimately they lost no money so there are no damages.

And this is why we need government regulation, either to break up the Telcos or nationalize them


Anybody can sue. And it'll still cost Verizon real money to fight those suits, even if they truly don't have liability.


Torturous influence or something maybe?


Very interesting and easy to understand overview, thanks.

I'm not familiar with this side of networking, but it sounds to me the "BGP Optimiser" product was left largely to its own devices and automated a configuration change without any explicit approval from a human operator (I could be wrong)

With the protocol being prone to problems like leaky routes and sloppy peers accepting them, is it really wise to leave these BGP optimiser products running without some level of supervision?

EDIT: of course I guess the human operator might wave the change through too without fully appreciating the problem...


I think the reality of it is that these BGP optimizers really can't be human checked. There's just too much it is doing, and for them to be really beneficial they need to respond quickly to network path congestion. I would be surprised if overseeing such a system could be done with fewer than 6 full time people, as a WAG.

... Which is why you should be really sure that these optimized routes never leak! And on top of it, Verizon should never have accepted those announcements.


the issue is that, as stated by the article. Things like this could be prevented by doing proper IRR filtering.


Nothing is ever new in IT. A similar incident happened 22 years ago. Buggy and/or misconfigured gear disaggregates and reannounces routes and creates a blackhole.

https://en.wikipedia.org/wiki/AS_7007_incident


> The RPKI framework that we implemented and deployed globally last year is designed to prevent this type of leak. It enables filtering on origin network and prefix size. The prefixes Cloudflare announces are signed for a maximum size of 20. RPKI then indicates any more-specific prefix should not be accepted, no matter what the path is.

Does RPKI prevent Cloudflare from announcing additional /22 routes during an incident like this? Any network with RPKI implemented would reject the /22s, but those who ignore it should pick them up over the leaked /21s.


We could break our prefixes into smaller routes, but 1) the Internet's routers have limited memory; 2) we have a lot of routes; and 3) we want to be good Internet citizens.

If every network announced all their routes as /24s — the smallest route generally accepted over the public Internet — the routing table would be a giant mess and would overwhelm many routers' ability to store them.

That said, after today we are thinking about ways that, in case of an emergency, we could break the routes down to be more specific than whatever is leaking. Given how broadly peered we are, Cloudflare's network will be as protected as anyone's. However, that's not really a good solution for the Internet generally. Better that we all implement and enforce RPKI.


Kudos for a CEO that understands in and outs of Internet routing, making me want to join CF's neteng team


Kudos for not deaggregating routes into /24s like many other major ISPs do nowadays.


And you believe the internet optimizer wouldn't have added /23s and /24s.... why?


Is Verizon's engineer on vacation, or did he just call in sick today?


Over a decade ago one of my friends was banned from the Sheffield Uni network for playing around with BGP and knocking the whole campus offline. One kind of has to wonder whether Verizon can suffer the same consequences simply by collective action on the part of other affected parties.


Nope - because said collective action would probably involve denying service to tens of millions of Verizon customers.


They could do better


Can they? How many choices of ISP do you have? I have 2, Spectrum (Time Warner) or some no-name for the same price with 1/10th the speed. I could not practically switch off Time Warner if I wanted to without moving. I imagine that Verizon customers are in the same boat.


Amusing callout to pager duty in the screenshot of the call-log :)


Which is more interesting, philosophically? The internet had a problem due to a single issue, or that the internet's problem was fixed due to a single person calling various people on a cell phone?

(device interactions versus human interactions..)


Every time I see the BGP abbreviation it's about a huge fuck-up. Either somebody hijacks routes intentionally or something like this happens.


As any other critical underlying infrastructure of our lives, it's taken for granted and ignored until it breaks.


Not trying to defend a legacy protocol, but it's one of these cases where basic infrastructure is not newsworthy when it works just fine, like with plumbing.


I work from home in NJ and I knew something was screwy this morning. I wish there was a place I could have checked to see it was this issue. I rebooted pretty much everything in my house.


I’m in NJ and had issues this morning as well. Figured it was a local outage and promptly forgot about it after I left for work. I never expected it was something so serious. Crazy.


"Why is PagerDuty calling me before I've had my coffee?" Call declined.


Well, you just lost yourself a T-shirt Sir. https://news.ycombinator.com/item?id=20269076


I don't fully understand all the networking protocols involved but who/what is responsible to manage this kind of failure in the network? How should this ideally be handled besides cloudflare engineers calling someone?



I don’’t think Cloudflare is going to get any business from Verizon anytime soon.

This may be as a professional of a “Hey Verizon, you don’t know what the fsck you are doing” as I’ve seen.


They're both big enough that they probably can't live without each other, which makes for an interesting relationship because they can probably swear at each other all day long and nothing will come of it.


What are the concerns around RPKI?

I'd like to know why some companies aren't implementing it if it solves similar problems. What kind of criticism does it receive?


So how many phone calls do you think Allegheny Technologies or that ISP in Pennsylvania received this morning?


I saw a few of my static sites that are hosted on Cloudflare and few other auxiliary 3rd party services I use flapping back and forth on PagerDuty, but not all of my Cloudflare sites triggered down.

Is this just because the unaffected Cloudflare sites were not within the CIDR range affected?


>It doesn't cost a provider like Verizon anything to have such limits in place. And there's no good reason, other than sloppiness or laziness, that they wouldn't have such limits in place.

I see you haven’t had much experience with large Telcos. They are all like this.


I still get a Cloudflare 1020 error here: http://shadow.tech (my location : Scandinavia ) Are these sites waiting for some kind of propagation or cache busting? It's a pretty large gaming service.


Works from the ATL DC, what is the airport code that shows up on https://cloudflare-test.judge.sh/#shadow.tech ? Might be a local [maybe routing] issue with CF -> shadow's web server.


Not working here either (Finland), that page shows HEL for me.

shadow.tech shows "Error 1020 Ray ID: 4ec1c24b2a945b25 • 2019-06-24 21:22:45 UTC Access denied What happened? This website is using a security service to protect itself from online attacks."


Oh, 1020 access denied means some firewall rule (any combination of rules [0] and `if`/`or` statements) or access controls (such as blocking IP ranges and countries) blocked access to the website. These are always set up by the site operator, so this isn't caused by any CF issues or the earlier routing issues.

I guess best course of action for those who want to access the site is to tweet them https://twitter.com/shadow_official with your Ray ID.

0: https://developers.cloudflare.com/firewall/cf-firewall-rules...


Has Verizon responded in any capacity yet?


Would this have affected stuff in the UK? All sorts of sites like MealPal were inaccessible this morning for a bit


I wouldn't be surprised if various other transit providers slurped the bogus routes up from Verizon (without filtering), so it's certainly possible.


Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: