I'm impressed. At most other similar size companies, this would take 4 days. And in something like Amazon, it would be 2 weeks of approvals, editing, and review before a watered down version with all specifics removed is published.
Either Cloudflare has some pre-existing beef with Verizon and is using this as an opportune moment to dump on them ... or Tom Strickx (who wrote the blog post) had his beauty rest interrupted early this morning to deal with Verizon's screw-up and was not having it.
Team in London started working the problem and called in reinforcements from elsewhere;
Upper management (me and one other person) got involved as it was serious/not resolved fast;
I spoke with the network team in London who seemed to have a good handle on the problem and how they were working to resolve but we decided to wake a couple of other smart folks up to make sure we had all the best people on it;
Problem got resolved by the diligence of an engineer in London getting through to and talking with DQE;
Some people went back to bed;
Tom worked on writing our internal incident report so that details were captured fast and people had visibility. He then volunteered to be point on writing the public blog (1415 UTC);
Folks in California woke up and got involved with the blog. Ton of people contributed to it from around the world with Tom fielding all the changes and ideas;
Very senior people at Cloudflare (including legal) signed off and we posted (1958 UTC).
No one had an axe to grind with Verizon. We were working a complex problem affecting a good chunk of our traffic and customers. Everyone was calm and collected and thoughtful throughout.
Shout out to the Support team who handled an additional 1,000 support requests during the incident!
The incident itself and lack of response (for HOURS) from Verizon's side is absolutely unacceptable. It's 2019, filtering ALL of your customer's routes according to - at least - the IRR (including the legacy ones connected to the old router in the closet) and having a responsive 24/7 NOC contact in PeeringDB are a matter of course.
Proper carriers like NTT go above and beyond simple IRR filtering nowadays with things like peerlock (http://instituut.net/~job/peerlock_manual.pdf).
AT&T uses RPKI and was completely unaffected: https://twitter.com/Jerome_UZ/status/1143276134907305984
EdgeCast used to be my favourite CDN. Not sure how they are doing now.
I love the shaming of Verizon without the sugar coat. Divisive for sure, but a welcomed one.
Indeed. And that's not going to help them or their customer's in the least the next time they need Verizon's cooperation to resolve an issue. You would never see this type of behavior on the NANOG mailing list which has been on the front line of communications between ISPs and providers for BGP issues since the beginning of the commercial internet. It is very much a "community" with reciprocal respect and professionalism, things this blog post was devoid of.
What element of the blog post are you referring to? NANOG often speaks in jargon and obtuse-professional speak, but with large routing leaks there are always strong opinions expressed. It's been this way going back well over a decade.
Another counterexample: go search the NANOG archives for opinions on AWS, EC2, and SES. You won't find much reciprocal respect - you'll find a bunch of unabashed criticism on how AWS operates, and how that affects the internet.
This is a clash of cultures. Cloudflare knows their customers expect a fast, accurate, transparent explanation. NANOG participants are used to an environment where their dirty laundry isn't aired in public to the point where they get calls from reporters asking about it.
Cloudflare is walking a tight line where they're trying to accurately explain to a lay audience what happened to their customers. They can't assume their audience knows what AS 701 is, or BCP 38, or the DFZ, or the prior harm that BGP optimizers have been known to cause.
A "professional" NANOG thread would touch on all of that, it just wouldn't be pieced together under a single byline for a mass audience.
> You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it.
Not for lack of trying...
> Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off as the cost of doing business.
It wasn't a slow response, it was no response. And either is unacceptable for a tier 1 carrier.
> But this industry is one big ass glass house. What’s that thing about stones again?
And other carriers are actively working to change that - including, in particular, CloudFlare.
So basically what Verizon did by looking at BCP194 and saying “nah, too much bother”??
But it’s 2019 and I can’t muster up much sympathy for a tier 1 who can’t get inbound filters and a responsive NOC implemented correctly - things which were table stakes in 2009.
Let's all take our sleep seriously, yeah? :-)
Yeah, that wasn’t contributing. Everyone has bad days.
> Chief Architect, Information Security, Akamai Technologies. I do not speak for my employer.
Probably best to disclose this more directly in comments on topics related to competitors.
I agree there will always be bad actors on hn, which is too bad. I think the moderators try hard to combat it, which I am grateful for.
That was a fun conference call, and listening to the lady on the phone I could see how the engineer got to that point.
From the late 90s to early 2000s I worked for UUNET/Worldcom as an engineer in the network planning and design group. I worked in the international group but among other things we were responsible for the build out of AS701 into Canada, the exchange sites where AS701 connected to the various other international UUNET AS's and the PoP's where dedicated circuits for international customers who wished to connect directly to AS701 would be terminated. The point being that I am familiar with how AS701 was operated at that time.
UUNET's reputation at the time might not have been sterling due to the business decision to basically be a safe haven for spammers but from a technical standpoint the network was operated at a high standard. The basic BGP filtering referenced in the article was certainly in place at the time and if this had happened then heads would have rolled.
Edit/Disclaimer: Yes, this is a joke. I woke up to several serious alarms just as the problem was starting. Luckily, I thought to check Cloudflare’s status page from my phone around the second or third time PagerDuty called me. I saw a preliminary notice from them indicating that they were observing networking issues. At that point, I decided I’d rather watch the world burn from my bed than my computer, so I “scheduled” maintenance for a few hours and went back to bed. Our whole infrastructure had split by that point, but there was nothing I could do about it.
"Hi, I'm Teejmya, and I was on call last night"
If you email me (matthewatcloudflaredotcom) your shirt size, preference for men's or women's cut, and your postal address with the subject line:
"Verizon BGP Leak On Call Support Group"
I'll send you a Cloudflare tshirt. Least we can do.
There’s a silver lining to everything.
I went off-call at 8:30am EST. Then, while the internet burned down, I slept in and played video games.
Whoever came up with that name and acronym deserves an award.
(If my understanding is correct, they shouldn’t have told Verizon about the better routing, while Verizon should have known better)
It's startling that Verizon doesn't appear to have any leak mitigations in place, but I feel like Allegheny is getting a pass here because they are small, or something.
(I’m from Pittsburgh, my grandfather and a bunch of my relatives worked for this company for decades. I’ve been kinda giggling about this intersection of my past and my present all day. I don’t work in the parts of Cloudflare that deal with this kind of thing; I’m glad my co-workers were on top of it.)
I guess “steel” in your name isn’t good for share prices.
Any ISP worth their salt has route filtering on any customer connections, nobody should be able to announce prefixes they don't own if the ISP is doing their job properly.
And, a sincere thank you for not mincing words when it comes to something as important as this.
>However, against numerous best practices outlined below, Verizon’s lack of filtering turned this into a major incident that affected many Internet services such as Amazon, Fastly, Linode and Cloudflare.
>IRR filtering would not have increased Verizon's costs or limited their service in any way. Again, the only explanation we can conceive of why it wasn't in place is sloppiness or laziness.
In an attempt to find any statement given by Verizon, I found that The Register was able to get this amazing statement:
"Verizon sent us the following baffling response to today's BGP cockup: "There was an intermittent disruption in internet service for some [Verizon] FiOS customers earlier this morning. Our engineers resolved the issue around 9am ET."" 
We're actually fortunate at Cloudflare because of our scale and wide-spread interconnection. That limited the impact more than it would have for a smaller, less-connected network. The crazy thing about BGP is that any router can announce that it's responsible for a block of IP addresses and, if it's trusted enough, that's what the map of the Internet will reflect.
The long term solution is for networks to implement and enforce RPKI. AT&T, for instance, implemented RPKI and we did not see any drop in traffic to their network today.
Verizon not only didn't implement RPKI, which would be the best-of-breed approach, but also didn't do even basic route filtering. It's as if a trusted traffic cop (Verizon) overheard from a random passing motorist that the main road was closed and, as a result, directed all traffic off a pier and into the ocean.
More about RPKI if you're interested: https://blog.cloudflare.com/rpki/
CF is great if you need "free" protection for a pet project, not really anything more.
A lot of people seem to conflate speaking professionally with speaking like a doormat. Verizon, specifically the team in charge of this system, fucked up. There are varying levels to that of course; if you mess up the fonts in the end of month report to your super and he calls you a fucking idiot, he's probably an unbalanced person in need of mental help. If on the other hand you knock dead 15% of GLOBAL Internet traffic out of sheer laziness, I'd say you've earned more than a few 'go fuck yourself's.
I don't think it was clearly laziness. It could have been a configuration mistake.
Clearly Verizon has inbound prefix filtering in place otherwise this would be a common occurrence for AS 701 and it is most certainly not. And it's quite surprising and sad to see how willing people are to just blindly parrot Cloudflare here and pile on. This of course was the desired outcome of the blog post.
I guess this does shift the burden of trust from an ISP to the RIR, and the blog post mentions international law as RIR and ISP memberships can be part of different countries and only RIRs would know who has what IP address since only they are TAs (which empowers certain governments over others). So I guess the debate is whether the pain of BGP route leaks and such is greater than the stress of another country having your RPKI entry.
I guess we'll just have to see how badly Verizon messes up in the future.
Just as a RIR could issue a certificate for your IPs to someone else, they could change WHOIS, which is how IP delegations are generally cross referenced.
You're welcome to accept (or propagate) someone's advertisements without RPKI in case of some dispute with their RIR, but expect to get called out for it if the routes are bogus if you don't answer your NOC phone or email or twitters.
Actually, I don't think Cloudflare was even calling Verizon out for not doing RPKI, which is fairly new and has costs, it was more for not limiting prefix counts; a small customer should probably be limited to 2n + 4 prefixes where N is the average number of prefixes they've advertised over the past 30 days; or like they have to put their prefixes in a portal or something.
Filtering customer advertisements with IRRs is also pretty normal.
But really, you gotta answer the phone. The steel guys answered the phone.
RPKI roots trust at the RIRs, and that is a vulnerability, but any government intervention would end that trust and end the use of the RIRs as trust anchors. It's pretty unlikely to ever be used that way.
Disclaimer: I co-authored some of the drafts for RPKI and helped implement RPKI systems at an RIR.
So this level of negligence is dangerous. Shouldn't there be criminal charges? Or at least some kind of legal action.
The fact that the original AS Origin is included here makes this even more weaponized.
Brings it back to why doesn't the Noction platform "dirty" the injected announcements. For example, throwing out some Private ASNs or ASNs of "tier 1" providers to prevent those announcements from ever getting propagated around.
There are also issues such a broken legacy ROAS:
And the list goes on. Please stop with the hype and hand waving.
We all make mistakes. It's unreasonable to expect 100% uptime from anyone. But if you operate a service that so many people are relying on, and you make billions of dollars in profit each year (we're not talking about an unpaid volunteer open-source maintainer here), you absolutely have a responsibility to at least try to help fix it when there's a problem. It's brazenly irresponsible to go radio silent while your customer's other vendor fixes the problem.
Can you explain what part of the Cloudflare statement you consider to be posturing? A cursory review of the BGP announcements referenced in the article are pretty clear. Facts are facts regardless of how the message is delivered.
If they can afford to lobby against non-profit competition and for local monopolies, they should damn well be able to staff a NOC for this type of issue.
If they are a malicious/malfeasant actor, can non-Verizon ASNs partition Verizon off the internet until they fix their shit?
IIRC the only cases where this has happened was when a couple of self-proclaimed "bulletproof hosters" were booted off of their uplinks, but even this wasn't a direct partition of the Internet.
What strikes me the most is that this whole "event" would have hardly even registered on anyone's radar (it affected less than 10% of their traffic during early hours of the morning. I saw one single news article about it, buried on The Verge, but other than that nothing), except for the fact that Cloudflare's CTO was on HN this morning fanning the flames of the one thread about it. It's like they dug their own hole drawing attention to the "Cloudflare outtage" headline, and now they're overcompensating by going to drastic measures to blame someone else.
And now they keep harping on the fact that Verizon still hasn't responded? Sure, part of that is probably the fact that Verizon is a giant corporation that doesn't want to bother with this stuff, but the other part is that this "event" was hardly even big enough of a deal to register on VZ's PR team's radar, no matter how much CF whines about it.
This blog post (and the accompanying HN comments from Cloudflare execs) just scream "immature company" to me. There's a reason that Cloudflare is the one making this blog post and devoting CEO time to it while the established behemoth is just going about their business as usual.
1. They implement basic precautions to prevent dumb things from going wrong.
2. They're available 24/7, to immediately respond to and remediate whatever does go wrong.
3. Both of the above are core obligations, which supersede any questions of public relations or maturity or higher-ups not wanting to be bothered.
If Verizon can't be trusted to properly operate their network, that's an immediate threat to the health of the Internet, and many people do need to be made aware of it. It's not just Cloudflare being salty because their customers yelled at them.
1. Cloud providers that were effected enough to apparently devote not insignificant CEO and CTO time to it (Cloudflare)
2. Cloud providers that were affected but seemingly not enough for it to even register as anything more than a blip on their status tracker (Google, AWS, etc)
3. Cloud providers that weren't effected
As a potential customer thinking about buying services from one of these companies, which one do you think I am doing to do with? It certainly won't be CF. And if I am already engaged with CF, I want to know what CF is going to do to mitigate this situation in the future, and no, pointing fingers like a child and saying "it wasn't our fault!" doesn't count.
Cloudflare can't really control Verizon's actions that lead to this situation, but they can control how they respond to it and mitigate it. They had an opportunity to stand up as a leader and improve the internet (which is literally their company motto). As you pointed out, the internet working correctly is a matter of companies working together as good actors, and getting these companies to work together via good, strong relationships is a part of that.
Did Cloudflare do that? Nah. Instead, they made a petty blog post and their CEO is on Twitter telling Verizon they should be ashamed. I don't know exactly what his goal there was, but I assume it has something to do with hoping they'll be better in the future (if that's not his goal, then it really is just petty finger pointing). And if Cloudflare's CEO's method of getting people to improve their work is to publicly shame them, I really feel bad for anyone who works under him.
If you're going to try to impose yourself as the gatekeeper of "knowing the context", you should probably know it yourself. Saying CF "simply cannot do anything" is narrow minded at best, and completely wrong otherwise. In fact, in this very blog post linked in the OP, Cloudflare talks about taking steps to mitigate BGP issues in the future. That's great, if only it wasn't also paired with a childish finger pointing session.
And as I've said multiple times now, Cloudflare was in a great position here to stand themselves up as a strong leader on this topic to start working together with other companies (a la Verizon) to start to make real headway to fix the BGP problem. As other commenters have noted, the internet is entirely built on multiple organizations acting in good faith towards one another. Verizon failed to do that, and Cloudflare's response also failed to do that. I said it in another comment, but I'll also say it here: publicly berating the people that you are supposedly taking a leadership position over is not good leadership. This entire episode is going to do nothing to encourage Verizon to work closely with CF to fix this issue. In fact, I imagine it will do the exact opposite.
Today was a display of incompetence from Verizon, and a display of bad leadership by Cloudflare. I have no idea why any objective-minded person would be applauding Cloudflare for this. As I mentioned elsewhere, I would normally love a good public bashing of Verizon, but not when it comes at the cost of professionalism and progress.
Verizon was acting so badly that it's clear the pure friendly approach was doing absolutely nothing. And I'm sure Cloudflare is willing to give very real and pleasant engineering help if desired.
If Verizon doesn't want to talk to Cloudflare, that's fine too. This is not a problem that requires active cooperation. They just have to do their job.
There is an enormous difference between assigning fault in a good faith attempt to find a root cause/solution, and casting unnecessary, unprofessional insults such as "Verizon's team should be ashamed of themselves". One is productive, and the other is just being a dick.
>The way you lead people isn't the same as the way you lead companies.
Yes, it certainly is. A company is an organization of people, after all. You don't get to eschew professionalism and start throwing around insults just because a group of people has decided to attach an additional label over their heads.
And just to put an even finer point on it, Matthew Prince's tweets about the issue were not targeted at Verizon "the company". He specifically attacked Verizon's NOC and its team members. Despite everything, this isn't a faceless, soulless corporation that's having insults hurled at them. He specifically went after a specific group of people and publicly shamed them. And then he has the gall to shame them even more for not immediately chomping at the bit to help someone who just aggressively insulted them.
Ask yourself: if Matthew Prince had sent a tweet berating team members from his own company, telling them they should be ashamed of themselves, and spent the rest of the day commenting on the internet insulting their competence, would you still be saying he is a good leader? Or even a good CEO? Of course not. It's Leadership 101 that insulting your team members isn't a good leadership style. And that doesn't change just because Prince isn't the one signing the Verizon team's paychecks.
> This is not a problem that requires active cooperation.
This is clearly not the opinion of those at Cloudflare that are loudly kicking their feet and whining that Verizon didn't devote enough resources to actively cooperate with Cloudflare's troubleshooting today.
Blaming a specific team can get too personal. Blaming an entire company is more about the decision-making structure, and is close to as impersonal as you can get. It's really not the same as blaming a person.
> This is clearly not the opinion of those at Cloudflare that are loudly kicking their feet and whining that Verizon didn't devote enough resources to actively cooperate with Cloudflare's troubleshooting today.
They didn't notice, acknowledge, or fix the problem. That's different from a lack of resources devoted to active cooperation. Heck, two messages of "on it" and "it's fixed" would be a pleasant level of "active cooperation", and that takes only a minute or two.
And yet blaming a specific team is exactly what they did.
>They didn't notice, acknowledge, or fix the problem. That's different from a lack of resources devoted to active cooperation. Heck, two messages of "on it" and "it's fixed" would be a pleasant level of "active cooperation", and that takes only a minute or two.
Sure, I'm not defending Verizon's inaction. My point is that regardless of the level of the cooperation, some cooperation is clearly still required. And now because of Cloudflare's hostility towards Verizon after this incident, I wouldn't be surprised if Verizon is much less inclined to participate in any cooperation. That not only seems counterproductive to Cloudflare's goal, it's also bad for all of us that use the internet.
In this specific case, just blaming "Verizon", it was not personal. (There are a variety of things that can be classified under "blaming a team" so I can't give it a blanket okay/not okay.)
Knowing it's the NOC team, as an amorphous blob of nameless people, is not getting too personal.
Just because something can be traced to a team doesn't mean that shaming the company is the same as shaming specific people from that team.
Going down that road would declare everything as personal, and that's really not how things work.
> I wouldn't be surprised if Verizon is much less inclined to participate in any cooperation.
The public pressure should be stronger than any pettiness, and if it's not then the solution is to let even more people know it was Verizon's fault.
That isn't what they did. They specifically called out teams, which according to what you just said, is too personal.
> The teams at @verizon and @noction should be incredibly embarrassed at their failings this morning ... It’s networking malpractice that the NOC at @verizon has still not replied to messages
Not only does he specifically call out the NOC, he also calls out teams. It is very obvious which "the teams" he is referring to, and "the NOC" is indeed a specific team. In other comments he also calls out Verizon's support team.
This wasn't the case of "tracing it back to a team". CF's CEO specifically addressed them and told them to be ashamed of themselves. That's personal, and it's also being a dick to boot. Was there anything in this situation that was gained by Prince calling these people out in these tweets? Would it not have been just as effective at calling out Verizon (while being less unprofessional and less personally malicious) if those tweets had been less vitriolic?
> The public pressure should be stronger than any pettiness, and if it's not then the solution is to let even more people know it was Verizon's fault.
So the solution to pettiness is more pettiness? Why does CF have a license to be petty but VZ apparently does not?
That is not what I said!
I said it can be, and then I clarified with: There are a variety of things that can be classified under "blaming a team" so I can't give it a blanket okay/not okay.
I see the tweet. I call this case not personal. He's pointing the blame at large groups inside someone else's opaque company.
If you're pointing at a blob of 100+ people (like you said, support is also being blamed) then you're not making it personal.
> Was there anything in this situation that was gained by Prince calling these people out in these tweets?
People know what company to blame (a good thing), but nobody outside that company even knows how many teams, let alone specifics about the people on those teams (an acceptable thing). Overall positive.
> Would it not have been just as effective at calling out Verizon (while being less unprofessional and less personally malicious) if those tweets had been less vitriolic?
Being less vitriolic would not make it more or less personally targeted.
I'm not sure if the vitriol helped exactly but I think Verizon did enough to deserve it that there's no need to berate Cloudflare for the vitriol itself.
> Why does CF have a license to be petty but VZ apparently does not?
Presuming I even agree with your definition of pettiness, the problem is not the pettiness itself, but the actions they take or don't take.
It's not terrible for VZ to be petty as long as they still fix their broken equipment.
Ahh, I see. So it's okay that he was offensive and insulting, because he was offensive and insulting to many people? It wouldn't have been okay if he was offensive and insulting to only a handful of people, but because it was more than that, it's okay? Is this some weird perversion of "one death is a tragedy, 1000 deaths is a statistic"?
He isn't pointing the blame at a large group inside an "opaque" company. He's insulting people. The people at Verizon will know full well that he is talking to them. People that work with the Verizon NOC will know full well that those specific people are being insulted by this CEO. The fact that it was personally directed at multiple people doesn't make it any less personal, it just makes it personal to more people, no matter how much you move the goalposts.
> I'm not sure if the vitriol helped exactly but I think Verizon did enough to deserve it that there's no need to berate Cloudflare for the vitriol itself.
So it didn't help to berate Verizon, but it was still okay because they "deserved it"? And then you don't apply the same logic to Cloudflare themselves? There absolutely is a need to berate Cloudflare for their unnecessary use of vitriol, especially if you're telling me the bar for berating someone is as low as "it didn't help but that's okay".
It's clear at this point that you're moving goalposts and adjusting your own principles in some weird attempt to defend Cloudflare. Cloudflare did nothing positive here, and your attempt to justify their vitriol and maliciousness is telling.
Nah. If you deliver 100 insults to 100 people, that's terrible. But if you deliver one insult to a vague blob of 100 people, that barely registers. The amount of insult directed at any specific person is tiny. That's why I'm not bothered by it.
> it just makes it personal to more people
> no matter how much you move the goalposts
Someone disagrees with you so they must be moving goalposts?
Do better than that. I've been consistent on what I consider personal.
Also, I think you're too focused on vitriol. You can single people out and cause them harm while using the nicest and most polite language in the world. The way you target and your underlying meaning is far more important than your choice of words.
> So it didn't help to berate Verizon, but it was still okay because they "deserved it"? And then you don't apply the same logic to Cloudflare themselves? There absolutely is a need to berate Cloudflare for their unnecessary use of vitriol, especially if you're telling me the bar for berating someone is as low as "it didn't help but that's okay".
Let's put it this way. I regard "impersonal beration" as one tenth the crime of "being obviously and extremely negligent with equipment that can break the internet". And I'm willing to forgive vitriol when it's deserved and impersonal.
You don't forgive that, and want to say Cloudflare acted somewhat badly? Okay, sure.
You want to claim they are failing as a leader, overcompensating with drastic childish measures to blame someone else for something they could and should have mitigated themselves? I completely disagree.
>Cloudflare has decided that it's high-time we took a leadership role to finally secure BGP routing
>their CEO is on Twitter telling Verizon they should be ashamed
>I'll be the first to line up for a good publish lashing of US ISPs
It wasn't just CloudFlare who were affected. And the time of day is completely irrelevant, I live in Australia and was affected by this during evening peak time. Some very popular services (eg: discord) were completely knocked offline.
I think you're underestimating the impact of this event.
That said, I do think the tone of this blog post may have been taken a bit far.
I didn't fan flames. There was already a link to our status page on the front page of HN. While the event was happening I gave short updates by editing a comment here.
Also, your "affected less than 10% of their traffic during early hours of the morning" is incredibly parochial and seems to ignore the fact that people use the Internet world over.
It is disingenuous to only state you edited "a comment" there. You posted 10 comments in that thread, with at least another 10 edits. Of the top 5 comments, three of them are yours. On HN, each time you make a comment and people upvote your comment, it contributes to ranking the post higher on HN's front page. I fully understand that you were probably just trying to be communicative, but unintentionally or not, you did "fan the flames" by drawing additional attention to the issue.
True that I posted other comments but they are short and don't say much. The real action was the main top comment.
What I don't appreciate is your company's unprofessional response re: Verizon after the issue ended, but that's been discussed elsewhere.
It's funny how we have to still use phone to help fixing some internet routing problem, even if phone doesn't literally means the old black curly-wire equipment
I’m only half joking.
Edit: and maybe we just give Verizon a toy walkie talkie
I, for one, hope there is a secret society of HAMs lurking as mild-mannered employees at every telco and ISP, ready to wire things back together when they short out.
Unfortunately, Verizon is one of those networks that isn't present there. But many other networks are represented there and it provides a direct path to those who have config/enable access on some of the largest networks out there. Cuts out the having to go via formal escalation paths and NOC groups that require a trouble ticket before you can engage them.
Today in 2019, “the world” is defined as “you know, the world“, and there are seven million buttons being held down all over the world.
If any four of them are released, the world ends.
We have made mistakes, is what I’m saying.
(I once had call to explain to nontechnical people how and why the internet is the way it is and why my ops crews tend to be full of people who are a little too calm about things being constantly on fire. This was my best crack at it.)
AS701 Verizon Business/UUnet
AS702 Verizon Business/UUnet Europe
AS703 Verizon Business/UUnet ASPAC
Although BGP probably makes about the same amount of sense to most people.
If we imagine the internet is going to keep expanding at anywhere near its historical rate it seems like we might have to let go of the idea of letting a single entity universally control a namespace.
Inconvenient, but that's a price I'm willing to pay for a network that empowers users rather than commercial interests.
All is not lost though. You can always opt out and run your own DNS or use hostfiles. Then you can have the internet you want and everyone the can have the internet they want.
I'm sure other countries would like to use .gov.
My point is I don't like a system in which the majority can decide they don't like you having something and take it. For instance, what if people decided they didn't like facebook, so decided to seize its domain?
Are there any examples of longstanding institutions that are _not_ beholden to the will of the many? All things fall if you can get enough people to revolt.
Fragmenting and an inability to formulate a standard answer when communicating are serious issues with using it, but a takeover is not.
If Chinese users would rather trust some Chinese government entity to resolve URLs, then that's what they'd get. At the same time as other users might get something else.
The system works!
I was careful not to berate/blame the T1 support people who have no clue what BGP is or even that an incident happened, but I tried to express the severity of the issue well enough that they would escalate a serious complaint to the network infra team.
Anyone working in their NOCs with thoughts of working elsewhere? Having a predetermination of "stupid and/or lazy" because of your workplace can't help employment prospects.
> Anyone working in their NOCs with thoughts of working elsewhere? Having a predetermination of "stupid and/or lazy" because of your workplace can't help employment prospects.
Some people select workplaces based on the notion that they are lazy.
Did RPKI help reduce the scope of this incident, by stopping propagation of these faulty routes earlier than otherwise? Or did it have no effect in this case?
The article notes that AT&T has implemented RPKI, and a client mentioned to me that he wasn't having problems accessing Cloudflare-hosted infrastructure via his AT&T phone. The rest of his employees were having major issues though via the municipal fiber service provider.
If everyone was working in good faith, professional courtesy.
But not being able to get someone on the phone? Wtf?
Glad the original provider was responsive and able to resolve. And hats off to your poor reliability engineers today.
Then wait just as long to pick up the phone
The sheer amount of calls taken would cost them money, the only thing they actively seem to notice
It's good to know that Net Neutrality isn't dead.
'Can you hear me now?'
6 hours later, when the 9-5p NOC wakes up....
Oh, you were talking. Whoops!
Thanks for the excellent write-up.
I know very little about BGP operations; I did not know that there was PKI and route validation like they described in the article.
I guess they could take steps to null route everything to/from Verizon to see if they could get someone's attention that way.
This is not an edge case, allowing downstream networks to broadcast routes for networks they do not own is a very well known security and operational issue with operating an ISP. Massive parts of the internet went down in the 90s to teach us this lesson.
Likewise, bureaucracy does not excuse not fixing an issue thats existed since the 90s, and not deploying any 1 of 3 mitigation tricks (let alone all 3).
Negligence causing damage from lost sales/traffic is sue-able.
The case would basically resolve around rather or not V had an obligation to prevent this from happening, and rather or not they were grossly negligent in that obligation.
In my view the answer is yes.
A Verizon customer might say "my internet was down" but there is 100% a clause in their contract about outages and SLAs.
Any company that lost sales likely doesn't have a contract with them, so what are they going to sue for? "Verizon didn't carry my 1s and 0s for free this morning"? Person A on the freeway having an accident and causing Person B to sit in traffic and miss their sales meeting isn't liable for that...
Maybe their peers (other telcos) have more standing because they couldn't deliver to their customers as a result but they of course all have a clause in their contracts about outages and SLAs that means ultimately they lost no money so there are no damages.
And this is why we need government regulation, either to break up the Telcos or nationalize them
I'm not familiar with this side of networking, but it sounds to me the "BGP Optimiser" product was left largely to its own devices and automated a configuration change without any explicit approval from a human operator (I could be wrong)
With the protocol being prone to problems like leaky routes and sloppy peers accepting them, is it really wise to leave these BGP optimiser products running without some level of supervision?
EDIT: of course I guess the human operator might wave the change through too without fully appreciating the problem...
... Which is why you should be really sure that these optimized routes never leak! And on top of it, Verizon should never have accepted those announcements.
Does RPKI prevent Cloudflare from announcing additional /22 routes during an incident like this? Any network with RPKI implemented would reject the /22s, but those who ignore it should pick them up over the leaked /21s.
If every network announced all their routes as /24s — the smallest route generally accepted over the public Internet — the routing table would be a giant mess and would overwhelm many routers' ability to store them.
That said, after today we are thinking about ways that, in case of an emergency, we could break the routes down to be more specific than whatever is leaking. Given how broadly peered we are, Cloudflare's network will be as protected as anyone's. However, that's not really a good solution for the Internet generally. Better that we all implement and enforce RPKI.
(device interactions versus human interactions..)
This may be as a professional of a “Hey Verizon, you don’t know what the fsck you are doing” as I’ve seen.
I'd like to know why some companies aren't implementing it if it solves similar problems. What kind of criticism does it receive?
Is this just because the unaffected Cloudflare sites were not within the CIDR range affected?
I see you haven’t had much experience with large Telcos. They are all like this.
shadow.tech shows "Error 1020 Ray ID: 4ec1c24b2a945b25 • 2019-06-24 21:22:45 UTC
This website is using a security service to protect itself from online attacks."
I guess best course of action for those who want to access the site is to tweet them https://twitter.com/shadow_official with your Ray ID.