Either Cloudflare has some pre-existing beef with Verizon and is using this as an opportune moment to dump on them ... or Tom Strickx (who wrote the blog post) had his beauty rest interrupted early this morning to deal with Verizon's screw-up and was not having it.
Team in London started working the problem and called in reinforcements from elsewhere;
Upper management (me and one other person) got involved as it was serious/not resolved fast;
I spoke with the network team in London who seemed to have a good handle on the problem and how they were working to resolve but we decided to wake a couple of other smart folks up to make sure we had all the best people on it;
Problem got resolved by the diligence of an engineer in London getting through to and talking with DQE;
Some people went back to bed;
Tom worked on writing our internal incident report so that details were captured fast and people had visibility. He then volunteered to be point on writing the public blog (1415 UTC);
Folks in California woke up and got involved with the blog. Ton of people contributed to it from around the world with Tom fielding all the changes and ideas;
Very senior people at Cloudflare (including legal) signed off and we posted (1958 UTC).
No one had an axe to grind with Verizon. We were working a complex problem affecting a good chunk of our traffic and customers. Everyone was calm and collected and thoughtful throughout.
Shout out to the Support team who handled an additional 1,000 support requests during the incident!
The incident itself and lack of response (for HOURS) from Verizon's side is absolutely unacceptable. It's 2019, filtering ALL of your customer's routes according to - at least - the IRR (including the legacy ones connected to the old router in the closet) and having a responsive 24/7 NOC contact in PeeringDB are a matter of course.
Proper carriers like NTT go above and beyond simple IRR filtering nowadays with things like peerlock (http://instituut.net/~job/peerlock_manual.pdf).
AT&T uses RPKI and was completely unaffected: https://twitter.com/Jerome_UZ/status/1143276134907305984
EdgeCast used to be my favourite CDN. Not sure how they are doing now.
I love the shaming of Verizon without the sugar coat. Divisive for sure, but a welcomed one.
Indeed. And that's not going to help them or their customer's in the least the next time they need Verizon's cooperation to resolve an issue. You would never see this type of behavior on the NANOG mailing list which has been on the front line of communications between ISPs and providers for BGP issues since the beginning of the commercial internet. It is very much a "community" with reciprocal respect and professionalism, things this blog post was devoid of.
What element of the blog post are you referring to? NANOG often speaks in jargon and obtuse-professional speak, but with large routing leaks there are always strong opinions expressed. It's been this way going back well over a decade.
Another counterexample: go search the NANOG archives for opinions on AWS, EC2, and SES. You won't find much reciprocal respect - you'll find a bunch of unabashed criticism on how AWS operates, and how that affects the internet.
This is a clash of cultures. Cloudflare knows their customers expect a fast, accurate, transparent explanation. NANOG participants are used to an environment where their dirty laundry isn't aired in public to the point where they get calls from reporters asking about it.
Cloudflare is walking a tight line where they're trying to accurately explain to a lay audience what happened to their customers. They can't assume their audience knows what AS 701 is, or BCP 38, or the DFZ, or the prior harm that BGP optimizers have been known to cause.
A "professional" NANOG thread would touch on all of that, it just wouldn't be pieced together under a single byline for a mass audience.
> You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it.
Not for lack of trying...
> Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off as the cost of doing business.
It wasn't a slow response, it was no response. And either is unacceptable for a tier 1 carrier.
> But this industry is one big ass glass house. What’s that thing about stones again?
And other carriers are actively working to change that - including, in particular, CloudFlare.
So basically what Verizon did by looking at BCP194 and saying “nah, too much bother”??
But it’s 2019 and I can’t muster up much sympathy for a tier 1 who can’t get inbound filters and a responsive NOC implemented correctly - things which were table stakes in 2009.
Let's all take our sleep seriously, yeah? :-)