Hacker News new | past | comments | ask | show | jobs | submit login

Cloudflare managed to get an in-depth blog post, one which has incident details, points blame to other parties, and makes some really quite aggressive (for corporate blog posts) claims, all during an incident, and they did all that in 8 hours.

I'm impressed. At most other similar size companies, this would take 4 days. And in something like Amazon, it would be 2 weeks of approvals, editing, and review before a watered down version with all specifics removed is published.




Regarding those really aggressive claims, I was a bit shocked by that as well.

Either Cloudflare has some pre-existing beef with Verizon and is using this as an opportune moment to dump on them ... or Tom Strickx (who wrote the blog post) had his beauty rest interrupted early this morning to deal with Verizon's screw-up and was not having it.


The sequence of events went a bit like this:

Team in London started working the problem and called in reinforcements from elsewhere;

Upper management (me and one other person) got involved as it was serious/not resolved fast;

I spoke with the network team in London who seemed to have a good handle on the problem and how they were working to resolve but we decided to wake a couple of other smart folks up to make sure we had all the best people on it;

Problem got resolved by the diligence of an engineer in London getting through to and talking with DQE;

Some people went back to bed;

Tom worked on writing our internal incident report so that details were captured fast and people had visibility. He then volunteered to be point on writing the public blog (1415 UTC);

Folks in California woke up and got involved with the blog. Ton of people contributed to it from around the world with Tom fielding all the changes and ideas;

Very senior people at Cloudflare (including legal) signed off and we posted (1958 UTC).

No one had an axe to grind with Verizon. We were working a complex problem affecting a good chunk of our traffic and customers. Everyone was calm and collected and thoughtful throughout.

Shout out to the Support team who handled an additional 1,000 support requests during the incident!


Thank you! CloudFlare's response is appropriate.

The incident itself and lack of response (for HOURS) from Verizon's side is absolutely unacceptable. It's 2019, filtering ALL of your customer's routes according to - at least - the IRR (including the legacy ones connected to the old router in the closet) and having a responsive 24/7 NOC contact in PeeringDB are a matter of course.

Proper carriers like NTT go above and beyond simple IRR filtering nowadays with things like peerlock (http://instituut.net/~job/peerlock_manual.pdf).

AT&T uses RPKI and was completely unaffected: https://twitter.com/Jerome_UZ/status/1143276134907305984


So basically we should avoid all Verizon Enterprise Product.

EdgeCast used to be my favourite CDN. Not sure how they are doing now.


Keep in mind that Verizon is a huge organization


Awesome. Thanks for this timeline and for the team being absolutely amazing.

I love the shaming of Verizon without the sugar coat. Divisive for sure, but a welcomed one.


Tom is based in London. So he had a good night's sleep and was well rested.


We don’t know that he wasn’t up late fixing another bug that we thankfully never saw, and that his life hasn’t been like a season of 24 this past day.


we don't know either if he's an alien coming from a planet that has a shorter day and his circadian cycle is being disrupted being on earth, but Occam would suggest no.


If Cloudflare is correct on the technical part (no idea, I don't have enough knowledge to evaluate) then they are completely within their rights to call out Verizon on not doing the right thing. It's not "aggressive" at all. They don't mean harm to Verizon, they just point out Verizon is not doing their job correctly, and should step up and fix it. And not having somebody to talk to for 8hr for a provider than is capable of downing significant part of the internet is also something worth calling out. It's not some mom-and-pop shop that the owner can just hang a sign on the door and go to the beach, it's a major provider that should always have open channels to communicate on things like this.


>"Either Cloudflare has some pre-existing beef with Verizon and is using this as an opportune moment to dump on them"

Indeed. And that's not going to help them or their customer's in the least the next time they need Verizon's cooperation to resolve an issue. You would never see this type of behavior on the NANOG mailing list which has been on the front line of communications between ISPs and providers for BGP issues since the beginning of the commercial internet. It is very much a "community" with reciprocal respect and professionalism, things this blog post was devoid of.


> You would never see this type of behavior on the NANOG mailing list which has been on the front line of communications between ISPs and providers.

What element of the blog post are you referring to? NANOG often speaks in jargon and obtuse-professional speak, but with large routing leaks there are always strong opinions expressed. It's been this way going back well over a decade.

Another counterexample: go search the NANOG archives for opinions on AWS, EC2, and SES. You won't find much reciprocal respect - you'll find a bunch of unabashed criticism on how AWS operates, and how that affects the internet.

This is a clash of cultures. Cloudflare knows their customers expect a fast, accurate, transparent explanation. NANOG participants are used to an environment where their dirty laundry isn't aired in public to the point where they get calls from reporters asking about it.

Cloudflare is walking a tight line where they're trying to accurately explain to a lay audience what happened to their customers. They can't assume their audience knows what AS 701 is, or BCP 38, or the DFZ, or the prior harm that BGP optimizers have been known to cause.

A "professional" NANOG thread would touch on all of that, it just wouldn't be pieced together under a single byline for a mass audience.


"the next time they need Verizon's cooperation to resolve an issue" According to the Cloudflare post, they didn't get Verizon's cooperation to fix a partly-Verizon-caused issue this time, so what do they have to lose?



Weird response by the Verizon employee.

> You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it.

Not for lack of trying...

> Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off as the cost of doing business.

It wasn't a slow response, it was no response. And either is unacceptable for a tier 1 carrier.

> But this industry is one big ass glass house. What’s that thing about stones again?

And other carriers are actively working to change that - including, in particular, CloudFlare.


[flagged]


I think what lima is saying is the Verizon employee basically says "Why didn't you call us for comment before publicly complaining that we never answer our phones?"


CloudFlare is not a tier 1 carrier if you go by the strict definition of the term, just like Google isn't - but it's one of the largest content networks, reponsible for a significant percentage of internet traffic, with a global carrier-level backbone. Google and CloudFlare even tend to have better internal routing than most tier 1 providers.


They are not a carrier period! They don't sell transit. A Tier 1 carrier does settlement-free peering. They are a CDN. And no the most certainly do not have a "global carrier-level backbone." Argot or whatever they are calling their product is not an actual backbone with dedicated fiber, submarine cable etc. There would be no reason for them to invest in fiber and lightwave gear as they are en edge network full stop. Your comment shows a lack of understanding of how the internet actually works.


Whats up with the Verizon employee comparing AS701 to APAC carriers? That’s a super harsh thing to publicly say about your employer.


I’d say that Verizon’s lack of cooperation was devoid of any respect or professionalism.


Yes, very much professionalism including such recent email threads titled "Russian Anal Probing"


Cloudflare's bet is essentially that they can control so much of the internet infrastructure that they can behave however they like and we all simply have to deal with it.


> behave however they like and we all simply have to deal with it

So basically what Verizon did by looking at BCP194 and saying “nah, too much bother”??


Not really. You don't have to be a massive player to screw things up with BGP.


No doubt.

But it’s 2019 and I can’t muster up much sympathy for a tier 1 who can’t get inbound filters and a responsive NOC implemented correctly - things which were table stakes in 2009.


2009? You are being generous. I'm pretty sure when I was managing BGP announcements for my small ISP in 1999 route filtering was a thing.


Exactly. All these comments about how rude cloudflare is forget this style of public shaming of AS’s that can’t perform basic hygiene on their own network has been the norm rather than the exception for over twenty years. And further, all the surprise that cloudflare was quick to report - here’s the deal: bgp doesn’t lie. The second something is wrong, everyone knows who did it. There’s zero mystery. It’s not like some grand caper that takes months of investigations. Operators basically have one rule - don’t leak bad routes to everyone else. That’s pretty much the only rule that’s a constant. When you break it your karma goes to zero, everyone dumps on you but life goes on.


You don't need to be a massive player to initiate the screwup, but you kind of need a massive player like Verizon to amplify it for you. That's why the onus on them should be greater.


There’s no excuse for the personal slander here - “his beauty rest interrupted” is a mean-spirited, inappropriate way to discuss Cloudflare’s post. Please don’t insult people unnecessarily here.


I read it as a reasonable and slightly humorous way to acknowledge that the author is a human being and may have been (justifiably?) in a bad mood writing the post because lack of sleep does that. No idea of how sensible that suggestion is or isn't in this case but didn't really think it was derogatory. A human being is not a machine. Sanity-sleep might be a better term for it but nobody uses that whereas beauty-sleep is in common, albeit ironic, usage.

Let's all take our sleep seriously, yeah? :-)


Uh, I think the idiom just means “got woken up early.” I don’t see an attack there.


Likewise, don't insinuate people are ugly or don't deserve sleep.


Blog posts are their speciality


(deleted)

Yeah, that wasn’t contributing. Everyone has bad days.


Your profile mentions:

> Chief Architect, Information Security, Akamai Technologies. I do not speak for my employer.

Probably best to disclose this more directly in comments on topics related to competitors.


Why? Comments don't typically disclose "I am a corporate spambot" and yet such things exist, so you should probably read every comment as though it has some angle / vested interest. At least brians was nice enough to mention their bias in their profile. Most cases won't be so easy. Making people "opt in" to disclosing biases will just make it easier for the real bad actors to slip through.


brians seems like a good actor, which is why I assumed they might appreciate the feedback.

I agree there will always be bad actors on hn, which is too bad. I think the moderators try hard to combat it, which I am grateful for.


Hot take from Akamai.


Strong words for an outage that affected "Cloudflare, Amazon, Linode, Google, Facebook, and others"




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: