Hacker News new | past | comments | ask | show | jobs | submit | page 2 login
Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline (cloudflare.com)
863 points by steveklabnik 61 days ago | hide | past | web | favorite | 274 comments



Can someone explain why the optimizer would split one route into two? Wouldn’t it be more optimized to coalesce routes whenever possible?


If an ISP has multiple physical connections that it could use to reach Cloudflare's network, it makes sense to distribute the traffic that's addressed to different IPs across different links, instead of using a single route that sends all the traffic over one link and leaves the others idle.


Good point. Thanks.


They automated a system that relies solely on trust? ...China hackers are surely having a great day.


From the post:

>"It doesn't cost a provider like Verizon anything to have such limits in place. And there's no good reason, other than sloppiness or laziness, that they wouldn't have such limits in place."

Is "sloppiness or laziness" really the only possible attribution here? I'm not a big fan of Verizon but I'm a big fan of civility and empathy, two qualities which your blog post lacks. Outages are a really unfortunate fact of life. We've seen them recently with Google, AWS, Dyn - all companies where technical competency is generally not questioned. It's quite possible the cause of of this outage was some "perfect storm" scenario such as an eBGP router rebooted and came up with a stale or incorrect config. "Perfect storm" scenarios even happen at companies with very rigorous engineering cultures as we saw with the most recent Google outage.

Your attempt to shame an organization without knowing all the details reeks of immaturity and pettiness. Ditto for your willingness to turn this into yet another Cloudflare marketing opportunity. Have you forgotten about your own Cloudbleed incident? How would you feel if it a security company took that as an opportunity to shame you for "sloppiness or laziness"? Or some other company's CEO was offering to send people "Cloudbleed Support Group" T-Shirts on HN as your own CEO is doing in this thread?

Lastly RPKI isn't a silver bullet, RPKI authorities can also be misconfigured and attacked[1][2]. This happened with the LACNIC incident in 2013[2]. It's also worth mentioning that RPKI potentially creates new threats[2]. But again it seems more important to you to use this as a marketing opportunity and promote yourself while throwing someone else under a bus while uttering pithy summations.

Also from your post:

>"And, in particular, we're looking at you Verizon — and still waiting on your reply."

Although Verizon is the 400lb gorilla in the room, their NOC and network engineers are still regular people with kids and families and feelings. They are also people who have had a really shit day today. Why you can't extend just a bit of human compassion and feel compelled to try to shame is quite inexplicable.

You may think that your blog post was a marketing coup but I see it as a massive failure in in both leadership and civility.

As a thought exercise maybe Cloudflare leadership could think about how they would like the community to react the next time they are at fault.

[1] https://www.cs.bu.edu/~goldbe/papers/hotRPKI.pdf

[2] https://www.cs.bu.edu/~goldbe/papers/sigRPKI_full.pdf


Cloudflare reached out multiple times in multiple ways to Verizon, to attempt to resolve the situation.

More than eight hours on, after utilising everything from what they were told was a Tier 1 support line to Twitter, they have nothing.

Even if we're kind to Verizon about the network failure, which was a global issue, they haven't done anything or said anything to suggest that Cloudflare should be treating them kindly in any way.

Not even a "we're aware, we're handling it".

Ghosting one of the world's largest (as in utilised) companies is not wise for administrative, technical or PR reasons.

Verizon have shown a complete lack of leadership.


Have you ever worked for a Tier 1 ISP during a big outage? There is not enough personnel bandwidth in a NOC for everyone to get an individual response.

>"Ghosting one of the world's largest (as in utilised) companies is not wise for administrative, technical or PR reasons"

Oh the Cloudflare marketing machine. Largest by "utilized"? What does that even mean? Cloudflare is not a Tier 1, a Tier 2, or a major eyeball network. They are pretty far down in the pecking order despite what your marketing department wants us to believe. There's always some fuzzy stat isn't there?

Being too inundated to respond to everyone on the day of outage is a human resource problem, plain and simple The fact that you have taken this so personally is kind of embarrassing. What this blog post, the opportunistic marketing ploy and finger pointing have shown is a complete lack of maturity on your part. You want to call out Verizon for their behavior yet your own behavior is unnecessarily aggressive.


> The fact that you have taken this so personally is kind of embarrassing.

What? I have said nothing personal.

> What this blog post, the opportunistic marketing ploy and finger pointing have shown is a complete lack of maturity on your part.

Ah. You seem to be confused. I am not affiliated with Cloudflare, and have not worked with Cloudflare at any point in time.


This is a pretty rude response to someone who doesn’t even work for cloudflare.


> regular people with kids and families and feelings.

This is just an appeal to emotion. No-one is even calling out any individual people. With a company of this scale and responsibility, individuals shouldn't even come into the discussion, and there should be multiple levels of redundancy. Verizon, collectively, is being shamed.

Verizon should be compared to a power plant, not a SaaS provider or some 3-person dev shop.


>"This is just an appeal to emotion."

Not at all, its an appeal to civility. The statement that Cloudflare made with "there's no good reason, other than sloppiness or laziness, that they wouldn't have such limits in place' is the appeal to emotion here.

>"No-one is even calling out any individual people.'

No, that is a very clear attempt to call out a specific group of people who work in the network engineering department.


It was gross negligence that leads to this. Nothing more.


Do you work for Verizon's NOC or Network Engineering department then? You have inside knowledge that it was negligence? Because I provided a specific scenario where it would not be "gross negligence."


> Because I provided a specific scenario where it would not be "gross negligence."

No, you didn't. You provided a vague conjecture for how the initial cause of the problem might not have been gross negligence, but offered no hypothesis for why Verizon isn't answering the Red Phone.


Yes I did. It's the part where I clearly state it's possible that a router that offline or rebooted came up with a stale or incorrect config. This actually happens occasionally. I've been on both ends of it. Clearly Verizon has inbound prefix filtering in place as this is not some common occurrence for AS 701.

Verizon was in contact with people yesterday. I have spoken to two people from two other carriers who were in touch with them. And you are just parroting the idea that because Cloudflare didn't get a response that Verizon wasn't responding to anyone period. And that's just not true. The fact that you think there's some red phone that just anyone can call the NOC and magically speak to someone during a major outage shows you have no practical experience with thes things you are commenting on and criticizing.


> Do you work for Verizon's NOC or Network Engineering department then

If that's what it takes to be able to decide, then I guess we can safely declare that it wasn't negligence, because no-one that fits that description will ever publicly admit to it.


No one has forgotten Cloudbleed. It's something we talk about internally and every single day I look at a report that shows me status of software running around the world so that I never, ever again let a piece of software running on our edge crash and leak information.


The point wasn't whether or not Cloudbleed was forgotten but rather acting with some civility towards other when these mistakes do happen. Your suggestion that "there's no good reason, other than sloppiness or laziness, that they wouldn't have such limits in place" is absurd. Misconfigurations and mistakes are a fact of life, they happen to everyone. To suggest that AS 701 which is old enough to have a 3 digit ASN, somehow doesn't use any ingress prefix filtering as a matter of course is disingenuous at best. The cause is quite likely that this was a misconfiguration on a single interface on a single router. I think you know this though.


Yeah, because everyone was so civil towards me and Cloudflare when Cloudbleed happened.


(You handled both admirably well, I would just stop responding to this person.)

bogomipz 60 days ago [flagged]

I'm sorry is holding a differing opinion and discussing that in a civil manner such an issue for you that you feel compelled to step in and offer prescriptive advice? Long live the filter bubble I guess huh?


Would you please stop posting like this to HN? At a certain point it only weakens your case, and you passed that point a while ago.

https://news.ycombinator.com/newsguidelines.html


Mistakes happen and CloudFlare's response to the memory leak was excellent.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: