1128 UTC update Looks like we're dealing with a route leak and we're talking directly with the leaker and Level3 at the moment.
1131 UTC update Just to be clear this isn't affecting all our traffic or all our domains or all countries. A portion of traffic isn't hitting Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
1134 UTC update We are now certain we are dealing with a route leak.
@dang etc.: could someone update the title to reflect the status page "Route Leak Impacting Cloudflare"
1147 UTC update Staring at internal graphs looks like global traffic is now at 97% of expected so impact lessening.
1204 UTC update This leak is wider spread that just Cloudflare.
1208 UTC update Amazon Web Services now reporting external networking problem https://status.aws.amazon.com/
1230 UTC update We are working with networks around the world and are observing network routes for Google and AWS being leaked at well.
1239 UTC update Traffic levels are returning to normal.
So, first alert to status page was 20 minutes.
At some point you reach a company that's large enough that they must cooperate because they want to remain in business of being an actual responsible ISP.
And then there's Verizon, who can safely ignore any ISP etiquette because they have a de-facto monopoly.
But 50% of our traffic has gone!
Hopefully you are still working on it!
How the heck did their peers not manage to filter a sudden announcement for a range big enough that it managed to snag both 220.127.116.11 and 18.104.22.168. Do upstreams really allow a tiny /24 AS to randomly announce a /4 and get away with it? Or am I misunderstanding something fundamental about how BGP routes are allowed to propagate?
That appears to be a steel/alloys company. Why are they operating BGP equipment?
It seems silly to me that an end user company not providing any network services which only has a 256 IP block has the ability to break a significant portion of the internet with a configuration mistake. There are several ways to setup dual ISPs and routing that don't involve such risk.
thisissue still exists if you break up your IP space, it just makes it far harder to manage.
It doesn't surprise me at all that they are still a part of infrastructure, somehow.
Update: sorry, I may have been wrong. Hard to see clearly in the fog of BGP.
I'm curious why so much of this lies on Verizon's shoulders. Couldn't DQE and Allegheny have implemented the exact same best practices that Verizon should have, so it never leaked to Verizon's level? And to the extent non-Verizon subscribers were affected, couldn't their ISPs have implemented the same best practices in distrusting Verizon? Is Verizon directly responsible for routing that much of global traffic?
But I think at some point a network peering with verizon trusts it to route things, i.e., if I as an ISP always go through verizon to deliver traffic to cloud flare then it's out of my hands the route they take.
As for downstreams adding mitigation, ideally this would happen, but I would think you should place blame proportionally to the resources and criticality. A ten person ISP won't necessarily do everything right, and it shouldn't matter that they do, since there's a small part of the internet.
I've been seeing it for about 10h now.
Update for datapoint: I'm in Bloomington, IN, on ATT DSL.
22.214.171.124 was never marketed as a "ping me to see if the Internet is up" service, as far as I know. Just as a fast, public DNS server.
As I said, it's much easier to respond to a ping than even a cached DNS query. Or it would also be consistent to simply never respond to ping.
Now obviously in the modern "you get nothing for nothing" world, Google is able to violate whatever expectations they'd like. But "rate limiting" in a way that makes basic ping(8)s look flaky, especially on a service that will be used for debugging, is downright nasty and deserves to be shouted from the rooftops (iff it's true).
126.96.36.199 is not even meant to be used as a public DNS server (and has sometimes hijacked DNS requests at times to remind people of that). So it's weird to use 188.8.131.52 to criticize Google for blocking ICMP on their actually-public DNS server.
As I said, the crux of the problem isn't Google's "blocking", but rather making it intermittent. Obviously it's well within their rights to play whatever games they want - drop every other packet, vary the latency based on your IP, duplicate packets, or make it appear some queue occasionally holds your packets for 3 seconds. It's also within their rights to redirect all DNS lookups to an April Fool's page. And to do any of this selectively based on how many different Google services you use.
But that is not what any user expects, and in the end that's all protocols are - expectations. To me, the pushback I've gotten here fits right in with Surveillance Valley's general attitude of shirking responsibility with some fine print disclaimer, knowing full well what the constructive situation is. "I'm just going to go like this [spinning arms], it's not my fault if you walk into me".
If you can't see how people would expect to be able to reliably ping 184.108.40.206, or how intermittently dropping pings causes confusion (as in the original comment above), then I can't help you.
your argument boils down to "it is convenient for me, and I see other people stealing bags too".
1. It is straightforward to restrict a DNS server so that it only answers specific networks. This doesn't even need to be close to comprehensive to get the message across. Level 3's (née BBN's) intent is to continue to respond to the wider Internet community, regardless of what their ambient PR says. Likely for similar reasons that they run a looking glass.
2. The frequency and magnitude of your scenario makes it a straw man. A more worthwhile example is someone using a business's bathroom without buying anything. Yet most places don't really care as in the end it balances out, and we're all humans that have needs that can't be fully met by commercial provisions. The major concern is people who mess up the bathroom, paying or not.
3. While a common touchstone, theft does not apply has nothing has been taken. Perhaps unjust enrichment. But given that anybody using 220.127.116.11 to answer production DNS queries is actually harming themselves with additional latency more than anything "taken" from Level3, that's a stretch too.
Have we really become so full of corporate bullshit that we're stuck analyzing things in its myopic paradigm? I thought this was Hacker News?
PS I notice 18.104.22.168 also responds to pings and DNS queries. Should I expect to get a bill for their services? Because I'd much rather just relish the feeling of a fleeting shared purpose with someone halfway around the world in a vastly different culture.
I wonder if it's related to this? It does say this kind of BGP thing can be a deliberate malicious attack. Perhaps this? https://en.wikipedia.org/wiki/BGP_hijacking
edit: availability has been alternating between available and unavailable
If you just log in to Cloudflare and click the "orange cloud" icon on the DNS tab, which points the domain back directly to your origin, you'll see the site up within a couple minutes.
Our CloudFlare stuff isn’t even pingable. Sometimes it’ll return an echo from a far away DC.
It’s been like this for over an hour now and your status page doesn’t even acknowledge it apart from “Network performance issues”.
(The average CF user has no idea what a route leak is, tbh.)
They do this by announcing something like "send me all traffic for 22.214.171.124 - 126.96.36.199", and if their peers don't verify it, they'll just start routing that traffic to them. Peer by peer, the route then propagates and a larger portion of the internet, and as routers learn the new bad route, more of the traffic to those IPs gets sent to incorrect network.
Anyway, no, HN is not on Cloudflare, at least at the moment.
Maybe it's turned on selectively for high traffic / bot situations?
"The network is unreliable" is a rule of thumb that was drilled into my head in network programming class.
It always has been, it always will be. Doesn't matter if it's the internet or the link between your computer and a device sitting on your desk. And it doesn't matter what the tech is.
Making the internet more resilient only increases the severity of the failure when organizations that don't understand the risk they're taking on experience network outages.
The network is unreliable.
I started seeing delays of up to 300 seconds! At best there was a 1 second delay. I wondered if I was going to have present "Why we've decided not to go with Cloudflare!"
Any longtime Cloudflare users comment on how rare an event this sort of thing is? It seems rare from eyeballing the recent alert history.
On the plus side, I did get to test the "Pause CloudFlare" button in a real-world scenario!
I've been a Cloudflare user for 7+ years and a Cloudflare Enterprise user for 2 years. Before joining Enterprise, Cloudflare would suffer some kind of global or localized network outage (that impacted our operation) about once or twice a year. Most localized ones don't really get reflected on the status page properly. After joining Enterprise, this is actually the first observed incident we've encountered so far.
Though it might not be a Cloudflare-only thing because funny thing is... Verizon Fios is also down for everyone I've talked to this morning.
It's saved me lots of time and energy.
The description of it as a leak AFAICT seems to be due to CF getting first dibs on the announcement[†] and positioned it as such. However, I firmly believe that had the general tech press gotten ahead of it first, it still would be treated much more generously than we treat China leaks.
On that protocol, the various systems broadcast what prefixes they can route, which then affects the rest of the networks' routing decisions.
By error or malice, a system can report a prefix they cannot or should not route, causing other systems to start routing traffic across it. This will either just cause weird routes (such as ones going through certain suspicious countries), cause poor performance for those routed, or no connection at all for those routed.
Eventually some governments will have to get involved...
The protocol is fine.
BGP was designed for operators to implement a routing policy. In most implementations it allows everything by default with no modifications to route metadata, so if you do not set up your policy correctly you'll have issues like this.
What you are describing is a protocol problem.
Such system would also contain human error to a smaller set of possible faults.
Care to explain that one?
It's a route leak, which can affect any arbitrary amount of ISPs, because the BGP protocol is totally unauthenticated.
90 AS 13335 Cloudflare, Inc.
18 AS 7018 AT&T Services, Inc.
8 AS 63949 Linode, LLC
8 AS 2828 MCI Communications Services, Inc. d/b/a Verizon Business
6 AS 26769 Bandcon
6 AS 16509 Amazon.com, Inc.
4 AS 6428 CDM
4 AS 2914 NTT America, Inc.
2 AS 9808 Guangdong Mobile Communication Co.Ltd.
2 AS 6939 Hurricane Electric LLC
2 AS 62904 Eonix Corporation
2 AS 55081 24 SHELLS
2 AS 54113 Fastly
2 AS 46606 Unified Layer
2 AS 45899 VNPT Corp
2 AS 4246 New Jersey Institute of Technology
2 AS 3257 GTT Communications Inc.
2 AS 27695 EDATEL S.A. E.S.P
2 AS 22781 Strong Technology, LLC.
2 AS 20473 Choopa, LLC
2 AS 16625 Akamai Technologies, Inc.
2 AS 12129 123.Net, Inc.