Hacker News new | past | comments | ask | show | jobs | submit login
Cloudflare Network Performance Issues (cloudflarestatus.com)
631 points by drcongo 15 days ago | hide | past | web | favorite | 310 comments



Cloudflare: Your status page showed "all systems operational" for over 20 minutes while your primary domain was returning a 502 error. Please change this to update automatically, many other engineering teams depend on you.

https://i.imgur.com/qHBM2JW.png


Sadly this reminds me of AWS outages too where the same applies. How is it that hundreds of developers know there's an issue before AWS do, or Cloudflare in this instance. See my blog post on similar AWS uptime reporting issues at https://www.ably.io/blog/honest-status-reporting-aws-service.

At Ably, our status site had an incident update about Cloudflare issues being worked on (by routing away from CF) before Cloudflare did: https://status.ably.io/incidents/647

We have machine generated incidents created automatically when error rates increase beyond a certain point stating "Our automated systems have detected a fault, we've been alerted and looking at it". See https://status.ably.io/incidents/569 for example. I think much larger companies like Cloudflare and Amazon could certainly invest a bit in similar systems to make it easier for their customers to know where the problem likely lies.


Heh, I am reminded of when the control plane at AWS went down... and we had a custom autoscaling config that would query for the number of instances running and scale appropriately... but when the AWS API died... we kept getting zero running instances...

So our system thought none were running and so it kept launching instances....

These were SPOT instances and thus only cost like .10 per hour...

But we launched like 2500 instances which all needed to slurp down their DB and config - so it overloaded all other control plane systems...

We had to reboot the entire system. Which took forever.

The only good things was this happened at 11am - so all team members were online and avail... and then AWS refunded all costs.

---

The other fun time was when a newbie dev checked in AWS creds to git - but he created the 201th repo (we had only paid for 200) -- and as it was the next repo which wasnt paid for, it was by default public - thus slurped up by bots asap - which then used the AWS creds to launch bitcoin mining bots in every single region around the globe. Like 1700 instances.

The thing that sucked about that was it happened at like 3am and we had to rally on that one pretty fast. AWS still refunded all costs...


> but he created the 201th repo (we had only paid for 200)

That's an odd choice of a failure mode.

> AWS still refunded all costs...

Yeah they should. It was their silly design choice that lead to disclosure of secrets after all.

What kind of failure mode is that even. Failing to create the repo would have led to a better user experience for sure.

Can you imagine if S3 charged more for private objects and once you reach your count, it just makes them public and posts them on Reddit?


>Can you imagine if S3 charged more for private objects and once you reach your count, it just makes them public and posts them on Reddit?

Exactly! What a stupid design UX.


>It was their silly design choice that lead to disclosure of secrets after all.

Wait, was the 200 private repos issue an AWS thing or a GitHub/GitLab/whatever thing?

What AWS product has a concept of private/public repos and limits on how many of the former you can get for a certain price?


It was a git thing.

Never post aws secrets to git.


>How is it that hundreds of developers know there's an issue before AWS do

Trust me, they know.

They know about problems we never find out about, too.


At least in the case with AWS, unfortunately there's business involved - because of their uptime guarantee, incidents that would be called downtime by a purely technical team are left as "operational" or "partly degraded". Otherwise, they might have to shell out millions or tens of millions.


You have to provide "your request logs that document the errors and corroborate your claimed outage" for the AWS Compute SLA https://aws.amazon.com/compute/sla/


slightly off-topic: Nice and clean, yet detailed enough status page. I like it. :)


Thanks :)


On the OT subject, the status logo is blurry (at least on a 4k monitor), whereas on the main website it's nice and crisp.


Ok, thanks for the heads up. Will raise a status website issue.


Are you sure it wasn't cached? For us it was showing correctly the incident...


If cloudflarestatus.com isn't serving text content as Cache-Control: no-cache, that's itself an egregious bug.


It serves Cache-Control: max-age=0, private, must-revalidate. Same thing as far as I get it, maybe wider support.


Their status page is hosted on Cloudfront, perhaps adding an unintended caching layer.


Is it? From my end it looks like a plain ec2 instance from statuspage.io.


Cloudfront supports Cache-Control: no-cache


Cloudfront itself can cache the content regardless. (if its headers are showing HIT)


It might have been cached on the server but was not cached on my clients: I did a forced refresh and used a private browser window but it only showed the “almost everything is good” view throughout the duration of the incident.

The actual incident page was correct.


Can confirm that we had the same view. Operational while not working.


All I saw was a vague message at the top, all services were marked green and operational throughout.


They should derive whether something is down from how much traffic the status page gets, and have alerts tied to that as well. I think it would be pretty accurate.


We saw an update within 5 minutes of our sites going down. Are you doing caching somewhere?


Interesting, I was command+shift+R refreshing and also tried from a VPN in another region. Perhaps our CF-hosted sites returned 502s in my region sooner than yours, causing me to check the status page sooner.


Seems unlikely, I don't use cloudflare have never visited the status page before. I noticied several 502 pages this morning and searched for "cloudflare status page" and saw the "all systems operational".


It isn't in their best interest to have a reliable status page. They want status page to contain information about failure only after people know about issues through other means. For the same reason cloud providers don't provide rebates on outages unless you ask for it explicitly.

I think the only way is to have status page operated by an independent 3rd party, but I don't think there's a viable business model for someone to provide such service. Perhaps there might be even a risk of lawsuits against you.


It was showing correctly for me the first time I checked it.


Why even have a status page if it doesn't reflect reality?


Once cloudflare.com came back I decided to check out their business SLA, and it's not very encouraging:

> For any and each Outage Period during a monthly billing period the Company will provide as a Service Credit an amount calculated as follows: Service Credit = (Outage Period minutes * Affected Customer Ratio) ÷ Scheduled Availability minutes

- https://www.cloudflare.com/business-sla/

So assuming an outage affects 100% of your users (this one seems like it did, but that's not clear), they only refund the time the service was offline? According to pingdom this outage lasted ~25 minutes, so that's 25/(31 * 24 * 60) = .056% of our bill, roughly 11 cents.

It sounds like you just don't pay for the time the service didn't work, which isn't much of a guarantee, that's just expected (of course you shouldn't pay for services not provided). Most SLAs for critical services have something like under 99.99% uptime you get 10% of your bill back, under 99.5% you get 20% back, under 99% you get 50% back. (*Numbers completely made up to demonstrate the concept.)

Am I misreading this? Morning coffee hasn't kicked in yet so maybe I am.


This doesn't surprise me at all - SLA's are widely overrated. No SLA will cover damages incurred by lost business due to an outage. What you likely want is some kind of third-party insurance for downtime caused by outages out of your control - but I'm not even sure this exists.


I guess you’ve never worked in enterprise. SLAs for critical systems very frequently incur payback in excess of the billing on outages.


But not consequential losses which is what the parent mentions.


I'm definitely not suggesting CF should cover losses. Sorry if I gave that impression. That would effectively require them to be an insurance company since they'd have to investigate claims, and possibly charge customers differently based on risk. (i.e. you don't want to bill a customer $200 per month if 10 minutes of downtime could lose $20 million in sales.)

I mean something like Amazon EC2's SLA (https://aws.amazon.com/compute/sla/) where credits are proportional to downtime, but not 1:1. i.e. they credit 100% for >= 5% downtime. With Cloudflare's SLA, 5% downtime (1.5 days in a month) would only give you a 5% credit.


There are different types of contingent business interruption insurance available. If you're large enough (like Fortune 500), you can negotiate the terms of your policy.


This does exist. It’s just a matter of who pays for it.


Also the standard SLA you get will be wildly different from the bespoke contacts negotiated by enterprises. Just depends on your spend.


This type of insurance does exist. Speak to your broker.


They reserve the best SLA for Enterprise, naturally.

https://www.cloudflare.com/plans/enterprise/

> 100% uptime and 25x Enterprise SLA

> In the rare event of downtime, Enterprise customers receive a 25x credit against the monthly fee, in proportion to the respective disruption and affected customer ratio.


Two years missing revenue for enterprise customers, ouch that's going to hurt.


25x the downtime, so about 10 hours' worth of charges if the downtime was 25 minutes.


you missed the:

>in proportion to the respective disruption and affected customer ratio.

so no they aren't losing 2 yrs of revenue


That's the business SLA. The Enterprise one is 100% uptime with a 25x payback so those are the ones they are focusing on keeping up. We were with Verizon before CloudFlare and their SLA was a similar pay back for outage. I think this is pretty typical, what service did you have that had a different setup for the SLA?


Amazon EC2 has an SLA like I mentioned: https://aws.amazon.com/compute/sla/


AWS CloudFront SLA refund/credit policy appears to work as you describe: https://aws.amazon.com/cloudfront/sla/

Monthly Uptime:

   * 99.0% <= uptime < 99.9%:  10% service credit
   * 95.0% <= uptime < 99.0%:  25% service credit
   *  0.0% <= uptime < 95.0%: 100% service credit
But CloudFlare's Enterprise SLA (25x credit) is similar or maybe even a little bit better (because you get to 100% at 96% instead of 95%). Of course when you are doing an Enterprise deal you can negotiate for whatever terms are mutually acceptable as long as you're willing to pay.

In any case, the function of the credit policy is to ensure there is enough pain for the provider to put in place the quality / reliability practices, process and code to protect themselves from losses. IMO most sustainable business pull in much more revenue per hour than 25x their CDN cost.

It would be interesting to know how CloudFlare's infra and processes differentiate free, Business and Enterprise customers.


25/(31 * 24 * 60) == 0.056%, not 5.6%.


Fixed, thanks. Like I said about the coffee. :)


That may be the public SLA for people who create an account without private contracts. Large businesses never use the standard SLA terms.


I left Cloudflare for AWS a long time ago despite CF's affordability since they didn't seem to care that much about uptime or quality. Their frontend was corrupting response bodies + caching response bodies (in retrospect, this was probably pre-discovery cloudbleed) and there was no way to get a response or help with it.


not sure what cloudflare costs but compared to the amount our business lost in those 30min, it probably does not matter much.


Throughout this outage, https://www.cloudflarestatus.com/ continued to show green - all services operational, with almost all services marked 'operational' and some vague cryptic message about users in this region being affected:

Investigating - Cloudflare is observing network performance issues. Customers may be experiencing 502 errors while accessing sites on Cloudflare. We are working to mitigate impact to Internet users in this region.

It seemed very much like a global outage affecting all services. Is this status page not automatically updated with service status, or is it just manually updated by humans? Even if manually updated, surely when posting that status message, the status of all the services should be set to degraded?


This is not my experience, I have received many updates both through email and updates to the cloudflare status page throughout the incident, except for possibly the first 10 minutes


You're probably talking about the yellow note at the top of the page (which is still there, with a little more detail now). That was updated and is fine.

I'm talking about the service indicators for each service lower down the page, which remained green throughout and appear to be just decoration, not an actual indication of service status, they all said operational throughout the incident (I reloaded a few times).

In particular I'm thinking of the Cloudflare Sites and Services section.


A great example for why you shouldn't transfer your domain to Cloudflare Registrar if you're also using their CDN. Those who have transferred their domains cannot change DNS servers to mitigate the outage.


You can setup your domains using their CNAME method. You do not have to delegate your entire domain to them. https://support.cloudflare.com/hc/en-us/articles/36002061511...

Together with a short TTL we were able to recover without relying on their dashboards.


Only if you are a paid customer. The free service does not allow this, and this is why I do not use Cloudflare for personal use.


For personal use you also probably don't need the high availability of switching over your domain the moment they are having problems?


One thing is not every free or pro plan on cloudflare is personal use.

I'm running the web servers, official wiki, and game external resource portal for the most active open source video game on github, through cloudflare, and maybe we might not want our 60 million requests a month website to go down when cloudflare does.

Because I can tell you right now our 300 a month budget (that mind you, is capable of covering 7 game servers that can handle 100 connected players (each)) can't take the 80 dollar hit just to make cloudflare not a single point of failure.


If you don’t count this as a personal or hobby project it’s a community project. They don’t have a pricing plan for that so you either have to go pro or go to some other provider who gives you this much for free. Why should a company give you even more pro features for free if you are already getting a lot of things for free?


Pro($20) does not give you the cname feature, business ($100) does.


he’s not arguing that they should. he’s just pointing out the need for your own plan b.


Ah, used to be an Enterprise only feature. We have access to that, but didn't realise it's now available to Business. Perhaps time for them to consider offering it more widely!


No good if your Registrar (Namecheap) is behind CF as well!


That would seem like a solid reason to avoid Namecheap.

Edit: Wait, do they use Cloudflare?

  $ dig namecheap.com +short
  198.54.117.250
whois:

  CIDR:           198.54.112.0/20
  NetName:        NAMEC-4
  Organization:   Namecheap, Inc. (NAMEC-4)
  Updated:        2015-11-13
https://whois.arin.net/rest/net/NET-198-54-112-0-1.html


They don't seem to use CF's nameservers, but try the www. subdomain:

    $ kdig +short www.namecheap.com
    www.namecheap.com.cdn.cloudflare.net.
    104.16.99.56
    104.16.100.56


I've got Cloudflare captcha just yesterday when logging in to Namecheap


Changing your NS records at the registry could help, but keep in mind most TLDs are serving NS records with 1-2 day TTLs, so you'll still see a lot of traffic going to the old server.

If this is something you want to be able to mitigate, you really need to be running a seperate DNS infra from your hosting/CDN and use short TTL cnames to delegate hostnames to the CDN. This becomes a big challenge if you host on an apex domain (eg example.org instead of www.example.org), so don't do that.


I use namecheap to manage my name servers. It's also down.


Yeah, I'm regretting that very choice now.


Problem is not just about using Cloudflare as your DNS registrar really. Even if you have a different registrar, the Cloudflare model is to have the NS (nameserver) records set up to point to Cloudflare, and then they in turn resolve the DNS. You cannot really use Cloudflare without that set up. Changes to nameservers at a registrar level are rarely quick, at least quick enough to mitigate a disaster like this. It's why we've used two completely different domains at Ably (ably.io and ably-realtime.com) for all services we provide.

We wrote about a strategy to circumvent this sort of thing a little while back https://www.ably.io/blog/routing-around-single-point-of-fail.... Given two incidents in a matter of weeks, I think a revisit of that article in light of most businesses who operate on a single domain would be useful :)


we were able to recover by submitting an API call rather than their UX, it was slow but the API requests work


I'm guessing that they're going to be back up again before those changes can "propagate", given the relatively high TTLs most TLDs run with.


Really good point. I debated about switching, and this exactly reason is why I didn't. Its one thing I didn't want to have in the same bucket.


They have now released an initial statement [1]:

For about 30 minutes today, visitors to Cloudflare sites received 502 errors caused by a massive spike in CPU utilization on our network. This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.

This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.

[1] https://blog.cloudflare.com/cloudflare-outage/


Ahh the "we test in prod" method


Yeah, except that wasn't meant to be the way things work.


> Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide.

Sounds like backtracking. If so, I'll bet there's a conversation happening about switching to re2.

edit: hmmm, https://github.com/cloudflare/lua-re2

I'm also curious why this rollout wasn't staged.


Sorry, I would have posted this myself but was too busy.


Interesting to hear later if the process was followed. Hard to believe they deploy changes to every production location at once.


Good.

I'm tired of people not learning that trusting a single gateway with 50% of the internet is bad.

Yes, I know, free DDOS protection. There has to be another way of doing this, some mesh based DDOS protection or so.


I agree that the centralization of the internet is troubling, but Cloudflare is solving a systemic problem that nobody else is tackling. The DDoS problem was not being solved for anyone except for enterprise customers until Cloudflare came along and to this day there is very little competition in this space. Your post makes it sound like making a "mesh based DDoS" system is somehow trivial. Who is going to pay for this? How does it work? How do you ensure latency is not atrocious? Why hasn't someone made this already? Cloudflare at least has a financial model that can be sustained, and it doesn't include harvesting all of our personal data.

Without CF, many websites would not stay on-line during an attack. And they would cease to exist because many of those places would never be able to afford DDoS protection. I know so many sites, including ones I run, that I would not be able to keep on the public internet without CF DDoS protection. There really is no real competition in this space.

I think we need to consider the fact that while this outage does take a lot of sites off-line at once, it is temporary, and it is still extremely rare. And the alternative is potentially that many websites would cease to exist at all, period, without something like Cloudflare existing.


> Your post makes it sound like making a "mesh based DDoS" system is somehow trivial.

It does not. "There has to be" ends in "!" and is an expression is a wish.

> Who is going to pay for this?

All of us. Everyone. I keep circling back to the idea that we should all join a protection ring, so we'd all share "cost" on this.

> How do you ensure latency is not atrocious? > Why hasn't someone made this already?

First one is very technical, and the answer is most probably by localisation and by only actually turning it on when needed.

For the second: because it's a very hard technical problem and there is basically no money in it. Business value maybe, but it would need people and companies to collaborate, it would probably need committee level decisions, and so far, nobody wanted to deal with this.

Or at least that's my theory.

EDIT Maybe dat:// will eventually become a viable option, and with that, due to the distributed nature, this kind of DDOS protection is sort of built in.


This seems like a case for putting more of the internet through a single gateway. Having my downtime correlated with everyone else's means users will be more forgiving because they'll perceive it as "the Internet's down" rather than "lkbm's site is broken". (We saw this with CloudFlare. Some users were pissed, and others jumped in with "it's not their fault; AWS is down". That doesn't happen when our stuff specifically goes down.

We need to decentralize the Internet, but this occurrence is not the reason. It's an argument to keep consolidating.


Centralizing more of the Internet to help people align their excuses is probably the worst reasoning I've ever read


But maybe this is how "the collective" works?


It's a horribly short-sighted, irresponsible and dangerous attitude. When your service goes down on its own its users can switch to some backup process temporarily. If half the internet goes down, they are screwed. How much they're screwed depends on how they're using your services at the moment, which you most likely can't even know.


Nobody Gets Fired For Buying IBM.


> Nobody Gets Fired For Buying IBM...

... mainframes for business critical applications.

As for buying IBM Cloud services, it's a bit fuzzier.


How cares whose fault it is? In the end, you're not able to provide service to your users and you potentially lose money.


Business risk cares. SLAs, SLOs, SLIs, they are all about this; to be able to direct the blame.


The argument being made above only holds water if Cloudflare has worse uptime than smaller providers, and that it's because they're big.

This is often the case when monopolies realize they're in a position where they can get away with sucking, but I don't believe that to be the case with Cloudflare yet.


That's a strange perspective. Users only care that a service works and there is a long history of them blaming whatever is most visible.

If you're site is down, it's your fault in their eyes, not some core infrastructure provider.


CYA buddy, CYA

(sarcasm!)


I get your point. But I'm not sure I can ever grok the mindset of someone who thinks, "good, that'll teach 'em". Not sure I'd ever want to work with that individual.


For the kind of coworker I’d like, a grumpy old cynic beats the peppy corporate cheerleaders any day of the week.


I feel like that's a false dichotomy.


Fair point. I'm wondering though, how else you make people understand something like this?


Education and positive reinforcement.

Teach people the right way to do things, and then make that behavior self-rewarding.


What if that behavior isn't self-rewarding and is actually expensive? Not using Cloudflare, Google Maps, Analytics et all means you need to use something else, need to spend attention points somewhere, need to pay for the services. Very few people will do that because "it's the right thing to do".


I didn't say it was easy or obvious. If it were, I would be out of a job.


This is important to recognize. Victim blaming is a highly destructive practice.


Assigning blame for technical business decisions to the people who made those decisions is victim blaming?


Targeting the process or behavior that lead to an event is far more effective than targeting the person that triggered it. Do not conflate the two.


I don't think there is another cheap & easy to administer way, or more people would be doing that. Also CF is nice for the slight protection it provides against obvious bots and mass hacking attempts. And decreasing requests for static page resources. And it also (as long as your website doesn't leak its own ip, i.e. not sending emails except through Received-path-scrubbing services) hides your ip somewhat, meaning if you're careful you can run a website from your home on a raspberry pi without any issues.


I agree on principle, but as an end user, I must selfishly disagree. I find life after Cloudflare to be better than before it.

There's a Cloudflare location right next to me, and that improves latency across many of the websites I use, and makes their public DNS service the fastest.

I just wish they had active competition in the same space. CDNs have network locations nearby but they don't offer an easy UX for relatively unexperienced website owners, and DDoS protection services usually have less network locations than content CDNs.


With a lot of these things, if people can't agree to make a standard system to do all of it just as well then there's going to be a big company that does so. We have had this with e-mail, signed exchanges, DDOS protection and will probably have it with many other things unless people pull themselves together at least after these proprietary solutions are created and create better alternatives.


Their free CDN is my biggest draw to their service: I can handle millions of requests per day with a $5 VM, sane caching headers, unoptimized code, and Cloudflare free tier.


It is slightly ironic that the best defence for a distributed problem is to build a single point of failure.


9.6% not 50%


9.6% including providers like digitalocean. I'm skeptical about that 9.6%.


So it seems... Even https://cloudflare.com/ itself is down


More importantly, their admin dashboard is down. It's impossible to bypass their "orange cloud" proxies and send traffic directly to our hosting. That they can't flip a switch and have their nameservers send dash.cloudflare.com to a separate piece of redundant infrastructure is mind-boggling.


We were able to flip this switch on our services through their API, as we have our Cloudflare config in Terraform.

The API wasn't working perfectly, but with some retries we were able to change the config for our domains.


change the nameserver at the registrar to someone else


By the time that change propagates, cloudflare will be backup


Yes, but then next time you will be able to control your DNS.


Though, only if you're using a short TTL. I'm not arguing against your position in general though.


Agreed. First thing I tried to do was login Cloudflare to disable proxying. But can't.


Yep, I'm seeing 502's all over the web as of a few minutes ago.


Great, now we can't even disable HTTP proxy to allow traffic directly to AWS :/


Point your DNS back to your actual host? Might be the only short term solution, though DNS propagation times kinda rule that out :/


That's a no-go if you transferred your domains to Cloudflare Registrar.


Something something all your eggs in one basket :)


Seems like even other registrars might rely on Cloudflare (e.g. Namecheap) so now people have to continuously ensure there’s no cross-pollination between their infra providers...


I think the only option here would be to change our name servers at registrar level to point to AWS and recreate all DNS records there, but then you have to deal with name server propagation.


We use our own service.


EDIT: the static pages load but trying to log in just times out. I think the static cache is back up but the rest of it is still down.

Seems to be back up and running for me.


So cloudflare.com runs on Cloudflare?


Turtles all the way down.

But in all frankness, if Cloudflare's own site did not run off of Cloudflare's infrastructure, why would anybody trust them with their websites?


Ironic isn't it.


For the site itself that's totally fine, and the status page is separate as it should be, too.


This is fine-ish...

In case of network issues affecting their proxy, being able to change your configuration to allow direct traffic would be really nice.


Fair. Some service recently had their status page hosted on the same server that went down, that was funny.


I guess it's like diversifying one's assets - you put your status page elsewhere than on your own infrastructure.


Discord's status page is also down haha


I'm quite surprised Discord's status page is behind Cloudflare. I thought they were using statuspage.io.


They are, but that subdomain is proxied through Cloudflare. If they'd set it to just DNS then it would have still worked.


Certainly what I would use if I was building cloudflare.com...



That number matches what I am seeing on StatusGator: Of the 438 status pages we monitor, 52 of them are showing some kind of warn or down notice right now. That's almost 12%.

Though some of them might not be because of Cloudflare, the ones I spot checked all do appear related. Medium, DigitalOcean, Shopify, CodeShip, Pingdom, and many more. The impact is staggering.


Add to that all the sites using resources hosted on Cloudflare's CDN.


9.6% of websites that have one of those CDNs


aah i misread so 7,5% of all websites


How about a weighted percentage of the Alexa top 1m?


Update - Cloudflare has implemented a fix for this issue and is currently monitoring the results.

Description: Major outage impacted all Cloudflare services globally. We saw a massive spike in CPU that caused primary and secondary systems to fall over. We shut down the process that was causing the CPU spike. Service restored to normal within ~30 minutes. We’re now investigating the root cause of what happened.


https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr

(edit: thank you mods for changing the submission URL)


oh wow. that's a long list.


aka the whole world.


Don't want to go off topic, but if I want to prevent my website going down because of stuff like this in the future, will having back up DNS entries solve the problem?

I know DNS will fall back if it can't reach a service, but would a 502 trigger that?


Yeah, you'll want to keep your domain registered through a different registrar and if CF goes down you can update your DNS Name Servers point from CF to something like AWS Route 53.

This has a few drawbacks like making sure your Route53 configuration is identical to your CF config, ensuring your origin servers can cope with the additional load if CF caching isn't available and the DNS propagation time required for the Name Servers to update.

During the last outage, we were able to get into the CF dashboard and simply disable the proxy which allowed our clients to access our origin servers directly but this time we can't even get into the Dashboard.


Yeah, if I had access to the DNS records this would be easy, but like you said, even the dashboard is down.

Ideally, I'd want something where if Cloudflare goes down, I don't have to change anything, but... 502 isn't going to trigger that without some work on my part.

Meh.


If you received a HTTP 502 then DNS must've already resolved. Browsers typically will do a DNS lookup, and then try establishing a TCP connection to one of the returned hosts. Its only if it can't establish a TCP connection to a host will it (sometimes) try another host from the DNS response.


No, you'd need to use an application-aware DNS provider like Route 53 that could detect the failure.

Backup DNS entries won't help, what you need is to use multiple DNS providers and add them all to your NS records


I think this is the correct answer.

Since I'm mostly thinking about static sites, I could also have something local that pings the site, and if it goes down, it could update my nameservers to point elsewhere on its own.

Probably more trouble than its worth though.


This is a good question.

I'm not sure if the browser receives all A/AAAA records from the syscall, or just one. I guess that if the browser has the whole list, and the error is in the 500 range, it could retry a different IP but I'm not sure if browsers do this.


DNS isn’t handled by the kernel, it’s handled by the network library runtime and that does return a list of addresses (I think it actually has you iterate through them in C anyway.)


What would solve it is having your DNS have a low lifetime and then changing the DNS to point to not-Cloudflare. It would still be down for some users as long as the old (Cloudflare) DNS is cached, though.


Make sure that DNS provider offers an API and that none of their infrastructure is hosted on CF. :D


A problem with this though is that some registrars take hours to propagate, by the time you have it switched it will have already likely been resolved. If you spread that across hundreds of customers, you'd have a bad time.


No, I don't believe it would.


No, dns doesn't work like this, you can't have backup records.


Not necessarily backup records, but you can add multiple A/AAAA records. There is no guaranteed order though.

It is possible to have multiple A/AAAA records pointing to different loadbalancers, but I don't know how browsers would deal with this.


Browsers would deal with it just fine (assuming the site is down hard and not responding with errors). Its some of the API tools and old libraries that may not. They would need retry logic that mimics the browser cycling through multiple A records. OTOH, API tools that have retry logic would just keep trying until the errors clear up. A browser will stop retrying when something responds unless there was javascript running in memory that had retry logic.

If there are errors, the site would need to be modified to not respond if broken and unable to proxy to a working origin. Perhaps CF have not coded their proxies in this manor.


The lookup is essentially random. If you point DNS to 2 IP's and one of those goes down, then (without going into detail) half of your requests will fail.


Yes, that is correct.

But it does make it possible for a browser to retry a different IP if a 500 range error, or timeout occurs.

But like I said in my other post: I don't know if browsers actually do this.


I just changed a setting in my CloudFlare account... did I break everything?


Oi! Change it back!


Well, they can’t now :p


Touché.

On one hand I think "Maybe I should diversify my infrastructure."

And on the other I think "But one of the biggest upsells was convinience."

And it's fortunate I don't have a third hand, because I'd be thinking "Oh crap oh crap I just migrated a client website to LightSail + CloudFlare saying how super awesome and robust it would be."

But it's okay now because it looks like everything is back up!


No


After trolling Verizon curious what reasons they will come with for that outage.


Maybe this is payback by Verizon...


Care to elaborate how they were "trolling" Verizon?


They came off that way in this blog post:

https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-...


That didn't seem like trolling – just a public call for Verizon to follow internet best practices. Given that most large ISPs treat failures as a PR exercise, that's probably necessary.


Agreed.

They reached out to Verizon privately, a Tier 1 carrier with expectations and responsibilities as a good netizen, and got no response.

They attempted to reach out through Verizon's public forms of communication and got a bullshit irrelevant CS response despite requesting escalation.

They then called out Verizon before the community as a whole.

They don't have the luxury of waiting for a well prepared letter from some Verizon lawyers. Modern day customer expectations don't allow for it. You may call it trolling, but all I saw was a company asking another company to stop pissing in the public pool.


So what should the ideal redundancy plan be here? If you can't log into the CDN provider and they are down do you you just have a second one ready (and paid for) and then log into your registrar and be ready to switch to that secondary CDN provider in this scenario? Or is there some sort of load balancing / routing solution between CDN's that I don't know about / understand?


If you use Cloudflare nameservers, you have to change to new nameservers, wait for that to propogate, and then wait for clients cached records TTLs to expire. So it will be a major disruption no matter what you do.


What about the API? If that is up, you could script it.


Can you use your own nameservers that delegate to cloudflare or akami 50% each, then adjust? Is there a service more suited for this than r53?


If you're using them for TLS certs then it's an even bigger problem unless you have them provisioned elsewhere.


Unless you need EV you can just pull some wildcards from Lets Encrpt (as long as you don't use pubkey pinning). No need to automate as it's just a one off.


The Cloudflare DNS-over-HTTPS resolver was serving up 502 errors as well, though the standard port 53 UDP resolver was working. This event definitely made me regret choosing Cloudflare as my sole DoH server.


Hear hear Mozilla!


Either downforeveryoneorjustme.com is itself served by Cloudflare too, or has been hugged to death.


Yep, we are 100% on CF workers :)

Here is a quick image of the peak downtime on downforeveryoneorjustme.com:

https://ibb.co/PZ9BMRc


Thanks for the fantastic work on the service. And thanks for sharing that stats!


It returns 502 Bad Gateway cloudflare so I assume it's not the HN hug...


Also, https://downdetector.co.uk/problems/cloudflare is currently detecting the outage in very direct fashion!


Not impressed it's serving an error page claiming the underlying host is to blame (this one from discord)

Error 502 Bad gateway

You - Browser - Working

Sydney - Cloudflare - Working

storage.googleapis.com - Host - Error

What happened? The web server reported a bad gateway error.


Seems fitting that Cloudflare spoke so aggressively against Verizon[0][1] last week and then this incident happens to them. I will be interested to read the postmortem on this situation. I really like Cloudflare but you should be careful not to jinx yourself with blogs posts like that.

[0]: https://blog.cloudflare.com/the-deep-dive-into-how-verizon-a... [1]: https://web.archive.org/web/20190628223129/https://blog.clou...


When you can't update your DNS because your registrar uses CloudFlare also....


You could use a dedicated dns provider, such as AWS


Any workarounds or solutions ? I'm an on-call engineer with lots of questions coming in. I'm not sure what I can do apart from moving the domain off Cloudflare, bug DNS propagation would take a few hours and by then Cloudflare might be up again.


Outages can always happen, when they do with companies like this, at least you'll know that some of the best people out there are working on the issue and that it will be resolved asap.

CloudFlare has proven in the past to be a very capable party, I don't think panicking now and try to move everything away is a smart move. Also, a few people have been saying that even if you want to, the site to do so is not reachable, so that would be a challenge as well.


Panicking was not the plan. Asking for advice was.


My suggestion is wait. I wouldn't even consider flipping my sites over to another DNS unless the outage begins lasting over four hours or so. A lot of top sites use Cloudflare and this sort of outage is extremely rare for them (I can't remember a time when Cloudflare's own site and dashboard were taken offline).

Then again, it depends on the priority of your site. But there are tons of top sites on Cloudflare and I bet a lot of those places don't have plans for emergency switching over to another DNS provider / CDN on short notice as it's often a fairly disruptive change, especially now that more frontend logic for a site is implemented alongside the CDN/LB.


1.1.1.1 DNS is down, and seeing a lot of 502s on cloudflare sites.


I just changed my dns-over-https script to use quad9:

    # sudo /usr/local/bin/dnsproxy -l 127.0.0.153 -p 53 -u https://cloudflare-dns.com/dns-query -b 1.1.1.1:53
    sudo /usr/local/bin/dnsproxy -l 127.0.0.153 -p 53 -u https://9.9.9.9/dns-query
The bootstrap 1.1.1.1:53 was working but the DoH dns-query url was not. Hence my visit here :)

Edit: Now fixed, it seems. Quick work, if it stays up!


That's a biggie...


The 1.1.1.1 vpn still seems to be working


Not here it's not (NE US).


dns is working in London UK.


I thought I was going insane, googling it returned nothing, but trusty old hacker news has my back


Holy crap, cloudflare down, and seemingly all the covers. Major sites such as Digitalocean are all down, and no way to easily disable cloudflare since their site is down.



The outage appears to have ended between that post and your post.


Yeah, not a long outage, but a big one.


Single point of failure - we all shouldn’t trust CF alone anymore...


Any CDN is a single point of failure and limits your availability to as low as three nines. Although anycast-based CDNs like Cloudflare are much less reliable than DNS-based CDNs, those can do orders of magnitude better.


Why do you think anycast based CDNs are worse than DNS based ones?


They rely on a single network infrastructure as opposed to many independent networks with independent edge nodes where isolating faults is rather trivial in comparison.


I’m eagerly waiting to read their blog post: "The Configuration Mistake That Almost Broke the Internet"


What will we do without our favourite "Enterprise MiTM" SaaS provider?


Holy shit. This impacts nearly everything.


Noticed that draw.io were down. 1.1.1.1 and all other Cloudflare backed sites as well... I'm off for today.


Cleartext DNS 1.1.1.1 is still up though, thankfully.


Looks like everyone is impacted. We are seeing 502 Bad Gateway across multiple domains and regions on https://taskade.com


Back up and running!


Damn... getting a bunch of alerts and can't even open Pingdom either... also running on CloudFlare https://my.pingdom.com/newchecks/checks#


npm and yarn are not working either...there goes Javascript land...


Yes, seems to have impacted all CF sites. (UK here)

Status was just updated here, but it was showing everything as operational for a while: https://www.cloudflarestatus.com/


Seems like it is clearing up. Our site is backup. Digital Ocean and Patreon are up as well.


Tried to go to some usual places where people discuss outages and met with outage. Ouch.


A possible downstream effect of this: Pingdom appears to be alerting VERY late, at least for us. I'm guessing with 8% of the web affected, their alerting systems aren't prepared for this many simultaneous alerts.


Several sites behind CloudFlare are returning 502 errors for me as well (France).



This is frustrating - can't access CF to turn off CF to make sites accessible. There should be an emergency admin/dash access to turn off protection for cases like this.


Also getting this in the UK. Not completely down, bits and pieces of medium comes through but very slowly and incomplete. I’m also unable to access npm.


It’s better that this happens now than later. I am confident CF will put protections in place to prevent this from happening again, but also put a switch in place to provide an instant fix the next time something like this happens.

It does suck to have a service down for a bit, but what CF offers, at the price point is pretty incredible.

Good luck to CF, and I wish you the best with coming up with a robust future-proof solution.


Can’t wait to see Verizon’s blog post about this one :)


And I'm sure that everybody hitting F5 to see if it's back is causing no problem at all, no no. Waiting anxiously for the writeup!


Also fun to notice that you have to agree to their privacy policy to receive updates, which is hosted on their website, which is down


>We are working to mitigate impact to Internet users in this region.

>This incident affects: North America (Ashburn, VA, United States - (IAD), Atlanta, GA, United States - (ATL), Boston, MA, United States - (BOS), Buffalo, NY, United States - (BUF), Calgary, AB, Canada - (YYC), Charlotte, NC, United States - (CLT), Chicago, IL, United States - (ORD), Columbus, OH, United States - (CMH), Dallas, TX, United States - (DFW), Denver, CO, United States - (DEN), Detroit, MI, United States - (DTW), Houston, TX, United States - (IAH), Indianapolis, IN, United States - (IND), Jacksonville, FL, United States - (JAX), Kansas City, MO, United States - (MCI), Las Vegas, NV, United States - (LAS), Los Angeles, CA, United States - (LAX), McAllen, TX, United States - (MFE), Memphis, TN, United States - (MEM), Miami, FL, United States - (MIA), Minneapolis, MN, United States - (MSP), Montgomery, AL, United States - (MGM), Montréal, QC, Canada - (YUL), Nashville, TN, United States - (BNA), Newark, NJ, United States - (EWR), Norfolk, VA, United States - (ORF), Omaha, NE, United States - (OMA), Phoenix, AZ, United States - (PHX), Pittsburgh, PA, United States - (PIT), Portland, OR, United States - (PDX), Queretaro, MX, Mexico - (QRO), Richmond, Virginia - (RIC), Sacramento, CA, United States - (SMF), Salt Lake City, UT, United States - (SLC), San Diego, CA, United States - (SAN), San Jose, CA, United States - (SJC), Saskatoon, SK, Canada - (YXE), Seattle, WA, United States - (SEA), St. Louis, MO, United States - (STL), Tampa, FL, United States - (TPA), Toronto, ON, Canada - (YYZ), Vancouver, BC, Canada - (YVR), Tallahassee, FL, United States - (TLH), Winnipeg, MB, Canada - (YWG)), Middle East (Amman, Jordan - (AMM), Baghdad, Iraq - (BGW), Baku, Azerbaijan - (GYD), Beirut, Lebanon - (BEY), Doha, Qatar - (DOH), Dubai, United Arab Emirates - (DXB), Kuwait City, Kuwait - (KWI), Manama, Bahrain - (BAH), Muscat, Oman - (MCT), Ramallah - (ZDM), Riyadh, Saudi Arabia - (RUH), Tel Aviv, Israel - (TLV)), Asia (Bangkok, Thailand - (BKK), Cebu, Philippines - (CEB), Chengdu, China - (CTU), Chennai, India - (MAA), Colombo, Sri Lanka - (CMB), Dongguan, China - (SZX), Foshan, China - (FUO), Fuzhou, China - (FOC), Guangzhou, China - (CAN), Hangzhou, China - (HGH), Hanoi, Vietnam - (HAN), Hengyang, China - (HNY), Ho Chi Minh City, Vietnam - (SGN), Hong Kong - (HKG), Hyderabad, India - (HYD), Islamabad, Pakistan - (ISB), Jinan, China - (TNA), Karachi, Pakistan - (KHI), Kathmandu, Nepal - (KTM), Kuala Lumpur, Malaysia - (KUL), Lahore, Pakistan - (LHE), Langfang, China - (NAY), Luoyang, China - (LYA), Macau - (MFM), Manila, Philippines - (MNL), Mumbai, India - (BOM), Nanning, China - (NNG), New Delhi, India - (DEL), Osaka, Japan - (KIX), Phnom Penh, Cambodia - (PNH), Qingdao, China - (TAO), Seoul, South Korea - (ICN), Shanghai, China - (SHA), Shenyang, China - (SHE), Shijiazhuang, China - (SJW), Singapore, Singapore - (SIN), Suzhou, China - (SZV), Taipei - (TPE), Tianjin, China - (TSN), Tokyo, Japan - (NRT), Ulaanbaatar, Mongolia - (ULN), Wuhan, China - (WUH), Wuxi, China - (WUX), Xi'an, China - (XIY), Yerevan, Armenia - (EVN), Zhengzhou, China - (CGO), Zuzhou, China - (CSX)), Africa (Cairo, Egypt - (CAI), Casablanca, Morocco - (CMN), Cape Town, South Africa - (CPT), Dar Es Salaam, Tanzania - (DAR), Djibouti City, Djibouti - (JIB), Durban, South Africa - (DUR), Johannesburg, South Africa - (JNB), Lagos, Nigeria - (LOS), Luanda, Angola - (LAD), Maputo, MZ - (MPM), Mombasa, Kenya - (MBA), Port Louis, Mauritius - (MRU), Réunion, France - (RUN), Kigali, Rwanda - (KGL)), Latin America & the Caribbean (Asunción, Paraguay - (ASU), Bogotá, Colombia - (BOG), Buenos Aires, Argentina - (EZE), Curitiba, Brazil - (CWB), Fortaleza, Brazil - (FOR), Lima, Peru - (LIM), Medellín, Colombia - (MDE), Mexico City, Mexico - (MEX), Panama City, Panama - (PTY), Porto Alegre, Brazil - (POA), Quito, Ecuador - (UIO), Rio de Janeiro, Brazil - (GIG), São Paulo, Brazil - (GRU), Santiago, Chile - (SCL), Willemstad, Curaçao - (CUR)), Oceania (Auckland, New Zealand - (AKL), Brisbane, QLD, Australia - (BNE), Melbourne, VIC, Australia - (MEL), Perth, WA, Australia - (PER), Sydney, NSW, Australia - (SYD)), and Europe (Amsterdam, Netherlands - (AMS), Athens, Greece - (ATH), Barcelona, Spain - (BCN), Belgrade, Serbia - (BEG), Berlin, Germany - (TXL), Brussels, Belgium - (BRU), Bucharest, Romania - (OTP), Budapest, Hungary - (BUD), Chișinău, Moldova - (KIV), Copenhagen, Denmark - (CPH), Dublin, Ireland - (DUB), Düsseldorf, Germany - (DUS), Edinburgh, United Kingdom - (EDI), Frankfurt, Germany - (FRA), Geneva, Switzerland - (GVA), Gothenburg, Sweden - (GOT), Hamburg, Germany - (HAM), Helsinki, Finland - (HEL), Istanbul, Turkey - (IST), Kyiv, Ukraine - (KBP), Lisbon, Portugal - (LIS), London, United Kingdom - (LHR), Luxembourg City, Luxembourg - (LUX), Madrid, Spain - (MAD), Manchester, United Kingdom - (MAN), Marseille, France - (MRS), Milan, Italy - (MXP), Moscow, Russia - (DME), Munich, Germany - (MUC), Nicosia, Cyprus - (LCA), Oslo, Norway - (OSL), Paris, France - (CDG), Prague, Czech Republic - (PRG), Reykjavík, Iceland - (KEF), Riga, Latvia - (RIX), Rome, Italy - (FCO), Saint Petersburg, Russia - (LED), Sofia, Bulgaria - (SOF), Stockholm, Sweden - (ARN), Tallinn, Estonia - (TLL), Thessaloniki, Greece - (SKG), Vienna, Austria - (VIE), Vilnius, Lithuania - (VNO), Warsaw, Poland - (WAW), Zagreb, Croatia - (ZAG), Zürich, Switzerland - (ZRH)).

Quite a wide region, innit?


I went to post the list in Slack and got a 'too many characters' error...


"This incident affects... EVERYONE!" (insert Gary Oldman Screaming GIF here)


Some people on Twitter are reporting that it's due to a DDoS. http://www.digitalattackmap.com/#anim=1&color=0&country=ALL&... seems to indicate Iran is the source?


The date on your link is February 21st.


Right you are, heh. Not sure why it jumped back to that date. False alarm.


It's not.


This seems FUD the dates are wrong and according to that report Iran is the one attacked


Wrong date on that page. That's Dec. 2, not Jul. 2


I'm beginning to see the issues disappearing for me on the East Coast of the US, as of 10:14 EDT.


"Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw."

So, it was a dirty REGEX. I can't even be mad.


Wonder how can we use CloudFlare and have a fallback plan in place for situations like this. What would be a good architecture for this? So far I've read that would be good to have the registrar out of CloudFlare and use them as CDN only. What else?



This article entitled Now Running on Cloudflare seems down

https://seankilleen.com/2016/12/now-running-on-cloudflare/


I'm getting '502 Bad Gateway' on all Cloudflare sites here in London too.


Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: