At Ably, our status site had an incident update about Cloudflare issues being worked on (by routing away from CF) before Cloudflare did: https://status.ably.io/incidents/647
We have machine generated incidents created automatically when error rates increase beyond a certain point stating "Our automated systems have detected a fault, we've been alerted and looking at it". See https://status.ably.io/incidents/569 for example. I think much larger companies like Cloudflare and Amazon could certainly invest a bit in similar systems to make it easier for their customers to know where the problem likely lies.
So our system thought none were running and so it kept launching instances....
These were SPOT instances and thus only cost like .10 per hour...
But we launched like 2500 instances which all needed to slurp down their DB and config - so it overloaded all other control plane systems...
We had to reboot the entire system. Which took forever.
The only good things was this happened at 11am - so all team members were online and avail... and then AWS refunded all costs.
The other fun time was when a newbie dev checked in AWS creds to git - but he created the 201th repo (we had only paid for 200) -- and as it was the next repo which wasnt paid for, it was by default public - thus slurped up by bots asap - which then used the AWS creds to launch bitcoin mining bots in every single region around the globe. Like 1700 instances.
The thing that sucked about that was it happened at like 3am and we had to rally on that one pretty fast. AWS still refunded all costs...
That's an odd choice of a failure mode.
> AWS still refunded all costs...
Yeah they should. It was their silly design choice that lead to disclosure of secrets after all.
What kind of failure mode is that even. Failing to create the repo would have led to a better user experience for sure.
Can you imagine if S3 charged more for private objects and once you reach your count, it just makes them public and posts them on Reddit?
Exactly! What a stupid design UX.
Wait, was the 200 private repos issue an AWS thing or a GitHub/GitLab/whatever thing?
What AWS product has a concept of private/public repos and limits on how many of the former you can get for a certain price?
Never post aws secrets to git.
Trust me, they know.
They know about problems we never find out about, too.
The actual incident page was correct.
I think the only way is to have status page operated by an independent 3rd party, but I don't think there's a viable business model for someone to provide such service. Perhaps there might be even a risk of lawsuits against you.
> For any and each Outage Period during a monthly billing period the Company will provide as a Service Credit an amount calculated as follows: Service Credit = (Outage Period minutes * Affected Customer Ratio) ÷ Scheduled Availability minutes
So assuming an outage affects 100% of your users (this one seems like it did, but that's not clear), they only refund the time the service was offline? According to pingdom this outage lasted ~25 minutes, so that's 25/(31 * 24 * 60) = .056% of our bill, roughly 11 cents.
It sounds like you just don't pay for the time the service didn't work, which isn't much of a guarantee, that's just expected (of course you shouldn't pay for services not provided). Most SLAs for critical services have something like under 99.99% uptime you get 10% of your bill back, under 99.5% you get 20% back, under 99% you get 50% back. (*Numbers completely made up to demonstrate the concept.)
Am I misreading this? Morning coffee hasn't kicked in yet so maybe I am.
I mean something like Amazon EC2's SLA (https://aws.amazon.com/compute/sla/) where credits are proportional to downtime, but not 1:1. i.e. they credit 100% for >= 5% downtime. With Cloudflare's SLA, 5% downtime (1.5 days in a month) would only give you a 5% credit.
> 100% uptime and 25x Enterprise SLA
> In the rare event of downtime, Enterprise customers receive a 25x credit against the monthly fee, in proportion to the respective disruption and affected customer ratio.
>in proportion to the respective disruption and affected customer ratio.
so no they aren't losing 2 yrs of revenue
* 99.0% <= uptime < 99.9%: 10% service credit
* 95.0% <= uptime < 99.0%: 25% service credit
* 0.0% <= uptime < 95.0%: 100% service credit
In any case, the function of the credit policy is to ensure there is enough pain for the provider to put in place the quality / reliability practices, process and code to protect themselves from losses. IMO most sustainable business pull in much more revenue per hour than 25x their CDN cost.
It would be interesting to know how CloudFlare's infra and processes differentiate free, Business and Enterprise customers.
Investigating - Cloudflare is observing network performance issues. Customers may be experiencing 502 errors while accessing sites on Cloudflare. We are working to mitigate impact to Internet users in this region.
It seemed very much like a global outage affecting all services. Is this status page not automatically updated with service status, or is it just manually updated by humans? Even if manually updated, surely when posting that status message, the status of all the services should be set to degraded?
I'm talking about the service indicators for each service lower down the page, which remained green throughout and appear to be just decoration, not an actual indication of service status, they all said operational throughout the incident (I reloaded a few times).
In particular I'm thinking of the Cloudflare Sites and Services section.
Together with a short TTL we were able to recover without relying on their dashboards.
I'm running the web servers, official wiki, and game external resource portal for the most active open source video game on github, through cloudflare, and maybe we might not want our 60 million requests a month website to go down when cloudflare does.
Because I can tell you right now our 300 a month budget (that mind you, is capable of covering 7 game servers that can handle 100 connected players (each)) can't take the 80 dollar hit just to make cloudflare not a single point of failure.
Edit: Wait, do they use Cloudflare?
$ dig namecheap.com +short
Organization: Namecheap, Inc. (NAMEC-4)
$ kdig +short www.namecheap.com
If this is something you want to be able to mitigate, you really need to be running a seperate DNS infra from your hosting/CDN and use short TTL cnames to delegate hostnames to the CDN. This becomes a big challenge if you host on an apex domain (eg example.org instead of www.example.org), so don't do that.
We wrote about a strategy to circumvent this sort of thing a little while back https://www.ably.io/blog/routing-around-single-point-of-fail.... Given two incidents in a matter of weeks, I think a revisit of that article in light of most businesses who operate on a single domain would be useful :)
For about 30 minutes today, visitors to Cloudflare sites received 502 errors caused by a massive spike in CPU utilization on our network. This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.
This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.
Sounds like backtracking. If so, I'll bet there's a conversation happening about switching to re2.
edit: hmmm, https://github.com/cloudflare/lua-re2
I'm also curious why this rollout wasn't staged.
I'm tired of people not learning that trusting a single gateway with 50% of the internet is bad.
Yes, I know, free DDOS protection. There has to be another way of doing this, some mesh based DDOS protection or so.
Without CF, many websites would not stay on-line during an attack. And they would cease to exist because many of those places would never be able to afford DDoS protection. I know so many sites, including ones I run, that I would not be able to keep on the public internet without CF DDoS protection. There really is no real competition in this space.
I think we need to consider the fact that while this outage does take a lot of sites off-line at once, it is temporary, and it is still extremely rare. And the alternative is potentially that many websites would cease to exist at all, period, without something like Cloudflare existing.
It does not. "There has to be" ends in "!" and is an expression is a wish.
> Who is going to pay for this?
All of us. Everyone. I keep circling back to the idea that we should all join a protection ring, so we'd all share "cost" on this.
> How do you ensure latency is not atrocious?
> Why hasn't someone made this already?
First one is very technical, and the answer is most probably by localisation and by only actually turning it on when needed.
For the second: because it's a very hard technical problem and there is basically no money in it. Business value maybe, but it would need people and companies to collaborate, it would probably need committee level decisions, and so far, nobody wanted to deal with this.
Or at least that's my theory.
Maybe dat:// will eventually become a viable option, and with that, due to the distributed nature, this kind of DDOS protection is sort of built in.
We need to decentralize the Internet, but this occurrence is not the reason. It's an argument to keep consolidating.
... mainframes for business critical applications.
As for buying IBM Cloud services, it's a bit fuzzier.
This is often the case when monopolies realize they're in a position where they can get away with sucking, but I don't believe that to be the case with Cloudflare yet.
If you're site is down, it's your fault in their eyes, not some core infrastructure provider.
Teach people the right way to do things, and then make that behavior self-rewarding.
There's a Cloudflare location right next to me, and that improves latency across many of the websites I use, and makes their public DNS service the fastest.
I just wish they had active competition in the same space. CDNs have network locations nearby but they don't offer an easy UX for relatively unexperienced website owners, and DDoS protection services usually have less network locations than content CDNs.
The API wasn't working perfectly, but with some retries we were able to change the config for our domains.
Seems to be back up and running for me.
But in all frankness, if Cloudflare's own site did not run off of Cloudflare's infrastructure, why would anybody trust them with their websites?
In case of network issues affecting their proxy, being able to change your configuration to allow direct traffic would be really nice.
Though some of them might not be because of Cloudflare, the ones I spot checked all do appear related. Medium, DigitalOcean, Shopify, CodeShip, Pingdom, and many more. The impact is staggering.
Major outage impacted all Cloudflare services globally. We saw a massive spike in CPU that caused primary and secondary systems to fall over. We shut down the process that was causing the CPU spike. Service restored to normal within ~30 minutes. We’re now investigating the root cause of what happened.
(edit: thank you mods for changing the submission URL)
I know DNS will fall back if it can't reach a service, but would a 502 trigger that?
This has a few drawbacks like making sure your Route53 configuration is identical to your CF config, ensuring your origin servers can cope with the additional load if CF caching isn't available and the DNS propagation time required for the Name Servers to update.
During the last outage, we were able to get into the CF dashboard and simply disable the proxy which allowed our clients to access our origin servers directly but this time we can't even get into the Dashboard.
Ideally, I'd want something where if Cloudflare goes down, I don't have to change anything, but... 502 isn't going to trigger that without some work on my part.
Backup DNS entries won't help, what you need is to use multiple DNS providers and add them all to your NS records
Since I'm mostly thinking about static sites, I could also have something local that pings the site, and if it goes down, it could update my nameservers to point elsewhere on its own.
Probably more trouble than its worth though.
I'm not sure if the browser receives all A/AAAA records from the syscall, or just one. I guess that if the browser has the whole list, and the error is in the 500 range, it could retry a different IP but I'm not sure if browsers do this.
It is possible to have multiple A/AAAA records pointing to different loadbalancers, but I don't know how browsers would deal with this.
If there are errors, the site would need to be modified to not respond if broken and unable to proxy to a working origin. Perhaps CF have not coded their proxies in this manor.
But it does make it possible for a browser to retry a different IP if a 500 range error, or timeout occurs.
But like I said in my other post: I don't know if browsers actually do this.
On one hand I think "Maybe I should diversify my infrastructure."
And on the other I think "But one of the biggest upsells was convinience."
And it's fortunate I don't have a third hand, because I'd be thinking "Oh crap oh crap I just migrated a client website to LightSail + CloudFlare saying how super awesome and robust it would be."
But it's okay now because it looks like everything is back up!
They reached out to Verizon privately, a Tier 1 carrier with expectations and responsibilities as a good netizen, and got no response.
They attempted to reach out through Verizon's public forms of communication and got a bullshit irrelevant CS response despite requesting escalation.
They then called out Verizon before the community as a whole.
They don't have the luxury of waiting for a well prepared letter from some Verizon lawyers. Modern day customer expectations don't allow for it. You may call it trolling, but all I saw was a company asking another company to stop pissing in the public pool.
Here is a quick image of the peak downtime on downforeveryoneorjustme.com:
Error 502 Bad gateway
You - Browser - Working
Sydney - Cloudflare - Working
storage.googleapis.com - Host - Error
The web server reported a bad gateway error.
CloudFlare has proven in the past to be a very capable party, I don't think panicking now and try to move everything away is a smart move. Also, a few people have been saying that even if you want to, the site to do so is not reachable, so that would be a challenge as well.
Then again, it depends on the priority of your site. But there are tons of top sites on Cloudflare and I bet a lot of those places don't have plans for emergency switching over to another DNS provider / CDN on short notice as it's often a fairly disruptive change, especially now that more frontend logic for a site is implemented alongside the CDN/LB.
# sudo /usr/local/bin/dnsproxy -l 127.0.0.153 -p 53 -u https://cloudflare-dns.com/dns-query -b 22.214.171.124:53
sudo /usr/local/bin/dnsproxy -l 127.0.0.153 -p 53 -u https://126.96.36.199/dns-query
Edit: Now fixed, it seems. Quick work, if it stays up!
Status was just updated here, but it was showing everything as operational for a while: https://www.cloudflarestatus.com/
It does suck to have a service down for a bit, but what CF offers, at the price point is pretty incredible.
Good luck to CF, and I wish you the best with coming up with a robust future-proof solution.
>This incident affects: North America (Ashburn, VA, United States - (IAD), Atlanta, GA, United States - (ATL), Boston, MA, United States - (BOS), Buffalo, NY, United States - (BUF), Calgary, AB, Canada - (YYC), Charlotte, NC, United States - (CLT), Chicago, IL, United States - (ORD), Columbus, OH, United States - (CMH), Dallas, TX, United States - (DFW), Denver, CO, United States - (DEN), Detroit, MI, United States - (DTW), Houston, TX, United States - (IAH), Indianapolis, IN, United States - (IND), Jacksonville, FL, United States - (JAX), Kansas City, MO, United States - (MCI), Las Vegas, NV, United States - (LAS), Los Angeles, CA, United States - (LAX), McAllen, TX, United States - (MFE), Memphis, TN, United States - (MEM), Miami, FL, United States - (MIA), Minneapolis, MN, United States - (MSP), Montgomery, AL, United States - (MGM), Montréal, QC, Canada - (YUL), Nashville, TN, United States - (BNA), Newark, NJ, United States - (EWR), Norfolk, VA, United States - (ORF), Omaha, NE, United States - (OMA), Phoenix, AZ, United States - (PHX), Pittsburgh, PA, United States - (PIT), Portland, OR, United States - (PDX), Queretaro, MX, Mexico - (QRO), Richmond, Virginia - (RIC), Sacramento, CA, United States - (SMF), Salt Lake City, UT, United States - (SLC), San Diego, CA, United States - (SAN), San Jose, CA, United States - (SJC), Saskatoon, SK, Canada - (YXE), Seattle, WA, United States - (SEA), St. Louis, MO, United States - (STL), Tampa, FL, United States - (TPA), Toronto, ON, Canada - (YYZ), Vancouver, BC, Canada - (YVR), Tallahassee, FL, United States - (TLH), Winnipeg, MB, Canada - (YWG)), Middle East (Amman, Jordan - (AMM), Baghdad, Iraq - (BGW), Baku, Azerbaijan - (GYD), Beirut, Lebanon - (BEY), Doha, Qatar - (DOH), Dubai, United Arab Emirates - (DXB), Kuwait City, Kuwait - (KWI), Manama, Bahrain - (BAH), Muscat, Oman - (MCT), Ramallah - (ZDM), Riyadh, Saudi Arabia - (RUH), Tel Aviv, Israel - (TLV)), Asia (Bangkok, Thailand - (BKK), Cebu, Philippines - (CEB), Chengdu, China - (CTU), Chennai, India - (MAA), Colombo, Sri Lanka - (CMB), Dongguan, China - (SZX), Foshan, China - (FUO), Fuzhou, China - (FOC), Guangzhou, China - (CAN), Hangzhou, China - (HGH), Hanoi, Vietnam - (HAN), Hengyang, China - (HNY), Ho Chi Minh City, Vietnam - (SGN), Hong Kong - (HKG), Hyderabad, India - (HYD), Islamabad, Pakistan - (ISB), Jinan, China - (TNA), Karachi, Pakistan - (KHI), Kathmandu, Nepal - (KTM), Kuala Lumpur, Malaysia - (KUL), Lahore, Pakistan - (LHE), Langfang, China - (NAY), Luoyang, China - (LYA), Macau - (MFM), Manila, Philippines - (MNL), Mumbai, India - (BOM), Nanning, China - (NNG), New Delhi, India - (DEL), Osaka, Japan - (KIX), Phnom Penh, Cambodia - (PNH), Qingdao, China - (TAO), Seoul, South Korea - (ICN), Shanghai, China - (SHA), Shenyang, China - (SHE), Shijiazhuang, China - (SJW), Singapore, Singapore - (SIN), Suzhou, China - (SZV), Taipei - (TPE), Tianjin, China - (TSN), Tokyo, Japan - (NRT), Ulaanbaatar, Mongolia - (ULN), Wuhan, China - (WUH), Wuxi, China - (WUX), Xi'an, China - (XIY), Yerevan, Armenia - (EVN), Zhengzhou, China - (CGO), Zuzhou, China - (CSX)), Africa (Cairo, Egypt - (CAI), Casablanca, Morocco - (CMN), Cape Town, South Africa - (CPT), Dar Es Salaam, Tanzania - (DAR), Djibouti City, Djibouti - (JIB), Durban, South Africa - (DUR), Johannesburg, South Africa - (JNB), Lagos, Nigeria - (LOS), Luanda, Angola - (LAD), Maputo, MZ - (MPM), Mombasa, Kenya - (MBA), Port Louis, Mauritius - (MRU), Réunion, France - (RUN), Kigali, Rwanda - (KGL)), Latin America & the Caribbean (Asunción, Paraguay - (ASU), Bogotá, Colombia - (BOG), Buenos Aires, Argentina - (EZE), Curitiba, Brazil - (CWB), Fortaleza, Brazil - (FOR), Lima, Peru - (LIM), Medellín, Colombia - (MDE), Mexico City, Mexico - (MEX), Panama City, Panama - (PTY), Porto Alegre, Brazil - (POA), Quito, Ecuador - (UIO), Rio de Janeiro, Brazil - (GIG), São Paulo, Brazil - (GRU), Santiago, Chile - (SCL), Willemstad, Curaçao - (CUR)), Oceania (Auckland, New Zealand - (AKL), Brisbane, QLD, Australia - (BNE), Melbourne, VIC, Australia - (MEL), Perth, WA, Australia - (PER), Sydney, NSW, Australia - (SYD)), and Europe (Amsterdam, Netherlands - (AMS), Athens, Greece - (ATH), Barcelona, Spain - (BCN), Belgrade, Serbia - (BEG), Berlin, Germany - (TXL), Brussels, Belgium - (BRU), Bucharest, Romania - (OTP), Budapest, Hungary - (BUD), Chișinău, Moldova - (KIV), Copenhagen, Denmark - (CPH), Dublin, Ireland - (DUB), Düsseldorf, Germany - (DUS), Edinburgh, United Kingdom - (EDI), Frankfurt, Germany - (FRA), Geneva, Switzerland - (GVA), Gothenburg, Sweden - (GOT), Hamburg, Germany - (HAM), Helsinki, Finland - (HEL), Istanbul, Turkey - (IST), Kyiv, Ukraine - (KBP), Lisbon, Portugal - (LIS), London, United Kingdom - (LHR), Luxembourg City, Luxembourg - (LUX), Madrid, Spain - (MAD), Manchester, United Kingdom - (MAN), Marseille, France - (MRS), Milan, Italy - (MXP), Moscow, Russia - (DME), Munich, Germany - (MUC), Nicosia, Cyprus - (LCA), Oslo, Norway - (OSL), Paris, France - (CDG), Prague, Czech Republic - (PRG), Reykjavík, Iceland - (KEF), Riga, Latvia - (RIX), Rome, Italy - (FCO), Saint Petersburg, Russia - (LED), Sofia, Bulgaria - (SOF), Stockholm, Sweden - (ARN), Tallinn, Estonia - (TLL), Thessaloniki, Greece - (SKG), Vienna, Austria - (VIE), Vilnius, Lithuania - (VNO), Warsaw, Poland - (WAW), Zagreb, Croatia - (ZAG), Zürich, Switzerland - (ZRH)).
Quite a wide region, innit?
So, it was a dirty REGEX. I can't even be mad.