Hacker News new | past | comments | ask | show | jobs | submit login
Post Mortem on Cloudflare Control Plane and Analytics Outage (cloudflare.com)
486 points by eastdakota 10 months ago | hide | past | favorite | 231 comments



Interesting choice to spend the bulk of the article publicly shifting blame to a vendor by name and speculating on their root cause. Also an interesting choice to publicly call out that you're a whale in the facility and include an electrical diagram clearly marked Confidential by your vendor in the postmortem.

Honestly, this is rather unprofessional. I understand and support explaining what triggered the event and giving a bit of context, but the focus on your postmortem needs to be on your incident, not your vendor's.

Clearly, a lot went wrong and Flexential needs to do their own postmortem, but Cloudflare doesn't need to make guesses and do it for them, much less publicly.


If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.

It might also be an effort to get out in front of the story before someone else does the speculating.

In any case, with at least three parties involved, with multiple interconnected systems… if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.

Edit to add: I for one am grateful for the information Cloudflare is sharing.


>If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.

It's been 2 days. I doubt PGE or Flexential even have root caused it yet, and even if they have, good communication takes time.

You don't throw someone under the bus and smear their name publicly just because they haven't replied for two days, and you certainly don't start speculating on their behalf. That's bad partnership.

You also don't publicly share what "Flexential employees shared with us unofficially" (quote from the article) - what a great way to burn trust with people who probably told you stuff in confidence.

>if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.

They can do all of that without smearing people on their company blog. In fact, they can do all of that without even knowing what happened to PGE/Flexential, because per their own admission they were already supposed to be anticipating this, but failed at it. Power outages and data center issues are a known thing, and is exactly why HA exists. HA which Cloudflare failed at. This post-mortem should be almost entirely about that failure rather than speculation about a power outage.


> You don't throw someone under the bus and smear their name publicly just because they haven't replied for two days, and you certainly don't start speculating on their behalf. That's bad partnership.

1. When you’re paying them the kind of money I imagine they’re paying and they don’t reply for 2 days, yea that’s crazy if true. I’d expect a client of this size could take to an executive on their personal number.

2. Telling the facts as you know them to be especially regarding very poor communication isn’t a smear.


They aren't telling the facts as they know them. Cloudflare themselves say that the information in the article is "speculation" (the article literally uses that term).

Publicly casting blame based on speculation isn't something you do to someone that you want to have a good working relationship with, no matter how much money you pay them.


That's not true. This is behaviour that would be enough for me to pull the plug working with this DC as this is more than unacceptable.


> if you want to have a good working relationship with

What are you disagreeing with OP ?

He is talking about how to behave if you continue the relationship not whether to continue it .


The post you're replying to is pointing out that multiple days without reporting out a preliminary root cause analysis is so absurdly below the expected level of service here that it would prompt them to reconsider using the service at all.

2 days is outrageous here, I have to imagine whoever thinks that is acceptable is approaching this from the perspective of a company whose downtime doesn't affect profits.


If you actually worked with datacenters you'd understand that what PGE and Flexential is unacceptable as well


Agreed. DC sends us notifications any time power status changes. We had a dark building event once, due actually to some similar sounding thing: power fail over caused some arc fault in HV that took out the fail over switchgear. We received updates frequently.

UPS failing early sounds like it may be a battery maintenance issue.


We have no idea what their contract is. But two business days without a reply isn’t exactly a long time. Especially if they are conducting their own investigation and reproduction steps.


> But two business days without a reply isn’t exactly a long time

What???? We have 4 hour boots on the ground support with Supermicro and that's a few thousand dollars a year lol.

That doesn't make any sense for a customer as big as CF.


My impression from reading the writeup is that CF did receive support and communication from Flexential during the event (although not as much communication as they would have liked), but hasn't received confirmation from Flexential about certain root cause analysis things that would be included in a post-mortem.

Two days without support communications would be a long time, but my original comment about the two day period is about the post-mortem. It's totally reasonable IMO for a company to take longer than two days to gather enough information to correctly communicate a post-mortem for an issue like this, and IMO its unreasonable for CF to try to shame Flexential for that.


Especially since it shouldn't matter why the DC failed — Cloudflare's entire business model is selling services allegedly designed to survive that. 99% of the fault lies with Cloudflare for not being able to do their core job.


In all fairness the rest of the article is about that


So why spend so much time trying to shift blame to the vendor? They could've just started the article with something like:

> Due to circumstances beyond our control the DC lost all power. We are still working with our vendors to investigate the cause. While such a failure should not have been possible, our systems are supposed to tolerate a complete loss of a DC.


I don't think I read it as charged as you did

Here's what happened, here's what went wrong, here's what we did wrong, here's our plans to avoid it happening again

Seems like a standard post mortem tbh


Because a small handful of decisions probably led to the Clickhouse and Kafka services still being non-redundant at the datacenter level, which added up to one mistake. But a small handful of mistakes were made by the vendor. Calling out each one of them was bound to take up more page space.

The ordering that they list the mistakes would be a fair point to make though, in my opinion. They hinted at a mistake they made in their summary, but don't actually tell us point blank what it was until they tell us all the mistakes that their vendor made. I'd argue that was either done to make us feel some empathy for Cloudflare as being victims of the vendor's mistakes, misleading us somewhat. Or it was done that way because it was genuinely embarrassing for the author to write and subconsciously they want us to feel some empathy for them anyway. Or some combination of the two. Either way, I'll grant that I would have preferred to hear what went wrong internally before hearing what went wrong externally.


Slightly less than half, and the bottom half, so that people just skimming over it will mostly remember the DC operators' problems, not Cloudflare's own. This is very deliberately manipulative.


It is of course possible they've shuffled things around since this was posted but it seems that the first part addresses their system failings.

5th paragraph to the 9th are Cloudflare's "we buggered up" before they get to the power segment. They then continue with the "this is our fault for not being fully HA" after the power bit.

Each to their own, I'm going to read it as a regular old post mortem on this one.


Yeah I agree. The data center should be able to blow up without causing any problems. That's what Cloudflare sells and I'm surprised a data center failure can cause such problems.

Going into such depths on the 3rd party just shows how embarrassing this is for them.


You are way off here, this is 100% on Flexential, they have a 100% Power SLA, that means the power will always be available, right? They also clearly hadn't performed any checks on the circuit breakers and this is a NEWER facility for them, they also didn't even have HALF of the 10hours for the batteries to charge the generators, they also DEFINITELY should have fully moved to generators during this maintenance, they clearly couldn't because they were MORE than likely assisting PGE. Cloudflare CEO is right on here, you pay for Data Center services to be full redundant, they have 18MW at this location and from what I can see they have (2) feeds? That I can't find? Do they? If (1) feed goes down the 2N they have should kick in and with generators there should be NO issues.


As far as I'm aware, this is the initial post-mortem to describe the events that took place.

And yes, that also means the initial event description in what they know so far.

Highly likely there will be another one https://twitter.com/eastdakota/status/1720688383607861442?t=...


I actually disagree, and think that the post mortem clearly defines that there were things that were disappointing that happened with the vendor, _as well as_ things that were disappointing that happened internally. I don't think that it's unfair to point out everything in an event that happened; I do think it would be unfair to ignore all the compounding issues that were in the power of the vendor, and just swallow all of the blame for an event, when a huge reason that businesses even go through vendors at all is to have an entity responsible for a certain set of responsibilities that the business in question doesn't feel they have the expertise to do themselves. Which implies a relationship built on trust, and it's fair to call out when trust is lost.

And even though Cloudflare did put some of the blame, as it were, on the vendor, the post mortem recognizes that Cloudflare wasn't doing their due diligence on their vendor's maintenance and upkeep to verify that the state of the vendor's equipment is the same as the day they signed on. And that's ignoring a huge focus of the post mortem where they admit guilt at not knowing or not changing the fact that Kafka and Clickhouse were only in that datacenter.

Furthermore, we do not know that Cloudflare didn't get the vendor's blessing to submit that diagram to their post mortem. You're assuming they didn't. But for what it's worth as someone that has worked in datacenters, none of this is all that proprietary. Their business isn't hurt because this came out. This is a fairly standard (and frankly simplified for business folk) diagram of what any decently engineered datacenter building would operate like. There's no magic sauce in here that other datacenter companies are going to steal to put Flexential out of business. If you work for a datacenter company that doesn't already have any of this, you should write a check to Flexential or their electrical engineers for a consultancy.

And finally, the things that Cloudflare speculated on were things like, to paraphrase, "we know that a transformer failed, and we believe that its purpose was to step down the voltage that the utility company was running into the datacenter." Which, if you have basic electrical engineering knowledge, just makes sense. The utility company is delivering 12470 volts, of course that needs to be stepped down, somewhere along the way, probably multiple times, before it ends up coming through the 210 volt rack PDUs. I'm willing to accept that guess in the absence of facts from the vendor while they're still being tight lipped.

However, that's not to say I'm totally satisfied by this post mortem either. I am also interested in hearing what decisions led to them leaving Kafka and Clickhouse in a state of non-redundancy (at least at the datacenter level) or how they could have not known about it. Detail was left out there, for sure.


That isn't a voltage change where you'd use multiple transformers in sequence generally, let alone if it's at the same site for the main/primary feed. A redundant feed counts the same, just to be clear, it's more that some low-power/"control plane of the electrical switchyard" applications may use a lower voltage if conveniently available, even if that means a second transformation step from the generators/grid to the load.

That said, the existence of the 480V labeled intermediary does suggest they have a 277/480 V outside system, and a 120/208 V rack-side system.


It's replies like these that make companies not want to share detailed postmortems. It's not crazy for many things in a incident to go wrong and for >0 of them to be external. It would be negligent for Cloudflare to not explicate what went wrong with the vendor which, I would note, reflects poorly on them: who picked the vendor? If anything, I would have liked to hear more on how Cloudflare ended up with a subpar vendor.

(none of this takes away from the mistakes that were wholly theirs that shouldn't have happened and that they should fix)


> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

> The other two data centers running in the area would take over responsibility for the high availability cluster and keep critical services online. Generally that worked as planned. Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04.

> A handful of products did not properly get stood up on our disaster recovery sites. These tended to be newer products where we had not fully implemented and tested a disaster recovery procedure.

So the root cause for the outage was that they relied on a single data center. I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.


> I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.

Bah, who cares about such unimportant details, what's important is that ~dev velocity~ was reaaally high right until that moment!

> We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster. Cloudflare allows multiple teams to innovate quickly. As such, products often take different paths toward their initial alpha. While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA). That was a mistake as it meant that the redundancy protections we had in place worked inconsistently depending on the product.

Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?


> Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?

This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.


I've been involved with some new service launches at AWS, and it's a strict requirement that everything goes through some rigorous operational and security reviews that cover exactly these issues before the service can be launched as GA. Feature-wise people might consider them "alpha", but when it comes to the resilience and security of the launched features, they are held to much higher standards than what is being described in this post-mortem.


Your operational reviews must be lacking at AWS then (surprise surprise) then because there are so many instances where something will be released in alpha yet the documentation will still be outdated, stale and incorrect LOL.


I think you misunderstand what's being talked about in this thread. "Operations" in this context has nothing to do with external-facing documentation, and instead refers to the resilience of the service and ensuring it doesn't for example, stop working when a single data center experiences a power outage.


"It stopped working because you did XYZ which you shouldn't have done despite it not being documented as something you shouldn't do" isn't different to a customer than a data center going down. For example, I'm sure the EKS UI was really resilient which meant little when random nodes dropped from a cluster due to the utter crap code in the official CNI network driver. My point wasn't that every cloud provider released alpha level software by the same definition but that by a customer's definition they all released alpha level software and label it GA.


> This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.

GCP run multi-year betas of services and features, so I'm doubtful there were still things not ironed out for GA. Do you have some examples?


Having worked at companies with varying degrees of autonomy, in my experience a more flexible structure allows for building systems that are ultimately more resilient. Of course, there are ways to do it poorly, but that doesn’t mean it’s a “complete and utter management failure”.


> Complete and utter management failure

Too strong. A failure certainly, but painting this as the worst possible management failure is kind of silly.


To be honest if you take the circumstances and them spending half of their post-mortem blaming the vendor, it does look like a total shitshow.


I’m going to leave out some details but there was a period of time where you could bypass cloudflare’s IP whitelisting by using Apple’s iCloud relay service. This was fixed but to my knowledge never disclosed.


There was a time when they were dumping encryption keys into search engine caches for weeks, and had the audacity to claim here, the issue was "mostly" solved. Until they were called out on it by Google Project Zero team...

"Cloudflare Reverse Proxies Are Dumping Uninitialized Memory" - https://news.ycombinator.com/item?id=13718752


There still exist many bypasses that work in a lot of cases. There's even services for it now. Wouldn't be surprised if that or similar was a technique employed.


Saw.t


And the top comment on the other HN post called it: https://news.ycombinator.com/item?id=38113503


And that this was unironically written in the same post mortem: “We are good at distributed systems.”

There’s a lack of awareness there.


Well, they did distribute their systems. Some were in the running DC, some were not ;)


Their uptime was eventually consistent


haha. The control plane was eventually consistent after 3 days


They are good at systems that are distributed; they are very bad at ensuring systems they sell thier custoners are distributed.


They distributed the faults across all their customers....


Good != infallible


> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

It's amazing that they don't have standards that mandate all new systems to use HA from the beginning.


> I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.

Absolute lack of faith in cloudflare rn.

This is amateur hour stuff.

It's especially egregious that these are new services that were rolled out without HA.


?

Tbh. As far as I can see, their data plane worked at the edge.

Cloudflare released a lot of new products and the ones that affected were: streams, new image upload and logpush.

Their control plane was bad though. But since most products worked, that's more redundancy than most products.

The proposed solution is simple:

- GA requires to be in the high availability cluster

- test entire DC outages


The combination of "newer products" and then having "our Stream service" as the only named service in the post-mortem is very odd, since Stream is hardly a "newer product". It was launched in 2017 and went GA in 2018[2]. If after 5 years it still didn't have a disaster recovery procedure I find it hard to believe they even considered it.

[1]: https://blog.cloudflare.com/introducing-cloudflare-stream/ [2]: https://www.cloudflare.com/press-releases/2018/cloudflare-st...


From what I was reading on the status page & customers here on HN, WARP + Zero Trust were also majorly affected, which would be quite impactful for a company using these products for their internal authentication.

It's not just streams, image upload & Logpush.


Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.

The data plane ( which I mentioned) had no issues.

It's literally in the title what was affected: "Post Mortem on Cloudflare Control Plane and Analytics Outage"

Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.

Source: I watched it all happen in the cloudflare discord channel.

If you know anyone that is claiming to be affected on the data plane for the services you mentioned, that would be an interesting one.

Note: I remember emails were also more affected though.


> Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.

Which was still like ~12+ hours, if we check the status page.

>Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.

What good is a status page that's lying to you? Especially since CF manually updates it, anyway?

>Source: I watched it all happen in the cloudflare discord channel.

Wow, as a business customer I definitely like watching some Discord channel for status updates.


?

This wasn't about status updates going to discord only.

There is literally a discussion section on the discord, named: #general-discussions

Not everything was clear in the discord too ( eg. The healthchecks were discussed there), that's not something you want to copy-paste in the status updates...

Priority for cloudflare seemed to get everything back up. And what they thought was down, was always mentioned in the status updates.


Oh, I just looked it up and I thought you mean that CF engineers were giving real time updates there. That's not the case.

However, I still fail to see your argument regarding Zero Trust and not being impacted. The status page literally mentioned that the service was recovered on Nov 3, so I don't understand what you mean by:

>The data plane ( which I mentioned) had no issues.

There's literally a section with "Data plane impact" on all over the status page, and ZT is definitely in the earlier ones. And this is given the fact that status updates on Nov 2 were very sparse until power was restored.


We don't use zero trust atm. So, I can't know for sure.

What I mentioned, was what I've seen passing by in the channel at the time.

I also saw no incoming help requests for zero trust tbh ( did some community help)


This was short downtime. But big companies must create own gateway but small just waiting and relying on CF


> Tbh. As far as I can see, their data plane worked at the edge.

Arguable, it's best to think of the edge as a buffering point in addition to processing. Aggregation has to happen somewhere, and that's where shit hit the fan.


? That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.

Cloudflare's data lives in the edge and is constantly moving.

The only thing not living in the edge ( as was noticed), is stream, logpush and new image resize requests ( existing ones worked fine) from the data plane


>That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.

You're being loose in your usage of 'data'. No one is talking about cached copies of an upstream, but you probably are.

Read the post mortem a bit more closely. They explicitly state that the control plane(s) source of truth lives in core, and that logs aggregate back to core for analytics and service ingestion. Think through the implications on that one.


That’s my interpretation as well. There is one central brain, and “the edge” is like the nervous system that collects signals, sends it to the brain, and is _eventually consistent_ with instructions/config generated by the brain.


> I am sorry and embarrassed for this incident and the pain that it caused our customers and our team.

So do they.


[flagged]


Sounds like chatgpt doesn't want your business and tuned thier cloudflare settings accordingly. Conveniently cloudflare is getting the blame, which is presumably part of what they're paying for.


>Sounds like chatgpt doesn't want your business and tuned thier cloudflare settings accordingly. Conveniently cloudflare is getting the blame, which is presumably part of what they're paying for.

The issue is fixed now. But as I mentioned CloudFlare still has a shit captcha, and the one for disabilities was broken as I mentioned.


Yep, it's easy to spot folks who have never configured Cloudflare's WAF when they suggest Cloudflare is blocking their browser of choice instead of the website itself.


As someone who was slightly affected by this outage, I personally also find this post-mortem to be lacking.

75% of the post-mortem talks about the power outage at PDX-04 and blames Flexential. Okay, fair - it was a bit of a disaster what was happening there judging from the text.

But by end of November 2 (UTC), power was fully restored. It still took ~30 hours according to the post-mortem for Cloudflare to fully recover service. This was longer than the outage, and the text just states that too many services were dependent from each other. But I'd wish they go into more detail here why the operation as a whole took that long. Are there any take-aways from the recovery process, too? Or was it really just syncing data from the edges back to the "brain" that took this long?

Also one aspect I am missing here is the lack of communication - especially to Enterprise customers. Cloudflare support was basically radio silent during this outage except for the status page. Realistically, they couldn't do much anyway. But at least any attempt at communication would be appreciated - especially for Enterprise customers, and even more especially after the post-mortem blames Flexential for a lack of communication.

While I like Cloudflare since it's a great product, I think there are still a few more things that should be taken as a conclusion for CF to take away from this incident.

That being said, glad you managed to recover, and thanks for the post-mortem.


I'm not that surprised at the relative lack of detail, given how quickly they released this; I'm surprised they published this much info so quickly. Calling it a postmortem is a bit of a misnomer, though. I'd expect a full postmortem to have the kind of detail you mention.


> In particular, two critical services that process logs and power our analytics — Kafka and ClickHouse — were only available in PDX-04 but had services that depended on them that were running in the high availability cluster. Those dependencies shouldn’t have been so tight, should have failed more gracefully, and we should have caught them.

This paragraph similarly leaves out juicy details. Exactly what services fail if logging is down? Were they built that way inadvertently? Why did no one notice?


> Also one aspect I am missing here is the lack of communication - especially to Enterprise customers.

They blame Flexential for lack of communication, but were the first one not saying anything.


Even "we don't know why our data center is failing, but we're sending a team over to physically investigate now" would have been A+ communication in the moment.


Everything was on the status page since the start?

DC related updates:

> Update - Power to Cloudflare’s core North America data center has been partially restored. Cloudflare has failed over some core services to a backup data center, which has partially remediated impact. Cloudflare is currently working to restore the remaining affected services and bring the core North America data center back online. Nov 02, 2023 - 17:08 UTC

> Identified - Cloudflare is assessing a loss of power impacting data centres while simultaneously failing over services.

We will keep providing regular updates until the issue is resolved, thank you for your patience as we work on mitigating the problem. Nov 02, 2023 - 13:40 UTC


As an enterprise customer, I would expect a CSM reaching out to us informing us about the impact, getting into more details about any restoration plans and potentially even ETAs or rough prioritization to resolution on them.

In reality, Cloudflare's support team was essentially completely unavailable on Nov 2, leaving only the status page. And for most of the day, the updates on the status page were very sparse except "we are working on it", and "We are still seeing gradual improvements and working to restore full functionality.".

Yet clearer status updates were only giving starting on Nov 3. However, I still don't think I heard anything from support or a CSM during that time.


?

1) Were you affected on the data plane? Which product?

As far as I can tell, while the outage was in the core dc's. The impact was minor.

2) Both examples were exactly from 2 November. Not 3 November.

3) What method of support did you try? I thought that their support was impacted ( email?).

The status page explicitly mentioned to get in contact with your account manager for some config changes on some products, if you wanted changes.

4) I have never heard of Enterprise customers being contacted by a cloud company during an outage.

Which company does that? Do you have an example?

5) I would think it's absolutely a nogo to contact every preemptively Enterprise customer with: "hey, the product works, but if you change xyz, atm that doesn't.".

Since most customers weren't affected and some others were minorly impacted.

There is not a single cloud company that does that.

Feel free to correct me if I'm wrong...


for us as an enterprise customer for many years:

ssl for saas -> custom hostnames are not working for new domains or changes to current ones. also page rules -> redirects are not working for new rules or changes to current rules. which are game-stoppers for our business.

we contacted via enterprise email support + ccing our managers and assigned engineers.

first they try to tell us product is working and sending us some details how to do that,this etc, after a couple of hours later they understand the issue is bigger than they thought and they said "the product is affected by api outage".

then in another email we asked them when this can be solved but only answer we got is "please follow status page for the updates".

and after a day, ssl for saas & ssl services took their places on status page. for a day nobody notices if it's working or not except customers.

so as we understand these emails even the team internally haven't got any idea what is working and what is not!


>1) Were you affected on the data plane? Which product?

No, but we needed to make urgent changes.

>2) Both examples were exactly from 2 November. Not 3 November.

Both messages contain no clear messages about remediation and co. They also didn't state clearly which products were failed over. I noticed that at this point I could at least login to the dashboard, but most stuff was still severely broken, and I had no idea whether changes with the few semi-functional components were actually applied or not.

Updates to single products with a more clear status were given only at the end of November 2nd (UTC).

(Also one of the message states data centres - not just data center. Not sure what happened there).

>3) What method of support did you try? I thought that their support was impacted ( email?).

Emergency line + contacting our CSM. The emergency line was shut down and replaced with voice mail (WTF?), and our CSM did not reply at all (or the message somehow made it to the wrong person, I'll find out next week, I guess).

So in our case, the communication was essentially non-existent, even though I raised a support case (or wanted to).

>4) I have never heard of Enterprise customers being contacted by a cloud company during an outage. Which company does that? Do you have an example?

I can remember of Datadog reaching out to us for their 2023-03-08 incident. Not sure if it was just our CSM being nice or someone did a support request on another communication channel, but looking back in history that came without asking + the post mortem. Same case when stuff happens such as vulnerabilities in one of their packages, they reach out to us proactively and notify us.

To be fair, this is a bit of a wishlist and definitely not necessary for a 30 minutes hickup, but for a 2 day outage... I don't know.

At the bare minimum, I'd expect at least their support team to be replying and not shutting down the communication channels.

>5) I would think it's absolutely a nogo to contact every preemptively Enterprise customer with: "hey, the product works, but if you change xyz, atm that doesn't.".

I don't know... At least at the time I raise an urgent support case about an issue, I expect to be kept up-to-date.

> Since most customers weren't affected and some others were minorly impacted.

What does it mean they were not affected? Yes, their core service was still functioning (thank god - after all they advertise a 100% (!) SLA on that), but you can see on same Discord channel you mentioned people failing to renew TLS certificates, people couldn't make Vercel deployments and more. So it did affect quite a bunch of downstream customers in their products, and they might also sell SLAs to their customers...

I cannot really comment on whether that just affected us, or if other customers had better support experiences here.

But I expect better in terms of communication here. Doesn't have to be as outreaching as I did in my last message, but stuff like shutting down the emergency line and not giving any comment is not really acceptable for an Enterprise contract.


Just mentioning from "the other side".

We are a service provider in ( mostly) Europe.

Our policy ( playbook) in case of an issue is updating the status page as quick as possible and customers can subscribe on RSS.

There was one issue in the past where we wanted to inform the clients. But it's not easy, as only some were impacted and we decided against it.

5 minutes later ( it was out of our hands) it was solved...

Our playbook is too update the status page as soon as possible to inform the clients something is up and we are aware.

There shouldn't be too much info on it, since sometimes you just aren't 100% sure about what's exactly going on.

We also decided that we want provide durations on it, since you then create a commitment that's possibly dependent on external factors.

Tbh. I can completely understand the approach from Cloudflare here. With an issue, support is overwhelmed. That's why you use the status page ASAP.

Technical details happen in the post-mortem. When we can be sure if any data is lost ( normally, there is nothing lost though, but it's possible we need to requeue some actions)

=> this is when we can contact our clients and brought up to date.

Depending on the SLA it's included or eg. Is paid extra ( in a lot of times, an external provider fails and we can fix something from our end, eg. Resending some data)


I've got no knock on the status page. Cloudflare is disappointed in the lack of notification from their data center provider, and Cloudflare customers are disappointed in the lack of notification from their service provider.

Instead of defending what was done and calling that good enough, Cloudflare should use this as an opportunity to commit to reevaluating the strategy for customer outreach during major service failures. If that's what Cloudflare expects from its service providers, that's what Cloudflare should provide to its customers.


?

You want Cloudflare to update every customer for an issue that they probably aren't affected with ( except when changing things) ?

Who even does that when you've got so many customers?

That's exactly what why the status page is there:

https://www.cloudflarestatus.com/

The DC obviously didn't have any means to update their customers.


I don't want that. Cloudflare's customers want that. Cloudflare was embarrassed and needs to listen to the feedback they're receiving.


There is literally not a single cloud company doing that.

Even those that had complete outages.


I think they just wanted a quick post-mortem. I'm sure they will add more to the blog later in the year when they implement mitigations.


I love how thorough Cloudflare post mortem’s are. Reading the frank, transparent explanations are like a breath of fresh air compared to the obfuscation of nearly every other company comm’s strategy.

We were affected but it’s blog posts like these that make me never want to move away. Everyone makes mistakes. Everyone has bad days. It’s how you react afterwards that makes the difference.


I would generally agree with you, but this post mortem was 75% blaming Flexential even though it took them almost two days to recover after power was restored. The power outage should have been a single paragraph and then pivoted - DC failures happen, its part of life. Failing to properly account for and recover from it is where the real learnings for Cloudflare are.


It was more of an incident report. The efforts to get back online were mostly around Flexential, so it makes sense to dive in to their failings. That said, it is clear there were major lapses of judgement around the control plane design since they should be able to withstand an earthquake. That they don't have regular disaster recovery testing of the control plane and its dependencies seems crazy. I wonder if it is more that some of those dependencies they hoped to eliminate and replace with in-house technology and hedged their bets on the risk.


> Everyone makes mistakes. Everyone has bad days.

The issue is when you start having bad days every other day though. We use and depend on CloudFlare Images heavily, it has now been down more than 67 hours over the last 30 days (22h on October 9th, 42h Nov 2 - Nov 4 and a sprinkle of ~hour long outages in between). That's 90.6% availability over the last month.

Transparency is a great differentiator between providers that are fighting in the 99.9% availability range, but when you are hanging on for dear life to stay above the one 9 availability, it doesn't matter.


They are a younger company than these other providers. Microsoft, Google, and AWS had their own growth pains and disasters. Remember when Microsoft deleted all the data (contacts, photos, etc) off all their customers Danger phones by accident and had no backup. Talk about naming their product a self-fulfilling prophecy.


Cloudflare is 14 years old and Cloudflare Stream, the "newer services they didn't have time to make HA" is 6 years old today.


they are 14 years old at this point. aws has what, four years on them?


AWS was the public release of tooling that amazon had been bulding for almost 20 years at that point.

Similar story for GCP.

All three of them had decades of institutional knowledge and procedures in place around running big services by the time Cloudflare was founded.


> AWS was the public release of tooling that amazon had been bulding for almost 20 years at that point.

No, even at the onset AWS was an entirely-from-the-ground-up build. The only thing it could even be argued to sit on top of was the extremely crufty VMs and physical loadbalancers from the original Prod at that point, and those things were not doing anybody any favors.


No they didn't. Amazon was 12 years old when AWS launched. Google was 10 years old when GCP launched.


Cloudflare is fourteen years old


I agree, but I also think that for security purposes they should leave out extraneous detail. Also, I know they want to hold their suppliers accountable, but I would hold off pointing fingers. It doesn't really improve behavior, and it makes incentives worse.

I really appreciate that they're going to fix the process errors here. But as they suggested, there's a tension between moving fast and being sure. This is typically managed like the weather, buying rain jackets afterwards (not optimal). I'd be curious to see how they can make reliability part of the culture without tying development up in process.

Perhaps they can model the system in software, then use traffic analytics to validate their models. If they can lower the cost of reliability experiments by doing virtual experiments, they might be able to catch more before roll-out.


> I also think that for security purposes they should leave out extraneous detail

Disagree completely, it's the frank detail that makes me trust their story.


Maybe, but I think that their "Informed Speculation" section was probably unnecessary. They may or may not be correct, but give Flexential an opportunity to share what actually happened rather than openly guessing on what might have happened. Instead, state the facts you know and move onto your response and lessons learned.


Yeah, that part really rubbed me the wrong way. If this was a full postmortem published a couple of weeks after the fact and Flexential still wasn't providing details, I could maybe see including it, but this post is the wrong place and time.


I prefer to have their informed speculation here.

Has Flexential provided a similarly detailed, public root cause analysis? If so, maybe we can refer to it. If not, how do you expect us to read it?


It’s only been a couple of business days, and it’s likely that they themselves will need root cause from equipment vendors (and perhaps information from the utility) to fully explain what happened. Perhaps they won’t publish anything, but at least give them an opportunity before trying to do it for them.


I expect them to start reporting out what they know immediately, and update as they learn more. If they're not doing that, and indeed haven't reported anything in days, that is a huge failure.

Imagine if the literal power company failed, and took days to tell people what was going on. You can see why people are reading the postmortem that exists, rather than the one that doesn't.


Cloudflare vowed to be extremely transparent since the start of their existence. I'm very happy with the fact they have managed to keep this a core company value under extreme growth. I hope it continues after they reach a stable market cap. It isn't like Google that vowed not to be evil until they got big enough to be susceptible to antitrust regulation and negative incentives related to ad revenue.


What "security purposes"? Good security isn't based on ignorance of a system, it is on the system being good. We create a self fulfilling prophecy when we hide security practices because what happens is then very few will properly implement their security. Openness is necessary for learning.


> know they want to hold their suppliers accountable

They do both. They stated what their problem was and they stated their due diligence in picking a DC

> While the PDX-04’s design was certified Tier III before construction and is expected to provide high availability SLAs

They said the core issue: innovating fast, which led to not requiring in the high availability cluster.

Which is also a fix.

From cloudflare 's POV, part of what made it originally worse, is the lack of communication by the DC.

Which is an issue, if you want to inform clients.


Its weird that upon reading this post, I have less confidence in Cloudflare. They basically browbeat Flexential for behaving unprofessional, which, yes, they probably did. However the fact that this causes entire systems that people rely on to go down is a massive redundancy failure on Cloudflares part, you should be able to nuke one of these datacentres and still maintain services.

Very worrying is they start by stating their intended design:

> Cloudflare's control plane and analytics systems run primarily on servers in three data centers around Hillsboro, Oregon

You need way more geographic dispersion than that, this control pane is used by people across the world. We are still on the intended design, not the flawed implementation by the way, which is wild to me.

> This is a system design that we began implementing four years ago. While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

I don't understand why this would ever be done in this way. If Cloudflare is making a new product for consumers shouldn't redundant design be at the forefront here? I am surprised that it was even an option. For the record I do use Cloudflare for certain systems and I use it because I assume it has great failovers if events like this occur making me not have to worry about these eventualities, but now I will be reconsidering this, how do I actually know my cloudflare workers are safe from these design decisions?

> When services were turned up there, we experienced a thundering herd problem where the API calls that had been failing overwhelmed our services.

Yeh I'll bet, its because Cloudflares core design is not redundant.

Really disappointed in this blog post trying to shift the blame to Flexential when this slapdash architecture should be the main problem on show. As a customer I don't care if Flexential disappears in an earthquake tomorrow, I expect Cloudflare to handle it gracefully.


I'm also a bit surprised about Hillsboro. The FEMA is assuming that when (not if) The Big One hits, everything west of I-5 is going to be toast.

Is placing the entirety of such a critical cluster in a known earthquake and tsunami zone a good idea? It looks like their disaster recovery to Europe didn't really work either...


Yeah. Moreover, looking at the map, the DCs around Hillsboro are terrifyingly close to each other.

By the way, assuming an ideal control plane (in contrast to data plane) would be 3 DCs at a distance of about 20-40 miles, are there any mitigation techniques so that a seismic event which destroys a single DC doesn't also sever the comms between the remaining two?


Is the Hillsboro thing is about latency?


That may well be part of it, some people were talking about the impact of latency in the outage thread [1].

[1] https://news.ycombinator.com/item?id=38113952


> However, we had never tested fully taking the entire PDX-04 facility offline.

That is a painful lesson, but unless you are physically powering off the dc or at least disconnecting the network from the outside world you are not testing a real disaster.

You can point fingers at the facility operators, but at the end of the day you have to be able to recover from a dc going completely offline and maybe never coming back. Mother Nature may wipe it off the face of the earth.


This is a fair point. Imagine there had been a serious fire like OVH suffered or flooding that destroyed the data center. Would Cloudflare have been able to recover?


That's not what happened here. Their edge worked fine.

Business was mostly running as usual.

The OVH outage was immediate downtime.


Most likely, yes. They have enough customer lock-in that enough customers would stick with them even if it took them a week to rebuild everything from in other DCs.


> Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning. That decision delayed our full recovery, but I believe made it less likely that we’d compound this situation with additional mistakes.

I liked this - the human element is underemphasised often in these kinds of reports, and trying to fix a major outage while overly tired is only going to add avoidable mistakes.

I don’t know how it would work for an org of Cloudflare’s size, but I know we have plans for a significant outage for staff to work/sleep in shifts, to try to avoid that problem as well.

Issue there is that you need a way to hand over the current state of the outage to new staff as they wake up/come online.


I’m curious, have these plans ever been tested in a real incident?

Like Mike Tyson says, everyone has a plan until they get punched in the face.


The biggest key to implementing these types of plans is that when the shit hits the fan, you send a third of the people home - so they can come back in 10-20 hours are relieve those who are still there.

If you don't do that, you're still going to be scrambling.


Somewhat amazed at the structure of this article: after first discussing the third-party for 75% of blog post, the first-party recovery efforts were detailed in considerably lesser paragraphs. It’s promising to see a path forward mentioned but I can’t help but wonder why this was published instead of currently acknowledging their failure/circumstances and later on publishing a complete post-mortem after the dust fully settles (i.e. without speculation).


To make sure their stonk doesn’t drop at market open next week. Investors will read this (or get the sound bites) and shrug it off as some vendor issue rather than deep issue that will require months of rework (millions of dollars and thus impacting earnings)


It’s called “shifting the blame”.


Poor doc: You had a high availability 3 data center setup that utterly failed. Why spend the first third of the document blaming your data center operator? The management of the data center facility is outside of your control. You gambled that not appropriately testing your high-availability setup (under your control) would not have consequences. You should absolutely discuss the DC management with your operator, but that's between you and them and doesn't belong in this post mortem.


Wow they REALLY buried this important part didn't they! This took a ton of scrolling:

"Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04."

Bingo, there we have it.


There’s also the part where the disaster recovery site apparently fell over under the load (which, OK, is a thing that might happen) and they needed to code up limits on the fly (and that is not OK; I don’t have the slightest idea how one might test this, but if you’re building a “disaster” site it seems like you’d need to figure it out):

> When services were turned up there, we experienced a thundering herd problem where the API calls that had been failing overwhelmed our services. We implemented rate limits to get the request volume under control.

This seems not to be mentioned in the bullet points at the end of the text (which are otherwise reasonable).

And now I’m curious—how do you design cold failover when the system is complex enough to be metastable[1] and you can’t afford to test it on live traffic? I can guess which techniques you could use to build it, it’s the design and testing part (knowing the techniques actually work in your situation) that’s the problem.

One other thing that seems to have gone completely unmentioned:

> Beginning on Thursday, November 2, 2023, at 11:43 UTC Cloudflare's control plane and analytics services experienced an outage. [... W]e made the call at 13:40 UTC to fail over to Cloudflare's disaster recovery sites located in Europe.

Why did the decision take so long? I can imagine it can’t be made lightly, but two hours seems like too much hesitation, even if there was an expectation that power would be restored imminently for most of that time. There has to be a (predetermined?) point when you hit the switch regardless of any promises. Was it really set that far?

[1] http://charap.co/metastable-failures-in-distributed-systems/


In my experience, power is the most common data center failure there is. Often it's the redundant systems that cause the failure.


Nobody cares why a data centre died. It's like complaining one of your nodes in a kubernetes cluster has died, or one of your disks in a raid.

The problem here, which is 100% Cloudflare, is that their systems were not resilient across geography.


And that's completely unrelated to my comment but thanks for the insight


Yep, past the part where they spent a long time blaming the data center and power company.


What does PDX-04 mean here? Not familiar with how data centers work.


PDX is the airport code for Portland, Oregon, USA. It's the fourth Portland data center.


Read the damn article! It's explained at the top.


Nah, if only the data center would've stayed up this wouldn't have been a problem. It's clearly on the data center. /s


Not criticism, just remarks:

> While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA).

I really like the model where a single team in a company, with Product + Dev, can quickly ship, iterate on a new product, and prove market demand without going through layers and layers of internal bureaucracy (Ops/Infra, Security, Privacy/Legal, Finance approval for production-scale), with the main stipulation being that such work is marked as alpha/beta/preview, and only going through the layers of internal bureaucracy once it's ready to go GA. But most companies really struggle with this, especially with ensuring that customers are never exposed to a/b/p software by default, requiring opt-in from the customer, allowing the customer to easily opt-out, and ensuring that using a/b/p software never endangers GA features they depend on. Building that out, if it's even on a company's internal Platform/DevX backlog, is usually super far down as a "wishlist" item. So I'm super interested to see what Cloudflare can build here and whether that can ever get exposed as part of their public Product portfolio as well.

> We need to use the distributed systems products that we make available to all our customers for all our services so they continue to function mostly as normal even if our core facilities are disrupted.

Super excited to see this. Cloudflare Workers is still too much of an "edge" platform and not a "main datacenter" platform, at least because D1 is still in beta and even if it wasn't, Postgres is far more feature-ful, and that pulls more software into a traditional single-datacenter model. So if Cloudflare can really succeed at this, then it'll be a much stronger statement in favor of building out software in an edge-only model.

Between the Pages outage and the API outage happening in one week, I was considering selling my NET stock, but reading a postmortem like this reminds me why I invested in NET in the first place. Thanks Matt.


> I really like the model where a single team in a company, with Product + Dev, can quickly ship, iterate on a new product, and prove market demand without going through layers and layers of internal bureaucracy (Ops/Infra, Security, Privacy/Legal, Finance approval for production-scale), with the main stipulation being that such work is marked as alpha/beta/preview, and only going through the layers of internal bureaucracy once it's ready to go GA.

Speaking from personal experience, what you're claiming as 'good', for CF meant SRE- usually core, but edge also suffered- got stuck with trying to fix a fundamentally broken design that was known faulty- and called faulty repeatedly- but forced through.

Nothing about this is desirable or will end well.

This reckoning was known and raised by multiple SRE near a decade before this occurred, and there were multiple near misses in the last few years that were ignored.

The part that's probably funny- and painful- for ex-CF SRE is that the company will do a hard pivot and try to rectify this mess. It's always harder to fix after, rather than building for, and they've ignored this for a long while.


I'm not sure if you understood my argument? I'm arguing that it's fine to ship a "fundamentally broken design" as long as the company makes abundantly clear that such software is shipped as-is, without warranty of any kind, MIT-license-style. Ramming that kind of software through to GA without unanimous sign-off from all stakeholders (infra/ops, sec, privacy/legal, etc.) is fundamentally unacceptable under such a model. Maybe there's an argument to be made that such a model is naïve, that in practice the gatekeepers for GA will always be ignored or overruled, but I would at least prefer to think that such cases are examples of organizational dysfunction rather than a problem with the model itself, which tries to balance between giving Product the agility it needs to iterate on the product, Infra/Sec/Legal concerns that really only apply in GA, and Ops (SRE) understanding that you can't truly test anything until it's in production; the same production where GA is.


> We need to use the distributed systems products that we make available to all our customers for all our services so they continue to function mostly as normal even if our core facilities are disrupted.

>> Super excited to see this. Cloudflare Workers is still too much of an "edge" platform and not a "main datacenter" platform, at least because D1 is still in beta and even if it wasn't, Postgres is far more feature-ful, and that pulls more software into a traditional single-datacenter model. So if Cloudflare can really succeed at this, then it'll be a much stronger statement in favor of building out software in an edge-only model.

On the other when a company dogfoods its own products you end up in a dependency hell like AWS apparently is in where a single Lambda cell hitting full capacity in us-east-1 breaks many services in all regions.

I'm sure there is a right way to manage end to end dependencies for 100% of your services past, present, and future but increasingly I'm of the opinion that it's not possible in our economic system to dedicate enough resources to maintain such a dependency mapping system since that takes away developer time from customer facing products that show up in the bottom line. You just limp along and hope that nothing happens that takes out your whole product.

Maybe companies whose core business is a money printing machine (ads) can dedicate people to it but companies whose core business is tech probably don't have the spare cash.


> Security

Security is what keeps a single service getting breached from causing the whole company to get breached.

> Privacy/Legal

Cloudflare doesn't get indemnification from the law just because a customer agrees to mutually break the law.


They don't always know it, but all large systems are moving gradually towards dependency management system with logic rules that covers "everything", physical, logical, human and administrative dependencies. Every time something new not covered is discovered, new rules and conditions are added. You can do it with manual checklists, multiple rule checkers, or put everything together.

I suspect that in end it's just easier to put everything into single declarative formal verification system and see if new change to the system passes, transition between configurations passes etc.


This is such an interesting way of putting it. I think this has been the subconscious reason I've been gravitating towards defining _everything_ I manage personally (and not yet at work) with Nix. It's not quite to the extent you're talking about here, of course, but in a similar vein at least.


Cloudflare's control plane and analytics systems run primarily on servers in three data centers around Hillsboro, Oregon. The three data centers are independent of one another, each have multiple utility power feeds, and each have multiple redundant and independent network connections. The facilities were intentionally chosen to be at a distance apart that would minimize the chances that a natural disaster would cause all three to be impacted, while still close enough that they could all run active-active redundant data clusters.

If the three data centers are all around Hillsboro, Oregon, an earthquake could probably take out all three simultaneously.


> Hillsboro, Oregon, an earthquake could probably take out all three simultaneously.

Is it west of I5?

(yes)

Oh yeah, they all gone.

Cascadia Subduction Zone - https://pnsn.org/outreach/earthquakesources/csz


Wikipedia's entry for Hillsboro: "Elevation 194 ft (60 m)"

Between that, and being ~50 miles inland - I'd say there's ~zero threat of Cascadia quakes or tsunamis directly knocking out those DC's. (Yeah, larger-scale infrastructure and social order could still be killers.)

OTOH - Mt. St. Helens is about 60 miles NNE of Hillsboro. If that really went boom, and the wind was right...how many cm's of dry volcanic ash can the roofs of those DC's bear? What if rain wets that ash? How about their HVAC systems' filters?


I was never worried about the tsunami. Okay, maybe not gone, but I wouldn't say it would be operational.

https://www.oregon.gov/oem/Documents/Cascadia_Rising_Exercis...

50% of roads and near 75% of bridges damaged on the west coast and the I5 corridor.

Refer to PDF page #93 where over 70% of power generation is highly damaged on the I5 corridor and 60% in the coastal areas with 0% undamaged.

Highly damaged - "Extensive damage to generation plants, substations, and buildings. Repairs are needed to regain functionality. Restoring power to meet 90% of demand may take months to one year."

"In the immediate aftermath of the earthquake, cities within 100 miles of the Pacific coastline may experience partial or complete blackout. Seventy percent of the electric facilities in the I-5 corridor may suffer considerable damage to generation plants, and many distribution circuits and substations may fail, resulting in a loss of over half of the systems load capacity (see Table 22). Most electrical power assets on the coast may suffer damage severe enough as to render the equipment and structures irreparable"


Good backups generators at their colo's could handle the lack of utility power for days to weeks. More & better generators could be hauled in and connected.*

The two big problems I'd see would be (1) Social Order and (2) Internet Connectivity. DC's are not fortresses, and internet backbone fibers/routers/etc. are distributed & kinda fragile.

*After all the large-scale power outages & near-outages of recent decades, Cloudflare has no excuse if they lack really-good backup generators at critical facilities. And with their size, Cloudflare must support enough "critical during major disaster" internet services to actually get such generators.


And most of thier SREs. Spending 30 hours to recover from the worst natural disaster in recorded history is slightly diffrent then from a ground fault on a single transformer.


There’s also the Portland Hills and Gales Creek fault zones to worry about.


They would then go into very detailed description how tectonic activity caused the outage.


thats why DRC exists right?


>While there were periods where customers were unable to make changes to those services, traffic through our network was not impacted.

They're just going to straight up lie like that? We definitely weren't able to get "traffic through [their] network" through the outage at many different random points.

So if the CF team is under the impression traffic was not impacted, dig deeper.


I think overall Cloudflare did a decent job on this. Clearly the DC provider cocked up big time here, but Cloudflare kept running fine for the vast majority of customers globally. No system is perfect and it’s only apocalyptic scenarios like this where the vulnerabilities are exposed - and they will now be fixed. Hope the SRE guys got some rest after all that stress.


> Clearly the DC provider cocked up big time here

Actually, this is the CF version, maybe Flexential will come out with a different one.

BTW, if you design a system to survive a DC failure, you cannot blame the DC failure.


Well, this is definitely NOT a blameless post-mortem!

Obviously I'm joking because they are blaming an external company (Flexential) to which they are surely paying big money for the DC space.


I wonder if CF execs aiming to use this to get out of their long term contract with them?


> We are a relatively large customer of the facility, consuming approximately 10 percent of its total capacity.

I'm surprised that CF are renting space in colocation facilities. I would have expected a business of their size to have their own DCs. Is this common practice for cloud providers?


CF is probably a lot smaller than you realize, especially per data center. As someone who works at a different CDN, I am guessing they only have a few hundred machines per data center around the world. That is way too small to be able to run your own DC.


The part you're missing is that their business is not actually that large (<$1 billion of revenue in 2022 - still deep in the red - and 3k employees).


You thought that they build > 300 DC's?

Colo is much more flexible, cheaper and quicker to start. Definitely since they sit close to the end-user on the data plane.


> You thought that they build > 300 DC's?

I have no idea how many DCs they have or operate in. Where does "300" come from?

> Colo is much more flexible, cheaper and quicker to start. Definitely since they sit close to the end-user on the data plane.

I understand that, but it has the disadvantage of reduced control and observability - particularly in the event of an outage such as that described in the blog post.

I kind of assumed that top-tier cloud platforms like AWS/Azure/GCP operate out of dedicated DCs, and that CF are similar because of their well-known scale of operations. Since my original comment has been downvoted†, someone presumably thinks this it was a naive or trivial question - although I don't understand why.

(† I don't much care about downvotes, but I do take them to be a signal.)


Probably most of us follow Cloudflare a bit more closely.

They want DC's close to every big city. I think most of us knew that they can't launch > 300 DC's in such a short amount of time.

The many amount of DC's is mentioned a lot ( social networks, blogs, here).

There is a distinction between eg. AWS / Azure / ... Which work with a couple of big DC's, while cloudflare operates more spread across more locations.

You're comment did made me realize it may may not be that clear from an outsider viewpoint though ( fyi, I'm an outsider too)



> I'm surprised that CF are renting space in colocation facilities. I would have expected a business of their size to have their own DCs. Is this common practice for cloud providers?

Google for one has both. Some GCP regions [0] are in colos, while others are in places where we already had datacenters [1]. We also use colo facilities for peering (and bandwidth offload + connection termination).

I'm under the impression that most AWS Cloudfront locations are also in colo facilities.

[0] https://cloud.google.com/about/locations

[1] https://www.google.com/about/datacenters/locations/


I'm a little surprised too; I figured they would have their own DCs for their core control plane servers. Colos for their 300+ PoPs makes sense, though.


I asked a sales rep once about services going out and how that would affect CF For Teams. They said it would be virtually impossible for CF to go down because of all their data centers around the world. Paraphrasing, “if there’s an outage, there’s definitely something going wrong with the internet.”

And here we are. My trust in them has hit zero.


FWIW, I'm a Cloudflare Enterprise customer and we had zero downtime. Only thing that was temporarily unavailable was the cloudflare dashboard.

I feel like a lot of people in this thread are commenting under the impression that all of Cloudflare was down for 24 hours when in reality I wouldn't be surprised if a lot of customers were unaffected and unaware of the incident.

I wouldn't even have known of the outage had it not been for HN..


2nd this. We had zero downtime on anything in production. The only reason we knew is because we are actively standing up a transition to R2 and ran into errors configuring buckets.


Trust? Random sample from the last 60 days...

Cloudflare outage – 24 hours now - https://news.ycombinator.com/item?id=38112515

Cloudflare Dashboard Logins Failing - https://news.ycombinator.com/item?id=38112230

Ask HN: Cloudflare Workers are down? - https://news.ycombinator.com/item?id=38074906

Cloudflare API, dashboard, tunnels down - https://news.ycombinator.com/item?id=38014582

Cloudflare Intermittent API Failures for Cloudflare Pages, Workers and Images - https://news.ycombinator.com/item?id=37819045

Cloudflare Issues with 1.1.1.1 public resolver and WARP - https://news.ycombinator.com/item?id=37762731

Cloudflare – Network Performance Issues - https://news.ycombinator.com/item?id=37604609

Cloudflare Issues Passing Challenge Pages - https://news.ycombinator.com/item?id=37336743


Why would you trust a sales rep?

Even honest engineers cannot foresee the exact cascading consequences effects of such outages. Sales reps are not paid to be either competent on such issues nor to be honest.


While debatably unprofessional to blame your vendor, I found this read to be fascinating. I'm sure there are blog posts that detail how data centers work and fail but it's rare to get that cross over from a software engineering context. It puts into perspective what it takes for an average data center of this class to fail: power outage, generator failure, and then battery loss.


I think what it really does is emphasise how common it is for crap to hit the fan when things go wrong - even with the best laid plans.

The DC almost certainly advertises the redundant power supplies, generator backups and battery failover in order to get the customers. But probably doesn't do the legwork or spend the money to make those things truly reliable. It's a bit like having automated backups - but never testing them and discovering they're empty when they're really needed.


I'm ultimately glad this happened because it very effectively helps illustrate how we are assigning a centralized gatekeeper to the internet at the infrastructure level and why it's a bad thing.


Contrary to others here, I find the postmortem a bit lacking.

The TLDR is that CF runs in multiple data centers, one went down, and the services that depend on it went down with it.

The interesting question would be why those services did depend on a single data center.

They are pretty vague about it

    Cloudflare allows multiple teams to innovate quickly. As such,
    products often take different paths toward their initial alpha.
If I was the CEO, I would look into the specific decisions of the engineers and why they decided to make services depend on just one data center. That would make an interesting blog post to me.

Designing a highly available system and building a company fast leads to interesting tradeoffs. The details would be interesting.


> why they decided to make services depend on just one data center

In my experience, no engineers really decided to make services depend on just one data center. It happened because the dependency was overlooked. Or it happened because the dependency was thought to be a "soft dependency" with graceful degradation in case of unavailability but the graceful degradation path had a bug. Or it happened because the engineers thought it had a dependency on one of multiple data centers, but then the failover process had a bug.

Reminds me of that time when a single data center in Paris for GCP brought down the entire Google Cloud Console albeit briefly. Really the same thing.


> In my experience, no engineers really decided to make services depend on just one data center.

Partially true in this case; I can't speak to modern CF (or won't, moreso) but a large amount of internal services were built around SQL db's, and weren't built with any sense of eventual consistency. Usage of read replicas was basically unheard of. Knowing that, and that this was normal, it's a cultural issue rather than an "oops" issue.

Flipping the whole DC data sources is a sign of what I'm describing; FAANG would instead be running services in multiple DC's rather than relying on primary/secondary architecture.


Dunno about that, I've read similar internal postmortems at the FAANG I worked at.


Everywhere I've worked requires a DR drill per service, but I've never seen anything where the whole company shuts down a DC at once across all services.

But probably we should. It's an immensely larger coordination problem, but frankly, it's probably the more common failure mode.


The FAANG I worked at did this back in 2016-18, so that what happened to CloudFlare didn't happen to them.


Isn't this sentence a bit further down more clear?

> This is a system design that we began implementing four years ago. While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

and

> It [PDX-04] is also the default location for services that have not yet been onboarded onto our high availability cluster.


>I would look into the specific decisions of the engineers and why they decided to make services depend on just one data center

And the product team defining requirements

And IT/governance/architecture teams for not properly cataloging dependencies

And the sales and marketing team not clearly articulating what they're selling (a beta/early access product that's not HA)


I experienced it myself within the last 24 hours. New D1 & Hyperdrive deployment was not working. It would spew out internal errors & timeouts.

Both are non-GA products, and the point is that non-GA are not part of the HA cluster (yet)


This is good reminder to myself to transfer domains registered on Cloudflare to another provider and only use Cloudflare for DNS or vice versa. I was effectively locked out of making any changes to domains registered and DNS hosted on Cloudflare during the entire outage due to single point of failure (Cloudflare) on my part.


Classic distraction maneuver. This postmortem is a prime example of tech porn that diverts attention from the main issue: many at Cloudflare didn't do their job properly.


They really threw the electricity power provider under the bus there.


The electricity provider is fine, it's Flexential that looks incredibly opaque and non-communicative in a stressful situation.

While Cloudflare should have been better prepared for this, it seems to be amateur hour in that particular Portland data-center. Other customers (Dreamhost, etc) were impacted too, and I can't imagine they don't also have some very pointed questions.


Sure, but DreamHost recovered fully within 12 hours [1], Cloudflare took almost 2 days [2]

[1] https://www.dreamhoststatus.com/pages/incident/575f0f6068263... [2] https://www.cloudflarestatus.com/incidents/hm7491k53ppg


A lot of mud-slinging on here about HA setup and CF's dealing with the problem but I can only assume people are armchair experts with no real experience of HA at the scale of CF.

"So the root cause for the outage was that they relied on a single data center.". No. Root cause was that data centre operator didn't manage the outage properly and didn't have systems in place in which case they could have avoided it + some systems knowingly and unknowingly had dependencies on the centre that went down because CF did have systems in place to allow that centre to fail.

"Cloudflare has a shit reputation in my eyes, because their terrible captchas". You don't like one product so they have a shit reputation? Enough said.

"but unless you are physically powering off the dc or at least disconnecting the network from the outside world you are not testing a real disaster." If you have ever had to do this, you know that it is never a good feeling. On-paper, yes, you should try your DR but in reality, even if it works, you lose data, you get service blips, you get a tonne of support calls and if it doesn't work, it might not even rollback again. On top of that, it isn't a case of just disconnecting something, most problems are more complicated. System A is available but not system B. Routers get a bad update but are still online, and on top of all of that, you would need some way to know that everything is still working and some problems don't surface for hours or until traffic volume is at a certain level etc. If you trust that a data centre can stay online for long periods of time and that you would then be able to migrate things at a reasonable rate if it doesn't, then you have to trust that to an extend.

All-in-all, CF are not attempting to blame someone, even though a lot is down to Flexential, the last paragraph of the first section says, "To start, this never should have happened...I am sorry and embarrassed for this incident and the pain that it caused our customers and our team."

Well done CF


> some systems knowingly and unknowingly had dependencies on the centre that went down because CF did have systems in place to allow that centre to fail.

I mean you're contradicting yourself in the same sentence. Had CloudFlare had such a system in place that would allow that particular center to fail, there would be no outages in the service. The truth is that they didn't account for it , and because they missed it, that center became a single point of failure which is what brought the whole CloudFlare service down. Power outage was just a trigger to discover a weakness in their system design and not a root cause.


Couldn’t agree more.

Many of these comments sound like they’re coming from some mythical alternate universe where bugs don’t exist and people and orgs have 100% flawless execution every time.

It reminds me a little of someone sitting at a sports bar yelling about a “stupid” play or otherwise criticizing a 0.0001% athlete who is playing at a level they can’t possibly fathom.

Monday Morning quarterbacking.


> Throughout the incident, Cloudflare's network and security services continued to work as expected.

This seems misleading. Their own status page said the data plane was impacted across many services.


> Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning

A minor point but this feels like not the most efficient way to manage an emergency. Having some form of staggered shifts or other approach versus just having everyone pile on. If a lot of knowledge resided in specific individuals so they are vital to an effort like this and cannot be substituted then that seems like a risk in it's own.


Why the very first step was not to fail over Europe?


They did after two hours. After the first they assumed the generators would be back but then ran into the breaker issue which caused the full day delay.


My question too, although possibly it seemed as a greater risk first to fail over. BTW, is there any unexpected GDPR implication of that? Assuming that fail over means restoring US backups in EU.


iirc the GDPR prohibits storing EU data in non-EU servers, not vice-versa


But it does mean all that data is now required to be handled in a compliant fashion


Why would any supplier want to do business with Cloudflare now? You have 1.8MW of datacenter space to lease, you have a few interested parties, how could you not view Cloudflare as a huge reputational risk? Why even do the business? Why not lease that space to someone else? Moreover, why renew the existing deals with Cloudflare?

Does Cloudflare have a plan to move 200+ racks in Oregon if that supplier decides just not to renew that deal whenever it comes up next? Are Cloudflare claiming they were able to build a technical plan which gets their architecture away from this site being a SPOF before the deal is up to renew, or is the CEO making a gamble here again?

Cloudflare have demonstrated their willingness to create reputational issues for suppliers by publicly shaming two of them recently, and here, in only about 2 days from incident. One interpretation of this blog would be Cloudflare are a very unreasonable customer and one who is willing to post incomplete or informal information from their suppliers. Cloudflare also chose to focus the first half of a lengthy postmortem on blaming the supplier and only then on their own culpability for the outage, despite it clearly being a shared responsibility.

One of the diagrams Cloudflare have posted is clearly marked "Proprietary and Confidential". Do Cloudflare have permission to post that? It's not clearly stated that they do. Should other suppliers expect when the sh*t hits the fan that any sensitive information they've shared will be part of a blog?

Most of the "Lessons and Remediation" section is stuff Cloudflare could have worked on at any point in advance of a major incident, and Cloudflare's senior management have quite clearly chosen not to prioritize that work until today, when forced to by this major incident.

When signing large deals, Cloudflare will frequently have to complete 'Supplier Disclosures', and they are also making claims through industry-standard certifications [1] like ISO, SOC and FedRAMP. Most of those will ask questions about the disaster recovery and business continuity plans and Cloudflare will have (repeatedly) attested they are adequate, something that this blog clearly demonstrates was a misrepresentation of their true capabilities.

Will there be an SEC disclosure coming out of this considering it could have material impacts on the business, which is publicly traded? Was there any requirement that the SEC disclosure come first, or be concurrent with a blog?

[1] https://www.cloudflare.com/trust-hub/compliance-resources/


You can run ClickHouse cluster across multiple datacenters. It will survive the failure of a single datacenter while being available for writes and reads, and the failure of two out of three datacenters while being available for reads. It works well when RTT between datacenters is less than 30 ms. If they are more distant, it will still work, but you will notice a quite high latency on INSERTs due to the distributed consensus.

I've run a ClickHouse cluster with hundreds of bare-metal machines distributed across three datacenters in two countries at my previous job. It survived power failures (multiple), a flood (once), and network connectivity issues (regular). This cluster was used for logging and analytics :)


After reading this, we will be moving off Cloudflare. The post shows a lack of maturity and professionalism.


This is karma for their passive-aggressive blog post about Okta's security incident one week earlier


The guy I genuinely feel sorry for is "an unaccompanied technician who had only been on the job for a week".

Regardless of any and all corporate spin on the issue, a newbie was dumped into an event at the worst possible time.

I really really really hope he gets a decent bit of counselling to make sure that is fully aware that the issue with the data centre had NOTHING to do with him unplugging the coffee maker to plug in his recharger for his iPhone. Absolutely nothing at all.


Whenever we design and build a data center (in addition to disaster recovery, business continuity, etc.), we always test all possible scenarios of power interruptions and never assume that the power utility will get in touch with us in advance for a pre-planned maintenance, including the outcome AND the time to restore etc., it’s kind of shocking that you just assumed UPS will last 10min but never put it on test.. especially with size of CF!


Some days things just go badly. The only thing you can change is how you respond. Well done to you and the team for getting through this.

I for one am and will always be a cloudflare customer


For me it's basically summed up as "we didn't test turning the power off" and making sure things worked the way we planned.

Yes it is hard and very expensive to do these types of tests. And doing it regularly is even more $$$ and time.

As most customers we seem to be okay with a cheap price hidden behind a facade of "high availability" since I don't really want to pay for true HA. Because if I knew the real cost it would be too expensive.


> It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators.

Are the data centers compensated or anything for this? I'd imagine generator-only might cost more in terms of fuel and wear-and-tear/maintinaince/inspections.

edit:

> DSG allows the local utility to run a data center's generators to help supply additional power to the grid. In exchange, the power company helps maintain the generators and supplies fuel

Interesting.


I'm not very well versed in this space but I've been told Progressive Insurance in Cleveland, OH has a similar (sounding) agreement. According to PGE's website, they basically pay for everything https://portlandgeneral.com/save-money/save-money-business/d...


Taking the positive outlook …

Ensuring all of their services are fully distributed is now top of mind at CF.

Ultimately, customers win if CF executes.


At what time were you notified Matt?


Of the incident? Someone on my team called me about 30 minutes after it started. It was challenging for me to stay on top of because it was also the same day as our Q3 earnings call. But team kept me informed throughout the day. I helped where I could. And they handled a very difficult situation very well. That said, lots we can learn from and improve.


What I find bizarre is that the Cloudflare share price jumped when the outage happend!

Having read the post mortem, I do not think it could have been handled any better. I think the decision to extend the outage in order to provide rest was absolutely correct.

I always enjoy reading these reports from Cloudflare as they are the best in the business.


I was surprised we didn't get a single question about it from an analyst or investor, either formally on the Q3 call or on any callbacks we did after. One weird phenomenon we've seen — though not so much in this case because the impact wasn't as publicly exposed — is that investors after we've had a really bad outage say: "Oh, wow, I didn't fully appreciate how important you were until you took down most of the Internet." So… ¯\_(ツ)_/¯


There's a class of investor (and their trade bots presumably) that sees outrage over a service outage as proof the provider is now mission critical, hence able to "extract value" from the market.


Did you rebuild all the server from scratch?


Coincidental timing?


Some one should make a webseries about this indecent. It will be a nice story to tell. Name: Mordern Day Disaster. Directed by : Mathew prince Releasing on : 25th December at Netflix Based on a true story.


Shockingly poor choices by CF.

We should absolutely blame them, just as "victims" of ransomware should be blamed. Hardening against system failure is the same process as security hardening.


Is it common for CloudFlare to publish documents or diagrams from their vendors clearly marked "Proprietary and Confidential"


To preface: I am not qualified to talk about this in the slightest.

> Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04

What do you mean you discovered? How could you not know? Surely when you were setting this high availability cluster up years ago and migrating services over, you double checked that all crucial dependencies had also been moved, right? And surely, since you had been "implementing" this for four years now, you've TESTED what would happen if one of the three DCs went completely offline, right???


> And surely, since you had been "implementing" this for four years now, you've TESTED what would happen if one of the three DCs went completely offline, right???

They discussed this. They had been running tests were they disabled the high availability cluster in any of (and of two of) the three DCs. That test didn't involve disabling the rest of the (non-HA) services from PDX-04 DC (oops).


appreciate the status updates and the quick report. as a person that handles the tiniest datacenter, it's impossible to predict every potential event. best you can do is to recover as quickly as possible and learn the lesson.

believing that this doesn't or can't happen to another vendor is being naive.

it has happened to all of them and it'll happen again. can only hope it's super rare.


I truly and genuinely hope that this incident and the many before will drive away customers from Cloudflare's monopoly


True, not snippy: I found interesting that their automated billing emails seemed to arrive right on time.


HA, maximum redundancy


So well known fails. Same was with one of our DC 3 years ago when they powered city and failed whole DC


Cloudflare Inc share price has increased 15% as a result. How does it work?


TLDR: Major flooding, riots, earthquake, asteroid, nuke goes off in Portland and Cloudflare is down because they decided they could put their entire control infrastructure in a single location for ease of use.

Centralising the web is such a great move.


Consider modes of failure.


So the tl:dr is that they took a risk and never tested their high availability setup.


How you


hot take: HN is way too biased and sympathetic towards the provider whenever an outage like this happens.


Not a lot of measured takes here. It seems to be either “eh we get it, comms could have been better” or “they’re idiots”.

The first group of people have been to war. The second have not.


I am really upset about this situation on behalf of CF however why don't they think about generating their own electricity with renewable energy sources?


> why don't they think about generating their own electricity with renewable energy sources?

How exactly do you imagine that working while inside a data center operated by a third party?

It's not like they let you stick some solar panels on the roof and run an extension cord to your rack.


And why not run their own DC?

It doesn't change anything fundamentally. A complex product is only as good as the weakest link. I have worked with various employers, some world leaders at the time. All of them had seriously weak links.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: