Hacker News new | past | comments | ask | show | jobs | submit login
Worst Case (tbray.org)
226 points by robin_reala 54 days ago | hide | past | favorite | 106 comments

This is a bizarre analysis. Public legal risk is absolutely the last imaginable threat to us-east-1, short of aliens abducting it. The U.S. security apparatus depends on AWS and would never allow it, Wall Street would never allow it, never mind the fact that Amazon itself would leverage every tool at it's disposal to protect its reputation for reliability. The politicians involved in this scenario might seek to remove Amazon's competitive advantages, or fine them, but the people who understand what AWS even is would never consider a move to shut down a datacenter.

Both the "enemy action" and "operation failure" scenarios are much bigger risks than this article makes out to be. Every non-aligned nation-state offensive cyber team has a knockout of us-east-1 at the top of their desired capabilities. I'm sure efforts range from recruiting Amazon employees to preparing physical sabotage to hoarding 0days in the infrastructure. There's no reason to think one of them wouldnt rock the boat if geopolitics dictated.

Operational failure is probably the most likely. AWS might have a decade of experience building resilience, but some events happen on longer timescales. A bug that silently corrupts data before checksums and duplication and doesn't get noticed until almost every customer is borked, a vendor gives bad ECC ram that fails after 6 months in the field and is already deployed to 10,000 servers, etc. Networking is hard and an extended outage on the order of a week isn't completely impossible. How many customer systems can survive a week of downtime? How many customer businesses can?

> Amazon itself would leverage every tool at it's disposal to protect its reputation for reliability.

This is a joke, right? The _real_ degradation map of us-east-1 of the last 5 years looks significantly worse than my non-UPS backed Home PC in Sweden.

Personally I'm not looking at us-east-1 as reliable at all; they even suffered a "harddrive crash" https://www.bleepingcomputer.com/news/technology/amazon-aws-...

I assume the author isn't asking "What if there was a 30 minute outage of us-east-1" (just wait for it to come back) or "What if there was an outage of a single AZ in us-east-1" (just spin things up in a different AZ)

Rather, they're asking "What if there was a 30 day outage of us-east-1" - so anyone who isn't multi-region or multi-cloud loses everything, including backups, AMIs, and control plane access.

(FWIW I agree with people disagreeing with the worry levels in the article - a solar storm last seen in 1859 is more likely than a software bug? Ha!)

Please don't mischaracterise the point: The solar storm is not more likely than a software bug. The effect of the solar storm is more likely than a similar effect by a software bug (particularly in the context of AWS).

realistically if AWS is down for 1+ day in that region you will be wondering very much if they're coming back in just 1hr more, 1 more day, 1 month or if at all.

Yeah, it's kind of "incredible" that they'd be gone for so long, but I've been on AWS during long outages before and it's never certain how long it will go on for and if data will be kept.

We assume too much I think of our providers, and I believe that's what the post is about. If you already _know_ AWS will be down for 30 days then you can make an informed decision based on that.

For me, there's no difference between 8 hours or 80, if you're down more than 8hrs I'm going to activate my redundancies. But: I have redundancies. Many people don't.

Curious has AWS us-east-1 ever been down for more that 8+ hours at a time?

"It hasn't happened since 1859" has nothing to do with whether it's likely to happen soon. You can get heads 162 times in a row when flipping a coin, it's not likely to happen but it's possible. In this case the chance of it happening is not 50%, so us going ~162 years since the last one isn't that weird. That doesn't mean the chance is 0% - we could lose that coin flip this year.

This is related to Tim's mention of climate change - lately, '100-year' and '10-year' storms are happening more often. It's possible the changes to our climate have turned those 100-year storms into 10-year storms, but we may not know until the observations and climate models catch up.

The solar storm would be more likely to, as you said, take down the region for a significant period of time. A software bug is more likely to happen, but less likely to have that big of an impact.

Surprisingly, this was published (though maybe not written) just after FB locked out all their employees with a bad config...

Yeah, the relative probability of scenarios may be wrong. Maybe even some of them are totally bogus. This doesn't undermine the main point of the article though.

I think the main point is that there's quite a lot of eggs in that basket and we should see this as a problem. Any single organization can think they have contingency plans for a big cloud region going permanently down. The problem is that when all of them try to execute their plans at once it won't work.

While your analysis is sound, I wouldn't disregard a Carrington type of event. It can happen again and we never know where. None of our current electronic infrastructures are hardened to handle this kind solar storm/EMP. IPFS is a good direction in mitigate these kinds of centralised data risk.

Diversifying one's cloud/server provider is a good thing! Or simply don't rely only on the cloud. Storage devices are cheap now days, just have local backup and/or in different geographical location.

It needs to be put in context. If a Carrington type event takes place then much of the tech we take for granted will be offline. If you do GPS navigation it doesn't matter if route planning is offline because the satellites themselves might be broken. It doesn't matter if you sell makeup because the delivery drivers, planes, and boats will be unable to navigate adequately to delivery anything. If doesn't matter if you stream videogames because many ISPs may well be offline.

Off-topic (and I'm totally not expert, please please correct me if i'm woring) but the the real risk of a Carrington type of event is for the power grid, especially the destruction of equipment at end of power lines.

The idea is that those large transformer cannot be mass produced and could be completely destroyed.

The cool hack is that physical disconnection in time avoids those damages.

At small scale the difference of potential is not enough to fear much physical damage.

Uh IPFS doesn't have replication? It's garbage in terms of reliability as far as I see it

Even 20 year old BitTorrent is a better option if that's the risk you're considering

> Bear in mind that Republicans hate Amazon because of Bezos’s Washington Post and because the whole tech industry is (somewhat correctly) perceived as progressive

Do Republicans actually hate Amazon all that much or do they just go on Fox News or Twitter and proclaim that they do? As much as they might complain about certain corporations, they don't seem to be at the top of the hit list.

I don’t think they hate amazon, but they certainly don’t appreciate WaPo or NYT or fill-in-the-blank left leaning media (read most). How that translates to shutting down an Amazon data center just kind of shows how deranged people get when it comes to politics.

Trump seems to at once despise Bezos for bailing out the WP, but also admires and fear him, since Bezos' money makes Trump look poor in comparison, and the working conditions in Amazon warehouses and anti-union activity reflects the kind of autocratic values that Trump admires. It's pretty clear that Trump would love to be Jeff's friend, which gives Amazon even more leverage to survive any right-wing populist attack.

This section destroyed the authors credibility altogether in my mind. Seems the entire article was a set up to make a political attack.

The best case scenario for an operational failure is the loss of a single AZ, but many of the scenarios you described are things that could impact an entire cloud vendor simultaneously. As others have pointed out (in discussing the Carrington event and the recent Facebook outage), it's not a matter of if it will happen, but a matter of when. And then it's just a question of duration and scope of the impact.

At this point around half of the world's leasable compute is concentrated in fewer than 100 facilities, the locations of which can easily be found with a google search. Using public satellite imagery you can identify network connection points as well as follow power transmission lines. In a wartime scenario, these are industrial targets with astounding strategic value, a tiny geographic footprint, and limited collateral damage in terms of human life. The Nagasaki and Hiroshima of the future could simply be kinetic attacks against a couple of datacenters. I'm alarmed that nobody is prepared for this and the industry zeitgeist seems to be to continue the consolidation of our economies into the cloud.

I agree datacenters would clearly be targets, but our IT infrastructure is hardly alone in being a relatively concentrated strategic target.

If kinetics are in play, said actor could also destroy our oil refining and pipeline systems. Taking out a few dozen large baseload power generation facilities would have a massive impact on the grid.

Regarding "enemy action" - I wonder what percentage of AWS/GCP/Azure employees are also employed by foreign security services? Is it 0%, 1%, 5% ?

You have to assume it's >0%, for both foreign and domestic, and I'm certain the cloud providers assume that too.

I would love to see Wall Street prevent a Coronal Mass Ejection.

I was handed a policy document at my previous job that said: "In the event <CEO of provider> spits at <CEO of ours>, we must be able to migrate in 2 billing cycles".

The reason isn't the spit, it's to prevent being so locked in that the provider can rent seek; but there's also an element of agility in that. If you can turn to AWS and say "Give us a good deal or we walk" and.. you know, you actually can walk... then they're more likely to do that.

This is the same reason broadband is fairly inexpensive in Sweden, the building will negotiate as a unit to an ISP and get a group deal. Bulk buying in this way gives a lot of negotiating power. So for a publisher to be able to move all titles to a new provider in 2 cycles: that's negotiation power.

If you can save 2 months of developer time in a year by switching providers, it behooves to do it; and cloud resources are _expensive_ I recently specced out an instance for gitlab which is currently living on a VPS, it went from $30 to nearly $800 on cloud. (Though admittedly that $800 variant would be more highly available.)

One thing you can also do is try to script a rebuild of your entire infrastructure to such an extent that you can make total rebuilds from scratch a normal part of your operating procedures.

Hypothetically, if you had 1 big script that could rebuild your entire org inside Azure/AWS within 4 hours, why wouldn't you consider doing this as a normal part of every weekend? If a meteor hits us-east-1, you wouldn't even have to think about it. It would just be another Saturday, perhaps with a quick region toggle in the redeploy script parameters.

Other advantages with weekly orbital nukings of infra also includes making it very difficult for persistent threats to... persist.

Clearly, some resources cannot be re-allocated with such frequency. Domain registrations and maybe TLS certs being the biggest examples. Getting away from needing static IP addresses in your cloud provider can also remove a lot of shackles.

I wish more businesses operated this way. Doesn’t mean you can’t use AWS Managed offerings of open api/source products (e.g. EKS, rds, s3).

The strategy provides for healthy friction and drives customer obsession by the vendor.

I don't. In reality it can incredibly difficult and expensive to switch all your infrastructure over to a new cloud provider, and hybrid / multi cloud is expensive and complicated to design and operate.

> multi cloud is expensive and complicated to design and operate.

Depending on your scale, not being multi-cloud can be more expensive. i.e. You can save more money than it costs you to migrate (by getting a discount from your current provider.) When negotiating, you need to be able to walk away, for real.

> The reason isn't the spit, it's to prevent being so locked in that the provider can rent seek;

As a minor pedantry, your cloud provider cannot "rent seek" just because you stepped into a vendor lock-in situation, that's not what the phrase "rent seeking" means. An example of actual rent seeking would be if US government legally required you to use AWS if you were doing business in some highly regulated sector. Being locked in particular vendor offering is definitely not a great situation, and the vendor can most definitely use the leverage to their own profit, it's just this is not rent-seeking.

Unless that migration is constantly tested, it's not going to happen in two billing cycles.

Cloud providers aren't Comcast. You don't get discounts for threatning to move from, because both providers and customers know the costs are impossibly high.

Which is why a lot of their pitch is that never be in a situation where you'd have to do move (some of course are better here, one cloud is trying very hard to overcome their parent company reputation), otherwise no sane executive would sign up.

The discounts you do get from clouds is by promising to move more work to them.

Two billing cycles in two months. For a small company or a medium-sized company with "modern" apps and app deployment processes, then that's somewhere between a nothing-burger and an inconvenient weekend (during which the engineers would get paid OT, hopefully!). For large companies with >$100k/month spend in AWS and sizeable legacy application assets, that's a death-march.

I'm the last person to claim any knowledge in this area but isn't this analysis completely missing the possibility of a more accidental fundamental lower-level OS or networking bug that cascades. The whole stack presumably is built on the same underlying frameworks and TCP/UDP/IP protocols so if there's a precarious update or unexposed time or memory contingent bug, that'd surely be quite damaging? I know he speaks of resilience and says that "they’ve seen everything imaginable that could go wrong" but that just seems like hubris, no?

Also with the FB BGP disaster we saw an example of how their resilience/RED-team/etc. learnings failed to highlight how hard recovery from a real-world outage would be irt to something as basic as building access. Plus the hilarious fact that widespread network outages make difficult the kinds of cross-location/timezone communication that would be needed in order to collaborate and apply fixes. FB teams experienced this apparently. The tools they rely on to communicate obviously relied on the assumption that such a network failure could never occur. They were left relying on non-internet comms and non-FB platforms.

To claim that Amazon is simply too experienced to let such things occur seems quite arrogant and naive.

This type of failure could conceivably knock us-east-1 down, but I think it could be pretty easy to recover (precarious update? roll back!). I think Bray is considering total obliteration of the us-east-1.

Yeh fair. I wonder what duration of outage would cause a spiral of the mentioned economic and social effects. I'm sure a fix made within a few days would not set the course of the economy on a different path but it's interesting to consider what duration or degree of an outage would.

Though there's a lot of food for thought in this article, it's obviously not written by a security expert.

They vastly underestimates the "Enemy Action" risk. It won't be a DDOS attack that takes down us-east-1, it would be a Stuxnet level of direct targeting, using a combination of internal knowledge and nonlinear physical properties. And while he's right that professional hackers would have no interest in any Amazon data services (the time investment/reward is just not here), he says at the very beginning of the article a clear reason for state actors to go after it - it would cause a lot of economic damage, which is how wars are fought these days.

Not saying it has any chance of happening anytime soon, but his idea of "Enemy Action" is limited.

At any rate, it's bad business to put your eggs into one basket, and while it's less friction to use cloud formation, dynamodb, or anything that locks you into AWS - it's far better to make a system that can be deployed anywhere, at least when you start getting big enough that it matters.

I think it would be the mother load for a ransomware gang. They would have many extortion opportunities. Pay us to get access back to your servers, pay us or we delete your data, pay us or we leak your internal data, pay us or we delete/leak your customers' data.

I'm sure that AWS has some of the greatest cybersecurity out there. But the potential massive cash opportunities make it such that why not try some easy attacks against them. Spending millions of dollars of labor to research and pull off an attack is likely only for nation states, but ransomware gangs should be walking by and testing the locks all day every day.

Sure, but you never want to be US public enemy number 1. Any large-scale attacks on these big DCs themselves would be treated as national security threats/terrorist attacks since, as laid out here, a large chunk of the US economy is reliant on us-east-1. No matter how much money you get, you'd have to be in Russia and sponsored by the state itself to carry out these attacks if you wanted to remain free for longer than a few months (in which it could be considered an act of war).

I guess the best way to do this without attempting a total shutdown of the dc (while still making off with $xx millions) would be to select a thousand customers, encrypt the hard drives that make up their data redundancy (live, backup, and sharded copies of the data), then ransom that. The only way this doesn't work is if they have all of it in a tape backup, but depending on how much you encrypt, that might be impractical for them to restore if it would cause significant downtime for those customers - and that could be mitigated by selecting petabytes of super-recent data that likely hasn't been backed up to tape yet.

Some of the ransomware actors would not be afraid of being labeled "US public enemy number 1" - for example, North Korea is running some operations, and they would really like to extract a hefty ransom in addition to hurting USA as Amazon's revenue is something like 10x the North Korean GDP.

That would be a declaration of war, which is why ransomware by some cash-strapped group of hackers is generally not an attack vector, given taking us-east-1 offline being seen as terrorism and the resources the US would dedicate to bringing such actors to justice. It'll always be easier to attack random medium-large companies' office ops, which are likely manned by 0 or underskilled IT security personnel (at least in current_year). Even for some place like Russia, the attackers would either need to be state-sponsored or Russia would avoid war by performing the rare non-treaty-bound extradition.

I specifically used NK as an example because it is already doing ransomware attacks (though not on the same scale) and while perhaps it might technically/legally be treated as "declaration of war", it is obviously not being treated that way. This would not be a novel thing, this be more of the same, just a bit larger target and larger impact. You could also look at all the other cases of state-sponsored malware causing damage; while technically those might be considered as an act of war, the precedent is that none of the cases have ever been treated by the victimized countries as such in practice. E.g. perhaps Iran complained about Stuxnet diplomatically, but it's not something that escalated to "kinetic action".

And even if it would, so what? It's not like USA is lacking some casus belli to attack NK; the major factors of whether some military action is worthwhile or not would stay the same after such a hack. This would work to deter Russia, who wants to be integrated in trade, but countries which already are isolated and/or already treated as hostile (for example, Iran) wouldn't care; if USA wanted a war there, then refraining from such a hack would not prevent it, and if USA doesn't consider a war there as profitable, then doing some hacks would not be treated as a larger threat than e.g. nuclear weapons development, so it wouldn't even be a significant escalation in the current bad relationships.

I imagine any significant state actor already has multiple employees working inside AWS. I'm not an expert on state security apparatuses or anything, but that just seems so potentially valuable that I can't fathom why they wouldn't.

This is all built on Quinn's assertion that us-east1 outage means a global depression, but if anything us-east1 has a well-deserved reputation for being the flakiest AWS region and there have been repeated severe outages, most recently in 2017 (S3 basically down) and Nov 2020 (Kinesis and all services relying on it down):


Sure, we haven't had a total outage where everything is offline for days on end, but I don't rate any of the scenarios as being particularly likely. Even things like geomagnetic storms and nuclear EMP would mostly impact power transmission, not the data centers themselves, especially if there's advance notice.

> (...) if anything us-east1 has a well-deserved reputation for being the flakiest AWS region and there have been repeated severe outages, most recently in 2017 (S3 basically down) and Nov 2020 (Kinesis and all services relying on it down)

I'm not sure S3 or Kinesis outages, or any other higher level service and serverless offerings, count as outages in this context, specially as they are quickly recoverable.

I believe the scenario involves something like, say, all EC2 dying or no traffic getting in or out for a few weeks. Think Katrina but on us-east-1.

S3 is about as low level as it gets, if S3 is having a bad day a lot of things will come to a grinding halt.

Yet while S3 going down in 2017 generated a lot of headlines, there was no systemic impact like (say) the stock market tanking. Which leads me to think that most (not all, but most) really critical stuff will have backups outside us-east1.

There's a sentence in here that really gets to the value of AWS for businesses:

> So while your customers and employees are going to be mad, they’re also going to be distracted from worrying about your downtime.

This is incredibly important for business. If we go with AWS and it goes down, then it's big news and our customers blame AWS. If we choose a different provider, even one which is more reliable than AWS, then our customers blame us.

This reasoning obviously doesn't apply to all systems e.g. air-traffic control or stock exchanges, but it does for the vast majority of businesses, and certainly any business that would host their systems in the cloud in the first place.

Something to that. OTOH I can think of businesses that earned customer loyalty by better planning for availability, e.g. H-E-B supermarkets early in the pandemic. Transcending blame-oriented thinking can get rewarded.

It's the new "nobody got fired for choosing IBM".

This seems to miss a couple of fairly likely scenarios.

1) us-east-1 goes down, and a whole bunch of important AWS backplane goes down with it, so it's impossible to bring everything back up again in other regions.

2) us-east-1 goes down, and the thundering herd stomps all over the other regions making it impossible to use them.

3) either scenario 1 or 2 happen, and another thundering herd of people who've made plans to stand everything up in Azure/GCP take those down as well.

As far as we know publicly, AWS regions are very separated and even if us-east-1 is entirely dead, the impact on the others should be minimal and mostly cosmetic ( i hope they've fixed the status page that was hosted in S3 us-east-1).

I’ve had problems in ap-southeast-2 not being able to provision resources when us-east-1 is having problems. Stuff that’s running works fine. Turning on new stuff, not so much. From memory auto scaling has failed me at least once in that situation too… or maybe that was spot fleets not rebalancing?

The understanding is that the services which you use in "global" region are hosted in us-east-1 .. CloudFront, Route 53, IAM. So outages in us-east-1 can cripple a lot more than you'd think. Your EC2 proably still runs but I would not count on AutoScaling or being able to log in and do admin tasks. I'm sure AWS has also thought about it and the good part about us-east-1 is that it has 5 or more zones so it's harder to take it down.

Certain API endpoints are not considered part of the availability guarantees. You might be able to verify and create IAM tokens, but not create new IAM roles, new S3 buckets etc. AWS sometimes advises that these operations shouldn't be done automatically, but not everybody listens.

Well, at the end of the day in this situation it is still humans that thought up the idea to separate the regions. I presume it is also humans who want to keep us-east-1 alive. Neither of those conditions (use1 going down or the cross DC contamination) are intentional.

If us-east-1 goes dead because of a condition that humans failed to anticipate, then it stands to reason that the other regions are actually dependent on use1 in a way humans didn't anticipate.

As I understand in this case a scenario is a reason why us-east-1 goes down, not the consequences.

I tend to think that, in response to the title rather than the body of the article, the actual worst case scenario for single source failure in the tech world is that either an invasion of Taiwan or something else jeopardizes TSMC's manufacturing. This would be easily a COVID sized event. I know there's a lot of news about countries trying to spin up semiconductor plants locally to help defend sovereignty against this possibility, but no one is anywhere near ready for it to happen.

The two natural disasters I’m most worried about are:

1. A Carrington-scale solar storm. This was talked about in the article. We have not had any significant solar storms affect earth in the last 150 years and basically all of our electrical, computer and communications infrastructure was built since then. Grid operators are not optimistic about the idea of starting up the entire electrical grid from nothing as it has never all gone down before.

2. A repeat of the recent historic activity in the New Madrid seismic zone. This is a place where historical accounts talk about aftershocks lasting for months but in more recent times it was not so seismically active so eg building codes don’t sufficiently prepare for this. Fortunately there is not that much economic activity there but there is still a lot.

I guess number 3 would be the “big one” in San Francisco. Though the fire risk isn’t so bad nowadays I think.

Read A Dangerous Place by Marc Reisner for a pretty fascinating scenario of a major quake on the SF Bay Area's other major fault system, the Hayward Fault.


Its impact on both the Bay region and the Delta, and by extension much of the Central Valley and Los Angele, would be immense. As has been noted for the past 50 years, a major event is long overdue.

I've long been concerned with the likelihood of a serious pandemic, something even more transmissible than covid, with a CFR 10-100x higher (1-10%).

> If it were me in my ideal world, I’d have copies of everything stored in S3 because of its exceptional durability; I sincerely believe there is no safer place on the planet to save data. Then I’d have a series of scripts that would rehydrate all my databases and config from S3, reconfigure all my code, and fire up my applications. I’d test this script regularly; any more than a few weeks untested and I’d lose confidence that it’d work.

for the vast majority of applications, this advice buried at the end is all you really need to do to survive anything other than total planetary annihilation (by then i'd worry about customer "churn", to put it bluntly). "all" is a strong word of course, i've never really worked at any place that tested backup recovery to that degree of regularity.

I have written a few documents outlining some risks and possible mitigations against AWS problems.

It's not just "Russia nuked North Virginia". That's not by far the most likely failure scenario in my estimation. I consider "billing problem causes Amazon to close our account" or "trusted employee with top level access deletes everything in our account right down to Glacier storage" or "employee with access severely violates AWS T&Cs and gets us booted off" as being far more realistic threats, while effectively having the same consequences as a couple of ICBMs with MIRVs targeted to every datacenter in NV...

In any of those cases, expecting to rehydrate from S3 isn't an option.

Of course, nobody wants to incur the costs associated with "How could we keep business continuity without AWS?" because it gets very expensive very quickly for anything non trivial.

> Of course, nobody wants to incur the costs associated with "How could we keep business continuity without AWS?" because it gets very expensive very quickly for anything non trivial.

I think that mostly depends on how much data you generate in a day. If sending backups out increases your bandwidth use by 5% then it's pretty easy to throw that into cold storage on google or azure or local or all three.

It's not just the data (that's an easy enough problem to solve).

It's the infrastructure.

Even if you've only got a reasonable simple platform, say some redundant EC2 app servers behind a load balancer and a multi AZ RDS database, with some S3 storage and a Cloudfront distribution serving static assets - you've probably also got Route53 DNS hosting, AKS ssl certs, deployment AMIs, CloudWatch monitoring/alerting, and a bunch of other "incidental" but effectively proprietary AWS stuff - because it's there.

How do you get all that stood up "right now" in Azure or GCP or DigitalOcean or wherever, unless you've already put the time/effort into making that happen?

How many "single points of failure" are locked inside your AWS account? (For my stuff, Route53 is the thing that keeps me up at night occasionally. If we lost access to our domains registered/hosted in AWS, we'd need to pick new domain names and update all out apps...)

I guess it depends on how much you need "right now".

It doesn't take crazy amounts of effort to set up app servers, a load balancer, a database, and S3-compatible storage somewhere else.

If you had one person working on that two days a month you could keep a warm fallback system ready to go. That takes some effort to keep a map of what your cloud services are actually doing, which is a good idea anyway.

interesting threat vector. would CEO having sole access to Cloudflare R2 clone satisfy at least data durability? since full continuity without AWS is probably impossible for most real life scenarios.

If the CEO has sole access to it, how do the backups get there?

Having different read/write permissions is an important durability consideration. In this case the CEO has read privs (not write) and the archive utilities have write privs, not read.

There are other problems of course, CEO being hit by a bus being the most obvious.

> In this case the CEO has read privs (not write) and the archive utilities have write privs, not read.

Not sure you could really make this work.

The archive utilities would also need to be barred from overwriting or deleting backups. If that's automated, who configures it? And how do you do differential backups without read privileges?

The archive utility, when run as the backup user, cannot read the file but it can read file metadata or have a database of backup state so it knows where to send differential backups.

The archive utility, run as the recover user, can only read the file.

Ehhhh. I had a good reminder that s3 is only as good as its supporting interfaces a few years ago. I needed to create some test buckets with production data for dev work while I was getting a project of the ground. Sure, I could have made a new, separate read only username just for the task and all that— but it was a simple task and I was crunched for time, so I logged into the web GUI with the account admin login, made the copy, ran my tests, and deleted it. No problem. I thought. I got a VM a couple of days later from their support team saying they had reason to believe they’d lost some of my data — apparently the GUI delete had a (quickly detected, fixed and proactively addressed) bug that was deleting entries with the same relative key in every vaguely similarly named bucket, including my production bucket. Luckily it was the beginning of the transition and I sill had local copies.

Anyone could validly make the argument that this us user error and sloppy ops work — but that’s almost always the case in data loss events… be it unverified backups, abusing root, etc.

I still think there’s value in diversifying both in vendor and physical location.

The article mentions that us-east-1 is the largest region. That's not the only reason why its failure would be so catastrophic. It is the original AWS region and a snowflake among regions. That in itself doesn't have a clear direct consequence--what does is that AWS itself depends on many services specifically in us-east-1. Losing us-east-1 could mean failure of many services outside of us-east-1 as well.

If you were planning a multdatacenter strategy, it should also minimize the number of AWS offerings used or expand it to be multi-cloud.

This article reminds me in particular of my long-open question about how to prepare for solar storms. But now I have a new question. If the way to prepare my electronics for a solar storm is to put them in a cardboard- and aluminium-foil-lined faraday cage -- what happens to all the hard disks in the cloud? Do any cloud providers have offline tape backups in faraday cages?

This isn't just about recovering after how many ever days or weeks without power. It's whether you have any data left _to_ restore. Somebody tell me I don't need to worry.

Solar storms don't work like that. It's not the movie-like electronics blow up left and right thing. It's quasi-DC currents in the ground thing.

What that means is that the power grid is designed for AC currents, and now there will be DC currents flowing through it, induced by the flexing of the Earth's magnetic field under the storm. That is very likely to saturate the cores in large scale transformers, causing them to blow up.

No more power grid. And these are not things you have spares for or can ship from China. You need to make new ones, in a country with spotty power and complete supply chain breakdown.

So your electronics would be fine, just out of power. Faraday cages are for the nuclear bomb type EMP events.

The writer talks about all the possible ways this "worst case" would happen, but there's not much about the "worst case" itself?

I'd like to know more about the nuances of this blanket statement:

>The best case outcomes closely resembled a global depression.

I read an article a few years back about a company who'd had their AWS root account hacked and held for ransom [1] and even though they had backups and snapshots and multi-AZ replication the attacker could destroy the primary and the backups at the same time, because they were all in the same AWS account.

If the writers are imagining something so severe it could trigger a global depression, they're probably thinking of something with that kind of impact hitting everyone in the region.

Such an event would be very unlikely, you would hope - but I'm sure you could pull it off with the resources of a nation-state and half a dozen sleeper agents.

[1] https://www.infoworld.com/article/2608076/murder-in-the-amaz...

> even though they had backups and snapshots and multi-AZ replication the attacker could destroy the primary and the backups at the same time, because they were all in the same AWS account.

This was back in 2014, but AWS now has ways to fix this (if you do insist on keeping stuff in AWS) with Object Lock[0]. The expense is that, with object lock in a compliance state, the only way to delete it is to close the AWS account (which is why MFA delete[1] is recommended).

0: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object...

1: https://docs.aws.amazon.com/AmazonS3/latest/userguide/MultiF...

I was hoping for more of an exploration of how to deal with the worst case, than admittedly arbitrary set of "worst case scenarios".

At the end, there's a whole of two paragraphs how you should use multiple regions, use CloudFormation/TerraForm and store the data in S3. Surely there is more to it, and I'd hope Tim would focus on that side of the story more?

Of course, the analysis is fine in its own light if you are trying to establish the risk of us-east-1 really going down — though I'd like to read such an analysis from someone willing to insure your operations against such a failure — but an analysis covering more of the survival phase would equally apply to other AWS regions and other cloud providers.

"How to deal" is addressed as "backup in production" active-active model, described toward the end of the artical.

Personally, I think war is a likely risk. If an adversary wanted to take out the US, the most likely attacks would be asymmetrical ones designed to take out the US economy backing the US military. There would be coordinate attacks at critical nodes like this one.

I don't know if this would be a conventional weapon launched from abroad, a conventional weapon launched from within (if I were an adversary, I might start to secretly build conventional weapons within US factories), a cyberattack, infiltration/sabotage, or something else. But I can think of a dozen ways one could do it with nation-state level resources, for a fraction of the cost of the damage done.

A disgruntled employee (or a few) seems the most likely scenario for something that could cause an extended outage.

> A disgruntled employee (or a few) seems the most likely scenario for something that could cause an extended outage.

That's highly unlikely given the fact that a region like us-east-1 is comprised of about 6 availability zones, which are essentially independent data centers that are physically separate.

Thus for your scenario to take place, you would need each availability zone to have a single point of failure, and at least a team of 6 saboteurs with priviledges access to each of the 6 data centers to work together.

Even so, AWS 101 regarding well architected services states that reliability is achieved by having multiple regional deployments.

Looking at the recent FB outage (which happened by accident) shows that it may well be possible.

I don't have specific numbers, but I think each AZ comprises many data centers. us-east-1 is a very large region, and I don't think any single AZ fits in one building.

Enough people have access to all 6 physically. I'm sure Bezos can walk in unannounced to any facility.

Single point of failure can be software written by said disgruntled employee.

It seems like a vanishingly small chance that us-east-1 will have a total meltdown across its 6 AZ's.

But us-east-1 appears to be the most problematic region (instance launch issues, API issues, etc), so for that reason alone, I wouldn't build a new deployment there. If I wanted proximity to the east coast, I'd probably go with us-east-2 (Ohio) with a backup in ca-central-1.

I'm not worried about surviving a meteor strike or Carrington Event, not much point in keeping a company alive when its employees and customers are struggling to survive.

> It seems like a vanishingly small chance that us-east-1 will have a total meltdown across its 6 AZ's.

Yeah - it's not the sort of thing that you expect to happen often, it's not like it went down twice last month or anything...


(We have a policy here to never use us-east-1 - it's got to be the least reliable AWS region by quite a substantial margin.)

> Sophie Schmieg is a high-level cryptography/security Googler, and Knows What She’s Talking About. She refers to the Carrington Event, a major solar storm (“Coronal Mass Ejection” they say) that happened in 1859, and severely disrupted the world’s telegraph system for about eight hours. This is an example of a Solar proton event. If/when one happens, it’s going to seriously suck for astronouts and for anyone who who depends on aerial radio-frequency communications.

TL;DR: the data center itself is probably quite safe from whatever the author meant here.

This is not entirely correct. The author confuses different things (and misspells the word "astronaut").

* An solar energetic particle event (SEP) is what's the most dangerous thing for astronauts, in part because there's going to be very little advance warning, but also because these high-energy particles can penetrate the inadequate shielding of a spacecraft and fry electronics (physically disrupting microchips) or damage biological tissues. But that's a non-issue for the rest of us: ground-level enhancements due to SEP events are rare.

* Carrington event's main source was a (probably double) coronal mass ejection, or a CME. This is a different phenomenon from an SEP. You can think of it this way: a CME is like a dam breaking down and you getting flooded by huge amounts of relatively slowly moving water masses, while an SEP is like getting shot at by a water cannon accelerating a relatively tiny stream to supersonic speeds. But the thing is that the particles in the CME again do not penetrate the magnetosphere and reach the atmosphere directly. Rather, the increased pressure of the solar wind causes changes in the magnetosphere's shape, which (a) "squeezes" field lines to lower latitudes, (b) causes something called "reconnection" in the magnetosphere tail, essentially slinging particles coasting along on the magnetic field lines in the tail back to the Earth (and causing auroras as they slam in the atmosphere), and (c) strengthens the ring current, which causes depression in the magnetic field, and this (changing magnetic field) in turn induces substantial currents in extended conductors such as telegraph lines or metal pipelines. But I imagine that this is going to be harmless to electronics, the main vulnerability is actually power lines feeding the data centers: volatility of magnetic field during even the biggest storms is on the order of several hundred nanoTeslas.

Source: am a space scientist.

How would you, in theory, shield your infrastructure from a CME?

”Could AWS survive a political move that forced a us-east-1 shutdown? I would be watching which PACs I donated money to…”

This feels a little out of touch. Oh, the privilege! Pity the poor plebes who only get to choose by casting votes, instead of by picking which fundraiser luncheons to attend!

The labor analysis with Amazon instantly caving in to a strike is also hilarious

Their history shows they do not do that at all for business even more vital to them than AWS - it could be weeks until the zone is up, and disgruntled enough workers may be willing to sabotage the infrastructure as well so you cannot even migrate.

> business even more vital to them than AWS

AWS is totally the most vital biz to amazon. Its the cash cow, and the one with tight contracts to meet.

> disgruntled enough workers may be willing to sabotage the infrastructure as well so you cannot even migrate.

Fast way for a union to get no support from the internet-using population.

Who else gets slightly suspicious when reading this:

> I’m not going to say this could never happen. But I’d be shocked. AWS has been doing cloud at scale for longer than anyone, they have the most experience, and they’ve seen everything imaginable that could go wrong, most things multiple times, and are really good at learning from errors.

I thought the vast majority of failures seem to be born out of software and operational errors and almost all of the other proposed problems seem to be solvable by avoiding these?

I run a globally redundant setup, GCP in euro, iowa and asia with fallback to AWS in asia, IONOS in central US and 2x home fiber in EU.

It costs about $30/month for the cloud part, and $80/month for the home fibers that I need anyway.

Problem solved!

Great, if it's encrypted. Wouldn't feed great if any of those nation states could pull a hard drive out and read it all.

I'm surprised Tim didn't talk about squirrels or rodents eating through wires, as that's definitely taken out segments of datacenters and interconnects.

> "because the whole tech industry is (somewhat correctly) perceived as progressive"

Proceeds to describe various Republicans as "guttersnipes" and "goons" while envisioning an utterly ridiculous scenario in which they forcibly shutdown us-east-1 for political gain.

People are welcome to their political opinions, but comments like these do not lend authors an air of credibility. In fact, they do the opposite.

It's a personal blog, he doesn't have to pretend to be neutral. Also, he is playing with his cards laying face-up:

> But then, I’m on the respectable left of the Canadian political spectrum, which makes me a raving Commie by US standards.

Sigh. This article would be so much more compelling without the childish political name-calling.

Is anyone else reading this and going "Yes, ha ha ha! Yes!", or am I just a sicko?

>> suppose Trump wins the Republican nomination for 2024, and runs on a rabble-rousing campaign of Revenge For The Steal, and explicitly rallies the Proud Boys, Oath Keepers, Sovereign Citizens, Three Percenters, Groypers, and police unions, telling them, “We can’t lose in a fair election, so if we do, let’s not let them steal it again.”

This is exactly what's gonna happen. We live last 3 years or normal life, enjoy it while you can

All it takes is one large drone with a briefcase nuke to take out everything. Or if they wanted to do it more covertly with a more plausible cover story, a weather balloon. It wouldn’t even have to take out the infrastructure with physical damage, the EMP from the airburst detonation from anywhere in the vicinity (5-500km up, although directly above at 1-5km would be ideal) would be enough to fry most circuits in the complex.

So something like this would be eminently possible for any sufficiently-motivated nation-state coughCHINAcough to pull off.

Why would China be in the interest to do this? The Chinese politburo would definitely not like a world war right now, they know that it will halt the flow of global trade and would also fuck them up immediately. Unless they are planning to mobilize their entire country towards war (which is not happening), I'm not seeing this happen in the near future.

(Unless a major economic depression happens, leading to economic instability and unrest in both China and the US, which leads them into nationalism and military mobilization and eventually war... History repeats itself yada yada.)

The AZs are generally spaced far enough apart to withstand this. While rumor was that this is specifically done with your scenario in mind, it may also be a happy coincidence for grid and other infrastructure resilience.

This is covered by the “you’ll have bigger problems” argument.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact