Both the "enemy action" and "operation failure" scenarios are much bigger risks than this article makes out to be. Every non-aligned nation-state offensive cyber team has a knockout of us-east-1 at the top of their desired capabilities. I'm sure efforts range from recruiting Amazon employees to preparing physical sabotage to hoarding 0days in the infrastructure. There's no reason to think one of them wouldnt rock the boat if geopolitics dictated.
Operational failure is probably the most likely. AWS might have a decade of experience building resilience, but some events happen on longer timescales. A bug that silently corrupts data before checksums and duplication and doesn't get noticed until almost every customer is borked, a vendor gives bad ECC ram that fails after 6 months in the field and is already deployed to 10,000 servers, etc. Networking is hard and an extended outage on the order of a week isn't completely impossible. How many customer systems can survive a week of downtime? How many customer businesses can?
This is a joke, right? The _real_ degradation map of us-east-1 of the last 5 years looks significantly worse than my non-UPS backed Home PC in Sweden.
Personally I'm not looking at us-east-1 as reliable at all; they even suffered a "harddrive crash" https://www.bleepingcomputer.com/news/technology/amazon-aws-...
Rather, they're asking "What if there was a 30 day outage of us-east-1" - so anyone who isn't multi-region or multi-cloud loses everything, including backups, AMIs, and control plane access.
(FWIW I agree with people disagreeing with the worry levels in the article - a solar storm last seen in 1859 is more likely than a software bug? Ha!)
Yeah, it's kind of "incredible" that they'd be gone for so long, but I've been on AWS during long outages before and it's never certain how long it will go on for and if data will be kept.
We assume too much I think of our providers, and I believe that's what the post is about. If you already _know_ AWS will be down for 30 days then you can make an informed decision based on that.
For me, there's no difference between 8 hours or 80, if you're down more than 8hrs I'm going to activate my redundancies. But: I have redundancies. Many people don't.
This is related to Tim's mention of climate change - lately, '100-year' and '10-year' storms are happening more often. It's possible the changes to our climate have turned those 100-year storms into 10-year storms, but we may not know until the observations and climate models catch up.
I think the main point is that there's quite a lot of eggs in that basket and we should see this as a problem. Any single organization can think they have contingency plans for a big cloud region going permanently down. The problem is that when all of them try to execute their plans at once it won't work.
Diversifying one's cloud/server provider is a good thing! Or simply don't rely only on the cloud. Storage devices are cheap now days, just have local backup and/or in different geographical location.
The idea is that those large transformer cannot be mass produced and could be completely destroyed.
The cool hack is that physical disconnection in time avoids those damages.
At small scale the difference of potential is not enough to fear much physical damage.
Even 20 year old BitTorrent is a better option if that's the risk you're considering
Do Republicans actually hate Amazon all that much or do they just go on Fox News or Twitter and proclaim that they do? As much as they might complain about certain corporations, they don't seem to be at the top of the hit list.
At this point around half of the world's leasable compute is concentrated in fewer than 100 facilities, the locations of which can easily be found with a google search. Using public satellite imagery you can identify network connection points as well as follow power transmission lines. In a wartime scenario, these are industrial targets with astounding strategic value, a tiny geographic footprint, and limited collateral damage in terms of human life. The Nagasaki and Hiroshima of the future could simply be kinetic attacks against a couple of datacenters. I'm alarmed that nobody is prepared for this and the industry zeitgeist seems to be to continue the consolidation of our economies into the cloud.
If kinetics are in play, said actor could also destroy our oil refining and pipeline systems. Taking out a few dozen large baseload power generation facilities would have a massive impact on the grid.
The reason isn't the spit, it's to prevent being so locked in that the provider can rent seek; but there's also an element of agility in that. If you can turn to AWS and say "Give us a good deal or we walk" and.. you know, you actually can walk... then they're more likely to do that.
This is the same reason broadband is fairly inexpensive in Sweden, the building will negotiate as a unit to an ISP and get a group deal. Bulk buying in this way gives a lot of negotiating power. So for a publisher to be able to move all titles to a new provider in 2 cycles: that's negotiation power.
If you can save 2 months of developer time in a year by switching providers, it behooves to do it; and cloud resources are _expensive_ I recently specced out an instance for gitlab which is currently living on a VPS, it went from $30 to nearly $800 on cloud. (Though admittedly that $800 variant would be more highly available.)
Hypothetically, if you had 1 big script that could rebuild your entire org inside Azure/AWS within 4 hours, why wouldn't you consider doing this as a normal part of every weekend? If a meteor hits us-east-1, you wouldn't even have to think about it. It would just be another Saturday, perhaps with a quick region toggle in the redeploy script parameters.
Other advantages with weekly orbital nukings of infra also includes making it very difficult for persistent threats to... persist.
Clearly, some resources cannot be re-allocated with such frequency. Domain registrations and maybe TLS certs being the biggest examples. Getting away from needing static IP addresses in your cloud provider can also remove a lot of shackles.
The strategy provides for healthy friction and drives customer obsession by the vendor.
Depending on your scale, not being multi-cloud can be more expensive. i.e. You can save more money than it costs you to migrate (by getting a discount from your current provider.) When negotiating, you need to be able to walk away, for real.
As a minor pedantry, your cloud provider cannot "rent seek" just because you stepped into a vendor lock-in situation, that's not what the phrase "rent seeking" means. An example of actual rent seeking would be if US government legally required you to use AWS if you were doing business in some highly regulated sector. Being locked in particular vendor offering is definitely not a great situation, and the vendor can most definitely use the leverage to their own profit, it's just this is not rent-seeking.
Cloud providers aren't Comcast. You don't get discounts for threatning to move from, because both providers and customers know the costs are impossibly high.
Which is why a lot of their pitch is that never be in a situation where you'd have to do move (some of course are better here, one cloud is trying very hard to overcome their parent company reputation), otherwise no sane executive would sign up.
The discounts you do get from clouds is by promising to move more work to them.
Also with the FB BGP disaster we saw an example of how their resilience/RED-team/etc. learnings failed to highlight how hard recovery from a real-world outage would be irt to something as basic as building access. Plus the hilarious fact that widespread network outages make difficult the kinds of cross-location/timezone communication that would be needed in order to collaborate and apply fixes. FB teams experienced this apparently. The tools they rely on to communicate obviously relied on the assumption that such a network failure could never occur. They were left relying on non-internet comms and non-FB platforms.
To claim that Amazon is simply too experienced to let such things occur seems quite arrogant and naive.
They vastly underestimates the "Enemy Action" risk. It won't be a DDOS attack that takes down us-east-1, it would be a Stuxnet level of direct targeting, using a combination of internal knowledge and nonlinear physical properties. And while he's right that professional hackers would have no interest in any Amazon data services (the time investment/reward is just not here), he says at the very beginning of the article a clear reason for state actors to go after it - it would cause a lot of economic damage, which is how wars are fought these days.
Not saying it has any chance of happening anytime soon, but his idea of "Enemy Action" is limited.
At any rate, it's bad business to put your eggs into one basket, and while it's less friction to use cloud formation, dynamodb, or anything that locks you into AWS - it's far better to make a system that can be deployed anywhere, at least when you start getting big enough that it matters.
I'm sure that AWS has some of the greatest cybersecurity out there. But the potential massive cash opportunities make it such that why not try some easy attacks against them. Spending millions of dollars of labor to research and pull off an attack is likely only for nation states, but ransomware gangs should be walking by and testing the locks all day every day.
I guess the best way to do this without attempting a total shutdown of the dc (while still making off with $xx millions) would be to select a thousand customers, encrypt the hard drives that make up their data redundancy (live, backup, and sharded copies of the data), then ransom that. The only way this doesn't work is if they have all of it in a tape backup, but depending on how much you encrypt, that might be impractical for them to restore if it would cause significant downtime for those customers - and that could be mitigated by selecting petabytes of super-recent data that likely hasn't been backed up to tape yet.
And even if it would, so what? It's not like USA is lacking some casus belli to attack NK; the major factors of whether some military action is worthwhile or not would stay the same after such a hack. This would work to deter Russia, who wants to be integrated in trade, but countries which already are isolated and/or already treated as hostile (for example, Iran) wouldn't care; if USA wanted a war there, then refraining from such a hack would not prevent it, and if USA doesn't consider a war there as profitable, then doing some hacks would not be treated as a larger threat than e.g. nuclear weapons development, so it wouldn't even be a significant escalation in the current bad relationships.
Sure, we haven't had a total outage where everything is offline for days on end, but I don't rate any of the scenarios as being particularly likely. Even things like geomagnetic storms and nuclear EMP would mostly impact power transmission, not the data centers themselves, especially if there's advance notice.
I'm not sure S3 or Kinesis outages, or any other higher level service and serverless offerings, count as outages in this context, specially as they are quickly recoverable.
I believe the scenario involves something like, say, all EC2 dying or no traffic getting in or out for a few weeks. Think Katrina but on us-east-1.
Yet while S3 going down in 2017 generated a lot of headlines, there was no systemic impact like (say) the stock market tanking. Which leads me to think that most (not all, but most) really critical stuff will have backups outside us-east1.
> So while your customers and employees are going to be mad, they’re also going to be distracted from worrying about your downtime.
This is incredibly important for business. If we go with AWS and it goes down, then it's big news and our customers blame AWS. If we choose a different provider, even one which is more reliable than AWS, then our customers blame us.
This reasoning obviously doesn't apply to all systems e.g. air-traffic control or stock exchanges, but it does for the vast majority of businesses, and certainly any business that would host their systems in the cloud in the first place.
1) us-east-1 goes down, and a whole bunch of important AWS backplane goes down with it, so it's impossible to bring everything back up again in other regions.
2) us-east-1 goes down, and the thundering herd stomps all over the other regions making it impossible to use them.
3) either scenario 1 or 2 happen, and another thundering herd of people who've made plans to stand everything up in Azure/GCP take those down as well.
If us-east-1 goes dead because of a condition that humans failed to anticipate, then it stands to reason that the other regions are actually dependent on use1 in a way humans didn't anticipate.
1. A Carrington-scale solar storm. This was talked about in the article. We have not had any significant solar storms affect earth in the last 150 years and basically all of our electrical, computer and communications infrastructure was built since then. Grid operators are not optimistic about the idea of starting up the entire electrical grid from nothing as it has never all gone down before.
2. A repeat of the recent historic activity in the New Madrid seismic zone. This is a place where historical accounts talk about aftershocks lasting for months but in more recent times it was not so seismically active so eg building codes don’t sufficiently prepare for this. Fortunately there is not that much economic activity there but there is still a lot.
I guess number 3 would be the “big one” in San Francisco. Though the fire risk isn’t so bad nowadays I think.
Its impact on both the Bay region and the Delta, and by extension much of the Central Valley and Los Angele, would be immense. As has been noted for the past 50 years, a major event is long overdue.
for the vast majority of applications, this advice buried at the end is all you really need to do to survive anything other than total planetary annihilation (by then i'd worry about customer "churn", to put it bluntly). "all" is a strong word of course, i've never really worked at any place that tested backup recovery to that degree of regularity.
It's not just "Russia nuked North Virginia". That's not by far the most likely failure scenario in my estimation. I consider "billing problem causes Amazon to close our account" or "trusted employee with top level access deletes everything in our account right down to Glacier storage" or "employee with access severely violates AWS T&Cs and gets us booted off" as being far more realistic threats, while effectively having the same consequences as a couple of ICBMs with MIRVs targeted to every datacenter in NV...
In any of those cases, expecting to rehydrate from S3 isn't an option.
Of course, nobody wants to incur the costs associated with "How could we keep business continuity without AWS?" because it gets very expensive very quickly for anything non trivial.
I think that mostly depends on how much data you generate in a day. If sending backups out increases your bandwidth use by 5% then it's pretty easy to throw that into cold storage on google or azure or local or all three.
It's the infrastructure.
Even if you've only got a reasonable simple platform, say some redundant EC2 app servers behind a load balancer and a multi AZ RDS database, with some S3 storage and a Cloudfront distribution serving static assets - you've probably also got Route53 DNS hosting, AKS ssl certs, deployment AMIs, CloudWatch monitoring/alerting, and a bunch of other "incidental" but effectively proprietary AWS stuff - because it's there.
How do you get all that stood up "right now" in Azure or GCP or DigitalOcean or wherever, unless you've already put the time/effort into making that happen?
How many "single points of failure" are locked inside your AWS account? (For my stuff, Route53 is the thing that keeps me up at night occasionally. If we lost access to our domains registered/hosted in AWS, we'd need to pick new domain names and update all out apps...)
It doesn't take crazy amounts of effort to set up app servers, a load balancer, a database, and S3-compatible storage somewhere else.
If you had one person working on that two days a month you could keep a warm fallback system ready to go. That takes some effort to keep a map of what your cloud services are actually doing, which is a good idea anyway.
There are other problems of course, CEO being hit by a bus being the most obvious.
Not sure you could really make this work.
The archive utilities would also need to be barred from overwriting or deleting backups. If that's automated, who configures it? And how do you do differential backups without read privileges?
The archive utility, run as the recover user, can only read the file.
Anyone could validly make the argument that this us user error and sloppy ops work — but that’s almost always the case in data loss events… be it unverified backups, abusing root, etc.
I still think there’s value in diversifying both in vendor and physical location.
If you were planning a multdatacenter strategy, it should also minimize the number of AWS offerings used or expand it to be multi-cloud.
This isn't just about recovering after how many ever days or weeks without power. It's whether you have any data left _to_ restore. Somebody tell me I don't need to worry.
What that means is that the power grid is designed for AC currents, and now there will be DC currents flowing through it, induced by the flexing of the Earth's magnetic field under the storm. That is very likely to saturate the cores in large scale transformers, causing them to blow up.
No more power grid. And these are not things you have spares for or can ship from China. You need to make new ones, in a country with spotty power and complete supply chain breakdown.
So your electronics would be fine, just out of power. Faraday cages are for the nuclear bomb type EMP events.
I'd like to know more about the nuances of this blanket statement:
>The best case outcomes closely resembled a global depression.
If the writers are imagining something so severe it could trigger a global depression, they're probably thinking of something with that kind of impact hitting everyone in the region.
Such an event would be very unlikely, you would hope - but I'm sure you could pull it off with the resources of a nation-state and half a dozen sleeper agents.
This was back in 2014, but AWS now has ways to fix this (if you do insist on keeping stuff in AWS) with Object Lock. The expense is that, with object lock in a compliance state, the only way to delete it is to close the AWS account (which is why MFA delete is recommended).
At the end, there's a whole of two paragraphs how you should use multiple regions, use CloudFormation/TerraForm and store the data in S3. Surely there is more to it, and I'd hope Tim would focus on that side of the story more?
Of course, the analysis is fine in its own light if you are trying to establish the risk of us-east-1 really going down — though I'd like to read such an analysis from someone willing to insure your operations against such a failure — but an analysis covering more of the survival phase would equally apply to other AWS regions and other cloud providers.
I don't know if this would be a conventional weapon launched from abroad, a conventional weapon launched from within (if I were an adversary, I might start to secretly build conventional weapons within US factories), a cyberattack, infiltration/sabotage, or something else. But I can think of a dozen ways one could do it with nation-state level resources, for a fraction of the cost of the damage done.
That's highly unlikely given the fact that a region like us-east-1 is comprised of about 6 availability zones, which are essentially independent data centers that are physically separate.
Thus for your scenario to take place, you would need each availability zone to have a single point of failure, and at least a team of 6 saboteurs with priviledges access to each of the 6 data centers to work together.
Even so, AWS 101 regarding well architected services states that reliability is achieved by having multiple regional deployments.
But us-east-1 appears to be the most problematic region (instance launch issues, API issues, etc), so for that reason alone, I wouldn't build a new deployment there. If I wanted proximity to the east coast, I'd probably go with us-east-2 (Ohio) with a backup in ca-central-1.
I'm not worried about surviving a meteor strike or Carrington Event, not much point in keeping a company alive when its employees and customers are struggling to survive.
Yeah - it's not the sort of thing that you expect to happen often, it's not like it went down twice last month or anything...
(We have a policy here to never use us-east-1 - it's got to be the least reliable AWS region by quite a substantial margin.)
TL;DR: the data center itself is probably quite safe from whatever the author meant here.
This is not entirely correct. The author confuses different things (and misspells the word "astronaut").
* An solar energetic particle event (SEP) is what's the most dangerous thing for astronauts, in part because there's going to be very little advance warning, but also because these high-energy particles can penetrate the inadequate shielding of a spacecraft and fry electronics (physically disrupting microchips) or damage biological tissues. But that's a non-issue for the rest of us: ground-level enhancements due to SEP events are rare.
* Carrington event's main source was a (probably double) coronal mass ejection, or a CME. This is a different phenomenon from an SEP. You can think of it this way: a CME is like a dam breaking down and you getting flooded by huge amounts of relatively slowly moving water masses, while an SEP is like getting shot at by a water cannon accelerating a relatively tiny stream to supersonic speeds. But the thing is that the particles in the CME again do not penetrate the magnetosphere and reach the atmosphere directly. Rather, the increased pressure of the solar wind causes changes in the magnetosphere's shape, which (a) "squeezes" field lines to lower latitudes, (b) causes something called "reconnection" in the magnetosphere tail, essentially slinging particles coasting along on the magnetic field lines in the tail back to the Earth (and causing auroras as they slam in the atmosphere), and (c) strengthens the ring current, which causes depression in the magnetic field, and this (changing magnetic field) in turn induces substantial currents in extended conductors such as telegraph lines or metal pipelines. But I imagine that this is going to be harmless to electronics, the main vulnerability is actually power lines feeding the data centers: volatility of magnetic field during even the biggest storms is on the order of several hundred nanoTeslas.
Source: am a space scientist.
This feels a little out of touch. Oh, the privilege! Pity the poor plebes who only get to choose by casting votes, instead of by picking which fundraiser luncheons to attend!
Their history shows they do not do that at all for business even more vital to them than AWS - it could be weeks until the zone is up, and disgruntled enough workers may be willing to sabotage the infrastructure as well so you cannot even migrate.
AWS is totally the most vital biz to amazon. Its the cash cow, and the one with tight contracts to meet.
> disgruntled enough workers may be willing to sabotage the infrastructure as well so you cannot even migrate.
Fast way for a union to get no support from the internet-using population.
> I’m not going to say this could never happen. But I’d be shocked. AWS has been doing cloud at scale for longer than anyone, they have the most experience, and they’ve seen everything imaginable that could go wrong, most things multiple times, and are really good at learning from errors.
I thought the vast majority of failures seem to be born out of software and operational errors and almost all of the other proposed problems seem to be solvable by avoiding these?
It costs about $30/month for the cloud part, and $80/month for the home fibers that I need anyway.
Proceeds to describe various Republicans as "guttersnipes" and "goons" while envisioning an utterly ridiculous scenario in which they forcibly shutdown us-east-1 for political gain.
People are welcome to their political opinions, but comments like these do not lend authors an air of credibility. In fact, they do the opposite.
> But then, I’m on the respectable left of the Canadian political spectrum, which makes me a raving Commie by US standards.
This is exactly what's gonna happen. We live last 3 years or normal life, enjoy it while you can
So something like this would be eminently possible for any sufficiently-motivated nation-state coughCHINAcough to pull off.
(Unless a major economic depression happens, leading to economic instability and unrest in both China and the US, which leads them into nationalism and military mobilization and eventually war... History repeats itself yada yada.)