Hacker News new | past | comments | ask | show | jobs | submit login
Is Amazon's cloud service too big to fail? (fnlondon.com)
182 points by azureel on Aug 2, 2017 | hide | past | favorite | 156 comments

This is (I was surprised) a pretty good article. Financial services are regulated and based on recent experience, they're concerned with systemic risk. Most industries do not have anyone responsible for worrying about this kind of thing.

It seems reasonable to start worrying about the fragility potentially introduced by these massive internet infrastructure companies.

If you wanted to blow something up to make the west suffer, an AWS datacenter would probably be a pretty good target. I wonder at what point that becomes a legitimate national security concern, and the government steps in to provide protection.

Yes, this goes to show that there are ~zero terrorists in the US. The US has an almost infinite supply of soft targets, but since 9/11 there have been no attacks of any consequence. There have been a few minor attacks like Boston bombers, but those have all been laughably amateurish.

The US has no terrorism problem. In 1979, the Irish killed the Queen's uncle in law, and in 1984, they blew up a hotel where Thatcher was staying. That's a serious terrorism problem! What the US has is nothing by comparison. We were unlucky on 9-11, and ever since we've been distorting our foreign policy out of unreasonable fear.

legitimate national security concern, and the government steps in to provide protection

There are a LOT of far softer targets that go unprotected. A terrorist attack on a sewage plant for a major city would be far more devastating than knocking out a few websites.

I feel like the point of the article we're commenting on is that AWS is far from "a few websites"

You can live without Netflix much longer than you can without a flushing toilet, is the point I'm making. Yet we don't have armed guards patrolling the sewageworks... It's a matter of priority how finite security personnel are deployed.

"Yet we don't have armed guards patrolling the sewageworks"

There's probably a moderate amount of unseen security in larger areas surrounding water/sewage plants.

Also, we've dealt with regional natural disasters before. Drought, fire, flooding, etc - we have experience mobilizing resources and redistributing as needed.

We've never dealt with an extended outage of one or more AWS data centers for days at a time. How many govt/university systems would be unable to function because of direct or indirect dependencies on AWS-related services?

S3 going down for, let's say 4 days, would cause big havoc on so many projects and systems I know of.

I'm pretty sure most people have no clue how much of their data and systems functionality is reliant on AWS-related services.

How about whatever is running on AWS GovCloud?

Water outages happen often enough - they distribute water manually with tanker trucks. When sewage is overwhelmed it overflows into a river and they tell people to not go for a swim. https://en.wikipedia.org/wiki/Sanitary_sewer_overflow

GovCloud has stuff that would be bad if it goes down, but most institutions that use it have a DR plan.

If AWS went down hard, it would be more than just "I can't watch Netflix".

Some people wouldn't be able to do their computer work, send receive emails, others might not receive their paychecks or be able to pay bills.

That still ranks lower than "sewage treatment gone". Maslow's pyramid still applies.

The sewage plant hits both physiological needs and safety needs (it's a pretty dire health threat quickly). AWS has weak tendrils to financial safety, but mostly affects layers above that.

The old historical lesson applies. People will say they're rebelling for various reasons. But at the end of the day it's for land, water, or food.

Receiving your paycheck could be worse. There are ways to deal with water outages. Worst case you let sewers overflow into rivers, as discussed in another comment. If salaries of large companies remain unpaid this would be far more severe for a lot of people.

>would be more than just "I can't watch Netflix".

And, I think even more than we can imagine. It's one thing to count the number of services that are direct customers/dependents of AWS, and we really don't know how deep that goes.

But, add to that the non-AWS based services that directly or indirectly rely on AWS-based services.

I used to work at a company that had a large cage in a local data center - one of the other cages in the same room was apparently for the local police force (we weren't supposed to know this!).

It wouldn't surprise me if those police systems have moved, or are moving, to Azure or AWS data centers.

Police would probably still work fine without data centers for a few days. As long as they can communicate and 911 works, they can do most of their job in the short run.

It's much worse if benefits or paychecks don't get paid. For many people that means that they won't be able to buy groceries as soon as a few days later. Not the standard HN crowd, but many people depend on regular paychecks and can be in trouble if it comes even one day late.

As long as they can communicate

Funny thing with the police, in the 60s there was flooding and they couldn't coordinate a response, so radio amateurs had to step in and provide comms. Those enthusiasts are still around under the codename RAYNET. The police eventually got their act together and invested in a system called TETRA which is self-contained and independent of any other infrastructure so they would never be caught with their pants down again.

Now they've ditched TETRA to save money and they just run over the 4G network like anyone else...

But if you attack a massive sewage treatment plant and completely disable it, you can effect maybe... 10 million people, if we're being generous?

If you hit AWS, just Netflix alone is going to effect 100 million people. There are definitely a few payroll providers that are going to have some reliance, so people all over are going to stop receiving their paycheques. Many businesses will be temporarily disabled.

On a national level we've got a lot more eggs concentrated in the AWS baskets than we do the sewage treatment baskets.

The thing is the dispatch system for those security guards may well be running on AWS or other cloud - once they become uberized, even more likely.

To the extent that this article is correct, financial services would be affected.

Knocking out all of AWS is very different from knocking out a single data center.

Especially because AWS regions are broken up into multiple availability zones (data centres in the same area). So taking out a single data centre won't do much if the AWS customers have correctly designed their systems for high-availability (ie having redundant instances in other AZs/regions with their data backed up elsewhere).

A single availability zone can be spread across many physical data centres.

According to this rackspace article, the largest AZ has 5 data centres.


Take this for what it is - I'm a software developeer who works in the cloud, not a cloud expert.

The abstraction behind AZs is what every AZ counts for at least 1 data centre. So for every region there is at least 2 AZs, and every AZ means at least 1 data centre (or 5 in your link).

This just means it's harder to take out a whole region by destroying individual data cetnres. Since most regions consist of 2-5 AZ, and AZs consist of 5+ data centres, that means destroying dozens of data centres.

That's a big if. The S3 outage in February this year showed that there are a lot of sites that aren't designed for this.

The S3 outage was across the whole US-EAST-1 region, not just one data centre.

I know my company and a few others that are redundant within a region (ie if one AZ goes down), but not if a whole region goes down.

This all assumes that "taking out" datacenters is a physical/hardware operation.

When you widen the potential attack surface to include software vulnerabilities, unauthorized access, process flaws and other "soft" vectors, a much wider--possibly coordinated--attack that is potentially far more crippling can be imagined.

The question is how load would behave if more than 1 AZ fails for several days. If 2 out of 5 AZs in US-East are down for a week, everyone would distribute to the other 3 (and probably on all three to be sure). I'm not sure if they have enough spare capacity to handle that. The AZ model is designed for single failures and failures that are a few hours, not days.

Of course just speculation, I have no knowledge about DR plans at Amazon.

Granted it was only for a few hours but they had a huge outage in February this year (https://aws.amazon.com/message/41926/). This affected pretty much all us-east customers (Which are most AWS customers) so it should be possible to extrapolate from there. Although I think a lot of customers learned the importance of replication that day so the next outage might not be that bad.

I get what you're saying.

However, we actually have evidence that a sewage plant for a major western city can suffer a sudden catastrophic failure[1] and it won't even make it above the fold.

[1] https://projects.seattletimes.com/2017/west-point/

A great soft target is electricity grid. The timelines for manufacturing large transformers for example can be many months and they usually don't have as much redundancy as you would expect. A semi coordinated attack against the grid, using dynamite to take out a number of pylons along main HV lines, easily accessible because they're out in rural areas. Combine with a number of sticks thrown over fences at transformers. And you could massively impact the US economy for months.

The economist had a pretty good article about that. It dealt with electromagnetic impulses, as those could take out several at once:


Wouldn't you have to blow up at least all the datacenters in a region to make an impact?

There are probably some people careless enough that destroying a single data centre would damage them. Most serious contenders (e.g. My employer) have fully redundant active-active setups across regions, so you'd have to engineer a massive outage to take out real shops.

Of course, if you completely remove an entire AWS region you might induce very damaging stresses on other regions as people fail over.

I think the number of applications running on AWS in a non-HA manner is very high, so a single datacenter going offline would have some impact.

All you need to blow is a few cables.

Why blow up anything or damage any cables? Hack the computer of an Amazon employee and do your damage there. The last S4 outage was because of someone fat fingering a script, imagine what someone could do that really wanted to mess stuff up.

This. Could've saved myself a comment elsewhere on this thread.

But, yeah, I'm kinda' surprised that this HN crowd in particular is so focused on hardware vectors.

And ultimately it leads you back to targeting the IT operations centre for the business where they provision equally to say both AWS and Azure for redundancy. At that point you can knock all of the biz cloud capacity off.

Well, you need to disable redundant ISP or power cables to all datacenters in a region. and that would probably be pretty easy to recover from in a day or two.

I imagine Amazon has some on site security measures as well.

It would be interesting to see how important services would cope with their main region going down for more than a few days.

Not that I think it is wise to discuss optimising a terrorist activity on a public forum, though it is interesting from a threat analysis point of view.

You don't need to severe both power and data cables. For power the datacentre should be able to cope for a while, particularly if it has access to fuel deliveries. Data cable should be both easier to severe and more difficult to recover from (not the least to identify which cable has been cut where).

Most of the infrastructure of a country in term of cables run along railway tracks, sewers, etc. This means thousands of kilometers of cable even for a small country, and it is impossible to secure everything. It is impossible to severe everything either but you don't need to severe every single cable. As long as you severe enough of the backbone, the other cables will be overloaded.

So I don't think it would take that much effort for a network of terrorists to create havoc in the communications of a country for at least a few days to a week. And the consequences for the economy can be pretty dire. We have seen with BA what happens when their datacentre goes offline. Their all fleet is grounded. I imagine the consequences of a country-wide outage could be pretty dramatic. Unlikely anyone would die but you could really dent the GDP.

But if that's your methodology, you'd need to cut several cables in different locations simultaneously. A single cable doesn't take too long to fix (a couple hours, tops) and there are probably several redundant backups to handle bandwidth in the interim.

Then all that's needed to repair is a few cables.

fight club update anyone?

Amazon's own data centers are tiny (compared to most) that taking one out would almost be a waste of time. Even one of their large colocations would probably be a waste.

If you wanted that type of destruction, and be noticed, you would need a city leveler type event in a data center heavy area.

Most people wouldn't even notice. They'd have to blow up a region and at that point, you have bigger problems.

I really hate the "too big to fail" meme and I strongly agree with Bernie in that if you are too big to fail you are too big to exist.

That should be the priority.

I don't know what Bernie's specific plans were/are but it's not unusual. At least, at first blush, this is a lot of people's reactions. Too-big-to-fail = danger = make it smaller.

Realisitically, this has not been the solution implmented (in the EU & US, at least). In the EU, it is even more crucial as the "solutions" to this problem are applied to state finances as well as financial institutions.

In terms of policies, there are two competing approaches: (1) Reduce the size of "too-big-to-fail" institutions. (2) Regulate them more heavily (or some other strategy) so that they will not fail. In the EU, this is being applied to states, not just financial institutions. Rules that (supposedly) reduce catastrophic risk.

Almost all seripous policy proposals are in the no. 2 category. Tighten regulation, reduce the risk of failure. Tighter regulation lends to stronger incumbents and larger average company size so by doing 2, you are probably doing the opposite of 1.

As I said, I don't know what Bernie's proposal is or how mature it is as a policy (as opposed to a politician statement). It would be notable if a left wing politician propsed loosening bank regulations, though definitely not impossible or unreasonable.

Paul Volcker's solution was that any time a bank is so big that we need to bail it out to avoid systemic problems, then the bank should be broken up. We can't always see all the problems in advance, but we can break them up in the aftermath.

Agreed, it seems like the software approach as well. You wouldn't want a class or a piece of code to be "too big" to fail, you'd refactor it into smaller pieces which can be overviewed more easily.

It's really a basic engineering principle. Smaller and more distributed systems are more reliable.

This sounds like the microkernel / monolith debate. Except monoliths kind of won.

It depends on what you want to optimize for. micro is good for some things, mono is good for other things. Neither is inherently "better", it just depends on your goal

> I really hate the "too big to fail" meme and I strongly agree with Bernie in that if you are too big to fail you are too big to exist.

That seems like the most reasonable response. And yet, since the great recession, our policy has been "make 'too big to fail' even bigger".

The problem is that the banks have become too powerful for anyone to challenge. A Teddy Roosevelt type of political leader can't exist today.

Corporate entities were radically more powerful in Teddy Roosevelt's day, far beyond what they are today. They were almost entirely unchained in regards to economic power, whereas today there is hyper regulation (tens of thousands of pages of it, including direct Fed control over the banking system).

Standard Oil was as powerful as the US Government in that era, as were the railroads. JP Morgan was far more powerful than the US Treasury. Cornelius Vanderbilt - pre Roosevelt - all by himself had greater financial capability at his peak than the US Government at the time.

The difference, is back then there was wide-spread and growing fear of the combinations and would-be monopolies. Today, Americans are relatively unconcerned by Microsoft (desktop), Google (search, android), Walmart & Amazon (retail), Intel (microprocessors), Facebook (social), Cisco, Boeing, etc.

Why leave it at American companies? The Dutch East India Company was far more powerful and bigger than any company seen today.

Beyond that, it was more powerful than many (maybe most) modern governments. It was an international power unto itself.

Here is an smbc comic making fun of this phenomenon. http://www.smbc-comics.com/index.php?id=3794

Why is every arrangement of characters now a "meme". That word has moved beyond devoid of meaning, at this point it's like a black hole of nothingness of a word.

> A meme (/ˈmiːm/ MEEM) is an idea, behavior, or style that spreads from person to person within a culture.

Seems like it's fitting here, does it not? Certain banks being "too big to fail" is an idea passed among persons within our culture.

kind of like the idea of a meme (/ˈmiːm/ MEEM) is a meme (/ˈmiːm/ MEEM).

> Financial services are regulated and based on recent experience, they're concerned with systemic risk. Most industries do not have anyone responsible for worrying about this kind of thing.

I'd say that most industries do not have anyone responsible for worrying about it high enough in the management chain.

If your architecture means your system goes down if AWS is down, then the question becomes can you replace AWS with something better that you can build, have means to build, have time to build, can keep running, can get enough momentum in term of sheer size of customer base to fund the upkeep of the platform?

If you can't build/run a better AWS replacement then it's a mute point, isn't it?

Then the question turns into if you can't build better AWS, can you architect your application to handle AWS failures? AWS itself lets you handle many kind of failures at AZ/DC level. Are you using that? For global AWS outages, can you have skeleton, survival critical system running on GCP or Azure?

Have you thought about outages that would be out of your control and out of AWS's control e.g. malware, DDoS, DNS, ISP, Windows/Android/iOS/Chrome/Edge zero day? How are you going to handle outages due to those issues?

If you are prepared to handle outages (communication, self-preservation, degraded mode, offline mode) then can a serious AWS outage be managed just like those outages?

irrelevant points are "moot", not "mute"

I think you mean "moo".

It's like a cow's opinion, you know, it just doesn't matter. It's "moo".

Have I been living with him for too long, or did that all just make sense?

Moot points aren't really irrelevant, on the contrary, they're perhaps the most relevant as non moot points are already settled.


Woah, Prof Brians updated his layout


Thanks. :)

I think most people probably don't need an entire AWS replacement, though. If you're running an e-commerce platform, you could run it on a VPS. It would be harder - you'd have to do your own server management, figure out your own deploy strategy, run your own load balancers, do your own security, backups, etc. - but that's not "rebuild AWS from scratch" harder.

Even at a smaller scale it is a little nerve-wracking to know be so reliant on one provider. If AWS tanks there's a fair amount of code that'd need to be changed just to switch over to Azure or GCE. Failover with, e.g., email providers is easy enough, but the entire cloud stack (for lack of better terms) is a completely different ballgame.

It is one of the issues with choosing the cloud providers and taking their stack. They are hoping the cost of swapping once bought into their way is too costly to a competitor who can offer similar service cheaper. Lockin used to be considered bad but something changed with cloud providers and ops/developers don't seem to care as much anymore.

Maybe because pricing by, say Amazon, is published on their web site and therefore, the same for everyone ? Whereas before, when you were with one supplier, he could make specific price for you and leverage its position to make you pay more ? dunno...

I'd be very surprised if big users of AWS or Azure pay the rates advertised on the public web sites.

Right, nobody pays /more/ than the published prices.

And prospective users can look at the published prices, and see that historically they've gone down more than they've gone up (although obviously that trend could reverse).

So people think they're safe from the risk of Amazon quadrupling their bill over night.

Of course, vendor lock-in can have other negative effects, but apparently people aren't worried about them, or at least think AWS is no worse than the alternatives.

Well, if you aren't in the US there is forex risk - Azure certainly increased their prices in the aftermath of the Brexit vote decrease of GDP against the USD.

Doing any kind of business in non-USD currency is itself not a great move.

Like most large business, AWS has a sales department and sales engineers that behave like you expect.

Rates are absolutely negotiable.

Our AWS rep is nice and cheery. He'll come into our office twice a year and bring sales engineers to hear about our upcoming projects. There's one lead developer on our team who keeps imagining systems that use half a dozen AWS services for "big data". The AWS dudes always end up talking to him the most and they definitely bait him with various pitches and, of course, feed his ego. Good thing that he's so disorganized and delayed that he never has a chance to waste company money on all that bullshit.

Having negotiated both, AWS is negotiable within a range that is much tighter than the range for enterprise sales of Cisco, EMC, etc gear. (Or AWS has better negotiators, but I've never gotten a call "Hey, Qx is about to end and I need to hit my numbers, so is there anything we can pull forward" from an AWS rep.)

They could show everyone different prices (like plane tickets)

There are some open source implementations of parts of the APIs of cloud providers that might help someone a bit when trying to migrate. For example, Minio [1] [2] implements the AWS S3 v4 API.

[1]: https://news.ycombinator.com/item?id=12392081

[2]: https://minio.io/

We[1] have seen a fair number of requests for managed services as devs claim that "we don't have the time or skills to maintain these open source components" (quoting verbatim from a request on Intercom). I don't think this is about having open source substitutes in all the cases. Personally not a fan of how/where the build vs. buy debate is playing out here.

[1]: https://hasura.io

"we don't have the time or skills to maintain these open source components" translates to "we don't want to install a monitoring solution that restarts a service when it fails, or think about design in regard to component failure".

It's such a poor argument. I was a developer long before AWS appeared and I've used so many open sources packages that were profoundly reliable. In many cases it just takes a daemon restart. And while it's not exciting to set up some of that stuff, it's far more tolerable than writing a CloudFormation template.

I warn other developers at my company about this. When new projects spin up they're often very excited about using new Amazon services and will make any excuse to choose an AWS product over a stable open source solution. If I were a manager, I'd be very worried over the vendor lock-in.

I don't understand the preference for AWS over open source in many cases. Their services are "reliable", but they often have minute restrictions that will eventually bite you. You also end up having to pay for something you could get for free. Why use SNS/SQS when there are free pubsub/message buses out there? Most of the other devs justify this with the argument of not having to maintain the software themselves. "But RabbitMQ might crash! We don't have to worry about that with AWS!"

Anyway, I typically minimize the AWS services I use (S3, EC2, ECS) so I don't dread the day AWS blows up or, more likely, some VP or exec says we're moving to GCP/Azure because we got a better deal.

>Why use SNS/SQS when there are free pubsub/message buses out there?

Free is never really free. There's always a tradeoff in engineering time and money when you choose to run your own stack instead of paying to use a stable, well-established service. Oftentimes running your own will be cheaper overall, but you have to do that cost-benefit comparison for yourself.

You're also forgetting that if you set up something on your own you also have all the hardware concerns as well. You need to procure hosts, provision them properly, deploy them, monitor them, scale them, fix them. That infrastructure cost doesn't go to zero but it is significantly reduced using a cloud provider.

I'm not arguing against cloud platforms in general; just the irrational use of very specialized services they offer. I can run a containerized service that uses open source packages on any of the cloud computing platforms. Now if I used Athena, SQS/SNS, DynamoDB, ELB, Lambda, EC2 that would make me very nervous, and I see other devs designing these stacks all the time. I guess I shouldn't care as much, because I'm not going to be the one to migrate that when the company gets a better deal from another platform service.

> "But RabbitMQ might crash! We don't have to worry about that with AWS!"

I can confirm that not only can RabbitMQ get into an unusable state, it will do so extremely rapidly and with little warning unless you sit an engineer or two on it to monitor and manage the incoming/dead letter rates.

AWS provides a lot of features that are exclusive to their platform and can't be drop-in replaced on other providers like Azure of GCE. ELB, EFS, S3, ASGs, etc. They'd need to be replaced at the application level for other platforms. That could be a huge commitment for a decent sized system.

I don't know about ELB, EFS and ASG but:

- S3 has a public protocol and many 3rd party providers support it (OpenIO, Scality, Ceph, Minio, etc),

- EFS could be replaced with something like DRDB or GlusterFS, or DigitalOcean's block storage or Google Cloud's networked disks.

- ELB could be replaced easily with similar services from other providers [1] if you use Kubernetes (I don't know if all have a LoadBalancer type though)

I would be more concerned about firewall/vpc rules, because I have no idea how those could be migrated without risk of forgetting some. Lock-in seems not that high in the end though and even less so if you use an open source container orchestration stack because they abstract most of these things away.

[1] https://kubernetes.io/docs/tasks/access-application-cluster/...

One good way is to have automated tests which make sure that those rules actually work. What if all your AWS rules suddenly get deleted. How are you supposed to know if you have not forgotten any.

Terraform (https://www.terraform.io), which we use, is a neat way to abstract this configuration data. We keep the configuration files in git, and can do GitHub pull requests over our whole infrastructure and apply the configuration with confidence (to change the existing setup or re-create it from scratch). This works for multiple cloud providers and is great for security purposes (all changes are auditable, no configuration drift).

Heap has a great blog post on Terraform : https://heap.engineering/terraform-gotchas/

Essentially, a declarative configuration for infra is what you are getting at. You can take this further, using containers and orchestration tech, to abstract your application behind a declarative configuration making it infra agnostic (as it should be iMO). Obviously, not getting locked into any cloud provider services is a pre-requisite for this. Check out such an implementation here[1](full disclosure - I work here))

[1]: https://docs.hasura.io/0.14/ref/project-configuration-and-st...

It's amazing how the promise of "decentralized" internet has turned into centralized datacenters.

P2P networks, each computer being a "data store" on the internet, no one entity can control data, etc to modern day centralized cloud where a couple of players control so much.

There has been a cultural shift. In the early 2000s, the idea of storing your data somewhere else would have been weird. But now, people don't care about keeping their data on apple/google/etcs data centers.

I think it has to do with the fact that computer/internet illiterate people are now the majority whereas in the 90s/early 2000s, it was generally the computer literate on the internet.

I was pretty befuddled when my company IT switched from self hosted storage to commercial cloud accounts for incredibly sensitive info.

I think the reasoning was cloud accounts are easier for the masses than mapping a drive and accessing over VPN

Also someone else to blame when the sensitive info is exposed.

This goes to beyond having a plan-B for hosting your own stuff somewhere else. Think about all the 3rd party services you are depending on. Then think about how many dependencies those services have. How many trace back to Amazon on some level?

The connections that could cause problems may not be obvious. For example network provider running into trouble because a ticketing or monitoring system that depends Amazon does not work. Hardware supplier not being able to ship spare parts for your on-premise SAN because logistics company runs into trouble due to issues at Amazon.

Personally as a dev, I find AWS's service in the middle of Paypal (shit, not sure why they're popular) to Stripe (Damn that was fast and easy) seeing as I used them both.

Their support is alright although you often have to pay for it but AWS docs are atrocious and remind me of university textbooks written by professors who like creating pseudo-scientific-sounding jargon which mixed with their huge array of features is quite un-comforting to use for even people with intermediate AWS experience (built some apps with AWS before kind of people).

I can see that there could be more specialized services like Firebase (which is built on Google Cloud) that should be built on AWS for the users. Firebase is a breeze to use and very responsive and I've used it to build real-time chat apps in a couple days.

It took me three reads of the first couple of paragraphs to realise that "snowball" and "snowmobile" were actually hardware products that you can touch. Tech news publishers need to do a jargon check and use appropriate punctuation, formatting or something to call out terms that 90% of readers would not have come accross

Maybe its because I saw your comment before reading, but I had no problem understanding the first few paragraphs.

The author states that a "snowball" is a grey suitcase with 50tb of HDD space inside, and a "snowmobile" is a massive 18 wheeler with what I would assume is petabytes of storage.

It's probably because it's 5 in the morning here :-) But looking at Amazon's own references to the appliances, they always capitalise the name. I guess what I can only assume was intentional obscuring what are probably trademarks made it read poorly to me.

Really? It's explicitly stated in the very first paragraph, in the second sentence:

> Not the lumps of mush and ice that children chuck at each other, but Amazon’s portable information storage devices, big grey suitcases that hold huge amounts of data.

Capitalizing it might have helped, though.

When I was working as contractor for one of big banks, which dev was concentrated on Canary Wharf, they weren't able to successfully complete disaster recovery testing on their primary database cluster for 2 years in a row, I just don't remember, was is department-wide or bank-wide.

Basically, each 6 months DR testing was failing and it was accepted as harsh reality. After seeing how they're working inside, I don't think that moving their infrastructure to AWS/Azure/Google is worst that could happen.

disc: Currently working at Amazon, but not at AWS.

Why did they not redo the DR testing until it worked? Normally you iterate tests and bugfixes over and over until it works. Otherwise, what's the point of the test? Being confident that your stuff does not work at all?

It was bank-wide activity with defined schedule etc.

That's why I think containerization and orchestration will be useful; open source orchestrators can standardize the infrastructure and make switching seamless. That way the infrastructure remains a commodity.

Except you can't containerize the huge amounts of data you are storing can you?

What would be great is the equivalent of the ACME protocol for cloud service providers. That will take a while and shouldn't happen until the offering matures and stabilises. But in an ideal world you wouldn't tie your application to a specific cloud provider. You should be able to lift and shift to another provider.

Which I think is a merit of using VMs as opposed to individual services.

But in an ideal world you wouldn't tie your application to a specific cloud provider.

You can do that easily if you just treat clouds merely as hosted hypervisors and think entirely in terms of VMDKs. But this doesn't make commercial sense to do at least in the short term - you need to utilise the layered services you are paying for anyway or you might as well just run your own DC.

It still makes sense for its elastic properties (from which EC2 got its name). You can't rent half a DC for an hour, but you can spawn generic instances from VMDKs on different providers with a fairly small abstraction layer.

Your data still needs to live somewhere and giant VMDKs being copied around aren't a reasonable solution, I'd argue.

ACME protocol?

Developped by let's encrypt, which helps solving the too big to fail problem with CA. When CA adopted it (which looks like it may happen), you will have a common protocol to create and renew certificates across CA.

Cloud services are concentrated by nature built with the same cloned DNA. Of course that is a systematic risk with so much it concentrated to fewer physical locations running on the same code.

Think Cloned bananas vs fingers disease but computers. http://www.bbc.com/news/uk-england-35131751

This does worry me. If there is a shortage of resources suddenly or a DC fire that takes out a region, then what?

We have contingency against this via our own infrastructure but I worry about organisations who don't have any.

> Amazon EC2 is hosted in multiple locations world-wide. These locations are composed of regions and Availability Zones. Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones. Amazon EC2 provides you the ability to place resources, such as instances, and data in multiple locations. Resources aren't replicated across regions unless you do so specifically.

Source: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-reg...

One region isn't going to be effected by fire. And AWS have dozens of regions. They're even managed as separate units by separate people. You'll notice there's never been a large, multi-region outage of AWS.


Some of the traditional apps we host are vulnerable hypervisor failure be that rack, DC or region.

Hardware always fails. That's why AWS has so many availability zones, regions and services that let you take easy advantage of HA across them.

>This does worry me. If there is a shortage of resources suddenly or a DC fire that takes out a region, then what?

Then some businesses will be out for a few hours / days.

No big deal.

From WWII to 9/11 to Katrina (and whatever regional stuff we have), we have been through much worse than that in modern history.

The solution is pretty simple, AWS/Azure need to provide on premise versions of their cloud.. You'd probably get stuck with a particular version, but better than nothing.

That's pretty much what Azure Stack is:


There might well be a commercial niche for providing Azure Stack hosting in non-Microsoft data centers.

I think there is a massive market for 100% cloud-compatible local deployments. In my personal experience every .Net shop I've seen would love to be incorporating more Azure goodness locally, but can't as they're cloud specific techs which bump into the realities of deployment and maintenance.

Personally, I think MS crapped the bed a little by taking Azure Stack off of commodity hardware and onto a combined hardware/software solution. Being able to deploy Azure-compatible solutions piece-meal locally would be a massive boon to governments, healthcare operations, and anyone working on a more thorough migration to the cloud.

Most of the EU, for example, has privacy regulation that makes cloud hosting impossible in some situations. Having a 'local Azure' would make it highly reasonable have all apps architected around Azures components and technology. Without the local deployment though you're kinda stuck with each foot in a different canoe... Hybrid infrastructures are highly favorable to DevOps and multi-party development scenarios.

From Scott Guthrie

"“So if the performance is dropping, do you call the server manufacturer, do you call the networking manufacturer, do you call the load balancer manufacturer, do you call the storage manufacturer? They typically point the finger at the other guy and you spend weeks and months trying to debug and get your cloud to work."


We can all relate to that. A "cloud" is sufficiently complex that vendor blaming is an almost guaranteed outcome.

Openstack is still going strong

I think about this problem every now and then for my own business, but not sure what the right answer is. Supporting multiple clouds requires more involved management of some pieces of infrastructure (e.g., DNS + healthchecks, DB replication), which introduces another point of failure.

How do people who need to have more nines of availability manage this issue with cloud providers? (EC2 and RDS promise 3.5 nines per AZ, but I imagine outages are somewhat correlated across zones)

for people who need more 9s of availability on a single cloud provider, you have to start going multi-region. aws takes region isolation/independence very seriously, and along with geographic independence gives you effectively two entirely independent clouds which just so happen to have the exact same APIs. Some of the (really great) Netflix blog posts[0] have talked about multi-region services.

If you do go multi-cloud, I would be wary of picking regions that are located very close to each other. While you'll obviously get independent code and (likely) independent deployments, you're still susceptible to issues correlated with the physical location.

[0] https://medium.com/netflix-techblog/global-cloud-active-acti...

Very, very few businesses should be architecting to ensure higher than 99.95% availability, IMO. (Less than 4.5 hours of downtime per year.)

Users are patient enough to give you a pass if you're down that amount (especially if you're down that amount while 1/3rd of the internet is also down).

Our largest e-commerce retail site does over $1BB/yr in fairly high-margin sales and still targets "only" 99.95% availability (generally it exceeds that with actual results, but we don't target higher than that). It's a hybrid of on-prem and cloud services backing that, migrating towards the cloud, but will never be 100% cloud as we own and run factories with on-prem equipment.

(I know you asked "how" and I answered "whether", but I thought it relevant.)

Hasn't anyone heard of disaster recover plans? I used to work at a medium sized insurance company and every year we had a project to update our disaster recovery plans. Including our main inhouse datacenter going down. If it was a critical system you'd better have a plan to get it back up in like 4 hours. and those were business critical we didn't have any life critical systems.

What's the disaster plan for "DynamoDB doesn't exist any more"? There is literally nothing else like it in the world. I don't know of an idiot proof queue system that can handle the scales SQS can take either.



Yes and no. By design it's not big, it just seems big. With relative RPO and RTO anyone can failover to other regions. And if you aren't leveraging multiple AZ's within a single region you need to rethink how you are using AWS.

The very nature of AWS requires Amazon to build in capabilities to handle failover. But, as they say at Amazon, "everything fails, always".

Is it possible for AWS to have a multi-region outage - as in is there anything connecting them that could bring them all (or several) down at once?

(Apart from the result of a botched patching or update to the core software stack that was done worldwide at the same time and hopefully never happens).

A cascading electrical grid failure? I don't know if there are any interconnects between the regions with the DC's, but if there were that might be a concern. Though at that stage, presumably most of the US is without power, hence not so much need for AWS.

I think each DC has at least two power sources and probably a backup generator. I think that's why cloud providers have been so reluctant to open in Africa, diversified power is apparently a problem.

Well I guess a nationwide power outage will have bigger implications than Netflix going down...

Yeah, it might take down Netflix and YouTube! Then what are we supposed to do while we wait for the electricity to come back on?!

Just bringing down us-east was enough to cause quite a bit of trouble recently: https://aws.amazon.com/message/41926/

That would go against a core principle at aws which is to have every region completely isolated.

Also, deployments are designed to be exponential and no region should ever have a cross region dependency.

Unless you work at amazon, you can't know that.

It looks very separated on the outside, but I've worked in so many companies that have appeared incredibly competent externally but have "snowflake" servers which keep things ticking over- Given Bezos treatment of workers I have absolutely no confidence that everything is as cleanly engineered as they claim.

My memory may be wrong, but I thought several regions were affected by the recent S3 outage? Also, I suspect that if us-east-1 went down completely that that would have a debilitating effect on the others.

Well given that you can manage services across all regions from a single web interface, I assume someone compromising this web interface would be able to control and bring stuff down across all regions.

Do you think there's a global world admin web console ?

Well as a customer how do you interact with it? Do you have one admin interface per region with its own URL or do you just go on an AWS homepage, login and control everything from there?

> Do you have one admin interface per region

This ^^^ When I go to login, it redirects me to: https://us-east-1.signin.aws.amazon.com

And for example the S3 console is on the URL https://console.aws.amazon.com/s3/home?region=us-east-1

Of course this doesn't mean I can't easily switch to a different region, or for that matter control all my regions from a single interface, but there is some distinction.

Separate end points. Also, AWS recommends that you might want multiple different accounts for more isolation [1]

[1] https://aws.amazon.com/answers/account-management/aws-multi-...

I meant like truly admin, amazon intranet admin.

Yes, there are ways to bring down all of their arch at once, but you'd have to get through a lot of barriers to do it.

A major solar flare and coronal mass ejection? It wouldn't just be Amazon that was affected, though.

No. Hence them rolling out new features region by region.

Nothing is too big to fail. Society needs to be able to adapt and maintain a level of patience during transition times i.e., be patient when Amazon's cloud fails to a new tool.

If Amazon's cloud service would disappear today, it would be a chaos for a week or two but most people should recover (as long as they have backups).

I'd wager most peoples' database backups live in AWS as well.

Plus, some people have huge, huge datasets. It could easily take weeks to migrate to, say, GCE, or to your own hosted servers. In the latter case, it would also necessitate a pretty large up-front investment.

For articles where the headline is a question, the answer is always "no".

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact