
Is Amazon's cloud service too big to fail? - azureel
https://www.fnlondon.com/articles/is-amazons-cloud-service-too-big-to-fail-20170801
======
dalbasal
This is (I was surprised) a pretty good article. Financial services are
regulated and based on recent experience, they're concerned with systemic
risk. Most industries do not have anyone responsible for worrying about this
kind of thing.

It seems reasonable to start worrying about the fragility potentially
introduced by these massive internet infrastructure companies.

~~~
peteretep
If you wanted to blow something up to make the west suffer, an AWS datacenter
would probably be a pretty good target. I wonder at what point that becomes a
legitimate national security concern, and the government steps in to provide
protection.

~~~
mediascreen
Wouldn't you have to blow up at least all the datacenters in a region to make
an impact?

~~~
cm2187
All you need to blow is a few cables.

~~~
bluedino
Why blow up anything or damage any cables? Hack the computer of an Amazon
employee and do your damage there. The last S4 outage was because of someone
fat fingering a script, imagine what someone could do that really wanted to
mess stuff up.

~~~
unclebucknasty
This. Could've saved myself a comment elsewhere on this thread.

But, yeah, I'm kinda' surprised that this HN crowd in particular is so focused
on hardware vectors.

~~~
martyvis
And ultimately it leads you back to targeting the IT operations centre for the
business where they provision equally to say both AWS and Azure for
redundancy. At that point you can knock all of the biz cloud capacity off.

------
AmIFirstToThink
If your architecture means your system goes down if AWS is down, then the
question becomes can you replace AWS with something better that you can build,
have means to build, have time to build, can keep running, can get enough
momentum in term of sheer size of customer base to fund the upkeep of the
platform?

If you can't build/run a better AWS replacement then it's a mute point, isn't
it?

Then the question turns into if you can't build better AWS, can you architect
your application to handle AWS failures? AWS itself lets you handle many kind
of failures at AZ/DC level. Are you using that? For global AWS outages, can
you have skeleton, survival critical system running on GCP or Azure?

Have you thought about outages that would be out of your control and out of
AWS's control e.g. malware, DDoS, DNS, ISP, Windows/Android/iOS/Chrome/Edge
zero day? How are you going to handle outages due to those issues?

If you are prepared to handle outages (communication, self-preservation,
degraded mode, offline mode) then can a serious AWS outage be managed just
like those outages?

~~~
savoytruffle
irrelevant points are "moot", not "mute"

~~~
darkr
I think you mean "moo".

It's like a cow's opinion, you know, it just doesn't matter. It's "moo".

~~~
aidenn0
Have I been living with him for too long, or did that all just make sense?

------
barsonme
Even at a smaller scale it is a little nerve-wracking to know be so reliant on
one provider. If AWS tanks there's a fair amount of code that'd need to be
changed just to switch over to Azure or GCE. Failover with, e.g., email
providers is easy enough, but the entire cloud stack (for lack of better
terms) is a completely different ballgame.

~~~
patta54
I warn other developers at my company about this. When new projects spin up
they're often very excited about using new Amazon services and will make any
excuse to choose an AWS product over a stable open source solution. If I were
a manager, I'd be very worried over the vendor lock-in.

I don't understand the preference for AWS over open source in many cases.
Their services are "reliable", but they often have minute restrictions that
will eventually bite you. You also end up having to pay for something you
could get for free. Why use SNS/SQS when there are free pubsub/message buses
out there? Most of the other devs justify this with the argument of not having
to maintain the software themselves. "But RabbitMQ might crash! We don't have
to worry about that with AWS!"

Anyway, I typically minimize the AWS services I use (S3, EC2, ECS) so I don't
dread the day AWS blows up or, more likely, some VP or exec says we're moving
to GCP/Azure because we got a better deal.

~~~
plandis
You're also forgetting that if you set up something on your own you also have
all the hardware concerns as well. You need to procure hosts, provision them
properly, deploy them, monitor them, scale them, fix them. That infrastructure
cost doesn't go to zero but it is significantly reduced using a cloud
provider.

~~~
patta54
I'm not arguing against cloud platforms in general; just the irrational use of
very specialized services they offer. I can run a containerized service that
uses open source packages on any of the cloud computing platforms. Now if I
used Athena, SQS/SNS, DynamoDB, ELB, Lambda, EC2 that would make me very
nervous, and I see other devs designing these stacks all the time. I guess I
shouldn't care as much, because I'm not going to be the one to migrate that
when the company gets a better deal from another platform service.

------
jpalomaki
This goes to beyond having a plan-B for hosting your own stuff somewhere else.
Think about all the 3rd party services you are depending on. Then think about
how many dependencies those services have. How many trace back to Amazon on
some level?

The connections that could cause problems may not be obvious. For example
network provider running into trouble because a ticketing or monitoring system
that depends Amazon does not work. Hardware supplier not being able to ship
spare parts for your on-premise SAN because logistics company runs into
trouble due to issues at Amazon.

------
forkLding
Personally as a dev, I find AWS's service in the middle of Paypal (shit, not
sure why they're popular) to Stripe (Damn that was fast and easy) seeing as I
used them both.

Their support is alright although you often have to pay for it but AWS docs
are atrocious and remind me of university textbooks written by professors who
like creating pseudo-scientific-sounding jargon which mixed with their huge
array of features is quite un-comforting to use for even people with
intermediate AWS experience (built some apps with AWS before kind of people).

I can see that there could be more specialized services like Firebase (which
is built on Google Cloud) that should be built on AWS for the users. Firebase
is a breeze to use and very responsive and I've used it to build real-time
chat apps in a couple days.

------
martyvis
It took me three reads of the first couple of paragraphs to realise that
"snowball" and "snowmobile" were actually hardware products that you can
touch. Tech news publishers need to do a jargon check and use appropriate
punctuation, formatting or something to call out terms that 90% of readers
would not have come accross

~~~
cdolan
Maybe its because I saw your comment before reading, but I had no problem
understanding the first few paragraphs.

The author states that a "snowball" is a grey suitcase with 50tb of HDD space
inside, and a "snowmobile" is a massive 18 wheeler with what I would assume is
petabytes of storage.

~~~
martyvis
It's probably because it's 5 in the morning here :-) But looking at Amazon's
own references to the appliances, they always capitalise the name. I guess
what I can only assume was intentional obscuring what are probably trademarks
made it read poorly to me.

------
galkk
When I was working as contractor for one of big banks, which dev was
concentrated on Canary Wharf, they weren't able to successfully complete
disaster recovery testing on their primary database cluster for 2 years in a
row, I just don't remember, was is department-wide or bank-wide.

Basically, each 6 months DR testing was failing and it was accepted as harsh
reality. After seeing how they're working inside, I don't think that moving
their infrastructure to AWS/Azure/Google is worst that could happen.

disc: Currently working at Amazon, but not at AWS.

~~~
Kenji
Why did they not redo the DR testing until it worked? Normally you iterate
tests and bugfixes over and over until it works. Otherwise, what's the point
of the test? Being confident that your stuff does not work at all?

~~~
galkk
It was bank-wide activity with defined schedule etc.

------
jondubois
That's why I think containerization and orchestration will be useful; open
source orchestrators can standardize the infrastructure and make switching
seamless. That way the infrastructure remains a commodity.

~~~
lukeholder
Except you can't containerize the huge amounts of data you are storing can
you?

------
cm2187
What would be great is the equivalent of the ACME protocol for cloud service
providers. That will take a while and shouldn't happen until the offering
matures and stabilises. But in an ideal world you wouldn't tie your
application to a specific cloud provider. You should be able to lift and shift
to another provider.

Which I think is a merit of using VMs as opposed to individual services.

~~~
gaius
_But in an ideal world you wouldn 't tie your application to a specific cloud
provider._

You can do that easily if you just treat clouds merely as hosted hypervisors
and think entirely in terms of VMDKs. But this doesn't make commercial sense
to do at least in the short term - you need to utilise the layered services
you are paying for anyway or you might as well just run your own DC.

~~~
icebraining
It still makes sense for its elastic properties (from which EC2 got its name).
You can't rent half a DC for an hour, but you can spawn generic instances from
VMDKs on different providers with a fairly small abstraction layer.

~~~
gaius
Your data still needs to live somewhere and giant VMDKs being copied around
aren't a reasonable solution, I'd argue.

------
acd
Cloud services are concentrated by nature built with the same cloned DNA. Of
course that is a systematic risk with so much it concentrated to fewer
physical locations running on the same code.

Think Cloned bananas vs fingers disease but computers.
[http://www.bbc.com/news/uk-england-35131751](http://www.bbc.com/news/uk-
england-35131751)

------
cjsuk
This does worry me. If there is a shortage of resources suddenly or a DC fire
that takes out a region, then what?

We have contingency against this via our own infrastructure but I worry about
organisations who don't have any.

~~~
kondro
One region isn't going to be effected by fire. And AWS have dozens of regions.
They're even managed as separate units by separate people. You'll notice
there's never been a large, multi-region outage of AWS.

~~~
cjsuk
Yet.

Some of the traditional apps we host are vulnerable hypervisor failure be that
rack, DC or region.

~~~
kondro
Hardware always fails. That's why AWS has so many availability zones, regions
and services that let you take easy advantage of HA across them.

------
blazespin
The solution is pretty simple, AWS/Azure need to provide on premise versions
of their cloud.. You'd probably get stuck with a particular version, but
better than nothing.

~~~
arethuza
That's pretty much what Azure Stack is:

[https://azure.microsoft.com/en-gb/overview/azure-
stack/](https://azure.microsoft.com/en-gb/overview/azure-stack/)

There might well be a commercial niche for providing Azure Stack hosting in
non-Microsoft data centers.

~~~
bonesss
I think there is a massive market for 100% cloud-compatible local deployments.
In my personal experience every .Net shop I've seen would love to be
incorporating more Azure goodness locally, but can't as they're cloud specific
techs which bump into the realities of deployment and maintenance.

Personally, I think MS crapped the bed a little by taking Azure Stack off of
commodity hardware and onto a combined hardware/software solution. Being able
to deploy Azure-compatible solutions piece-meal locally would be a massive
boon to governments, healthcare operations, and anyone working on a more
thorough migration to the cloud.

Most of the EU, for example, has privacy regulation that makes cloud hosting
impossible in some situations. Having a 'local Azure' would make it highly
reasonable have all apps architected around Azures components and technology.
Without the local deployment though you're kinda stuck with each foot in a
different canoe... Hybrid infrastructures are highly favorable to DevOps and
multi-party development scenarios.

~~~
Delphiza
From Scott Guthrie

"“So if the performance is dropping, do you call the server manufacturer, do
you call the networking manufacturer, do you call the load balancer
manufacturer, do you call the storage manufacturer? They typically point the
finger at the other guy and you spend weeks and months trying to debug and get
your cloud to work."

[https://www.theregister.co.uk/2017/07/10/interview_with_micr...](https://www.theregister.co.uk/2017/07/10/interview_with_microsofts_scott_guthrie/?page=2)

We can all relate to that. A "cloud" is sufficiently complex that vendor
blaming is an almost guaranteed outcome.

------
fovc
I think about this problem every now and then for my own business, but not
sure what the right answer is. Supporting multiple clouds requires more
involved management of some pieces of infrastructure (e.g., DNS +
healthchecks, DB replication), which introduces another point of failure.

How do people who need to have more nines of availability manage this issue
with cloud providers? (EC2 and RDS promise 3.5 nines per AZ, but I imagine
outages are somewhat correlated across zones)

~~~
dastbe
for people who need more 9s of availability on a single cloud provider, you
have to start going multi-region. aws takes region isolation/independence very
seriously, and along with geographic independence gives you effectively two
entirely independent clouds which just so happen to have the exact same APIs.
Some of the (really great) Netflix blog posts[0] have talked about multi-
region services.

If you do go multi-cloud, I would be wary of picking regions that are located
very close to each other. While you'll obviously get independent code and
(likely) independent deployments, you're still susceptible to issues
correlated with the physical location.

[0] [https://medium.com/netflix-techblog/global-cloud-active-
acti...](https://medium.com/netflix-techblog/global-cloud-active-active-and-
beyond-a0fdfa2c3a45)

------
sharemywin
Hasn't anyone heard of disaster recover plans? I used to work at a medium
sized insurance company and every year we had a project to update our disaster
recovery plans. Including our main inhouse datacenter going down. If it was a
critical system you'd better have a plan to get it back up in like 4 hours.
and those were business critical we didn't have any life critical systems.

~~~
YawningAngel
What's the disaster plan for "DynamoDB doesn't exist any more"? There is
literally nothing else like it in the world. I don't know of an idiot proof
queue system that can handle the scales SQS can take either.

~~~
darkr
Cassandra?

Rabbit?

------
nogbit
Yes and no. By design it's not big, it just seems big. With relative RPO and
RTO anyone can failover to other regions. And if you aren't leveraging
multiple AZ's within a single region you need to rethink how you are using
AWS.

The very nature of AWS requires Amazon to build in capabilities to handle
failover. But, as they say at Amazon, "everything fails, always".

------
smegel
Is it possible for AWS to have a multi-region outage - as in is there anything
connecting them that could bring them all (or several) down at once?

(Apart from the result of a botched patching or update to the core software
stack that was done worldwide at the same time and hopefully never happens).

~~~
askvictor
A cascading electrical grid failure? I don't know if there are any
interconnects between the regions with the DC's, but if there were that might
be a concern. Though at that stage, presumably most of the US is without
power, hence not so much need for AWS.

~~~
smegel
Well I guess a nationwide power outage will have bigger implications than
Netflix going down...

~~~
yjftsjthsd-h
Yeah, it might take down Netflix _and_ YouTube! Then what are we supposed to
do while we wait for the electricity to come back on?!

------
jriot
Nothing is too big to fail. Society needs to be able to adapt and maintain a
level of patience during transition times i.e., be patient when Amazon's cloud
fails to a new tool.

------
zeep
If Amazon's cloud service would disappear today, it would be a chaos for a
week or two but most people should recover (as long as they have backups).

~~~
pavel_lishin
I'd wager most peoples' database backups live in AWS as well.

Plus, some people have huge, huge datasets. It could easily take weeks to
migrate to, say, GCE, or to your own hosted servers. In the latter case, it
would also necessitate a pretty large up-front investment.

------
nhumrich
For articles where the headline is a question, the answer is always "no".

------
amerine
No.

[https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...](https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines)

