It seems reasonable to start worrying about the fragility potentially introduced by these massive internet infrastructure companies.
The US has no terrorism problem. In 1979, the Irish killed the Queen's uncle in law, and in 1984, they blew up a hotel where Thatcher was staying. That's a serious terrorism problem! What the US has is nothing by comparison. We were unlucky on 9-11, and ever since we've been distorting our foreign policy out of unreasonable fear.
There are a LOT of far softer targets that go unprotected. A terrorist attack on a sewage plant for a major city would be far more devastating than knocking out a few websites.
There's probably a moderate amount of unseen security in larger areas surrounding water/sewage plants.
Also, we've dealt with regional natural disasters before. Drought, fire, flooding, etc - we have experience mobilizing resources and redistributing as needed.
We've never dealt with an extended outage of one or more AWS data centers for days at a time. How many govt/university systems would be unable to function because of direct or indirect dependencies on AWS-related services?
S3 going down for, let's say 4 days, would cause big havoc on so many projects and systems I know of.
I'm pretty sure most people have no clue how much of their data and systems functionality is reliant on AWS-related services.
Water outages happen often enough - they distribute water manually with tanker trucks. When sewage is overwhelmed it overflows into a river and they tell people to not go for a swim. https://en.wikipedia.org/wiki/Sanitary_sewer_overflow
Some people wouldn't be able to do their computer work, send receive emails, others might not receive their paychecks or be able to pay bills.
The sewage plant hits both physiological needs and safety needs (it's a pretty dire health threat quickly). AWS has weak tendrils to financial safety, but mostly affects layers above that.
And, I think even more than we can imagine. It's one thing to count the number of services that are direct customers/dependents of AWS, and we really don't know how deep that goes.
But, add to that the non-AWS based services that directly or indirectly rely on AWS-based services.
It wouldn't surprise me if those police systems have moved, or are moving, to Azure or AWS data centers.
It's much worse if benefits or paychecks don't get paid. For many people that means that they won't be able to buy groceries as soon as a few days later. Not the standard HN crowd, but many people depend on regular paychecks and can be in trouble if it comes even one day late.
Funny thing with the police, in the 60s there was flooding and they couldn't coordinate a response, so radio amateurs had to step in and provide comms. Those enthusiasts are still around under the codename RAYNET. The police eventually got their act together and invested in a system called TETRA which is self-contained and independent of any other infrastructure so they would never be caught with their pants down again.
Now they've ditched TETRA to save money and they just run over the 4G network like anyone else...
If you hit AWS, just Netflix alone is going to effect 100 million people. There are definitely a few payroll providers that are going to have some reliance, so people all over are going to stop receiving their paycheques. Many businesses will be temporarily disabled.
On a national level we've got a lot more eggs concentrated in the AWS baskets than we do the sewage treatment baskets.
According to this rackspace article, the largest AZ has 5 data centres.
The abstraction behind AZs is what every AZ counts for at least 1 data centre. So for every region there is at least 2 AZs, and every AZ means at least 1 data centre (or 5 in your link).
This just means it's harder to take out a whole region by destroying individual data cetnres. Since most regions consist of 2-5 AZ, and AZs consist of 5+ data centres, that means destroying dozens of data centres.
I know my company and a few others that are redundant within a region (ie if one AZ goes down), but not if a whole region goes down.
When you widen the potential attack surface to include software vulnerabilities, unauthorized access, process flaws and other "soft" vectors, a much wider--possibly coordinated--attack that is potentially far more crippling can be imagined.
Of course just speculation, I have no knowledge about DR plans at Amazon.
However, we actually have evidence that a sewage plant for a major western city can suffer a sudden catastrophic failure and it won't even make it above the fold.
Of course, if you completely remove an entire AWS region you might induce very damaging stresses on other regions as people fail over.
But, yeah, I'm kinda' surprised that this HN crowd in particular is so focused on hardware vectors.
I imagine Amazon has some on site security measures as well.
It would be interesting to see how important services would cope with their main region going down for more than a few days.
You don't need to severe both power and data cables. For power the datacentre should be able to cope for a while, particularly if it has access to fuel deliveries. Data cable should be both easier to severe and more difficult to recover from (not the least to identify which cable has been cut where).
Most of the infrastructure of a country in term of cables run along railway tracks, sewers, etc. This means thousands of kilometers of cable even for a small country, and it is impossible to secure everything. It is impossible to severe everything either but you don't need to severe every single cable. As long as you severe enough of the backbone, the other cables will be overloaded.
So I don't think it would take that much effort for a network of terrorists to create havoc in the communications of a country for at least a few days to a week. And the consequences for the economy can be pretty dire. We have seen with BA what happens when their datacentre goes offline. Their all fleet is grounded. I imagine the consequences of a country-wide outage could be pretty dramatic. Unlikely anyone would die but you could really dent the GDP.
If you wanted that type of destruction, and be noticed, you would need a city leveler type event in a data center heavy area.
That should be the priority.
Realisitically, this has not been the solution implmented (in the EU & US, at least). In the EU, it is even more crucial as the "solutions" to this problem are applied to state finances as well as financial institutions.
In terms of policies, there are two competing approaches: (1) Reduce the size of "too-big-to-fail" institutions. (2) Regulate them more heavily (or some other strategy) so that they will not fail. In the EU, this is being applied to states, not just financial institutions. Rules that (supposedly) reduce catastrophic risk.
Almost all seripous policy proposals are in the no. 2 category. Tighten regulation, reduce the risk of failure. Tighter regulation lends to stronger incumbents and larger average company size so by doing 2, you are probably doing the opposite of 1.
As I said, I don't know what Bernie's proposal is or how mature it is as a policy (as opposed to a politician statement). It would be notable if a left wing politician propsed loosening bank regulations, though definitely not impossible or unreasonable.
That seems like the most reasonable response. And yet, since the great recession, our policy has been "make 'too big to fail' even bigger".
The problem is that the banks have become too powerful for anyone to challenge. A Teddy Roosevelt type of political leader can't exist today.
Standard Oil was as powerful as the US Government in that era, as were the railroads. JP Morgan was far more powerful than the US Treasury. Cornelius Vanderbilt - pre Roosevelt - all by himself had greater financial capability at his peak than the US Government at the time.
The difference, is back then there was wide-spread and growing fear of the combinations and would-be monopolies. Today, Americans are relatively unconcerned by Microsoft (desktop), Google (search, android), Walmart & Amazon (retail), Intel (microprocessors), Facebook (social), Cisco, Boeing, etc.
Seems like it's fitting here, does it not? Certain banks being "too big to fail" is an idea passed among persons within our culture.
I'd say that most industries do not have anyone responsible for worrying about it high enough in the management chain.
If you can't build/run a better AWS replacement then it's a mute point, isn't it?
Then the question turns into if you can't build better AWS, can you architect your application to handle AWS failures? AWS itself lets you handle many kind of failures at AZ/DC level. Are you using that? For global AWS outages, can you have skeleton, survival critical system running on GCP or Azure?
Have you thought about outages that would be out of your control and out of AWS's control e.g. malware, DDoS, DNS, ISP, Windows/Android/iOS/Chrome/Edge zero day? How are you going to handle outages due to those issues?
If you are prepared to handle outages (communication, self-preservation, degraded mode, offline mode) then can a serious AWS outage be managed just like those outages?
It's like a cow's opinion, you know, it just doesn't matter. It's "moo".
Woah, Prof Brians updated his layout
And prospective users can look at the published prices, and see that historically they've gone down more than they've gone up (although obviously that trend could reverse).
So people think they're safe from the risk of Amazon quadrupling their bill over night.
Of course, vendor lock-in can have other negative effects, but apparently people aren't worried about them, or at least think AWS is no worse than the alternatives.
Rates are absolutely negotiable.
It's such a poor argument. I was a developer long before AWS appeared and I've used so many open sources packages that were profoundly reliable. In many cases it just takes a daemon restart. And while it's not exciting to set up some of that stuff, it's far more tolerable than writing a CloudFormation template.
I don't understand the preference for AWS over open source in many cases. Their services are "reliable", but they often have minute restrictions that will eventually bite you. You also end up having to pay for something you could get for free. Why use SNS/SQS when there are free pubsub/message buses out there? Most of the other devs justify this with the argument of not having to maintain the software themselves. "But RabbitMQ might crash! We don't have to worry about that with AWS!"
Anyway, I typically minimize the AWS services I use (S3, EC2, ECS) so I don't dread the day AWS blows up or, more likely, some VP or exec says we're moving to GCP/Azure because we got a better deal.
Free is never really free. There's always a tradeoff in engineering time and money when you choose to run your own stack instead of paying to use a stable, well-established service. Oftentimes running your own will be cheaper overall, but you have to do that cost-benefit comparison for yourself.
I can confirm that not only can RabbitMQ get into an unusable state, it will do so extremely rapidly and with little warning unless you sit an engineer or two on it to monitor and manage the incoming/dead letter rates.
- S3 has a public protocol and many 3rd party providers support it (OpenIO, Scality, Ceph, Minio, etc),
- EFS could be replaced with something like DRDB or GlusterFS, or DigitalOcean's block storage or Google Cloud's networked disks.
- ELB could be replaced easily with similar services from other providers  if you use Kubernetes (I don't know if all have a LoadBalancer type though)
I would be more concerned about firewall/vpc rules, because I have no idea how those could be migrated without risk of forgetting some. Lock-in seems not that high in the end though and even less so if you use an open source container orchestration stack because they abstract most of these things away.
Heap has a great blog post on Terraform : https://heap.engineering/terraform-gotchas/
P2P networks, each computer being a "data store" on the internet, no one entity can control data, etc to modern day centralized cloud where a couple of players control so much.
There has been a cultural shift. In the early 2000s, the idea of storing your data somewhere else would have been weird. But now, people don't care about keeping their data on apple/google/etcs data centers.
I think it has to do with the fact that computer/internet illiterate people are now the majority whereas in the 90s/early 2000s, it was generally the computer literate on the internet.
I think the reasoning was cloud accounts are easier for the masses than mapping a drive and accessing over VPN
The connections that could cause problems may not be obvious. For example network provider running into trouble because a ticketing or monitoring system that depends Amazon does not work. Hardware supplier not being able to ship spare parts for your on-premise SAN because logistics company runs into trouble due to issues at Amazon.
Their support is alright although you often have to pay for it but AWS docs are atrocious and remind me of university textbooks written by professors who like creating pseudo-scientific-sounding jargon which mixed with their huge array of features is quite un-comforting to use for even people with intermediate AWS experience (built some apps with AWS before kind of people).
I can see that there could be more specialized services like Firebase (which is built on Google Cloud) that should be built on AWS for the users. Firebase is a breeze to use and very responsive and I've used it to build real-time chat apps in a couple days.
The author states that a "snowball" is a grey suitcase with 50tb of HDD space inside, and a "snowmobile" is a massive 18 wheeler with what I would assume is petabytes of storage.
> Not the lumps of mush and ice that children chuck at each other, but Amazon’s portable information storage devices, big grey suitcases that hold huge amounts of data.
Capitalizing it might have helped, though.
Basically, each 6 months DR testing was failing and it was accepted as harsh reality. After seeing how they're working inside, I don't think that moving their infrastructure to AWS/Azure/Google is worst that could happen.
disc: Currently working at Amazon, but not at AWS.
Which I think is a merit of using VMs as opposed to individual services.
You can do that easily if you just treat clouds merely as hosted hypervisors and think entirely in terms of VMDKs. But this doesn't make commercial sense to do at least in the short term - you need to utilise the layered services you are paying for anyway or you might as well just run your own DC.
Cloned bananas vs fingers disease but computers.
We have contingency against this via our own infrastructure but I worry about organisations who don't have any.
Some of the traditional apps we host are vulnerable hypervisor failure be that rack, DC or region.
Then some businesses will be out for a few hours / days.
No big deal.
From WWII to 9/11 to Katrina (and whatever regional stuff we have), we have been through much worse than that in modern history.
There might well be a commercial niche for providing Azure Stack hosting in non-Microsoft data centers.
Personally, I think MS crapped the bed a little by taking Azure Stack off of commodity hardware and onto a combined hardware/software solution. Being able to deploy Azure-compatible solutions piece-meal locally would be a massive boon to governments, healthcare operations, and anyone working on a more thorough migration to the cloud.
Most of the EU, for example, has privacy regulation that makes cloud hosting impossible in some situations. Having a 'local Azure' would make it highly reasonable have all apps architected around Azures components and technology. Without the local deployment though you're kinda stuck with each foot in a different canoe... Hybrid infrastructures are highly favorable to DevOps and multi-party development scenarios.
"“So if the performance is dropping, do you call the server manufacturer, do you call the networking manufacturer, do you call the load balancer manufacturer, do you call the storage manufacturer? They typically point the finger at the other guy and you spend weeks and months trying to debug and get your cloud to work."
We can all relate to that. A "cloud" is sufficiently complex that vendor blaming is an almost guaranteed outcome.
How do people who need to have more nines of availability manage this issue with cloud providers? (EC2 and RDS promise 3.5 nines per AZ, but I imagine outages are somewhat correlated across zones)
If you do go multi-cloud, I would be wary of picking regions that are located very close to each other. While you'll obviously get independent code and (likely) independent deployments, you're still susceptible to issues correlated with the physical location.
Users are patient enough to give you a pass if you're down that amount (especially if you're down that amount while 1/3rd of the internet is also down).
Our largest e-commerce retail site does over $1BB/yr in fairly high-margin sales and still targets "only" 99.95% availability (generally it exceeds that with actual results, but we don't target higher than that). It's a hybrid of on-prem and cloud services backing that, migrating towards the cloud, but will never be 100% cloud as we own and run factories with on-prem equipment.
(I know you asked "how" and I answered "whether", but I thought it relevant.)
The very nature of AWS requires Amazon to build in capabilities to handle failover. But, as they say at Amazon, "everything fails, always".
(Apart from the result of a botched patching or update to the core software stack that was done worldwide at the same time and hopefully never happens).
Also, deployments are designed to be exponential and no region should ever have a cross region dependency.
It looks very separated on the outside, but I've worked in so many companies that have appeared incredibly competent externally but have "snowflake" servers which keep things ticking over- Given Bezos treatment of workers I have absolutely no confidence that everything is as cleanly engineered as they claim.
This ^^^ When I go to login, it redirects me to: https://us-east-1.signin.aws.amazon.com
And for example the S3 console is on the URL https://console.aws.amazon.com/s3/home?region=us-east-1
Of course this doesn't mean I can't easily switch to a different region, or for that matter control all my regions from a single interface, but there is some distinction.
Plus, some people have huge, huge datasets. It could easily take weeks to migrate to, say, GCE, or to your own hosted servers. In the latter case, it would also necessitate a pretty large up-front investment.