I of course don't know the tradeoffs involved in running your system, but I know for a lot of my situations the simplicity of single AZ with a straightforward failover option is usually the right tradeoff.
I’m trying to put some numbers into that, I’ve been running a relatively well trafficked website in multiple AZs since 2011. We had ~20 minutes of downtime when they had a network routing issue for us-east-1 and a few hours of degraded service when S3 had a region-wide outage. I haven’t added up the number of single AZ outages during that period but based on the RSS feeds I think it’s a good bit more relative to the very modest additional cost.
If Netflix can go down when AWS goes down, so can your app. AWS outages impact so much of the Internet that people will just accept it.
Of course here I am with a site on Heroku which uses AWS... impacted by the AWS outage... fielding questions about why I didn't pick AWS if Heroku suffers outages like this. Can't please them all.
So your statement relies on customer ignorance.
We got lucky this time, RDS, ELB, EC2 and Lightsail instances all in US-East-1 across multiple accounts and no issues (knock-on-wood and understand that we'll get it the next time). Especially happy as I had a two-day running neural net training job running and it's still going, that would have been depressing. Phew.
It seems like a lot of businesses are chasing this ideal of perfect and end up much worse off than if they had just stuck whatever application on a single server in a semi-reliable part of the world.
Specific providers aside, there's more complexity involved in a large cloud provider's infrastructure and much more that can go wrong as a result. Having a code update, or some orchestration issue from your infrastructure provider be potential points of major outages are huge and unnecessary risks. You don't need that much scale, just utilizing enough resources to fill up a few whole physical machines for a few hundred dollars a month. Add some globally distributed BGP Anycast DNS and database replication and you have enough redundancy to withstand most of the worst major infrastructure failures.
I would understand if AWS was super simple and convenient, but these days the learning curve seems far greater than setting up the above described bare metal solution. While being almost an order of magnitude more expensive for the equivalent amount of resources.
How did we end up here? Does brand recognition just trump all technical and economic factors, or what am I missing?
Disclaimer: I run a bare metal hosting provider
I trust any of the big cloud providers to do these things more reliably than I can. In particular, if I'm going to replicate a database across data centers within a region (availability zones as the big cloud providers call them), I'm quite sure that a managed database service will be more reliable than my own hand-configured cluster.
The CPU model is non-standard, but at 10 cores and 2.90GHz is effectively a slightly higher clocked version of the E5 2660v3 (10 cores, 2.60GHz). The first google result for a 2660v3 dedicated server with an order page that allows to adjust the options (13 usable IP's, 64GB RAM, minimal 120GB SSD storage) comes out to $275.
And this is based on whole box to whole box comparison. The cost of individual instances at AWS equivalent to one of those boxes can be much higher depending on the type and size.
And it's incredibly rare you'd be limited to a single rack of course.
Cloud has many tangible benefits, but "amount of compute available" is not one of them. Time to acquire compute, though, is. (and, obviously, management of the resources/datacenter operations).
Cloud is almost never cheaper, even factoring in salaries. It's just very convenient if you're small enough not to have people providing compute very well internally. (and, internally, people tend to understaff/underfund the teams that would do the same job as cloud operators are doing)
- we don’t want to maintain our own build servers. We just use CodeBuild with one of the prebuilt Docker containers or create a custom one. When we push our code to Github, CodeBuild brings up the Docker container, runs the build and the unit tests and puts the artifacts on S3.
- We don’t want to maintain our own messaging, sftp, database, scheduler, load balancer, oath, object storage, servers etc. We don’t have to, AWS does that.
- We don’t want to manage our own app and web servers. We just use Lambda and Fargate and give AWS a zip file/docker container and it runs it for us.
- We need to run a temporary stress test environment or want to do a quick proof of concept, we create s Cloud Formation template, spin up the entire environment, run our tests with different configurations and kill it. When we want to spin it up again, we just run the template.
We don’t have to pay for a staff of people to babysit our infrastructure between our business support contract with AWS and the MSP, we can get all of the troubleshooting and busy work support as needed.
I’m a software engineer/architect by title, but if you look at my resume from another angle, I would be qualified to be an AWS architect. I just don’t enjoy the infrastructure side that much.
That's fair, that's totally your right.
However, you're talking about absolute cost and unfortunately you're examples weave through true and false quite frenetically.
> - we don’t want to maintain our own build servers. We just use CodeBuild with one of the prebuilt Docker containers or create a custom one. When we push our code to Github, CodeBuild brings up the Docker container, runs the build and the unit tests and puts the artifacts on S3.
Like all things business, "want" and "cost" are different, in this case, depending on your size of course, it could easily be cheaper to have a dedicated "build engineer" maintaining a build farm. This is how the majority of people do it. (I work in the video games industry, it's _MUCH_ cheaper to do it this way for us)
> - We don’t want to maintain our own messaging, sftp, database, scheduler, load balancer, oath, object storage, servers etc. We don’t have to, AWS does that.
Again, those are "wants", TCO can be much lower when out of the cloud. But again, depends on scale. (as in, lower scale is cheaper on cloud, not larger scale).
> - We don’t want to manage our own app and web servers. We just use Lambda and Fargate and give AWS a zip file/docker container and it runs it for us.
I mean, 1 sysadmin can automate/orchestrate literally thousands of webservers.
>- We need to run a temporary stress test environment or want to do a quick proof of concept, we create s Cloud Formation template, spin up the entire environment, run our tests with different configurations and kill it. When we want to spin it up again, we just run the template.
Yes, this is a real strength of cloud.
> We don’t have to pay for a staff of people to babysit our infrastructure between our business support contract with AWS and the MSP, we can get all of the troubleshooting and busy work support as needed.
Yes, but you are paying "overhead" for all of that, and not having talented engineers on your payroll who understand your business critical systems is, in my opinion, foolish.
I've dealt with vendor support and it's incredibly hit and miss, and it's much more "miss" when you're a smaller customer to the vendor. Of course, this is anecdotal.
Cheaper to have a dedicated build engineer than using Codebuild? I just looked at my bill for August, my startup has $50K/mo. AWS spend across 4 regions in US/EU/Asia. We use Codebuild to build and deploy all of our infra from GitHub, including a ton of EC2 for our dedicated apps.
Guess how much my bill for Codebuild was in August? 17 cents!
$0.06 Asia Pacific (Singapore)
$0.07 Asia Pacific (Tokyo)
$0.02 EU (Frankfurt)
$0.00 EU (Ireland)
$0.02 US West (Oregon)
I'd like to see you hire a build engineer for $0.17. AWS services are dirt cheap because they let you automate all of the stuff that would require dedicated engineers for, while you can focus on your business, or what differentiates you.
That’s $80K to $100K. You can buy a lot on AWS for that price...
That’s another $100K to $200K....
That’s yet another $100K. You’re up to at least $250K - $500K In salaries.
I am one of the “talented engineers” that’s why I mentioned I could go out and get a job tomorrow as an AWS architect - my resume is very much buzzword compliant with what it would take from both a development, Devops, and netops side to hold my own in a small to medium size company. I just find that side of the fence boring - we outsource the boring work or the “undifferentiated heavy lifting”.
As that side got too much for me, we hired one dedicated sysadmin type person to coordinate between what he would do himself, our clients and our MSP.
I’m actually trotted our as the “infrastructure architect” to our clients even though my official title and day to day work is a developer. I haven’t embarrassed us yet.
I agree completely. If it’s something complex, I either do it myself or have very detailed requirements on what our needs are. But honestly, the more managed services you use, the less you have to do that part.
We can do a lot with a half million dollars a year on AWS.
As I was implying. You’re just outsourcing your ops. At scale, you end up spending significantly more than you expect.
Also, when I put my software architect hat on (and take my infrastructure hat off), it’s a lot quicker to get things done just to ask our MSp to open an empty account in our AWS Organization, spin up the entire infrastructure, pilot it, get it approved, audited, and then run the same template in the production account without having to wait on a change request, approvals, pre approval security audits, etc.
I’m also not advocating all in on cloud. With a Direct Connect from your colo to your cloud infrastructure, it makes sense sometimes to have a hybrid solution. Everything from using your cloud infrastructure as a cold DR standby, using it for green field development where a team doesn’t need to be shackled by change requests, committees, etc
Cloud is great for speed of deployment. But once something is made, stable and has predictable load then it’s a huge cost saving to bring it in-house. Many don’t, probably because they’ve used some cloud only technologies or fear the migration path will take time.
So you just continually line bezos pockets instead of using the cost savings to remain liquid.
In what world is that a lot of downtime?
Usually you strive for "five 9's" in infrastructure, obviously there's a lot of wiggle room depending on business case. But reliability for individual components gets exponentially harder with each 9 after the first 2.
99.96% uptime of a datacenter is shockingly low, taking connection issues into account (IE; number of successful inbound packets vs unsuccessful ones, not just served requests). For context my company has around 15 datacenters around the world which routinely hit 5-9's, with only a few issues of datacenters being down for 2-3 minutes during a particularly bad ISP outage.
The overwhelming majority of degradations are ones related to bad code being deployed. But since reliability is a sum of all components availability it follows that permitting more outages is less preferable. Especially since they affect all or at least the majority of components in a given region.
In every company I have worked for, the amount of outages caused by bugs and other post deployment issues was already above that number.
IMO you should do everything you can as multi AZ and if there are some services that are harder to do and you don't need it, then put them in a single AZ.
Thing is that if you keep everything in a single AZ it will be much harder to change when this requirement become important.
For more simpler sites, having multi-region failover (even if it's a manual failover and you lose a few in-flight transactions) is much easier to build.
1. People don't realize how much they love and depend on you until you're gone.
2. Keeps you on your toes, it's easy to get complacent when everything just runs along happily for months and years on end.
I do wish there was a way to train users that millions of them reloading constantly as service ramps back up doesn't accelerate the ramp-up time, though. ;)
When I design services these days, I try to design them so these failure scenarios are constantly exercised. Eg if I care about multi-AZ resiliency, I try to design it so that it’s forced to fail over to other AZs all the time. Or at the least, write tests for the scenario. Exceptional behavior or code paths are dangerous.
tbh, I can't recall a support request from a customer that was caused by our infrastructure vendor and not our product.
I'm interested in any evidence to back up my impression if anyone has bothered to do the proper data gathering.
(Aside, stink eye on whoever made a breaking change over a holiday weekend, if this turns out not to be random.)
Working with multiple regions is cost friendly on AWS. You should put in the time and learn how that stuff is done, it’s not as complicated as you think.
I use lambda all of the time and I’m definitely not afraid of the “lock in” boogeyman, but I always architect my lambda’s to make moving away from lambda to either Fargate or just an EC2 instance as easy as possible.
On another note, it’s just as easy to architect your regular old EC2 instances running stateless servers to be AZ failure resilient. Just set up an autoscaling group with a min/max of 1 and configure it to work across multiple AZ’s.
Also lambda comes with its own set of limitation - maximum runtimes of 15 minutes, cold start times, temporary storage space of only a half a gig, limited CPU/memory options, etc.
EDIT T16:31Z: It appears Heroku has failed over their dashboard, but dynos are still failing to come online. We had assumed that they had multi-region failovers for their customers. Incredibly disappointing.
Also Heroku: we auto restart your dyno every day!
So now my app is down because Heroku forced it to restart, and none of our hourly employees can work :|
As a PaaS I would think that they would run a high availability cluster on at least 2 multiple regions so that they would have a mechanism in place for events like these. I know it's expensive, but if you charge 250 for 2.5GB of RAM I believe you would have enough money to cover it. I also think as you hinted that they should separate services across different regions..
> 10:47 AM PDT We want to give you more information on progress at this point, and what we know about the event. At 4:33 AM PDT one of 10 datacenters in one of the 6 Availability Zones in the US-EAST-1 Region saw a failure of utility power. Backup generators came online immediately, but for reasons we are still investigating, began quickly failing at around 6:00 AM PDT. This resulted in 7.5% of all instances in that Availability Zone failing by 6:10 AM PDT. Over the last few hours we have recovered most instances but still have 1.5% of the instances in that Availability Zone remaining to be recovered. Similar impact existed to EBS and we continue to recover volumes within EBS. New instance launches in this zone continue to work without issue.
And if there is an issue for some reason failing over to another AWS region is a lot easier than having to fail over to another cloud provider. Outside of a very small number of cases building for multi-cloud is a lot of unnecessary work.
> We are investigating connectivity issues affecting some single-AZ RDS instances in a single Availability Zone in the US-EAST-1 Region.
Who cares about stateless, that's a solved problem.
> Deploy across multiple cloud providers including AWS EC2, Kubernetes
The only difference between then and now is that we’re online (seemingly) at every waking minute expecting a hundred different services to be functional at any given moment.
In the 80s if a university campus internet connection went down, only that university was affected. Now, when a single AWS availability zone goes down, a much wider swath of users is impacted. Such consolidation / centralization shows a disregard for the spirit of the early internet and design considerations that went into it.
Again, maybe I'm full of shit. Lots of people here seem to think so.
I've avoided that region and I can't remember the last time I had downtime caused by Amazon.
Well there’s your problem, people. Use multiple AZs.
 technically your lambda never runs “inside your VPC” but it’s a colloquialism that everyone understand.
I don't work for any of the entities mentioned.
My client's Heroku instances are online, thankfully.
Can anyone here speak to their experience with the Ohio region? I'm considering leaning on that more and more.
I've been told that the "a" AZ you get was the least populated at object creation time (ie the first time you make an object that lives in an az), but I don't know how valid that is.
Is there any way for the owners of the instances to reach them?
That has lead to us-east-1 being the largest AWS region by far, also compromised of the largest number of availability zones (6) of all AWS regions.
If we move anywhere, its going to be completely out of AWS and into on-prem or some bare metal provider. Hopping regions hoping to win at some reliability metric game is not a good way to run a business IMO.
Designing truly resilient and available applications with DB servers that replicate across continents is hard.
I guess partitioning can help, but then isn't it just turning the DB servers into pizzas of master-slave where the Hawaiian slice is master only in Hawaii, and slave everywhere else?
aws ec2 describe-availability-zones --region us-east-1 --output text
Glad to know that it wasn't anything personal over any Hacker News gags I've done.
If your entire service just went down as soon as this happened, Congratulations! You didn't deploy in multiple regions or think about a failsafe/fallback option that redirects from your affected service or instance.
An outage like this happens how often?
Edit: Looks like this is affecting a single AZ... so bit different situation, but I would agree if you're not capable of surviving a single AZ outage in 2019 then your engineering team should be replaced.
My engineers are all React and CSS web developers. They don't know anything about multi tenant data resiliency. But they can make a real pretty "system down" page.
Without data it doesn't even cost significantly more than running in a single region nowadays, if you are willing to go serverless. As serverless stuff (FaaS, ...) is pay for what you use and the provider handles the scaling automatically behind the curtain you can easily deploy to multiple regions without much additional cost.
With data you have of course the cost of storing the data multiple times in the different regions (or to come up with some kind of sharding) and solving the consistency challenges that come with that, but at least services like DynamoDB and S3 offer cross-region replication out of the box nowadays and you don't have to provision any capacity like you used to (thanks to DynamoDB AutoScaling and so on).
Once you have your application running in multiple regions you can direct users to the closest one, so they enjoy lower latencies.
I believe for a lot of applications running cross-region just makes a lot of sense as it offers various benefits.
Always CYA guys... you will pay for this if you dont.