It is very common for AWS to completely run out of entire classes of instance types e.g. all of R5 or all of M5. And when that happens your cluster will die.
What you want to do is split your cluster into minimum two node groups e.g. Core and Task:
Core: On-Demand for all of your critical and management apps e.g. monitoring. Task: Spot for your random, ephemeral jobs that aren't a big deal if it needs to be re-ran.
So for a Spark cluster for example you would pin your driver to the Core nodes and let the executors run on the Task nodes.
Combined with the fact that you can have an ASG with multiple instance types:
Means that you can be reasonably certain you'll never run out of capacity unless AWS runs out of every single instance type you have requested, terminates your Spot instances, and you can't launch any more On-Demand ones.
(and even so, set a minimum percentage of On-Demand in AutoSpotting to ensure you maintain at least some capacity)
This is more common than you think.
Internally cloud providers schedule instance types on real hardware, and running out of an instance type likely means they have run out of capacity, and only a tiny amount exists in fragmentation. To access that tiny remainder, they'll terminate spot instances and migrate live users (which they have to do very slowly) to make space for a few more of whichever instance types make most business sense (which varies depending on the mix of real hardware and existing instance types).
It takes someone like AWS a good few weeks, sometimes months, to provision new actual hardware.
It isn't uncommon for big users to be told they'll be given a service credit if they'll move away from a capacity constrained zone.
For business reasons they might decide not to do that though - your small instance might mean they have to say no to a big allocation later.
Instead they just delay your instance starting and hope other instances moving around opens up a more suitable location for it.
Theres an entire paper on the topic: https://dl.acm.org/doi/10.1145/2797211
To set matters straight, AutoSpotting pre-dates the new AutoScaling mixed instance types functionality by a couple of years and it (intentionally) doesn't make use of it under the hood for reliability reasons related to failover to on-demand. To avoid any race conditions, AutoSpotting currently ignores any groups configured with mixed instances policy.
In the default configuration AutoSpotting implements a lazy/best-effort on-demand->spot replacement logic with built-in failover to on demand and to different spot instance types. To keep costs down, it is only triggered when failing to launch new spot instances (for whatever reason, including insufficient spot capacity).
What we do is iterating in increasing order of the spot price until successfully launching a compatible spot instance (roughly at least as large as the original from CPU/Memory/disk perspective but cheaper per hour). If all compatible spot instances fail to launch, the group keeps running the existing on-demand capacity. We retry this every few minutes until we eventually succeed.
There's currently no failover to multiple on-demand instance types (this is a known limitation), but this could be implemented with reasonable effort.
We're also working in significantly improving the current replacement logic to address a bunch of edge cases with a significant architectural change(making use of instance launch events). I'm very excited about this improvement and looking forward to having this land, hopefully within a few weeks.
At the end of the day, unlike most tools in this space(including AWS offerings) AutoSpotting is an open source project so if anyone is interested in helping out implement any of these improvements(or maybe others), while at the same time getting experience with Go and using the AWS APIs, which are nowadays very valuable skills, you're more than welcome to join the fun.
If you don't mind I'd like to get some feedback/feature ideas from users like you.
Please get in touch with me on https://gitter.im/cristim
The mixed capacity ASGs currently run at decreased capacity when failing to launch spot instances. AutoSpotting will automatically failover to on-demand capacity when spot capacity is lost and back to spot once it can launch it again.
Another useful feature is that it most often requires no configuration of older on-demand ASGs, because it can just take them over and replace their nodes with compatible spot instances.
This makes it very popular for people who run legacy infrastructure that can't be tampered with for whatever reasons, as well as for large-scale rollouts on hundreds of accounts. Someone recently deployed it on infrastructure still running on EC2 Classic started in 2008 or so that wasn't touched for years.
Another large company deployed it with the default opt-in configuration against hundreds of AWS accounts owned by as many teams, many with legacy instances running for years. It would normally take them years to coordinate as a mass migration but it just took them a couple of months to migrate to spot. The teams could opt-in and try it out on their application or opt-out known sensitive workloads. A few weeks later then they centrally switched the configuration to opt-out mode, converting most of their infrastructure to spot literally overnight and saving lots of money with very little configuration effort and very few disruption to the teams.
If you want to learn more about it have a look at our FAQ at https://autospotting.org/faq/index.html
It's also the most prominent open source tool in this space. Most competition consists of closed-source, commercial (and often quite expensive) tools so if you're currently having any issues or missing functionality, anyone skilled enough can submit a fix or improvement pull request.
If those don't answer your questions feel free to reach out to me and I'll do my best to explain further.
It sounds a bit hinky, but it tends to leave you with the number of instances you want running without having to determine what percentage of the ASG should be on demand or spot — especially with the possibility of not being able to start new spot instances if they’ve been terminated.
there are other strategies to avoid this as well
using multiple type of instance, of different sizes, on different availability zones. because price spike is different for each combinations
using bigger size instance than usual because bigger = more stable / less likely to get evicted
And so even our instance fleets were failing to provision capacity.
Unless you paid for a reservation, you're SOL.
For the past few weeks I've been having this exact problem with GPU instances, where on-demand requests would kill my spotfleet instances in the middle of a job, I'd spin up again, get killed again a few minutes later.
The only way to fix it without rewriting a bunch of stuff was to blacklist that instance type from my spotfleet requests.
It definitely wasted a bunch of my time though which seems to be increasingly more common with newer AWS APIs.
The whole service becomes even more ridiculous if a company is already operating EC2 instances or even kubernetes clusters on EC2 spot instances (the topic of this discussion), because more often than not these workloads can be already covered by the cluster's unused capacity.
I somewhat understand why enterprises want to use it, but why are small startups using it so much and then complaining about the cost?
Nowadays when we have high speed internet, and a lot of things are containerized, it is so simple to change hosting partners. Just pick one that doesn't cost an arm and a leg and move to a different one if it didn't fit very well.
I have used linux containers for 10 years now and changed hosting a few times, each time reducing costs even more. Yes, it is a bit of manual labour, but if you have someone with sysadmin/devops skills, it is easily doable.
I agree with you, and that's why I try to get the point of view of those who actually decide to adopt AWS. They aren't crazy or stupid, and as AWS is the world's leading cloud provider then it's highly doubtful that the decision is irrational and motivated by ignorance.
So far, the main fallacy with regards to companies picking AWS is that cost is relevant. It isn't. AWS might overcharge a lot, but truth of the matter is that for any sizeable corporation it's irrelevant if they spend 200€ or 400€ on their cloud infrastructure. It's far less than a salary and it's even less than the bill for some office utilities. So once the infrastructure foot is in the door then why would management worry about cost? What they do care about is uptime and development speed, because that has direct impact on productivity, and thus value extracted from salaries. If a particular service provider enables you to show tangible results in no time at all (see spin up a database or message broker or workflow in less than nothing) then they don't mind paying a premium that matches, say, their air conditioning bill.
Then, at some point, that dynamic changes as you have a better understanding of your needs, the bills start to build up and the architecture is in less flux. You might also have a bigger team and can afford to start allocating more resources to operation. That is the point when it it might make sense to migrate over to self-managed.
Then at the same time, you have the scalability, which might be more of a key point for even larger organizations.
I think building somewhat cloud-agnostic to ease friction of provider migration is good, regardless, but do so pragmatically and look at the APIs from a service perspective.
Kubernetes? All the bigger providers have alternatives and you can run your own. Fargate? You're going to have to do some rewrites. MemoryStore? Just swap for your another Redis instance. BigTable? Highly GCP specific. etc.
Not saying there aren't a lot of companies who choose the wrong provider for the wrong reasons, but it can also be part of a conscious strategy. Also nobody got fired for IBM and so on.
I completely agree, and I had this discussion with my direct manager in the past. Yet, even if the potential savings are significant, managers might not be too keen on investing on switching your infrastructure. Running your own infrastructure is risky, and although top managers enjoy lower utility bills they don't enjoy the sight of having a greater risk of suffering downtime, specially if the downtime is self-inflicted and affects business-critical services.
So, if this transition doesn't go perfectly smoothly... The people signing for the migration to self-hosting services might be risking their whole career on a play that, at best, only brings some short-term operational cost savings. Does this justify a move whose best case scenario is equivalent to a AWS discount?
Frankly the whole discussion mostly highlights that these are things you need to think about upfront before you're fully committed.
Things started improving when we said ‘f-it we are not moving out of Postgres, let us at least use the best features of PG’
There is a similar problem when trying to use AWS with the constant thought about moving out of AWS at some point.
Have you ever done this? It's exorbitantly hard to migrate off of a cloud provider, and few ever do.
> I think building somewhat cloud-agnostic to ease friction of provider migration is good
Of course, that's not always an option if the system is already built, but it's definitely a good approach.
If you don’t have heavy AD and policies in place even better.
Part of the problem is that a huge proportion of the people I come across who chose AWS used this exact argument. Part of the problem with that argument is that none of the big guys are paying the list prices (unless they're not doing their jobs; I've seen the kind of discounts available once you get into even the few hundred k/month range and tell your account manager you're considering moving), and a lot of them also used the same line of thinking.
It pulls in a lot of people who pick AWS for all the wrong reasons.
> AWS might overcharge a lot, but truth of the matter is that for any sizeable corporation it's irrelevant if they spend 200€ or 400€ on their cloud infrastructure.
The ones I used to deal with used to be more like a 3x-10x cost difference on bills in the 10k-100k/month range. I agree with you that if the difference is ~200/month, then who cares. But a lot of much bigger companies burn money this way. Often because they started off with a 200/month difference, and then never made it a point to re-evaluate as their costs grew.
The difference isn't always that bad, but especially bandwidth hungry services are ridiculously expensive on AWS (to the point where if people really badly want to stay on AWS and spend a lot on bandwidth, a quick and dirty fix is to rent servers to use as caches in front of their AWS setup)
I'm not saying people shouldn't use AWS. But as you point out, the right usecase for AWS is when you don't mind the cost, and pick it for convenience, and there's the warm fuzzy feeling of knowing you can hire people "off the street" who knows how AWS works.
AWS is the luxury option. Sometimes you want the luxury option.
But it worries me how many startups build in ways that end up locking them into a provider that for some of them multiplies their per user cost by anything from 2x to 10x. When I evaluate startup pitches today, I often ask whether or not they have thought this through. It doesn't matter so much that they're on AWS - that might well be a fine choice. What matters is whether it was a conscious decision, and they've done at least a superficial attempt at modelling the costs both for AWS and some alternatives, rather than just picked it by default.
However, I'd argue if you pick the right tools from the start you can leverage AWS relatively inexpensively... But that's hard without enough cloud knowledge in the industry yet, and consultants are (generally) terrible at this.
The main advantage is you can pay that crazy $200/month for a scalable database without paying $5,000+/month (burdened) for the guy that can build it and maintain it for you. A developer can handle connecting and writing code for clusters not easily than they can learn how to do scalable database -- and this is just an example, replace DBs with some other function you might hire a person or a team to do.
You'll have a hard time finding a AWS consultant that's specialized (or inclined) to help you set up your infrastructure so that you don't use or need to use AWS. Not only is there no need for that sort of service, it would actually kill the goose that lays their golden eggs.
Odds are that you could find consultants that are specialized in some other cloud service provider, and aren't experienced enough in AWS to be in a position to smoothly migrate services out of it.
Some setups we made cloud agnostic enough that when we finally got migrations approved we were able to do zero downtime migrations by splitting the setup between providers temporarily. That incidentally was the best way of getting people to migrate: You make the case for resilience and flexibility, then argue for a test run for a month or so, and then all it takes is for them to compare the bills.
I even several times made offers to migrate customers off AWS for a proportion of what I estimated they'd save over the next 3-6 months. None of them ever took me up on it when they realized just how much that'd add up to vs. the fixed time based offer I gave them but it was a useful sales tool to demonstrate that I was willing to stand behind my estimates. One customer slashed their hosting bill 90% by getting off AWS (they were bandwidth heavy, and we cut their bandwidth costs by 98%; AWS outbound transfer is ridiculous)..
[as I've said elsewhere, AWS has its uses, but keeping cost down is not one of them... Ironically one of the good uses for AWS is to keep the cost of a dedicated setup down: Being able to "spill" over onto AWS (or other cloud) instances in the case of unexpected events lets you operate far closer to the wire than you otherwise would dare on a dedicated environment, even if you rarely use the capability; doing so also allows for more easily spinning up additional test/dev environments etc]
The biggest reason I can see why you don't find more consultants offering those services, is that a surprising proportion of people hire consultants to give them backing to do what they already want done ("see, the AWS consultant says I was right to want to use this AWS service") rather than to genuinely give independent advice. If you're not comfortable being repeatedly told "yes, but here's why we'll be ignoring the professional advice we ostensibly hired you for" and being very careful about forcefully presenting opinions backed by evidence that don't match the hiring managers preconceptions, the pool of contracts rapidly shrinks.
In interesting cases, cost is _very_ significant. And it's not a 2x factor, IIRC it can easily be 10x.
I am assuming nothing. I personally had this very same debate with a program manager of a company whose bread and butter was doing a lot of computation. At the eyes of upper level management, arguing about spending 20€ or 100€ on a cloud service provider is a kin of debating which brand of detergent should the company buy. It's irrelevant.
> But in that case, a company could just rent a server at some ISP and be done with it.
That's where you get it all wrong. Hiring bare metal services does nothing to your ability to scale, either to meet demand or to develop/test/try out new services, nor does it help you use higher-level managed services. Everything you have to do or manage by yourself is a productivity hit, and that productivity hit is measured as a percentage of your entire payroll, which eclipses how much you pay to your cloud service provider.
... although if it's outbound bandwidth - if you can take care of your compute, maybe it's possible to purchase outgoing bandwidth at an hourly resolution, rather than "whatever your wires can send us", and have your flexibility.
I was doing some testing of Serververless (the framework) for a personal project. I wanted to do it in Google functions + Database, but even for the basic out of the examples GCS wouldnt work; I spent 3 hours fiddling around. I moved on to AWS lambda + Dynamo and was done in 1 hour.
Also AWS support is simply amazing. Considering Google history of bad support, I wouldnt consider it for anything serious.
A few weeks ago, I had a two-node Elasticsearch cluster (evidently my mistake...though IDK why AWS can operate high availability two-node transaction RDS clusters no problem but not ES).
One node went down, only manual intervention by AWS support could fix it, automatic backups were completely inaccessible (since they rely on the cluster being up? ... WTF), and it took many hours for support to reset the cluster.
I eventually just "screw it," and spent a couple rebuilding the cluster from scratch.
Github is amazing for developer happiness but CodeCommit is secure and seamlessly integrates with so many AWS Services I Can live without all the bells and whisltes of Github.
You can get this Business Support for free a year if you join StartupSchool and get 3K USD credits with the business support.
If you can't afford it, maybe you are just hobbying around but AWS offers lots of ways to support you until you have a substainable revenue.
I always hear people on the internet talking about AWS being crazy expensive, but from SFBA it looks really damn cheap. Would I rather give $thousands to AWS or $hundredthousands to an internal specialist who’s likely gonna say my company is too small and boring to keep them interested anyway?
AWS wind that math every day. And that’s the market they target. Why wouldnt they?
I LIKE AWS. I think it can be a great choice for many companies and use-cases. But the idea that AWS is firmly everyone, that it's less expensive for everyone when you factor in the TCO, simply isnt correct.
Correct. I specifically said it’s perfect if your labor cost is higher than your AWS cost
Another good case is indie hackers whose project likely won’t ever make it out of the free tier
For $1500/mo you can get half rack (5Kv) and 50 Mbps internet. A couple $5k switches and 6-8 10k servers and you're well under $125k, plus you probably don't need $10k servers or $5k switches and can find cheaper hosting. I realize you said a full rack, but that's probably overkill and could be done for $350k or less. Once you have the servers/switches your fixed costs are relatively low.
Still, those $100 monthly we paid for aws support were worth their weight in gold.
But for a business that requires cloud hosting with support? There aren't many places on the planet where $100 is a prohibitive cost for a business that's willing to spend at minimum that much for hosting.
In short, it's the cost of doing business.
There's a tier of services at AWS:
s3/ec2/ecs/rds/elasticache - Flagships, will almost always work except in weird edge cases. Everyone uses these.
Niche Stuff - If you need it you'll know. It'll generally at least be an 80% tool (think athena/firehose/aws waf/etc).
Stuff with shitty pricing - Stuff they obviously only built for feature parity with GCP/Microsoft (think EKS/Managed SFTP/Cloud Active Directory)
Broken shit you use once, it screws you, and you never use it again - Usually it's either because the underlying open source tool is built in a fashion that isn't appropriate for a shared services environment or it's because someone at AWS has 'opinions' (think Elasticsearch/Cloudformation)
For example, I wanted to replicate GCS to another hosted block store so that we could have a backup of our systems outside of our GCP account (they have been known to lock accounts on small businesses and not be very helpful in fixing it, GCS itself has been as stable as S3 in my experience).
Anyhow, I really wanted to use Backblaze B2 service for this purpose. Unfortunately, they don’t have the kinds of security controls or third party audits our industry requires, and their sales team indicated it wasn’t on the roadmap. I appreciate that honesty, but it’s one more reason the major players have a leg up.
They amortize the cost of compliance to the point where you don’t see it. For a long time AWS charged a lot extra for it, but GCP did not and was cheaper anyway. Now AWS gives compliance away for “free” as well.
It often leaves me wondering about other startups... how secure are they, really? I know my industry is onerous, but a lot of it is just “common sense” security-wise. Why should my browsing history or e-commerce purchases be any different from my medical records, when there are ways to use former to reverse engineer the latter?
I might have missed it in the search, but seems Hetzner is probably still working on it. Nobody here like to deploy k8s although the tools to do that are super sexy these days.
But certainly for SEA they don't make sense, unless you e.g. have any subsets of functionality that are bandwidth intensive but not latency insensitive (e.g. I used to manage a network that was split between UK, Germany and New Zealand, and we used Hetzner for the German footprint and put anything where latency didn't matter, like bulk e-mailing, there, while customer facing stuff was all in their respective country). For that to be worthwhile you need a quite significant volume though.
Of course, the flexibility of k8s does come with complexity, so YMMV.
Our services are highly elastic, and can vary from processing 8 million events/day to 600 million events/day. Same as with our users, which are mainly active during work hours, with some running night shifts, and then fewer running weekends.
We are probably the prime-case for cloud, since elasticity is where you save cost by going away from dedicated hosting.
As for availability; our customers are highly dependant on us processing their data live, and them being able to monitor, get alarms and react on their data. They rely on us to notify whichever technician needs to fix their production line when it's stopped, since they loosing money for every minute the line is not running (true for almost any manufacturing company).
This means that we need a lot of redundancy, and these things are built-in to almost all AWS offerings.
There's a case for dedicated servers still, I'll agree on that, but we are definitely benefiting from the cloud.
Let's say that, for a unit of computing work, ,the cloud price is N times of non-cloud. I'm being a bit vague here for generality; plus, when not on a cloud, the cost structure is different, but bear with me.
Ok. Now, your load varies by a factor of up to 133x. But unless your peak load is over N times your average, it is still cheaper for you to keep machines which support the peak load and have a bunch of idle time.
* Can do other computational work (e.g. experiments) during off hours without impacting system responsiveness.
* Can perhaps put some machines to sleep, or other power-saving measures, during off hours.
* You have to take care of more system and cluster administration work than on the cloud.
Above N, the cloud makes sense. Close to N - not sure. Well below N - doesn't seem to make sense.
You'll need to tell us that N is low enough.
This is exacerbated because the alternative is not either-or - the most cost effective system is often dedicated hosting + the ability to spin up cloud instances to take the peaks.
Doing so lets you provision your dedicated servers to run at much higher utilization most of the time, and make pure cloud setups look far more expensive, and most people with setups like that end up needing the cloud instances very rarely.
That's not to say there aren't exceptions with genuinely massive peaks, but even then it doesn't take a huge base load before a hybrid system starts to have the potential to bring substantial savings.
It translates to about 7krps/day. Assuming it's all in one region at day time, it's 14krps/12hrs, or 42krps/4hrs. It's well within a couple high-powered servers even in the worst case
This combined with how unsexy most of the VPS or dedicated hosting providers look makes the cloud providers a seemingly good choice for startups.
And people here working in start up environments are not aware how much hardware from internal IT departments costs. Internal IT departments are generally way less efficient than the cutthroat cloud providers. AWS/Azure/etc. hosting costs are peanuts compared to what most banks or other big enterprises pay for their internal hosting.
On top of that, internal IT response times are frequently horrible. Any developer worth his pay generally loves cloud providers and abhors IT departments. They're slow to deploy hardware or VMs, update firewall rules, install software, etc.
The biggest issue is not that people want to avoid internal IT and/or want elasticity, but the number of people who never even do a proper cost comparison beyond the top tier most expensive providers.
Most places outside tech hubs put massive emphasis upon being able to swap labor and are technology laggards because their switching costs are so high due to lack of velocity in their work in technology in the first place (and also why waterfall or spiral makes more sense for their efforts rather than Agile probably even today). The irony here is that strong AWS professionals are not cheap in any way, but neither is being dead in the water because your 60+ year old guru graybeard IT guy that managed the racks faithfully for 12+ years retires suddenly.
And yet they're handing out $25k-$100k in credits like candy to startups to prevent them from going to Google cloud or Azure.
Another tell is that the ways to "prove" start-up status under most of the big providers' terms for cloud credit align more around "did you get VC funding?" than "are you small?" due to the longer lead time nature of bootstrapped start-ups. Most don't give you credits if you've been around 2+ years and soured on another provider. Maybe I just sucked at it but I was bootstrapping before and had trouble getting decent credit then.
> they paid top $ for IT admins, which by the way were very capable.
Most companies don't pay top dollar and as a result they have to scrape by with the leftovers. As a result Amazon for sure beats them in sysadmining efficiency.
That's not true at all. Cloud services are very attractive to good old fashioned companies, because they express the cost of using computational resources as a simple monthly bill, just like any other utility. Your manager goes through the monthly bills and he sees, say, electricity, water, cloud infrastructure, cleaning, office supplies. That's it. No need to know what a server is or what a spot instance is. If the costs fluctuate smoothly and within boundaries then you simply don't worry about them.
I have lost count of the number of startups I've evaluated that have gotten $25k-$100k in credits from AWS or Google Cloud, or both. When I was doing consulting I once got paid to first set up an environment on AWS despite them knowing it was too expensive for them, then to migrate from AWS to Google Cloud, then to migrate from Google Cloud to Hetzner - all because my fees for doing the setups and migrations were far lower than the free credits heaped on them by AWS and Google.
There are certainly reasons why people in bigger companies like them as well, though from what I see most of the time it is because they can slip the cost under the nose of the manager one level up in a way that is harder to do with more clearly quantified contracts. It's boiling frogs - as long as the cost just slowly creeps up, it doesn't get queried.
Hetzner is indeed excellent if you are a cost-conscious european client that is willing to absorb the cost of self-managed bare-metal or quasi-bare metal servers and cares about saving on bandwidth costs. However, if your goal is to provide a world-wide service then you are compelled to look elsewhere. Even competing european service providers such as OVH or scaleway fare better than Hetzner on this domain.
And Hetzner works just fine for most small startups even if/when they have a US audience - the latency is perfectly manageable, though in those kind situations you will certainly want to expand into other regions later.
The main point, though, is that even when you scale, nothing stops you from using Hetzner for Europe, and e.g. OVH for US, and indeed filling in with things like AWS when there are needs you can't otherwise meet.
In reality very few startups gets to a scale where this starts to matter, and overoptimizing for it at the cost of infra spend at the start is a great way of running out of money - seen that happen way too many times.
The reason in both cases is MSFT/AWS gave them massive up front "free" credits, which meant there was no incentive to impose internal controls on usage or put in place a culture of conservation. AWS doesn't think twice before dropping $20k on cloud credits to anyone who wrote a mobile app.
At the company with the unreasonable bill, it's not even a SaaS provider. It literally just runs a few very low traffic websites for user manuals, some CI boxes and some marketing sites. The entire thing could be run off three or four dedicated boxes 95% of which would be CI, I'm sure of it. Yet this firm spends several tech salaries worth of money on Azure every quarter. The bill is eyebleeding.
The problem is everyone who wanted one got a VM. You got VMs being used for nothing except running a single test program that the owner then forgot about. People used expensive hosted cloud services instead of installing Postgres because "their time is more valuable" etc. Free credits created a culture in which the company just institutionally forgot that servers cost money. When the free credits ran out it was invisible to the engineers using the services and simply became another (opaque) line item for accountants to worry about the burn rate. In the rare cases engineers decided to "optimise" they did it by spending lots of time and effort on using Kubernetes so stuff would scale down at night.
There was an abortive attempt to move off the cloud. This was unfortunately stymied by incompetence by both the firm and the chosen hosting provider. It got some boxes from OVH and then didn't pay the bills for them, so the boxes were simply yanked and deleted. All the setup time was lost. In another case OVH allocated machines in datacenters that were far apart but they needed to be close together. In another case the machines were delivered but the hardware on one was faulty. Of course this stuff is a one-time hit and avoidable with non-broken processes, but it empowers those who just want to keep learning Azure/Kubernetes/Docker so they can put it on their CV.
The other firm is much smaller and has much more reasonable costs, they also get more for it (e.g. use Heroku which automates a lot of stuff).
Having observed these two different companies, I resolved up front that if/when I go back to running my own business I will not use cloud services, even if they offer free credits. I'll be taking pride in finding ways to drive server costs down as far as possible. Perhaps even hosting non-HA-required servers at my home given we have duplex gigabit fiber now. Yes, my time is valuable and all that, but establishing a culture up front in which people use the resources they need and not more is even more valuable.
So in our case, credits got us into AWS, but instead of treating it like free money, we spent a lot of time improving efficiency and reducing our bill so when the credits ran out we could run our system for an affordable rate.
I take pride in reducing costs in the cloud. I don't argue cloud is the only way, but there's a lot of use cases where it just makes sense -- CloudFront+S3 for static websites, for example.
 until the bill shows up ;-)
Actually -- to be honest, we're in the golden age of infrastructure. I love building on cloud. I don't have to worry about the fiddly bits as much and when I do things are often documented either in official docs or in blog posts and things.
It's sad to say but people are not rewarded by going with a perceived riskier option in an effort to save a couple 100k.
You go with what works and proven, and with what will make your life the easiest. If AWS goes down, then its much easier to explain why. If "Big Bill's Lowcost Cloud Solutions" provider goes down, then your entire judgement will be questioned. Even if Big Bill's Cloud solution is equally or more capable and only half the cost.
AWS failed? Great you have half of the internet down as well to cover your ass. No one get the blame and everyone continue with their work.
Of course this depends on culture and regional reputation. For example you would have no problem if your Startup in France is using OVH.
Considering that software is known to be high margin, there's lots and lots of companies for which hosting can be painful, but not threatening.
And if you don't, AWS can be pretty cost-competitive because your developers can handle a lot of the "infrastructure" that you used to need a dedicated sysadmin (or a team of) to handle.
Don't get me wrong, I wouldn't likely spend my own money on them, but it does beat having to try to find an on-call infrastructure person and a reliable data centre with proper routes, and then developers/devops people who can scale, monitor, manage, and maintain queueing services, message gateways, object storage, databases, firewalls, load balancers, containers, VMs, network attached storage, etc.
Also, don't underestimate the value of things like CloudFormation. Being able to make an API call and have an entire cluster configured, with load balancing, backups, multi-AZ redundancy, CDNs, etc., is pretty potent.
AWS might be expensive, but it gives you access to a lot of things that you might not otherwise have, even if you do have a sysadmin/devops guy.
There's a critical core infrastructure - i would argue that it is:
* some sort of centralized management platform even if platform is just a set of deploy scripts ( 1 instance ),
* Origin DNS servers ( 3 instances - one per AZ)
* HTTP/HTTPS entry points ( 3 instances - one per AZ) ,
* A couple of database servers ( say primary and a backup 2 - instances ).
* I would add job server though it probably could be collapsed into the management platform.
You need to have fixed IP addresses for this ( you really do not want to deal with service discovery at this level ) and you want it to be at a provider that won't ever make you need to renumber, or preferably a provider that does not ever break fundamental things like IP address assignment, or a provider that would run out of resources.
One's "internet presence" disappears when this dies. Running all of this as a core workload on AWS/GCP/Azure is a no brainier - it will cost about $100/mo, be nearly insta-rebuildable and re-deployable and a couple configuration files in git would take care of bringing up the beach head up.
At this point your core is up and everything else becomes service specific. This is the point where costs become a consideration but drive to low cost should not come at the expense of tooling. If by embracing ephemeral resources and existing tooling one can cut the base AWS/GCP/Azure price down by 80% most people would think it is a win over having to invest into building tooling to make containers stable ( as in always behave in a predictable manner ) on a provider that can shave off 95% of the costs.
The biggest issues with the cloud providers is the cost of IO and cost of bandwidth which scales linearly with the workload but that's an issue for very specific subset of customers that should be hiring real ops people.
Because so many others use it, if you care about latency to the counter-party server it's often a nice way to ensure low latency (by using the same availability zone).
What cloud provider has a similar feature set?
Sure, if you want to meddle around with VMs and containers, you can use pretty much any provider, but if you want to go a step further there isn't much left
- our clients (airlines) would very much prefer we use AWS over the smaller, lesser known offering.
- AWS offer free promotional credits for startups.
- AWS when utilized right (requires work) is not much more expensive than traditional hosting.
Internal IT departments or hosting providers meant for highly regulated environments can charge 5-10x that for a VM of a similar size :-)
And what about scaling?
That's precisely the problem. Bare metal hosts are unbeatable in cost, but fixed costs render them too expensive for a startup. Then, when fixed costs start to become irrelevant, you need to factor in the cost of rearchitecting your solution.
Then, when both of those costs become irrelevant, you already have the entire team trained and experienced in using a cloud service provider.
I know that first hand for having done both, and seeing how the AWS systems consistently earned me more money, and how rarely I had to deal with the bare metal (I've done anything from actual own hardware in colos to renting servers from places like Hetzner; when I was handling servers in two separate colos I spent on average a day a year in each of them, the rest was done by "remote hands" at the data centre)
> That's precisely the problem. Bare metal hosts are unbeatable in cost, but fixed costs render them too expensive for a startup. Then, when fixed costs start to become irrelevant, you need to factor in the cost of rearchitecting your solution.
Actually buying hardware is too expensive. But coloing leased hardware or renting on a month by month bases from a dedicated hosting provider costs about the same when you amortize over a three year period unless you're physically located somewhere with cheap land. E.g. I work out of London, and when I was doing this we eventually deprecated the own hardware in favour of renting from Hetzner because colo space in London was so much more expensive that the savings on the actual hardware couldn't make up for it.
> Then, when fixed costs start to become irrelevant, you need to factor in the cost of rearchitecting your solution.
Or you architect it properly from the start. I've done zero downtime migration between AWS, GCE and Hetzner. I've had systems that tied in cloud instances, VMs running on our own hardware, containers running on dedicated instances, and VMS on rented hardware, all tied into a single system. If you run everything in containers anyway, all you need to make that happen is a simple orchestration system and a reliable network overlay, and an architecture that ensures reliable replication of your data.
Once you've done that, you're free to pick and choose and migrate services as you please depending on cost and need, and it really is not that hard to get working - you already do most of the necessary planning if you are setting up a reproducible cloud setup anyway.
The reward being that once you're in, they are too big and slow to ever move away from your product.
And since the OP was talking about Kubernetes, and you're talking about "lock-in", another reason I love AWS is it is not a lock-in device. I can use any service of AWS's by itself, without being forced to use anything else or do something in a proprietary way. All the interfaces are a command-line or REST API with JSON data formats. They are all designed to operate by themselves, so you can provide replacement components at any time, hosted anywhere.
On the other hand, you can't just use one part of K8s, because you have to at least set up and manage an entire cluster first. And there's dozens of services K8s simply does not have, and other hosting providers don't have.
Furthermore, just because someone has a managed k8s doesn't make it less lock-in or less work. With AWS you don't need to use a cluster of anything. With k8s you are signing yourself up to tons of complex services and specific design and operation paradigms. With AWS you have no such inherent restrictions.
K8s is inherently more complex and difficult to use than AWS services, which aren't even a good comparison because they are so simple by comparison.
This is, in my experience, quite a false statement when talking about AWS. You are either a (team of) 10x engineer(s) capable of anything or as you grow in people and services you will need to use a good chunk of your developer time managing AWS, or hire a dedicated person (sysadmin).
Just because businesses hire dedicated backed people doesn't mean they actually need to, they just feel better about it when they do, because that's how it was always done before.
AWS is established and relatively straightforward. I found the user experience to be pretty seamless. Maybe there are better alternatives but I've found Google Cloud Platform and IBM Cloud to be absolutely miserable and confusing (the latter more so).WRT Google Cloud Platform I think the main sin is the UI/UX, and if I had to foot the bill I would give them another consideration. I had very limited (but good) experience with DigitalOcean and Heroku, but those just don't have the infrastructure at scale to compete with AWS in my opinion.
What do you think is the best alternative to AWS?
Examples: Twitter became far more successful than their architecture and infrastructure could handle back then.
Pokemon Go became an insanely huge success far exceeding their worst case scenario. There's a postmortem/whitepaper on Google Cloud Engine about that IIRC.
That being said, it also makes sense to apply some basic optimizations to reduce costs by a significant margin, as long as the work involved in optimizing pays off. A few days to cut costs by 50% makes sense, a few months to save a few dollars doesnt.
For people who know very well? I think it’s not unlike interpersonal dynamics. Codependency looks very different from outside versus inside. It’s difficult to tell why your friend keeps investing in this bozo, coming up with ever more elaborate ways to manage them and their moods.
And over lunch a mutual friend will discuss how they might be better off with someone else, and once in a great while someone will ask if maybe So-and-so might not be better off alone.
AWS is a dishonest partner. They have decided that you not knowing how much this is going to cost you is a good thing, and give no evidence that they are willing to change. This is who they are. Do the good things about them make up for that kind of bad thing? I don’t think so. I think you should look for another partner. Maybe just something light, not a serious relationship. Or maybe try some alone time and see what that’s like.
If you need 50 engineers to be able to manage 200ish servers, well, uh, you just don't have a ton of options.
Obviously not everyone falls into that second category but plenty do.
Who needs 50 engineers for 200 servers? My IT team manages 750 servers in 5 co-locations + 200 end users + DBA responsibilities with 5. If we grow above 1000 we might need to hire a 6th.
I have seen this as the outcome of a bubble which keeps repeating that "having to maintain is bad", "servers are bad" or "this is old".
The bill of the cloud vendor is already approved and the 3rd party apps don't need to have a seperate approval process.
1. Nobody gets fired for picking AWS.
2. Plentiful amount of talent that understands AWS and AWS patterns.
I'd argue it's most expensive for small (10-50 person... ish) businesses -- to small to hire a LOT of overhead staff, but not small enough that you're still in POC/MVP stage.
A lot of ideas can be tested in AWS essentially free. However, if you pick the wrong tools or don't anticipate future bills, you can end up with additional costs and need to re-engineer slightly.
That said, it works best with scalability concerned workloads. If you only need 10,000 machines for a few hours/days (research workloads), or your load is much heavier just 9-5pm, or you need vast constant scalability.
My org currently has 19,500 VMs in AWS right now -- they are student VMs running a mix of OS's. While many of them are offline for days or weeks at a time, we can easily start 5,000 of them in a minute or two if needed (though we usually don't see more than 400/minute). Sure, we can run these in a large virtualization system of our own, but we're only 3 developers, a single system admin, and one cloud focused person (though we do cross train a lot)... So AWS allows us scale quickly without as much concern on the underlying architecture.. purchasing equipment... Dealing with hardware failures or warranty/replacement claims. We can spend our 9-5 building, improving, and optimizing the system (including reducing costs). With this setup, nearly anyone in our team has the capability to research a new component of we really want to add it (like websocket support) and play around (in a sandbox), without getting others involved until it is ready for a POC demo... Then we can have a new feature in production in days/weeks despite needing new equipment.
All that said, I'm not a cloud-is-the-only-way evangelist... If you don't need any of that, you can fairly easily make a cloud agnostic system with container... Using ECS for many typical use cases is rather portable to other container systems. If your making larger lambda packages that can handle many paths (i.e. ELB/API Gateway Lambda proxy), they can move with less work than 100 purpose built functions.
If argue most people unhappy with AWS have leadership with goals or ideas sold to them that are silly (lift and shift)... Or staff that are not trained on the cloud provider so they don't know the right considerations for building in the provider they pick. I'd wager many developers don't even know how to use Dynamo DB properly for the first year or two they use it. It's hard to say it's AWS's fault if you have a large bill because your developers 'SELECT * FROM users' every time you lookup a single user... and AWS scales to support the inefficient scan (how does a DB provider know if your application logic only uses one value?).
Anyway, I hope this gives some insight on why orgs use AWS.
On the other hand, you can purchase reserved instances and get ~60% cost savings with zero engineering work. Its worth thinking long and hard about whether the cost of engineering time is worth that next 20%.
There's also a lot of useful ground in between "critical state, must never be lost (like a database)" and "can handle being terminated with 2 minutes notice". A service that can be re-created if necessary but takes 10 minutes to start up is really scary if run on spot instances, but can still be pretty useful.
The problem with spot autoscaling is that if everyone does it, it stops working. Everyone in us-east-1 gets most of their traffic from 9am EST to 5pm EST, because everyone hosts close to their users, their users are human, and humans are diurnal. Most "batch" workloads that people have also follow the same cycle; they're working on something in the office and want results now, not tomorrow morning. So they run their batch jobs during the day. If you can figure out how to get your traffic spikes from 9pm until 5am, spot instances are going to be great -- millions of CPUs are sitting idle waiting for your novel workload. But if your customers are working 9-5 jobs like you, and you care about latency enough to host close to them... you're competing for instances with every other computer user in the region.
If you're transferring between EC2 instances in peered VPCs meaning you can still use private IPs, it's more like $20/TB.
I personally think it will breed huge organizational problems if things like CI are slow. "I'll get a cup of coffee while this runs" and then you come back and forget what you were going to release. Soon it becomes "let's get another change into this build before we release" and then it's "well, it's been six months since we've released anything, what do we do." You have to start fast and stay fast if you want to keep developers productive. So saving a couple bucks on computers that are half a world away can end up being a huge expense if you're not careful.
As other comments mention, you also have to be careful about transfer costs. In the CI case, getting your source code into the CI server is cheap, but getting the containers out is going to cost you, especially if you don't make an effort to optimize them. For batch data processing jobs, the same applies; getting the result out is cheap, but getting the data in is going to be a lot of transfer. (If you were using Small Data, you could just run the job on your laptop, after all.)
The speed of computers half a world away is not great either. I remember updating some Samsung drivers once, which were served out of a Korean AWS region instead of CloudFront... and the downloads were glacially slow. Their website is the same way. I couldn't believe how a multinational corporation could push bits at me so slowly. When you're reading their documentation all day, or tweaking drivers, you notice it, and you start to think "next time I'm going to buy Intel". (Compare Samsung's SSD website with McMaster-Carr's website. What site do you hope to interact with again in the future?)
Anyway, you get a bill for compute resources, and you don't get a bill for unhappy employees context-switching all day, so I see why people want to craft clever schemes to save pennies on their compute costs. But be careful. Not every cost is charged directly to your credit card.
Unless you have legacy app you know must be around for 3 years and have zero efforts to try to refactor.
eg: if the infra cost down by 10,000 USD per month, they may say it's not worth it because they are paying more for 1 developer in US in a month.
With spot instances, you don’t need to do this planning, assuming you can fall back to reserved quickly enough.
we did mix reserved instances and spot instances for our production workload. worst case scenario will be reserved + on-demand
We have done the same - our bills went down, but not by as much as 80% I think closer to 50%. But it took a fair bit of developer time, and we now have a lot of Kubernetes related problems to deal with. I guess those will smooth out over time, but I don't think anyone ever factors in this stuff when they claim great savings. Developer time ain't cheap.
On a plus note, running multiple small boxes via Kubernetes does give you a more high availability system. If one instance goes down, there will still be another one available, so it's not all negative.
A great in-between is to simply have a backup server ready to go in a few minutes time. Super simple compared to orchestrated container system.
Of course for client projects that spec out a certain number of 9's must be done just so, but can also be billed accordingly.
It does seem to be an example of a "standard" architecture done well. Our application has a tiny fraction of the traffic and it struggles with some things.
Here's a series of blog posts with a lot more detail by one of their devs: https://nickcraver.com/blog/archive/
They're very optimized and can serve all of their traffic on a single webserver, redis instance and SQL server.
The nice side effect is they did migrate to .NET Core
And as another commenter said, another side effect was they got fault tolerant infrastructure. So not every minute they spent on migrating to "spot" instances is dedicated to that.
we basically only run jobs on the spot instances that don't need to run instantenious, so it's really cheap for us.
I’m literally emotionally drained after unsuccessfully working with k8s after 2+ weeks.
It’s incredibly over complicated and documentation is all over the place. I had a large write up of my experiences but those are lost and I don’t have the energy to retype all of that.
I simply wanted to utilize k8s to help provide some auto scaling and redundancies for a 10 year old service I run.
After 2 weeks of deep diving on this topic and getting essentially nowhere, even with the help of a friend that does this for his day job and him waving his hands not being able to help, I’m reluctantly done.
The technology is just not ready. It’s too complicated. The documentation isn’t sufficient. Sure you can document every nut and bolt, but if you can’t create simple patterns for people to follow you lose. There’s too much change going on between versions.
At my last 2 companies, they each had a team of 2-10 people working on implementing kubernetes. After over a year at each company, no significant progress had been made on k8s. Sure some stuff was migrated over but no significant services were running with it.
Not wanting you to go to the pain of trying to recreate your original post, but of interest, what kinda of things were the primary areas of pain from your work?
We had one full time senior person doing the move (me) and 6 other coming in an out of the project through the year as their capacity allowed.
One of my clusters is running 5000+ containers with ease. Not huge by other company standards but big enough.
I fully acknowledge this will only work in certain scenarios and for certain workloads, eg not ideal for long running/cache/database style services.
The indirect effect of building on a system like this is that the recovery mechanisms get tested on a regular basis instead of just on the odd day when things fail.
Spot instances are like a natural chaosmonkey mode, with money being saved and forcing you to build failure tolerance, retries & circuit breakers early in dev.
This precludes one from becoming complacent with spot instances that rarely go away.
"A lot more stable" isn't really a desirable characteristic of ephemeral compute capacity. In my experience, the less frequently the instances went away, the more complacent the operators became.
Preemptible instance are stable in the sense that you know they're going away within 24 hours and must be prepared for that.
spot used to be that way and the price was very sensitive. but AWS tweaked it so that it's more stable.
to the point, after a year or 2 of running spot instances, we don't feel the difference of spot and ondemand that much. we got complacent.
As you mention, definitely don’t do this where persistence is paramount (cache that is expensive to backfill upon recovery, database, etc) but it’s just fine for transient workloads or workloads you can rapidly and safely preempt and resume.
Of course, at the end of the day rarely is something ever truly stateless.
Well, one question to ask yourself when considering going down this route is whether it makes more sense to move all the statefulness into managed services, like Aurora, BigTable, S3, etc.
That drastically simplifies life. Now the only infrastructure directly managed by you are stateless workloads that can easily be self-healed, rolled back, scaled up/down, etc. Managed DBs are more expensive than running your own DB, but most likely the cost savings of moving the rest of the infrastructure to spot/preemptible outweighs this difference.
My plan is to use dedicated servers for most of the load and some elastic capacity at peak loads if necessary.
also, this works out so well for our use case because we were using .net framework at the time so the cost saving was huge.
a lot has changes since then.
Also, this strategy is not limited to AWS, the similiar type of instances are also available on Azure, GCP, etc...
This is NOT For production use (that’s what the managed AKS service is for), but I like to tinker with the internals, so I keep an instance running permanently with a few toy services.
The whole idea was from spotinst blog. Thanks a lot! I just glue all the opensource projects together with some changes here and there. If the idea didn't work, I will def consider using Spotinst.
However, every cost saving is important for our startup back then. we were a small shop in Southeast Asia where senior engineer merely cost $1000 a month. I was thinking maybe I can save the cut from Spotinst too :)
I don't understand how something you pay for every month is being considered a sunk cost? Am I missing an up front charge, or does the writer not understand what it means?
There isn’t anything about preemptible to “cause networking issues”. We may have had a general networking outage, if that’s what you experienced, but we don’t additionally make networking worse for preemptible. You’re likely to get shot, but we don’t adjust throttles or anything.
Anyway, yes. It caused us problems. Attempts to connect to machines outside of GCP would fail repeatedly due to the NA dropping the SYN. This would mean our connect() call would timeout repeatedly. There was an ICMP response I could see with tcpdump:
ICMP host aaa.bbb.ccc.ddd unreachable - admin prohibited filter, length 68