Hacker News new | past | comments | ask | show | jobs | submit login
We use Kubernetes and spot instances to reduce EC2 billing up to 80% (tuananh.net)
514 points by talonx on Feb 25, 2020 | hide | past | favorite | 293 comments

I hope people don't go and take this advice and just run everything on Spot as that is a mistake.

It is very common for AWS to completely run out of entire classes of instance types e.g. all of R5 or all of M5. And when that happens your cluster will die.

What you want to do is split your cluster into minimum two node groups e.g. Core and Task:

Core: On-Demand for all of your critical and management apps e.g. monitoring. Task: Spot for your random, ephemeral jobs that aren't a big deal if it needs to be re-ran.

So for a Spark cluster for example you would pin your driver to the Core nodes and let the executors run on the Task nodes.

Shout-out for "AutoSpotting", which transparently re-launches a regular On-Demand ASG as spot instances, and will fall back to regular instances: https://github.com/AutoSpotting/AutoSpotting/

Combined with the fact that you can have an ASG with multiple instance types: https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-groups...

Means that you can be reasonably certain you'll never run out of capacity unless AWS runs out of every single instance type you have requested, terminates your Spot instances, and you can't launch any more On-Demand ones.

(and even so, set a minimum percentage of On-Demand in AutoSpotting to ensure you maintain at least some capacity)

> runs out of every single instance type you have requested, terminates your Spot instances, and you can't launch any more On-Demand ones.

This is more common than you think.

Internally cloud providers schedule instance types on real hardware, and running out of an instance type likely means they have run out of capacity, and only a tiny amount exists in fragmentation. To access that tiny remainder, they'll terminate spot instances and migrate live users (which they have to do very slowly) to make space for a few more of whichever instance types make most business sense (which varies depending on the mix of real hardware and existing instance types).

It takes someone like AWS a good few weeks, sometimes months, to provision new actual hardware.

It isn't uncommon for big users to be told they'll be given a service credit if they'll move away from a capacity constrained zone.

Is there a similar concept to airline upgrading? Better than to deny a paying customer to board the plane. Surely there must be spare capacity, somewhere in the datacentre, with slightly better specs.

Yes - they totally do that. If there is only space for a large instance, but you want a small one, they fit your small one in the free capacity, and there is now space for someone else to fit another small one next to it.

For business reasons they might decide not to do that though - your small instance might mean they have to say no to a big allocation later.

Instead they just delay your instance starting and hope other instances moving around opens up a more suitable location for it.

Theres an entire paper on the topic: https://dl.acm.org/doi/10.1145/2797211

The AutoSpotting author here, always feels great to see my little pet project mentioned by happy users. Thank you for making my day!

To set matters straight, AutoSpotting pre-dates the new AutoScaling mixed instance types functionality by a couple of years and it (intentionally) doesn't make use of it under the hood for reliability reasons related to failover to on-demand. To avoid any race conditions, AutoSpotting currently ignores any groups configured with mixed instances policy.

In the default configuration AutoSpotting implements a lazy/best-effort on-demand->spot replacement logic with built-in failover to on demand and to different spot instance types. To keep costs down, it is only triggered when failing to launch new spot instances (for whatever reason, including insufficient spot capacity).

What we do is iterating in increasing order of the spot price until successfully launching a compatible spot instance (roughly at least as large as the original from CPU/Memory/disk perspective but cheaper per hour). If all compatible spot instances fail to launch, the group keeps running the existing on-demand capacity. We retry this every few minutes until we eventually succeed.

There's currently no failover to multiple on-demand instance types (this is a known limitation), but this could be implemented with reasonable effort.

We're also working in significantly improving the current replacement logic to address a bunch of edge cases with a significant architectural change(making use of instance launch events). I'm very excited about this improvement and looking forward to having this land, hopefully within a few weeks.

At the end of the day, unlike most tools in this space(including AWS offerings) AutoSpotting is an open source project so if anyone is interested in helping out implement any of these improvements(or maybe others), while at the same time getting experience with Go and using the AWS APIs, which are nowadays very valuable skills, you're more than welcome to join the fun.

Thanks for the shout-out, really appreciate it.

If you don't mind I'd like to get some feedback/feature ideas from users like you.

Please get in touch with me on https://gitter.im/cristim

ASG, per the blog-post you linked to, now supports starting both on-demand and spot instances, so what's the use of AutoSpotting?

The author of AutoSpotting here, this is often being asked and I'm happy to clarify it.

The mixed capacity ASGs currently run at decreased capacity when failing to launch spot instances. AutoSpotting will automatically failover to on-demand capacity when spot capacity is lost and back to spot once it can launch it again.

Another useful feature is that it most often requires no configuration of older on-demand ASGs, because it can just take them over and replace their nodes with compatible spot instances.

This makes it very popular for people who run legacy infrastructure that can't be tampered with for whatever reasons, as well as for large-scale rollouts on hundreds of accounts. Someone recently deployed it on infrastructure still running on EC2 Classic started in 2008 or so that wasn't touched for years.

Another large company deployed it with the default opt-in configuration against hundreds of AWS accounts owned by as many teams, many with legacy instances running for years. It would normally take them years to coordinate as a mass migration but it just took them a couple of months to migrate to spot. The teams could opt-in and try it out on their application or opt-out known sensitive workloads. A few weeks later then they centrally switched the configuration to opt-out mode, converting most of their infrastructure to spot literally overnight and saving lots of money with very little configuration effort and very few disruption to the teams.

If you want to learn more about it have a look at our FAQ at https://autospotting.org/faq/index.html

It's also the most prominent open source tool in this space. Most competition consists of closed-source, commercial (and often quite expensive) tools so if you're currently having any issues or missing functionality, anyone skilled enough can submit a fix or improvement pull request.

Where can I read about some of these more impressive use cases you describe?

Have a look at https://github.com/AutoSpotting/AutoSpotting or the FAQ section on https://autospotting.org

If those don't answer your questions feel free to reach out to me and I'll do my best to explain further.

It replaces on demand instances in-place. If there’s no spot instances, it will leave them running. If the spot instance gets killed, it will start again as on demand.

It sounds a bit hinky, but it tends to leave you with the number of instances you want running without having to determine what percentage of the ASG should be on demand or spot — especially with the possibility of not being able to start new spot instances if they’ve been terminated.

Yes. We ran into this early on. Setting your bid price(maximum price you are willing to pay for the resource you are spinning up) higher than the average spot price do not protect you against instance termination. Also even if you place your spot bid price at current on-demand price (you only get charged at the current spot market price) your instane will be terminated if there is no resource availble in the spot instance pool. For example, you have spin up an ec2 instance with 100 GB RAM with type spot instance, with bid price set to the on-demand price. If someone else spin up a 100GB as on-demand instance type- if there is no resource left in spot pool, your instance will be terminated.

author here: this is why i suggest to use mix of reserved instances and spot instances. worse case scenario, everything will be on the small number of reserved + on-demand instances.

there are other strategies to avoid this as well

using multiple type of instance, of different sizes, on different availability zones. because price spike is different for each combinations

using bigger size instance than usual because bigger = more stable / less likely to get evicted

Why not use spot fleets to solve this? You can specify multiple acceptable instance types.

I can't speak for the other regions but in ap-southeast-2 we had a couple of incidents in 2019 where there was no Spot capacity for ANY instance types between say 64GB and 512GB of RAM.

And so even our instance fleets were failing to provision capacity.

Requiring a minimum of 64GB RAM seems quite unique computationally, so if it's mission critical that you have at least one instance of this type all the time up, then maybe you could reserve that one instance fully while depending on spots for scaling?

this is why ability to fall back to on-demand instances is critical.

When there are no spot instances, it's common that on-demand instances can't be started either.

Unless you paid for a reservation, you're SOL.

Fwiw, Product@SpotInst here. Curious what makes you arrive at this conclusion? At SpotInst we have customers who migrate between Spot <--> RI <--> OD everyday.

that's for you to decide how many percent of reserved instances out of total compute power you need. it's always a trade off. in that case, pct of cost saving is lower but availability is higher.

It seems weird that AWS wouldn't just do this automatically, right?

Yes and no. Because they do not want to disclose how much capacity they have and on demand costs more, customers would be left wondering if AWS unfairly decided to run on demand when spot could have been used instead.

I'm also puzzled why people wouldn't just use spot fleet with on-demand as backup.

because at the time when we did this (2016-2017), kops did not support spot fleet. worst case scenario, you get mix of reserved + ondemand instances.

I don't think I've ever seen spotfleet not satisfy a request with all the same instance type. When the instance gets terminated, it seems to fill again with the same instance type as before (and then get killed again in busy periods).

For the past few weeks I've been having this exact problem with GPU instances, where on-demand requests would kill my spotfleet instances in the middle of a job, I'd spin up again, get killed again a few minutes later.

The only way to fix it without rewriting a bunch of stuff was to blacklist that instance type from my spotfleet requests.

It definitely wasted a bunch of my time though which seems to be increasingly more common with newer AWS APIs.

Fwiw, Product@SpotInst here. Might I suggest that you give us a try. We ward off against this specific scenario by calculating a Spot score based on Instance Type/AZ and deliver flexibility with a choice of different instance types.

because at the time when we did this (2016-2017), kops did not support spot fleet

Fwiw, I lead Product for Ocean at SpotInst. Your analysis is _spot_ on :). At SpotInst, we predict interruptions by calculating a Spot Market Score which is based on Instance Family, Type and AZ (using historic data). This allows us to spread pods/workloads across a variety of spot instance types while delivering an SLA. As an additional data point, we have a mix of stateless/stateful workloads running 24x7 on our platform. For critical/management workloads/apps, RI could make a lot of sense and we can dynamically buy/sell those on your behalf, delivering a fully-managed Spot <--> RI <--> OD solution where needed.

Fargate spot is out now and pretty nice, since it is just running a container of a certain size and can decide where it is run it seems like it would be less likely to run out of capacity as well.

The main problem with AWS Fargate, as with most of AWS's offering, is cost. More precisely, granularity. AWS Fargate cost infrastructure, at best, charges you per ceiling of 0.25 vCPU and 0.5GB of RAM (rounded up) per second, with a ceiling of 2GB of RAM. If your containers have some spiky workloads that require more than 2GB of RAM then the coarseness goes up and your are charged per 1GB of RAM (rounded up) per second. That means a spiky workload can lead you to spend more than a month's worth of a EC2 instance with 2GB of RAM (say, T3.small with 2vCPU and 2GB of RAM).

The whole service becomes even more ridiculous if a company is already operating EC2 instances or even kubernetes clusters on EC2 spot instances (the topic of this discussion), because more often than not these workloads can be already covered by the cluster's unused capacity.

Product@SpotInst here. Just fyi, we have customers launch Spark workloads that require 300G RAM and we are able to efficiently deliver this capacity by ensuring that the underlying infrastructure is shared with other pods that are not memory bound. We bin-pack, scale up and down quickly, while allowing you to have warmed-up spare capacity when needed to ensure that your pods/tasks are serviceable immediately.

I spent months in the summer of 2018 not able to deploy azure gs4/5’s in north or west Europe - such a pain

are you using or have you built and dashboards to track this sort of stuff? It seems like deprioritizing certain types of work to save money might warrant some sort of forecasting info. Should I rely on something happening today or not?

Why are people so obsessed by AWS? It is one of the most expensive hosting solutions that tries hard to lock you into their ecosystem.

I somewhat understand why enterprises want to use it, but why are small startups using it so much and then complaining about the cost?

Nowadays when we have high speed internet, and a lot of things are containerized, it is so simple to change hosting partners. Just pick one that doesn't cost an arm and a leg and move to a different one if it didn't fit very well.

I have used linux containers for 10 years now and changed hosting a few times, each time reducing costs even more. Yes, it is a bit of manual labour, but if you have someone with sysadmin/devops skills, it is easily doable.

> Why are people so obsessed by AWS? It is one of the most expensive hosting solutions that tries hard to lock you into their ecosystem.

I agree with you, and that's why I try to get the point of view of those who actually decide to adopt AWS. They aren't crazy or stupid, and as AWS is the world's leading cloud provider then it's highly doubtful that the decision is irrational and motivated by ignorance.

So far, the main fallacy with regards to companies picking AWS is that cost is relevant. It isn't. AWS might overcharge a lot, but truth of the matter is that for any sizeable corporation it's irrelevant if they spend 200€ or 400€ on their cloud infrastructure. It's far less than a salary and it's even less than the bill for some office utilities. So once the infrastructure foot is in the door then why would management worry about cost? What they do care about is uptime and development speed, because that has direct impact on productivity, and thus value extracted from salaries. If a particular service provider enables you to show tangible results in no time at all (see spin up a database or message broker or workflow in less than nothing) then they don't mind paying a premium that matches, say, their air conditioning bill.

For a startup, it can work out like this: Start out on AWS/GCP/Azure in the initial phase when you want to optimize for velocity in terms of pushing out new functionality and services. When you start to require several message queues, different data stores, dynamic provisioning and high availability, you save a lot on setup and maintenance - the initial cost of getting your private own cloud up and running, and doing so stably, is not to be underestimated. Especially when you're still exploring and haven't figured out the best technologies for you long-term.

Then, at some point, that dynamic changes as you have a better understanding of your needs, the bills start to build up and the architecture is in less flux. You might also have a bigger team and can afford to start allocating more resources to operation. That is the point when it it might make sense to migrate over to self-managed.

Then at the same time, you have the scalability, which might be more of a key point for even larger organizations.

I think building somewhat cloud-agnostic to ease friction of provider migration is good, regardless, but do so pragmatically and look at the APIs from a service perspective.

Kubernetes? All the bigger providers have alternatives and you can run your own. Fargate? You're going to have to do some rewrites. MemoryStore? Just swap for your another Redis instance. BigTable? Highly GCP specific. etc.

Not saying there aren't a lot of companies who choose the wrong provider for the wrong reasons, but it can also be part of a conscious strategy. Also nobody got fired for IBM and so on.

> Then, at some point, that dynamic changes as you have a better understanding of your needs, the bills start to build up and the architecture is in less flux. You might also have a bigger team and can afford to start allocating more resources to operation. That is the point when it it might make sense to migrate over to self-managed.

I completely agree, and I had this discussion with my direct manager in the past. Yet, even if the potential savings are significant, managers might not be too keen on investing on switching your infrastructure. Running your own infrastructure is risky, and although top managers enjoy lower utility bills they don't enjoy the sight of having a greater risk of suffering downtime, specially if the downtime is self-inflicted and affects business-critical services.

So, if this transition doesn't go perfectly smoothly... The people signing for the migration to self-hosting services might be risking their whole career on a play that, at best, only brings some short-term operational cost savings. Does this justify a move whose best case scenario is equivalent to a AWS discount?

For sure, there are a lot of factors to consider here. For heavy workloads where you need solid I/O, the cost for the same performance on baremetal vs VMs on cloud can be >4x so you could even afford to have a duplicate setup on another network for redundancy and still be saving.

Moving workloads in-house does happen. But, in general, you're right. It's hard to advocate for a near-term expensive (in time and money) and at least somewhat risky (expect some nights and weekends crises) migration for possibly some longer term cost benefit (assuming you've accounted for all the costs). Which BTW neither you nor your manager may still be around to take credit for. And also BTW is at least somewhat counter to what companies are doing in general, for better or worse, and which execs will probably rightly see as a potential distraction from whatever the company is trying to accomplish.

Frankly the whole discussion mostly highlights that these are things you need to think about upfront before you're fully committed.

Back in the day, when I was part of a startup, the DB guy was all into making us write “provide agnostic” sql in case we ever wanted to switch to Mysql or Oracle. We were actually using Postgres. This was a nightmare.

Things started improving when we said ‘f-it we are not moving out of Postgres, let us at least use the best features of PG’

There is a similar problem when trying to use AWS with the constant thought about moving out of AWS at some point.

Yeah, this is a bit what I mean with doing it pragmatically - at least when you choose provider-specific services know that either 1) you have an idea of how you would migrate it or 2) it is a conscious decision to leverage a USP. By taking the provider-agnostic paradigm to the extreme, you have the least common denominator, getting none of the upsides.

> Start out on AWS/GCP/Azure in the initial phase when you want to optimize for velocity in terms of pushing out new functionality and services.

Have you ever done this? It's exorbitantly hard to migrate off of a cloud provider, and few ever do.

I agree that migrating off a cloud provider can be very hard. However, architecting your system with portability in mind can help a lot, as rumanator points out:

> I think building somewhat cloud-agnostic to ease friction of provider migration is good

Of course, that's not always an option if the system is already built, but it's definitely a good approach.

I have and It’s only hard if things aren’t dockerized imo. If everything is in something like cloud foundry it will be much harder.

If you don’t have heavy AD and policies in place even better.

+1 on containerization, and I would add that controlling your container orchestration service (particularly ingress and security) is another key factor. Whether someone uses Docker Swarm or Kubernetes, this setup enables anyone to redeploy their entire applications at a blink of an eye, regardless of which cloud service provider you use.

If your entire infra is dockerized, you are in the vast minority and should probably be discarded as an outlier.

> They aren't crazy or stupid, and as AWS is the world's leading cloud provider then it's highly doubtful that the decision is irrational and motivated by ignorance.

Part of the problem is that a huge proportion of the people I come across who chose AWS used this exact argument. Part of the problem with that argument is that none of the big guys are paying the list prices (unless they're not doing their jobs; I've seen the kind of discounts available once you get into even the few hundred k/month range and tell your account manager you're considering moving), and a lot of them also used the same line of thinking.

It pulls in a lot of people who pick AWS for all the wrong reasons.

> AWS might overcharge a lot, but truth of the matter is that for any sizeable corporation it's irrelevant if they spend 200€ or 400€ on their cloud infrastructure.

The ones I used to deal with used to be more like a 3x-10x cost difference on bills in the 10k-100k/month range. I agree with you that if the difference is ~200/month, then who cares. But a lot of much bigger companies burn money this way. Often because they started off with a 200/month difference, and then never made it a point to re-evaluate as their costs grew.

The difference isn't always that bad, but especially bandwidth hungry services are ridiculously expensive on AWS (to the point where if people really badly want to stay on AWS and spend a lot on bandwidth, a quick and dirty fix is to rent servers to use as caches in front of their AWS setup)

I'm not saying people shouldn't use AWS. But as you point out, the right usecase for AWS is when you don't mind the cost, and pick it for convenience, and there's the warm fuzzy feeling of knowing you can hire people "off the street" who knows how AWS works.

AWS is the luxury option. Sometimes you want the luxury option.

But it worries me how many startups build in ways that end up locking them into a provider that for some of them multiplies their per user cost by anything from 2x to 10x. When I evaluate startup pitches today, I often ask whether or not they have thought this through. It doesn't matter so much that they're on AWS - that might well be a fine choice. What matters is whether it was a conscious decision, and they've done at least a superficial attempt at modelling the costs both for AWS and some alternatives, rather than just picked it by default.

For any start-up that is expecting to work with enterprise customers there is no choice and has been this way for at least 3 years now but to support AWS. This doesn't mean you must use AWS (with Azure close second and GCP irrelevant essentially for most large non-tech customers) for your entire footprint, but you will need to have POP in most regions that the F500 works with strategically. Any enterprise tech founder that utters the word cloud should know this as table stakes to compete for the foreseeable future.

That absolutely makes sense. But you can achieve that by either being able to deploy the client specific bits to AWS for clients that absolutely insist, or by simply deploying proxies and picking and choosing on a per service basis whether or not it makes sense to deploy it to AWS or proxy it to your own infra.

Sure, but there's a fun catch with this strategy. I'm familiar with some companies that refuse to work with vendors that use AWS for production hosting of their footprint because it would be funding their competitor. There's no such thing as a company that contractually requires all of production in AWS though, in contrast, not even AWS.

wrt to aws, there's also the option of building out with serverless tech, which is a dramatic cost reduction (not paying for idle) and ideally scaled for usage to business model/revenue. cloud portability suffers but I've found for an ha setup, its a dramatic cost savings (10x) while getting traction for a service. I've seen that transformation bear out in many enterprise companies. As an example I built out https://awsapichanges.info and run it for less then a dollar month with fargate spot, s3, etc. Just saying ymmv depending on how you want to build/design your app's use of infra resources.

It really isn't a dramatic cost reduction for most people, as most people simply don't have sufficiently large difference between base load and peak. I love the concept of serverless, but I'm not seeing it as a cost savings measure for the most part, but as a simplification of architecture.

I basically agree...

However, I'd argue if you pick the right tools from the start you can leverage AWS relatively inexpensively... But that's hard without enough cloud knowledge in the industry yet, and consultants are (generally) terrible at this.

The main advantage is you can pay that crazy $200/month for a scalable database without paying $5,000+/month (burdened) for the guy that can build it and maintain it for you. A developer can handle connecting and writing code for clusters not easily than they can learn how to do scalable database -- and this is just an example, replace DBs with some other function you might hire a person or a team to do.

> However, I'd argue if you pick the right tools from the start you can leverage AWS relatively inexpensively... But that's hard without enough cloud knowledge in the industry yet, and consultants are (generally) terrible at this.

You'll have a hard time finding a AWS consultant that's specialized (or inclined) to help you set up your infrastructure so that you don't use or need to use AWS. Not only is there no need for that sort of service, it would actually kill the goose that lays their golden eggs.

Odds are that you could find consultants that are specialized in some other cloud service provider, and aren't experienced enough in AWS to be in a position to smoothly migrate services out of it.

I used to do that, and used to actively recommend customers not to use AWS in most cases, or to use hybrid setups, as while my billable hours tended to be higher for AWS setups, the demand is high enough that it's not worth making bad recommendations to milk a client vs. showing them that you can help them secure substantial savings.

Some setups we made cloud agnostic enough that when we finally got migrations approved we were able to do zero downtime migrations by splitting the setup between providers temporarily. That incidentally was the best way of getting people to migrate: You make the case for resilience and flexibility, then argue for a test run for a month or so, and then all it takes is for them to compare the bills.

I even several times made offers to migrate customers off AWS for a proportion of what I estimated they'd save over the next 3-6 months. None of them ever took me up on it when they realized just how much that'd add up to vs. the fixed time based offer I gave them but it was a useful sales tool to demonstrate that I was willing to stand behind my estimates. One customer slashed their hosting bill 90% by getting off AWS (they were bandwidth heavy, and we cut their bandwidth costs by 98%; AWS outbound transfer is ridiculous)..

[as I've said elsewhere, AWS has its uses, but keeping cost down is not one of them... Ironically one of the good uses for AWS is to keep the cost of a dedicated setup down: Being able to "spill" over onto AWS (or other cloud) instances in the case of unexpected events lets you operate far closer to the wire than you otherwise would dare on a dedicated environment, even if you rarely use the capability; doing so also allows for more easily spinning up additional test/dev environments etc]

The biggest reason I can see why you don't find more consultants offering those services, is that a surprising proportion of people hire consultants to give them backing to do what they already want done ("see, the AWS consultant says I was right to want to use this AWS service") rather than to genuinely give independent advice. If you're not comfortable being repeatedly told "yes, but here's why we'll be ignoring the professional advice we ostensibly hired you for" and being very careful about forcefully presenting opinions backed by evidence that don't match the hiring managers preconceptions, the pool of contracts rapidly shrinks.

I didn't say avoid AWS... I was more trying to point out that consultants and what not with certifications and "experience" are great at regurgitating information but often lack personal experience and result of their work months down the line. They suggest using tools based on their understanding, which is frequently with minimal hands-on experience... E.g. use a bunch of Dynamo DB tables with several indexes each to get serverless database.. ignoring concepts of duplicating data and leveraging hash/range, avoiding scans.. etc. (As an example)

You're assuming the organization/company does not do a lot of computation. If that's true, then yes, cost is not much of a factor. But in that case, a company could just rent a server at some ISP and be done with it.

In interesting cases, cost is _very_ significant. And it's not a 2x factor, IIRC it can easily be 10x.

> You're assuming the organization/company does not do a lot of computation.

I am assuming nothing. I personally had this very same debate with a program manager of a company whose bread and butter was doing a lot of computation. At the eyes of upper level management, arguing about spending 20€ or 100€ on a cloud service provider is a kin of debating which brand of detergent should the company buy. It's irrelevant.

> But in that case, a company could just rent a server at some ISP and be done with it.

That's where you get it all wrong. Hiring bare metal services does nothing to your ability to scale, either to meet demand or to develop/test/try out new services, nor does it help you use higher-level managed services. Everything you have to do or manage by yourself is a productivity hit, and that productivity hit is measured as a percentage of your entire payroll, which eclipses how much you pay to your cloud service provider.

100€ on cloud service is not "a lot of computation".

Computing itself is the cheap part, especially on spot instances. It's the outbound bandwidth that'll kill you

Ok, so s/computing/computing, storage and network I\/O/ . Same argument.

... although if it's outbound bandwidth - if you can take care of your compute, maybe it's possible to purchase outgoing bandwidth at an hourly resolution, rather than "whatever your wires can send us", and have your flexibility.

Because it works?

I was doing some testing of Serververless (the framework) for a personal project. I wanted to do it in Google functions + Database, but even for the basic out of the examples GCS wouldnt work; I spent 3 hours fiddling around. I moved on to AWS lambda + Dynamo and was done in 1 hour.

Also AWS support is simply amazing. Considering Google history of bad support, I wouldnt consider it for anything serious.

> Also AWS support is simply amazing.

A few weeks ago, I had a two-node Elasticsearch cluster (evidently my mistake...though IDK why AWS can operate high availability two-node transaction RDS clusters no problem but not ES).

One node went down, only manual intervention by AWS support could fix it, automatic backups were completely inaccessible (since they rely on the cluster being up? ... WTF), and it took many hours for support to reset the cluster.

I eventually just "screw it," and spent a couple rebuilding the cluster from scratch.

AWS has over 175+ services and is continually improving. I would say more new services on launch don't live up to the promise but quarter after quarter you see dramatic improvement, some services it's year after year. What makes AWS so valuable is not individual services but the fact you can string them together which outweighs the feature set of a single service.

Github is amazing for developer happiness but CodeCommit is secure and seamlessly integrates with so many AWS Services I Can live without all the bells and whisltes of Github.

Elasticsearch is on the older side of those 175, since it launched in 2015. [1]

[1] https://aws.amazon.com/blogs/aws/new-amazon-elasticsearch-se...

I've never had aws hand over anything under a tenuous legal argument. Github bent over and found the first reason they could. I'll pay my money to AWS first, because it doesn't compromise my business.

This. AWS support is wildly expensive. I actually encourage most companies to find a smaller partner with strong infrastructure for this exact reason.

AWS Business Support is only $100 / USD per month where you can call in 24/7 and be connected with a Cloud Support Engineer within 5-15 mins. If you don't like the engineer, hang up call again and get someone different. That's incredibly in-expensive, saves me hours or hiring more devs.

You can get this Business Support for free a year if you join StartupSchool and get 3K USD credits with the business support.

If you can't afford it, maybe you are just hobbying around but AWS offers lots of ways to support you until you have a substainable revenue.

Last I checked it was a percentage of your spend.

I think you may not appreciate how much $100 a month is for many people around the globe.

That may be. But, then, you're definitely in the category of your time isn't really worth anything and screwing around for a few days doing DIY on software/hardware/etc. is a better solution than paying someone to do something for you. That's fine but you're really describing paid support generally.

This right here

I always hear people on the internet talking about AWS being crazy expensive, but from SFBA it looks really damn cheap. Would I rather give $thousands to AWS or $hundredthousands to an internal specialist who’s likely gonna say my company is too small and boring to keep them interested anyway?

AWS wind that math every day. And that’s the market they target. Why wouldnt they?

AWS is extremely expensive once you get to any meaningful size. If you save on infrastructure you pay for it on enterprise support, engineers or consultants and/or bandwidth.

I LIKE AWS. I think it can be a great choice for many companies and use-cases. But the idea that AWS is firmly everyone, that it's less expensive for everyone when you factor in the TCO, simply isnt correct.

> But the idea that AWS is firmly everyone, that it's less expensive for everyone when you factor in the TCO, simply isnt correct.

Correct. I specifically said it’s perfect if your labor cost is higher than your AWS cost

Another good case is indie hackers whose project likely won’t ever make it out of the free tier

This just isn't true, in fact quite the opposite. If you just have a website you are working on aws is complex and expensive. When you need to handle it at scale then AWS is far better than other clouds and an order of magnitude cheaper than rolling your own racks.

What's meaningful? It'll cost you a half million to build a full rack, power it and give it connectivity for a year. That's a lot of spot instances if you don't have a constant load

> It'll cost you a half million to build a full rack, power it and give it connectivity for a year.

For $1500/mo you can get half rack (5Kv) and 50 Mbps internet. A couple $5k switches and 6-8 10k servers and you're well under $125k, plus you probably don't need $10k servers or $5k switches and can find cheaper hosting. I realize you said a full rack, but that's probably overkill and could be done for $350k or less. Once you have the servers/switches your fixed costs are relatively low.

Part of it is that people see the AWS or support bill or consultant bill or whatever. But they don't really see the cost of DIY whether initial cost or ongoing support. The cost of build (on many dimensions) vs. buy is often underestimated. That's not to say you should never build--no, you shouldn't use AWS for everything--but it's important to understand the real costs.

I ran the tech team of a startup in a third world country (bullet lending $100 bucks to people) we were cheap and it was always a problem to convince finance to buy technology instead of throwing more cheap labour.

Still, those $100 monthly we paid for aws support were worth their weight in gold.

For a person anywhere on earth? Sure. But 99.99999% of those people don't need on-demand cloud hosting support.

But for a business that requires cloud hosting with support? There aren't many places on the planet where $100 is a prohibitive cost for a business that's willing to spend at minimum that much for hosting.

In short, it's the cost of doing business.

They once tried to get me to sign up for $100/month to fix an issue on their side. I refused and 3 days later my problem was magically resolved. AWS is starting to mimic the poor customer treatment I used to only associate with the Amazon marketplace.

Elasticsearch is a bad example. Their elasticsearch offering has always been wack.

There's a tier of services at AWS:

s3/ec2/ecs/rds/elasticache - Flagships, will almost always work except in weird edge cases. Everyone uses these.

Niche Stuff - If you need it you'll know. It'll generally at least be an 80% tool (think athena/firehose/aws waf/etc).

Stuff with shitty pricing - Stuff they obviously only built for feature parity with GCP/Microsoft (think EKS/Managed SFTP/Cloud Active Directory)

Broken shit you use once, it screws you, and you never use it again - Usually it's either because the underlying open source tool is built in a fashion that isn't appropriate for a shared services environment or it's because someone at AWS has 'opinions' (think Elasticsearch/Cloudformation)

It is not recommended to run two node ES clusters. The AWS service shouldn't allow that configuration.

Yeah, that was my mistake.

I once used a much larger EC2 than I needed and got billed a few hundred. After explaining the situation they refunded all of it.

We do it because those alt vendors don’t have the security or compliance options offered by AWS/GCP/Azure.

For example, I wanted to replicate GCS to another hosted block store so that we could have a backup of our systems outside of our GCP account (they have been known to lock accounts on small businesses and not be very helpful in fixing it, GCS itself has been as stable as S3 in my experience).

Anyhow, I really wanted to use Backblaze B2 service for this purpose. Unfortunately, they don’t have the kinds of security controls or third party audits our industry requires, and their sales team indicated it wasn’t on the roadmap. I appreciate that honesty, but it’s one more reason the major players have a leg up.

They amortize the cost of compliance to the point where you don’t see it. For a long time AWS charged a lot extra for it, but GCP did not and was cheaper anyway. Now AWS gives compliance away for “free” as well.

It often leaves me wondering about other startups... how secure are they, really? I know my industry is onerous, but a lot of it is just “common sense” security-wise. Why should my browsing history or e-commerce purchases be any different from my medical records, when there are ways to use former to reverse engineer the latter?

I used to get 18k monthly bill on AWS, moved everything to DO and now it costs me 8k a month. Billing is so much easier to understand now as well. It was k8s to k8s cluster transfer, so migration wasn't very painful.

And even DO is still quite expensive (though the cheaper alternatives do tend to have limitations - e.g. if you're ok with Europe only Hetzner tends to be far cheaper, but that doesn't work for everyone).

Thanks but majority of the target audience is in SEA. Also we considered moving to DO only once they had their k8s offering ready.

I might have missed it in the search, but seems Hetzner is probably still working on it. Nobody here like to deploy k8s although the tools to do that are super sexy these days.

Hetzner deploys new features very slowly; that's the main issue with them if you're not prepared to roll your own.

But certainly for SEA they don't make sense, unless you e.g. have any subsets of functionality that are bandwidth intensive but not latency insensitive (e.g. I used to manage a network that was split between UK, Germany and New Zealand, and we used Hetzner for the German footprint and put anything where latency didn't matter, like bulk e-mailing, there, while customer facing stuff was all in their respective country). For that to be worthwhile you need a quite significant volume though.

You could get that 8k/mo bill down to $500/month with a $8000 hardware purchase and well thought out colocation.

And support work. Saving 20k a month is one engineer in expensive markets.

If you use k8s, you can shop around more easily.

Of course, the flexibility of k8s does come with complexity, so YMMV.

Two main reasons: Elasticity and Availability.

Our services are highly elastic, and can vary from processing 8 million events/day to 600 million events/day. Same as with our users, which are mainly active during work hours, with some running night shifts, and then fewer running weekends.

We are probably the prime-case for cloud, since elasticity is where you save cost by going away from dedicated hosting.

As for availability; our customers are highly dependant on us processing their data live, and them being able to monitor, get alarms and react on their data. They rely on us to notify whichever technician needs to fix their production line when it's stopped, since they loosing money for every minute the line is not running (true for almost any manufacturing company).

This means that we need a lot of redundancy, and these things are built-in to almost all AWS offerings.


There's a case for dedicated servers still, I'll agree on that, but we are definitely benefiting from the cloud.

I'm no cloud expert, but - I'm not sure your argument is convincing.

Let's say that, for a unit of computing work, ,the cloud price is N times of non-cloud. I'm being a bit vague here for generality; plus, when not on a cloud, the cost structure is different, but bear with me.

Ok. Now, your load varies by a factor of up to 133x. But unless your peak load is over N times your average, it is still cheaper for you to keep machines which support the peak load and have a bunch of idle time.

Extra benefits:

* Can do other computational work (e.g. experiments) during off hours without impacting system responsiveness.

* Can perhaps put some machines to sleep, or other power-saving measures, during off hours.

Extra detriments:

* You have to take care of more system and cluster administration work than on the cloud.

Above N, the cloud makes sense. Close to N - not sure. Well below N - doesn't seem to make sense.

You'll need to tell us that N is low enough.

My rule of thumb is that you need peaks shorter than ~6 hours for most "normal" type setups before AWS becomes viable from a cost perspective. Of course that depends on how significant the peaks are. Most consumer websites with an international audience for example, rarely have significant enough peaks.

This is exacerbated because the alternative is not either-or - the most cost effective system is often dedicated hosting + the ability to spin up cloud instances to take the peaks.

Doing so lets you provision your dedicated servers to run at much higher utilization most of the time, and make pure cloud setups look far more expensive, and most people with setups like that end up needing the cloud instances very rarely.

That's not to say there aren't exceptions with genuinely massive peaks, but even then it doesn't take a huge base load before a hybrid system starts to have the potential to bring substantial savings.

Also to keep in mind, despite it's name, 600m shouldn't blow anyone's socks off in upkeep.

It translates to about 7krps/day. Assuming it's all in one region at day time, it's 14krps/12hrs, or 42krps/4hrs. It's well within a couple high-powered servers even in the worst case

Because the cloud providers advertise to tech startups. They are told that the scalability of the cloud is needed, because YOUR startup can become an overnight success and can’t handle the traffic.

This combined with how unsexy most of the VPS or dedicated hosting providers look makes the cloud providers a seemingly good choice for startups.

I'm not sure this is the only reason. Clouds are super, super attractive to stodgy, boring enterprises.

And people here working in start up environments are not aware how much hardware from internal IT departments costs. Internal IT departments are generally way less efficient than the cutthroat cloud providers. AWS/Azure/etc. hosting costs are peanuts compared to what most banks or other big enterprises pay for their internal hosting.

On top of that, internal IT response times are frequently horrible. Any developer worth his pay generally loves cloud providers and abhors IT departments. They're slow to deploy hardware or VMs, update firewall rules, install software, etc.

The main competition to AWS is not (or should not be) internal IT, but other cloud providers outside the big 2-3 (Digital Ocean etc.), and dedicated and managed hosting providers. I can usually get a server in less than 24h from a provider like Hetzner, but if I need capacity urgently I can also temporarily spin up cloud instances there.

The biggest issue is not that people want to avoid internal IT and/or want elasticity, but the number of people who never even do a proper cost comparison beyond the top tier most expensive providers.

AWS hasn't been selling itself as a competitor to other clouds for like a decade compared to in-house IT because their market entrance _is_ technology lagging companies. The first demo use cases for, say, SQS back to 2008 were literally failing over databases into the cloud. It wasn't really until Azure that any truly serious attempt to compete cloud to cloud at enterprise scale. For all the HN hype of all the various vendors out there, they're all small fries collectively compared to the many billions of the cloud gorillas like Alibaba, Azure, and AWS. DO, Linode, Vultr, etc. are simply not intended for bureaucracy-laden companies that are the lion's share of the global IT market and also are not even allowed under many enterprise vendor and purchasing agreements (had this problem with several customers' contracts / legal before that would have killed 7+ figure deals).

Most places outside tech hubs put massive emphasis upon being able to swap labor and are technology laggards because their switching costs are so high due to lack of velocity in their work in technology in the first place (and also why waterfall or spiral makes more sense for their efforts rather than Agile probably even today). The irony here is that strong AWS professionals are not cheap in any way, but neither is being dead in the water because your 60+ year old guru graybeard IT guy that managed the racks faithfully for 12+ years retires suddenly.

> AWS hasn't been selling itself as a competitor to other clouds for like a decade

And yet they're handing out $25k-$100k in credits like candy to startups to prevent them from going to Google cloud or Azure.

"Selling" is what I was getting at. When looking at the tail end of the funnel every other vendor is giving away drugs to hook kids when they're young. Tail distribution dealers can't afford to do this in contrast.

Another tell is that the ways to "prove" start-up status under most of the big providers' terms for cloud credit align more around "did you get VC funding?" than "are you small?" due to the longer lead time nature of bootstrapped start-ups. Most don't give you credits if you've been around 2+ years and soured on another provider. Maybe I just sucked at it but I was bootstrapping before and had trouble getting decent credit then.

I have strong evidence of the opposite. I worked in a hedge fund (unnamed for privacy) they paid top $ for IT admins, which by the way were very capable. An incursion into AWS taught them a hard lesson - the were nearly 4-5 more efficient than AWS at similar hosting capacity. Moving to GPC was their only option to maintain elasticity at relatively “acceptable” prices.

My "anecdata" is stronger than your "anecdata", I'd say.

> they paid top $ for IT admins, which by the way were very capable.

Most companies don't pay top dollar and as a result they have to scrape by with the leftovers. As a result Amazon for sure beats them in sysadmining efficiency.

> Because the cloud providers advertise to tech startups.

That's not true at all. Cloud services are very attractive to good old fashioned companies, because they express the cost of using computational resources as a simple monthly bill, just like any other utility. Your manager goes through the monthly bills and he sees, say, electricity, water, cloud infrastructure, cleaning, office supplies. That's it. No need to know what a server is or what a spot instance is. If the costs fluctuate smoothly and within boundaries then you simply don't worry about them.

> That's not true at all.

I have lost count of the number of startups I've evaluated that have gotten $25k-$100k in credits from AWS or Google Cloud, or both. When I was doing consulting I once got paid to first set up an environment on AWS despite them knowing it was too expensive for them, then to migrate from AWS to Google Cloud, then to migrate from Google Cloud to Hetzner - all because my fees for doing the setups and migrations were far lower than the free credits heaped on them by AWS and Google.

There are certainly reasons why people in bigger companies like them as well, though from what I see most of the time it is because they can slip the cost under the nose of the manager one level up in a way that is harder to do with more clearly quantified contracts. It's boiling frogs - as long as the cost just slowly creeps up, it doesn't get queried.

I am a bit skeptical regarding your claims. Without any detail, if you're talking about companies whose business justifies $100k credits from AWS then you're not talking about normal businesses that have at best a local presence. If your company has a global presence then Hetzner doesn't quit cut it, as they only have data centers on Germany and Finland[1] and the added latency of accessing their data centers from non-european space is something that easily adds 100-200ms to each request.

Hetzner is indeed excellent if you are a cost-conscious european client that is willing to absorb the cost of self-managed bare-metal or quasi-bare metal servers and cares about saving on bandwidth costs. However, if your goal is to provide a world-wide service then you are compelled to look elsewhere. Even competing european service providers such as OVH[2] or scaleway[3] fare better than Hetzner on this domain.

[1] https://www.hetzner.com/unternehmen/rechenzentrum/

[2] https://us.ovhcloud.com/about/company/data-centers

[3] https://www.scaleway.com/en/datacenter/

I'm talking about tiny startups.

And Hetzner works just fine for most small startups even if/when they have a US audience - the latency is perfectly manageable, though in those kind situations you will certainly want to expand into other regions later.

The main point, though, is that even when you scale, nothing stops you from using Hetzner for Europe, and e.g. OVH for US, and indeed filling in with things like AWS when there are needs you can't otherwise meet.

In reality very few startups gets to a scale where this starts to matter, and overoptimizing for it at the cost of infra spend at the start is a great way of running out of money - seen that happen way too many times.

And Hetzner has cloud/VPS/etc offerings in Germany & Finland: https://www.hetzner.com/cloud

It seems some consulting companies prefer to push AWS over other service providers. AWS has the advantage of a lot of professional certifications and a big catalog of managed services to meet all kinds of needs. Cost is often mostly dev time for a lot of software running on it. It does pretty well at saving developers' time.

Also: free cloud credits. Don't underestimate that. I know of two startups (one that I work at). Both are hooked on the cloud, albeit one has an insane cloud bill and one has a very reasonable bill.

The reason in both cases is MSFT/AWS gave them massive up front "free" credits, which meant there was no incentive to impose internal controls on usage or put in place a culture of conservation. AWS doesn't think twice before dropping $20k on cloud credits to anyone who wrote a mobile app.

At the company with the unreasonable bill, it's not even a SaaS provider. It literally just runs a few very low traffic websites for user manuals, some CI boxes and some marketing sites. The entire thing could be run off three or four dedicated boxes 95% of which would be CI, I'm sure of it. Yet this firm spends several tech salaries worth of money on Azure every quarter. The bill is eyebleeding.

The problem is everyone who wanted one got a VM. You got VMs being used for nothing except running a single test program that the owner then forgot about. People used expensive hosted cloud services instead of installing Postgres because "their time is more valuable" etc. Free credits created a culture in which the company just institutionally forgot that servers cost money. When the free credits ran out it was invisible to the engineers using the services and simply became another (opaque) line item for accountants to worry about the burn rate. In the rare cases engineers decided to "optimise" they did it by spending lots of time and effort on using Kubernetes so stuff would scale down at night.

There was an abortive attempt to move off the cloud. This was unfortunately stymied by incompetence by both the firm and the chosen hosting provider. It got some boxes from OVH and then didn't pay the bills for them, so the boxes were simply yanked and deleted. All the setup time was lost. In another case OVH allocated machines in datacenters that were far apart but they needed to be close together. In another case the machines were delivered but the hardware on one was faulty. Of course this stuff is a one-time hit and avoidable with non-broken processes, but it empowers those who just want to keep learning Azure/Kubernetes/Docker so they can put it on their CV.

The other firm is much smaller and has much more reasonable costs, they also get more for it (e.g. use Heroku which automates a lot of stuff).

Having observed these two different companies, I resolved up front that if/when I go back to running my own business I will not use cloud services, even if they offer free credits. I'll be taking pride in finding ways to drive server costs down as far as possible. Perhaps even hosting non-HA-required servers at my home given we have duplex gigabit fiber now. Yes, my time is valuable and all that, but establishing a culture up front in which people use the resources they need and not more is even more valuable.

Counterpoint: my org got a large credit pool, but out development and engineering team were closely communicating with relevant leadership to explain costs and processes. We stretched out the credits for 3 years, despite spinning up/down nearly a hundred thousand VMs. Our infrastructure team still controlled access and setup, but it was a lot easier to say "director, Bob wants $200/month for this project, approved?" Then it was created in a way the costs tank to that project. Slower than Bob with direct access, but much faster than traditional systems since the VM is up in minutes and could be put in a sandbox if needed.

So in our case, credits got us into AWS, but instead of treating it like free money, we spent a lot of time improving efficiency and reducing our bill so when the credits ran out we could run our system for an affordable rate.

I take pride in reducing costs in the cloud. I don't argue cloud is the only way, but there's a lot of use cases where it just makes sense -- CloudFront+S3 for static websites, for example.

Fair enough. Perhaps it's unfair to blame the cloud providers for these situations. Badly run companies can waste money in a lot of ways. Just seems like cloud frequently crops up as one of those ways.

But..but..BUT!!! IT'S FREEEEEeeeeeeee!!!!!!!!!!

Nobody ever got fire from buying IB ... er AWS. [1]

[1] until the bill shows up ;-)

Actually -- to be honest, we're in the golden age of infrastructure. I love building on cloud. I don't have to worry about the fiddly bits as much and when I do things are often documented either in official docs or in blog posts and things.

IBM was the 90s. The 2000s was VMWare.

No one got fired for choosing AWS ... Nor blamed in case of outage because of AWS.


It's sad to say but people are not rewarded by going with a perceived riskier option in an effort to save a couple 100k.

You go with what works and proven, and with what will make your life the easiest. If AWS goes down, then its much easier to explain why. If "Big Bill's Lowcost Cloud Solutions" provider goes down, then your entire judgement will be questioned. Even if Big Bill's Cloud solution is equally or more capable and only half the cost.

Yep, it's risk aversion at the individual level. Your solution you fought for breaks once? That's associated with you, personally. AWS breaks three times in the same time period, or ends up costing a ton more than expected, or whatever? Well it's an "industry standard" so no-one gets the blame, even if it gets bad enough that a change (how about Azure!) is initiated.

Exactly. Try telling your management to switch to DO or Linode, which is already the top spot for 2nd Tier Cloud Hosting. Who is there to take the risk of failing for saving that is in the grand scheme of things negligible.

AWS failed? Great you have half of the internet down as well to cover your ass. No one get the blame and everyone continue with their work.

Of course this depends on culture and regional reputation. For example you would have no problem if your Startup in France is using OVH.

Not sure about DO, but Linode will bounce your servers whenever they feel like it. They usually give notice.

But some were forced to close as they were unable to afford the bills.

B2B SaaS hosting costs are often a low single digit percents.

Considering that software is known to be high margin, there's lots and lots of companies for which hosting can be painful, but not threatening.

> if you have someone with sysadmin/devops skills, it is easily doable.

And if you don't, AWS can be pretty cost-competitive because your developers can handle a lot of the "infrastructure" that you used to need a dedicated sysadmin (or a team of) to handle.

Don't get me wrong, I wouldn't likely spend my own money on them, but it does beat having to try to find an on-call infrastructure person and a reliable data centre with proper routes, and then developers/devops people who can scale, monitor, manage, and maintain queueing services, message gateways, object storage, databases, firewalls, load balancers, containers, VMs, network attached storage, etc.

Also, don't underestimate the value of things like CloudFormation. Being able to make an API call and have an entire cluster configured, with load balancing, backups, multi-AZ redundancy, CDNs, etc., is pretty potent.

AWS might be expensive, but it gives you access to a lot of things that you might not otherwise have, even if you do have a sysadmin/devops guy.

Because fundamentally there are tiers of service that one need to use:

There's a critical core infrastructure - i would argue that it is:

* some sort of centralized management platform even if platform is just a set of deploy scripts ( 1 instance ),

* Origin DNS servers ( 3 instances - one per AZ)

* HTTP/HTTPS entry points ( 3 instances - one per AZ) ,

* A couple of database servers ( say primary and a backup 2 - instances ).

* I would add job server though it probably could be collapsed into the management platform.

You need to have fixed IP addresses for this ( you really do not want to deal with service discovery at this level ) and you want it to be at a provider that won't ever make you need to renumber, or preferably a provider that does not ever break fundamental things like IP address assignment, or a provider that would run out of resources.

One's "internet presence" disappears when this dies. Running all of this as a core workload on AWS/GCP/Azure is a no brainier - it will cost about $100/mo, be nearly insta-rebuildable and re-deployable and a couple configuration files in git would take care of bringing up the beach head up.

At this point your core is up and everything else becomes service specific. This is the point where costs become a consideration but drive to low cost should not come at the expense of tooling. If by embracing ephemeral resources and existing tooling one can cut the base AWS/GCP/Azure price down by 80% most people would think it is a win over having to invest into building tooling to make containers stable ( as in always behave in a predictable manner ) on a provider that can shave off 95% of the costs.

The biggest issues with the cloud providers is the cost of IO and cost of bandwidth which scales linearly with the workload but that's an issue for very specific subset of customers that should be hiring real ops people.

I use it when needing vast geographical spread - e.g. nodes that are closest to exchanges or some such.

Because so many others use it, if you care about latency to the counter-party server it's often a nice way to ensure low latency (by using the same availability zone).

I didn't have the impression people where choosing AWS for costs alone.

What cloud provider has a similar feature set?

Sure, if you want to meddle around with VMs and containers, you can use pretty much any provider, but if you want to go a step further there isn't much left

What's a popular feature set that only AWS can do and cannot easily roll your own solution?

there was several reasons for our case. There might be more for other startups. Here are some:

- our clients (airlines) would very much prefer we use AWS over the smaller, lesser known offering.

- AWS offer free promotional credits for startups.

- AWS when utilized right (requires work) is not much more expensive than traditional hosting.

- Etc...

Those are the most expensive computers I ever saw: https://aws.amazon.com/ec2/dedicated-hosts/pricing/

Maybe you haven't seen enough computers? :-p

Internal IT departments or hosting providers meant for highly regulated environments can charge 5-10x that for a VM of a similar size :-)

So that will be 50-100x more expensive than Hetzner?


I completely agree. Most startup's can do with cheaper or even bare-metal hosting for years before having to use highly-scalable solutions like AWS.

How much would it cost to get someone dedicated to manage your bare metal infrastructure? In this day and age, where using ansible or terraform is considered close to the metal, how much would it cost you to manage your own server?

And what about scaling?

That's precisely the problem. Bare metal hosts are unbeatable in cost, but fixed costs render them too expensive for a startup. Then, when fixed costs start to become irrelevant, you need to factor in the cost of rearchitecting your solution.

Then, when both of those costs become irrelevant, you already have the entire team trained and experienced in using a cloud service provider.

As someone who has been that "someone dedicated to managing your bare metal infrastructure": It will typicall cost less than that "someone dedicated to ensuring your AWS setup is correct and works and handle all the regular config changes".

I know that first hand for having done both, and seeing how the AWS systems consistently earned me more money, and how rarely I had to deal with the bare metal (I've done anything from actual own hardware in colos to renting servers from places like Hetzner; when I was handling servers in two separate colos I spent on average a day a year in each of them, the rest was done by "remote hands" at the data centre)

> That's precisely the problem. Bare metal hosts are unbeatable in cost, but fixed costs render them too expensive for a startup. Then, when fixed costs start to become irrelevant, you need to factor in the cost of rearchitecting your solution.

Actually buying hardware is too expensive. But coloing leased hardware or renting on a month by month bases from a dedicated hosting provider costs about the same when you amortize over a three year period unless you're physically located somewhere with cheap land. E.g. I work out of London, and when I was doing this we eventually deprecated the own hardware in favour of renting from Hetzner because colo space in London was so much more expensive that the savings on the actual hardware couldn't make up for it.

> Then, when fixed costs start to become irrelevant, you need to factor in the cost of rearchitecting your solution.

Or you architect it properly from the start. I've done zero downtime migration between AWS, GCE and Hetzner. I've had systems that tied in cloud instances, VMs running on our own hardware, containers running on dedicated instances, and VMS on rented hardware, all tied into a single system. If you run everything in containers anyway, all you need to make that happen is a simple orchestration system and a reliable network overlay, and an architecture that ensures reliable replication of your data.

Once you've done that, you're free to pick and choose and migrate services as you please depending on cost and need, and it really is not that hard to get working - you already do most of the necessary planning if you are setting up a reproducible cloud setup anyway.

It's more complicated than that. Sometimes, the decision is not just in the engineers' hands. eg - our clients (airlines) would very much prefer we use AWS over the smaller, lesser known offering.

What justification do the airlines give for placing demands on your own suppliers?

It's pretty common for Fortune 500 type companies to ask for all sorts of intrusive terms they perceive as a benefit to them. Indemnity, long net payment terms, penalties for missed SLA, etc. They could be specifying AWS in this case to reduce latency and/or avoid the internet as a dependency. Maybe the client side is on AWS already, or they already have high bandwidth / dual path via AWS direct connect, and don't wish to do that for a different provider.

The reward being that once you're in, they are too big and slow to ever move away from your product.

"We want it because it makes our lives (specifically, probably some paperwork) easier. Do you still want our money or not?"

i'm not familiar with that part. also sale people can make all kind of promises to close the deal.

Not necessarily. StackOverflow runs on a couple of servers.

You get what you pay for. You can pay a lot of money and have almost no worries (and lots of free time), or pay a little money and have lots of worries (and no free time).

And since the OP was talking about Kubernetes, and you're talking about "lock-in", another reason I love AWS is it is not a lock-in device. I can use any service of AWS's by itself, without being forced to use anything else or do something in a proprietary way. All the interfaces are a command-line or REST API with JSON data formats. They are all designed to operate by themselves, so you can provide replacement components at any time, hosted anywhere.

On the other hand, you can't just use one part of K8s, because you have to at least set up and manage an entire cluster first. And there's dozens of services K8s simply does not have, and other hosting providers don't have.

Baloney. All the major clouds offer managed k8s. And using k8s in no way precludes you from integrating with services that are not on k8s (of course though, k8s can run anything you can containerize). So AWS has vastly more "lock-in" debt than k8s.

To replace an AWS service, you just plug in a different service integration, or use a literal API complete clone. To replace the use of any part of K8s with something non-K8s, you basically have to replace all your use of anything in K8s. So the debt is much higher with k8s because it takes much more engineering work to get off it and it's not compatible with anything but itself.

Furthermore, just because someone has a managed k8s doesn't make it less lock-in or less work. With AWS you don't need to use a cluster of anything. With k8s you are signing yourself up to tons of complex services and specific design and operation paradigms. With AWS you have no such inherent restrictions.

K8s is inherently more complex and difficult to use than AWS services, which aren't even a good comparison because they are so simple by comparison.

> You get what you pay for. You can pay a lot of money and have almost no worries (and lots of free time), or pay a little money and have lots of worries (and no free time).

This is, in my experience, quite a false statement when talking about AWS. You are either a (team of) 10x engineer(s) capable of anything or as you grow in people and services you will need to use a good chunk of your developer time managing AWS, or hire a dedicated person (sysadmin).

There are literally compete turn-key solutions in AWS marketplace to most things any business needs to do that you don't need to be an engineer to use. A monkey who can read a walk-through with screenshots can build clusters of apps and CI/CD pipelines with AWS. And there are thousands of companies that you can pay to help manage your AWS resources.

Just because businesses hire dedicated backed people doesn't mean they actually need to, they just feel better about it when they do, because that's how it was always done before.

Here's my two cents as a developer who has only paid the cost of AWS once for a one-off project (in retrospect I could've gotten more value with similar processing power with a different service).

AWS is established and relatively straightforward. I found the user experience to be pretty seamless. Maybe there are better alternatives but I've found Google Cloud Platform and IBM Cloud to be absolutely miserable and confusing (the latter more so).WRT Google Cloud Platform I think the main sin is the UI/UX, and if I had to foot the bill I would give them another consideration. I had very limited (but good) experience with DigitalOcean and Heroku, but those just don't have the infrastructure at scale to compete with AWS in my opinion.

What do you think is the best alternative to AWS?

I wonder if a lot of startups think "What if we're an overnight success and we have to scale up fast?". This is of course what Amazon makes sure to use in their marketing as well.

Examples: Twitter became far more successful than their architecture and infrastructure could handle back then.

Pokemon Go became an insanely huge success far exceeding their worst case scenario. There's a postmortem/whitepaper on Google Cloud Engine about that IIRC.

The white paper on Pokemon Go ramping up on GCP after launch:


It's all relative. For most businesses, productivity is more important than the cost of infrastructure because it creates more revenue to cover the cost, leading to a net profit.

That being said, it also makes sense to apply some basic optimizations to reduce costs by a significant margin, as long as the work involved in optimizing pays off. A few days to cut costs by 50% makes sense, a few months to save a few dollars doesnt.

i have absolutely no idea how much my company is paying for AWS. I expect I will soon, but it’s a little alarming that this data is not readily available to me. I think I am not the only one in that boat.

For people who know very well? I think it’s not unlike interpersonal dynamics. Codependency looks very different from outside versus inside. It’s difficult to tell why your friend keeps investing in this bozo, coming up with ever more elaborate ways to manage them and their moods.

And over lunch a mutual friend will discuss how they might be better off with someone else, and once in a great while someone will ask if maybe So-and-so might not be better off alone.

AWS is a dishonest partner. They have decided that you not knowing how much this is going to cost you is a good thing, and give no evidence that they are willing to change. This is who they are. Do the good things about them make up for that kind of bad thing? I don’t think so. I think you should look for another partner. Maybe just something light, not a serious relationship. Or maybe try some alone time and see what that’s like.

I am not sure they are obsessed with AWS. I think they are obsessed with scalability (both technical and financial), elasticity (turning off capacity that you do not use), security compliance (required by many industries), reliability, availability, and support in case something goes sideways. Containers and Linux are orthogonal to this.

If you need a few servers, then yeah, AWS is super pricey and locked in.

If you need 50 engineers to be able to manage 200ish servers, well, uh, you just don't have a ton of options.

Obviously not everyone falls into that second category but plenty do.

> If you need 50 engineers to be able to manage 200ish servers, well, uh, you just don't have a ton of options.

Who needs 50 engineers for 200 servers? My IT team manages 750 servers in 5 co-locations + 200 end users + DBA responsibilities with 5. If we grow above 1000 we might need to hire a 6th.

It's also a very common skill. A lot of people are familiar with AWS. What money you would save by using an alternative, you will lose on training new recruits, or mistakes made from lack of expertise.

Many of the knowledge works across other vendors that are just Linux. What AWS specific feature are people so desperate to use to get themselves locked up?

We moved to GCP from AWS because we thought it'd be cheaper. Turns out it wasn't that much cheaper at all. But Google paid for the transition so there's that...

Yes, and this is sad.

I have seen this as the outcome of a bubble which keeps repeating that "having to maintain is bad", "servers are bad" or "this is old".

AKA marketing.

According to what think: accounting.

The bill of the cloud vendor is already approved and the 3rd party apps don't need to have a seperate approval process.

You'll never get fired for suggesting AWS. And IAM is the killer feature.

I think it's probably the new "no one got fired for buying IBM"

Because AWS is not just server hosting company.

What do you suggest instead of AWS?

Two reasons come to mind:

1. Nobody gets fired for picking AWS. 2. Plentiful amount of talent that understands AWS and AWS patterns.

AWS can be expensive, but it comes down to engineering and needs.

I'd argue it's most expensive for small (10-50 person... ish) businesses -- to small to hire a LOT of overhead staff, but not small enough that you're still in POC/MVP stage.

A lot of ideas can be tested in AWS essentially free. However, if you pick the wrong tools or don't anticipate future bills, you can end up with additional costs and need to re-engineer slightly.

That said, it works best with scalability concerned workloads. If you only need 10,000 machines for a few hours/days (research workloads), or your load is much heavier just 9-5pm, or you need vast constant scalability.

My org currently has 19,500 VMs in AWS right now -- they are student VMs running a mix of OS's. While many of them are offline for days or weeks at a time, we can easily start 5,000 of them in a minute or two if needed (though we usually don't see more than 400/minute). Sure, we can run these in a large virtualization system of our own, but we're only 3 developers, a single system admin, and one cloud focused person (though we do cross train a lot)... So AWS allows us scale quickly without as much concern on the underlying architecture.. purchasing equipment... Dealing with hardware failures or warranty/replacement claims. We can spend our 9-5 building, improving, and optimizing the system (including reducing costs). With this setup, nearly anyone in our team has the capability to research a new component of we really want to add it (like websocket support) and play around (in a sandbox), without getting others involved until it is ready for a POC demo... Then we can have a new feature in production in days/weeks despite needing new equipment.

All that said, I'm not a cloud-is-the-only-way evangelist... If you don't need any of that, you can fairly easily make a cloud agnostic system with container... Using ECS for many typical use cases is rather portable to other container systems. If your making larger lambda packages that can handle many paths (i.e. ELB/API Gateway Lambda proxy), they can move with less work than 100 purpose built functions.

If argue most people unhappy with AWS have leadership with goals or ideas sold to them that are silly (lift and shift)... Or staff that are not trained on the cloud provider so they don't know the right considerations for building in the provider they pick. I'd wager many developers don't even know how to use Dynamo DB properly for the first year or two they use it. It's hard to say it's AWS's fault if you have a large bill because your developers 'SELECT * FROM users' every time you lookup a single user... and AWS scales to support the inefficient scan (how does a DB provider know if your application logic only uses one value?).

Anyway, I hope this gives some insight on why orgs use AWS.

You can do a lot with lambda while keeping your costs down.

On the one hand, building fault tolerant infrastructure that can, as a side effect, work painlessly on spot instances is great.

On the other hand, you can purchase reserved instances and get ~60% cost savings with zero engineering work. Its worth thinking long and hard about whether the cost of engineering time is worth that next 20%.

There's also a lot of useful ground in between "critical state, must never be lost (like a database)" and "can handle being terminated with 2 minutes notice". A service that can be re-created if necessary but takes 10 minutes to start up is really scary if run on spot instances, but can still be pretty useful.

Back when I used AWS, I went the reserved instances route. The pricing is pretty okay, and you are guaranteed to have machines at busy periods of the day.

The problem with spot autoscaling is that if everyone does it, it stops working. Everyone in us-east-1 gets most of their traffic from 9am EST to 5pm EST, because everyone hosts close to their users, their users are human, and humans are diurnal. Most "batch" workloads that people have also follow the same cycle; they're working on something in the office and want results now, not tomorrow morning. So they run their batch jobs during the day. If you can figure out how to get your traffic spikes from 9pm until 5am, spot instances are going to be great -- millions of CPUs are sitting idle waiting for your novel workload. But if your customers are working 9-5 jobs like you, and you care about latency enough to host close to them... you're competing for instances with every other computer user in the region.

Most of batch workloads are not that latency sensitive. You can send them to the other side of the world as everyone is sleeping there right now.

A thing to keep in mind is AWS's major outobound traffic costs, at $90/TB. So taxiing the data oversea might not be affordable either

Even for inter-AWS region transfers?

It depends on how the data is being transferred. In general, the answer is yes because you'd be using public IPs.

If you're transferring between EC2 instances in peered VPCs meaning you can still use private IPs, it's more like $20/TB.


That's not been my experience, actually. The first CI service I used was hosted entirely in Europe, and we needed to ssh in to debug something, the keystroke latency was maddening. We eventually unsubscribed and just bought a big EC2 instance in us-east with Jenkins running on it. It cost approximately 100x more, but our productivity was high and frustration low. Well worth.

I personally think it will breed huge organizational problems if things like CI are slow. "I'll get a cup of coffee while this runs" and then you come back and forget what you were going to release. Soon it becomes "let's get another change into this build before we release" and then it's "well, it's been six months since we've released anything, what do we do." You have to start fast and stay fast if you want to keep developers productive. So saving a couple bucks on computers that are half a world away can end up being a huge expense if you're not careful.

As other comments mention, you also have to be careful about transfer costs. In the CI case, getting your source code into the CI server is cheap, but getting the containers out is going to cost you, especially if you don't make an effort to optimize them. For batch data processing jobs, the same applies; getting the result out is cheap, but getting the data in is going to be a lot of transfer. (If you were using Small Data, you could just run the job on your laptop, after all.)

The speed of computers half a world away is not great either. I remember updating some Samsung drivers once, which were served out of a Korean AWS region instead of CloudFront... and the downloads were glacially slow. Their website is the same way. I couldn't believe how a multinational corporation could push bits at me so slowly. When you're reading their documentation all day, or tweaking drivers, you notice it, and you start to think "next time I'm going to buy Intel". (Compare Samsung's SSD website with McMaster-Carr's website. What site do you hope to interact with again in the future?)

Anyway, you get a bill for compute resources, and you don't get a bill for unhappy employees context-switching all day, so I see why people want to craft clever schemes to save pennies on their compute costs. But be careful. Not every cost is charged directly to your credit card.

What about using something like mosh for latency?

Note, AWS savings plans make this even easier


^ Came here to say the above. Majority of the time the Saving Plan is actually what you want.

Unless you have legacy app you know must be around for 3 years and have zero efforts to try to refactor.

Any idea of when this product was introduced?

So is this blog post redundant?

No. Spot gives deeper discounts than savings plans do.

And you can stack them.

I did something like this in GKE with preemptible instances which are guaranteed to go away at least once every 24 hours. I had separate node groups for stateless and stateful work loads. My clusters were roughly a 50/50 mix of each. Worked out pretty well and yielded some decent cost savings.

preemptible instances are worse than spot. we observed there were spot instances that last a year for us.

That presumes that an instance type that last longer is "better".

Yes, preemptible instances are expected to be expendable, I don't see this as a bad thing.

i believe what he meant is by being shutdown more frequent, we will be forced to build a more resilent system. otherwise, we would be more complacent.

Yea were wondering the same thing... How much does this cost in engineering hours ? Its like asking someone who smokes... they always forget/lie about at least 50% of the cigarettes... And even if the engineering hours was super minimal compared to reserved instance... will this be true for other teams ?

i hear all these engineering hours cost but only because engineering cost in US is so high. we're in Southeast Asia where senior engineer only cost $1000 a month back then in 2016.

Isn't cost of living also much lower in Southeast Asia?

Sure, but AWS infrastructure costs pretty much the same no matter where you live.

i believe he was talking how it cost more in engineering cost versus the saving we can get from infrastructure by migrating.

eg: if the infra cost down by 10,000 USD per month, they may say it's not worth it because they are paying more for 1 developer in US in a month.

Reserved instances and savings plans are predicated on predicting usage for 1 to 3 years in advance.

With spot instances, you don’t need to do this planning, assuming you can fall back to reserved quickly enough.

this is why i love google cloud over aws these days - i can get that cost saving with doing none of the work nor making any of the commitment. It's usually that commitment business hate, imho.

why not both :)

we did mix reserved instances and spot instances for our production workload. worst case scenario will be reserved + on-demand

And how much developer time did it cost?

We have done the same - our bills went down, but not by as much as 80% I think closer to 50%. But it took a fair bit of developer time, and we now have a lot of Kubernetes related problems to deal with. I guess those will smooth out over time, but I don't think anyone ever factors in this stuff when they claim great savings. Developer time ain't cheap.

On a plus note, running multiple small boxes via Kubernetes does give you a more high availability system. If one instance goes down, there will still be another one available, so it's not all negative.

I think most people over-estimate the need for availability anyway. It's tempting to build the very best you can, when in reality a business really needs a good-enough solution. A whole ton of software can be down every weekend if needed.

A great in-between is to simply have a backup server ready to go in a few minutes time. Super simple compared to orchestrated container system.

Of course for client projects that spec out a certain number of 9's must be done just so, but can also be billed accordingly.

Goes back to the fact that stack overflow itself runs on a single beefed up machine for all their traffic (with a backup machine of course). What this company does that needs so many instances? And they use the same tech too (.NET). Instead of thinking about that people always think about over engineering for "scale" to overcompensate bad code.

Do you have any idea how big their actual database is? And how many clients thay serve?

It does seem to be an example of a "standard" architecture done well. Our application has a tiny fraction of the traffic and it struggles with some things.

It's all public: https://stackexchange.com/performance

Here's a series of blog posts with a lot more detail by one of their devs: https://nickcraver.com/blog/archive/

They're very optimized and can serve all of their traffic on a single webserver, redis instance and SQL server.

Being a compiled program vs something in python or Ruby probably makes the majority of the difference!

> And how much developer time did it cost?

The nice side effect is they did migrate to .NET Core

And as another commenter said, another side effect was they got fault tolerant infrastructure. So not every minute they spent on migrating to "spot" instances is dedicated to that.

what were the challenges or what challenges do you have? currently my company runs on GKE and auto scheduling, even from "spot" instances (called preemptible on gke) is a breeze.

we basically only run jobs on the spot instances that don't need to run instantenious, so it's really cheap for us.

Large uploads and downloads filling up memory and crashing pods. When we had one large machine it didn't have this problem.

When that developer leaves you will still be saving 50%. It should pay for itself over the long haul.

Sigh. I had a very long and detailed reply typed out on my phone about the trevails of dealing with kubernetes in the last 2 weeks. Then Safari decides to reload the page and it all got lost.

I’m literally emotionally drained after unsuccessfully working with k8s after 2+ weeks.

It’s incredibly over complicated and documentation is all over the place. I had a large write up of my experiences but those are lost and I don’t have the energy to retype all of that.

I simply wanted to utilize k8s to help provide some auto scaling and redundancies for a 10 year old service I run.

After 2 weeks of deep diving on this topic and getting essentially nowhere, even with the help of a friend that does this for his day job and him waving his hands not being able to help, I’m reluctantly done.

The technology is just not ready. It’s too complicated. The documentation isn’t sufficient. Sure you can document every nut and bolt, but if you can’t create simple patterns for people to follow you lose. There’s too much change going on between versions.

At my last 2 companies, they each had a team of 2-10 people working on implementing kubernetes. After over a year at each company, no significant progress had been made on k8s. Sure some stuff was migrated over but no significant services were running with it.

You are definitely right that it is a fast moving target and hence can be frustrating to work with at the moment, particularly if you are trying to get it running on-prem. It is still relatively early days, and there is plenty of distillation to come, before an easy predictable set of patterns emerge.

Not wanting you to go to the pain of trying to recreate your original post, but of interest, what kinda of things were the primary areas of pain from your work?

I think this depends on where you run kubernetes and how you set it up. We moved from AWS + ECS to GCP + kubernetes. We use and use terraform to set it all up so that consistency in tooling/patterns helps.

We had one full time senior person doing the move (me) and 6 other coming in an out of the project through the year as their capacity allowed.

One of my clusters is running 5000+ containers with ease. Not huge by other company standards but big enough.

This is quite a neat strategy, leveraging elastic compute costs and kubernetes "self-healing". I'm surprised I haven't heard more about this kind of technique before.

I fully acknowledge this will only work in certain scenarios and for certain workloads, eg not ideal for long running/cache/database style services.

> leveraging elastic compute costs and kubernetes "self-healing"

The indirect effect of building on a system like this is that the recovery mechanisms get tested on a regular basis instead of just on the odd day when things fail.

Spot instances are like a natural chaosmonkey mode, with money being saved and forcing you to build failure tolerance, retries & circuit breakers early in dev.

GCP has a similar instance type called "preemptible", they're not quite as cheap as spot, but they don't "dry up" and they're guaranteed to go down every 24 hours.

This precludes one from becoming complacent with spot instances that rarely go away.

https://cloud.google.com/kubernetes-engine/docs/how-to/preem... states that preemtible VMs aren't guaranteed to be available.

You're right. spot instances are a lot more stable than preemptible. we've seen spot instances that last a year for us.

This is the second response of this sort that I'm replying to.

"A lot more stable" isn't really a desirable characteristic of ephemeral compute capacity. In my experience, the less frequently the instances went away, the more complacent the operators became.

Preemptible instance are stable in the sense that you know they're going away within 24 hours and must be prepared for that.

true that.

spot used to be that way and the price was very sensitive. but AWS tweaked it so that it's more stable.

to the point, after a year or 2 of running spot instances, we don't feel the difference of spot and ondemand that much. we got complacent.

Yeah but Kubernetes "self-healing" a lot of times isn't. Or it just trips over itself (stuck pods, stuck cronjobs, etc).

This is run of the mill for job schedulers (Nomad, Mesos/Marathon, Globus), it’s just more accessible with k8s, containers, and VM spot pricing than historically.

As you mention, definitely don’t do this where persistence is paramount (cache that is expensive to backfill upon recovery, database, etc) but it’s just fine for transient workloads or workloads you can rapidly and safely preempt and resume.

If your instance is stateless, and your app can easily self-heal, there’s lots of computing paradigms you can explore. Serverless/function is also an option.

Of course, at the end of the day rarely is something ever truly stateless.

Eh, I'm currently running an old version of DCOS which does a pretty terrible job of scaling down. Much more than just a scheduler is required to make this work well.

Yeah, your fleet can become fragmented.

> not ideal for long running/cache/database style services.

Well, one question to ask yourself when considering going down this route is whether it makes more sense to move all the statefulness into managed services, like Aurora, BigTable, S3, etc.

That drastically simplifies life. Now the only infrastructure directly managed by you are stateless workloads that can easily be self-healed, rolled back, scaled up/down, etc. Managed DBs are more expensive than running your own DB, but most likely the cost savings of moving the rest of the infrastructure to spot/preemptible outweighs this difference.

What kind of services are you guys running that require you to scale up and down? Why not get one or two dedicated server(s) and run everything on them!? The post had no numbers, but I'm pretty sure you would come off even cheaper if you use dedicated servers, even managed.

I'm running a workload that takes about 3 minutes to run per request (compute heavy MVP), which means that a big surge of users once a day at peak would require a LOT of dedicated servers to serve in time.

My plan is to use dedicated servers for most of the load and some elastic capacity at peak loads if necessary.

I have a similar use case in mind, TBF I plan to just keep a small EC2 instance up for this need, and assemble a nice PC at home to catch up with the queue from my home for heavy workloads, the cost is so much cheaper and I get a PC as well! Worst case I spin up one more worker with more specs if the queue gets long. Sounds less effort than doing all this scaling work for a MVP and counterproductive when I'd rather spend more time on my actual logic

That's my initial plan as well, until the requirements get too high for my PC :)

author here: please note that this is what we did in 2016-2017 where kops(what we use to provision) did not support spot fleet yet.

also, this works out so well for our use case because we were using .net framework at the time so the cost saving was huge.

a lot has changes since then.

Also, this strategy is not limited to AWS, the similiar type of instances are also available on Azure, GCP, etc...

How much would it cost to host this on bare metal and co-lo servers, I wonder. Probably orders of magnitude less, but only if your ops costs are very low. If you developers have a DevOps culture, it's doable.

our clients (airlines) would very much prefer we use AWS over the smaller, lesser known offering.

Very interesting seeing as Airlines don’t move fast tech wise and usually have DCs.

If you want to play with an equivalent (barebones, spot instance) K3s Azure setup I use, the template is available here:


This is NOT For production use (that’s what the managed AKS service is for), but I like to tinker with the internals, so I keep an instance running permanently with a few toy services.

Thanks. this would be useful to spin up a k3s cluster for testing.

Btw, spotinst (we, I work there) released Ocean in 2018/2019 which is the K8S equivalent of our EC2 solution (elaatigroup) and Eco which is our AWS reserved recommendation product. I won't start a whole marketing speak but I'll just say that spotinst moved away from just a "cost saving solution" and are now a more rounded cloud managememt solution (ease of use, cost monitoring and insights).


The whole idea was from spotinst blog. Thanks a lot! I just glue all the opensource projects together with some changes here and there. If the idea didn't work, I will def consider using Spotinst.

However, every cost saving is important for our startup back then. we were a small shop in Southeast Asia where senior engineer merely cost $1000 a month. I was thinking maybe I can save the cut from Spotinst too :)

Now that is the cut-throat cost saving strategy we like to see! :)

.NET core has been a godsend in making .NET an interoperable option for cloud architectures. I'm continually impressed with the open source embrace from Microsoft.

> Now, the biggest sunk cost are obviously RDS and EC2.

I don't understand how something you pay for every month is being considered a sunk cost? Am I missing an up front charge, or does the writer not understand what it means?

This is pretty common, databricks for example uses regular instances for the driver and spot instances for workers by default.

I suspect doing that will break down in the case of large Spark shuffles though?

It's the default but you can change it. Most people will appreciate the cost savings

Love this thread. If you are using K8s and want to reduce both the time you spend managing compute infra and the associated cloud costs (whether for AWS, Azure or GCP) a Spot Instance or Preemtible VM, DIY approach is certainly possible, but will require a lot of setup work. Imagine handling multiple autoscaling groups for multiple Spot Instance types - an absolute necessity to diversify interruption risk, dealing with slow autoscaling or classic autoscaling that only considers cpu/mem and not actual pod requirements or no easy way to create a buffer of spare nodes for high priority workloads that suddenly need capacity, or identifying over-provisioning of machine sizes (based on incorrect pod requirements)which greatly exceed the actual needs of your pods. As an alternative, you can try Spotinst's Ocean product (yes, I work there) for K8s and ECS where not only is your infra management simplified, but you can easily reduce your cloud compute cost by 80%.

I also did this with GCP preemtiple instances and worked great for a while until I found out one random day that networking issues may also occur in addition to shutting off your instances within 24 hours. On sandbox clusters thou it’s been very smooth for over half a year. Highly recommend.

We use preemptibles for our CI fleet and it's great. We can run a hundred instances at full tilt boogie for 8 hours a day and the nodepool downscales to zero while we sleep. It's a no brainer if your controller (and use case) can handle preemption gracefully.

What CI software do you use? I played around with spot instances and Jenkins, and it was quite a poor experience.

We use Jenkins to invoke Tekton pipelines (https://github.com/tektoncd/pipeline) with a wrapper we wrote. The pipeline runs, outputs junit to a bucket and we pull it back and give it to jenkins. Was a bit of a lift to get working out of the gate but it's been mostly smooth (and flexible and cheap) since then.

Disclosure: I work for Google Cloud (and on preemptible VMs).

There isn’t anything about preemptible to “cause networking issues”. We may have had a general networking outage, if that’s what you experienced, but we don’t additionally make networking worse for preemptible. You’re likely to get shot, but we don’t adjust throttles or anything.

Thanks for letting me know. Overall, we use them for high availability and non mission critical tasks anyways but now that you say this, it gives me more confidence in recommending it. It was strange, it only affected one of the two GKE cluster (both had preemptive nodes). I tried debugging and found that it wasn’t able to even ping any external ips (which is why I thought maybe preemtible nodes also might have networking issues). There weren’t any notices about general networking outages within the last +\- 48 hours (I subscribe to the RSS GCP outages via slack). I ended up manually rebuilding the nodes (non preemtible nodes temp) and it went away. As soon as healthchecks were failing, I created a debug container and kubectl exec -it bashed in and saw that it wasn’t able to connect to the internet. It could ping/telnet internal IP databases just fine. It does warn me that it’s in beta so, again maybe you’re working through some things but thought you should know in case other people have experienced this.

Did your instances have external IPs? We had connectivity issues when using NAT due to it dropping SYN packets. We just gave everything that needed to talk to external IPs their own external IPs.

Glad you found a workaround, but please file a support ticket for this. If you don’t have support: (a) consider it and/or (b) send me an email with your info (contact in profile) and I’ll file a bug.

We actually did have a dedicated Googler who oversaw our move from AWS to GCP. I asked him about it but never got a reply.

Anyway, yes. It caused us problems. Attempts to connect to machines outside of GCP would fail repeatedly due to the NA dropping the SYN. This would mean our connect() call would timeout repeatedly. There was an ICMP response I could see with tcpdump:

ICMP host aaa.bbb.ccc.ddd unreachable - admin prohibited filter, length 68

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact