
We use Kubernetes and spot instances to reduce EC2 billing up to 80% - talonx
https://tuananh.net/2020/02/20/the-story-behind-my-talk-cloud-cost-optimization-at-scale/
======
threeseed
I hope people don't go and take this advice and just run everything on Spot as
that is a mistake.

It is very common for AWS to completely run out of entire classes of instance
types e.g. all of R5 or all of M5. And when that happens your cluster will
die.

What you want to do is split your cluster into minimum two node groups e.g.
Core and Task:

Core: On-Demand for all of your critical and management apps e.g. monitoring.
Task: Spot for your random, ephemeral jobs that aren't a big deal if it needs
to be re-ran.

So for a Spark cluster for example you would pin your driver to the Core nodes
and let the executors run on the Task nodes.

~~~
someone13
Shout-out for "AutoSpotting", which transparently re-launches a regular On-
Demand ASG as spot instances, and will fall back to regular instances:
[https://github.com/AutoSpotting/AutoSpotting/](https://github.com/AutoSpotting/AutoSpotting/)

Combined with the fact that you can have an ASG with multiple instance types:
[https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-
groups...](https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-groups-with-
multiple-instance-types-purchase-options/)

Means that you can be reasonably certain you'll never run out of capacity
unless AWS runs out of every single instance type you have requested,
terminates your Spot instances, and you can't launch any more On-Demand ones.

(and even so, set a minimum percentage of On-Demand in AutoSpotting to ensure
you maintain at least some capacity)

~~~
ignoramous
ASG, per the blog-post you linked to, now supports starting both on-demand and
spot instances, so what's the use of AutoSpotting?

~~~
alien_
The author of AutoSpotting here, this is often being asked and I'm happy to
clarify it.

The mixed capacity ASGs currently run at decreased capacity when failing to
launch spot instances. AutoSpotting will automatically failover to on-demand
capacity when spot capacity is lost and back to spot once it can launch it
again.

Another useful feature is that it most often requires no configuration of
older on-demand ASGs, because it can just take them over and replace their
nodes with compatible spot instances.

This makes it very popular for people who run legacy infrastructure that can't
be tampered with for whatever reasons, as well as for large-scale rollouts on
hundreds of accounts. Someone recently deployed it on infrastructure still
running on EC2 Classic started in 2008 or so that wasn't touched for years.

Another large company deployed it with the default opt-in configuration
against hundreds of AWS accounts owned by as many teams, many with legacy
instances running for years. It would normally take them years to coordinate
as a mass migration but it just took them a couple of months to migrate to
spot. The teams could opt-in and try it out on their application or opt-out
known sensitive workloads. A few weeks later then they centrally switched the
configuration to opt-out mode, converting most of their infrastructure to spot
literally overnight and saving lots of money with very little configuration
effort and very few disruption to the teams.

If you want to learn more about it have a look at our FAQ at
[https://autospotting.org/faq/index.html](https://autospotting.org/faq/index.html)

It's also the most prominent open source tool in this space. Most competition
consists of closed-source, commercial (and often quite expensive) tools so if
you're currently having any issues or missing functionality, anyone skilled
enough can submit a fix or improvement pull request.

~~~
616c
Where can I read about some of these more impressive use cases you describe?

~~~
alien_
Have a look at
[https://github.com/AutoSpotting/AutoSpotting](https://github.com/AutoSpotting/AutoSpotting)
or the FAQ section on [https://autospotting.org](https://autospotting.org)

If those don't answer your questions feel free to reach out to me and I'll do
my best to explain further.

------
Jnr
Why are people so obsessed by AWS? It is one of the most expensive hosting
solutions that tries hard to lock you into their ecosystem.

I somewhat understand why enterprises want to use it, but why are small
startups using it so much and then complaining about the cost?

Nowadays when we have high speed internet, and a lot of things are
containerized, it is so simple to change hosting partners. Just pick one that
doesn't cost an arm and a leg and move to a different one if it didn't fit
very well.

I have used linux containers for 10 years now and changed hosting a few times,
each time reducing costs even more. Yes, it is a bit of manual labour, but if
you have someone with sysadmin/devops skills, it is easily doable.

~~~
rumanator
> Why are people so obsessed by AWS? It is one of the most expensive hosting
> solutions that tries hard to lock you into their ecosystem.

I agree with you, and that's why I try to get the point of view of those who
actually decide to adopt AWS. They aren't crazy or stupid, and as AWS is the
world's leading cloud provider then it's highly doubtful that the decision is
irrational and motivated by ignorance.

So far, the main fallacy with regards to companies picking AWS is that cost is
relevant. It isn't. AWS might overcharge a lot, but truth of the matter is
that for any sizeable corporation it's irrelevant if they spend 200€ or 400€
on their cloud infrastructure. It's far less than a salary and it's even less
than the bill for some office utilities. So once the infrastructure foot is in
the door then why would management worry about cost? What they do care about
is uptime and development speed, because that has direct impact on
productivity, and thus value extracted from salaries. If a particular service
provider enables you to show tangible results in no time at all (see spin up a
database or message broker or workflow in less than nothing) then they don't
mind paying a premium that matches, say, their air conditioning bill.

~~~
Legogris
For a startup, it can work out like this: Start out on AWS/GCP/Azure in the
initial phase when you want to optimize for velocity in terms of pushing out
new functionality and services. When you start to require several message
queues, different data stores, dynamic provisioning and high availability, you
save a lot on setup and maintenance - the initial cost of getting your private
own cloud up and running, and doing so stably, is not to be underestimated.
Especially when you're still exploring and haven't figured out the best
technologies for you long-term.

Then, at some point, that dynamic changes as you have a better understanding
of your needs, the bills start to build up and the architecture is in less
flux. You might also have a bigger team and can afford to start allocating
more resources to operation. That is the point when it it might make sense to
migrate over to self-managed.

Then at the same time, you have the scalability, which might be more of a key
point for even larger organizations.

I think building somewhat cloud-agnostic to ease friction of provider
migration is good, regardless, but do so pragmatically and look at the APIs
from a service perspective.

Kubernetes? All the bigger providers have alternatives and you can run your
own. Fargate? You're going to have to do some rewrites. MemoryStore? Just swap
for your another Redis instance. BigTable? Highly GCP specific. etc.

Not saying there aren't a lot of companies who choose the wrong provider for
the wrong reasons, but it can also be part of a conscious strategy. Also
nobody got fired for IBM and so on.

~~~
rumanator
> Then, at some point, that dynamic changes as you have a better understanding
> of your needs, the bills start to build up and the architecture is in less
> flux. You might also have a bigger team and can afford to start allocating
> more resources to operation. That is the point when it it might make sense
> to migrate over to self-managed.

I completely agree, and I had this discussion with my direct manager in the
past. Yet, even if the potential savings are significant, managers might not
be too keen on investing on switching your infrastructure. Running your own
infrastructure is risky, and although top managers enjoy lower utility bills
they don't enjoy the sight of having a greater risk of suffering downtime,
specially if the downtime is self-inflicted and affects business-critical
services.

So, if this transition doesn't go perfectly smoothly... The people signing for
the migration to self-hosting services might be risking their whole career on
a play that, at best, only brings some short-term operational cost savings.
Does this justify a move whose best case scenario is equivalent to a AWS
discount?

~~~
ghaff
Moving workloads in-house does happen. But, in general, you're right. It's
hard to advocate for a near-term expensive (in time and money) and at least
somewhat risky (expect some nights and weekends crises) migration for possibly
some longer term cost benefit (assuming you've accounted for all the costs).
Which BTW neither you nor your manager may still be around to take credit for.
And also BTW is at least somewhat counter to what companies are doing in
general, for better or worse, and which execs will probably rightly see as a
potential distraction from whatever the company is trying to accomplish.

Frankly the whole discussion mostly highlights that these are things you need
to think about upfront before you're fully committed.

~~~
harikb
Back in the day, when I was part of a startup, the DB guy was all into making
us write “provide agnostic” sql in case we ever wanted to switch to Mysql or
Oracle. We were actually using Postgres. This was a nightmare.

Things started improving when we said ‘f-it we are not moving out of Postgres,
let us at least use the best features of PG’

There is a similar problem when trying to use AWS with the constant thought
about moving out of AWS at some point.

~~~
Legogris
Yeah, this is a bit what I mean with doing it pragmatically - at least when
you choose provider-specific services know that either 1) you have an idea of
how you would migrate it or 2) it is a conscious decision to leverage a USP.
By taking the provider-agnostic paradigm to the extreme, you have the least
common denominator, getting none of the upsides.

------
juliansimioni
On the one hand, building fault tolerant infrastructure that can, as a side
effect, work painlessly on spot instances is great.

On the other hand, you can purchase reserved instances and get ~60% cost
savings with zero engineering work. Its worth thinking long and hard about
whether the cost of engineering time is worth that next 20%.

There's also a lot of useful ground in between "critical state, must never be
lost (like a database)" and "can handle being terminated with 2 minutes
notice". A service that can be re-created if necessary but takes 10 minutes to
start up is really scary if run on spot instances, but can still be pretty
useful.

~~~
NikolaeVarius
Note, AWS savings plans make this even easier

[https://aws.amazon.com/savingsplans/](https://aws.amazon.com/savingsplans/)

~~~
40acres
Any idea of when this product was introduced?

~~~
NikolaeVarius
End of 2019 [https://aws.amazon.com/blogs/aws-cost-management/reinvent-
ro...](https://aws.amazon.com/blogs/aws-cost-management/reinvent-round-up-
savings-plans-cost-categories-and-more/)

~~~
40acres
So is this blog post redundant?

~~~
QuinnyPig
No. Spot gives deeper discounts than savings plans do.

~~~
chillydawg
And you can stack them.

------
collyw
And how much developer time did it cost?

We have done the same - our bills went down, but not by as much as 80% I think
closer to 50%. But it took a fair bit of developer time, and we now have a lot
of Kubernetes related problems to deal with. I guess those will smooth out
over time, but I don't think anyone ever factors in this stuff when they claim
great savings. Developer time ain't cheap.

On a plus note, running multiple small boxes via Kubernetes does give you a
more high availability system. If one instance goes down, there will still be
another one available, so it's not all negative.

~~~
ramraj07
Goes back to the fact that stack overflow itself runs on a single beefed up
machine for all their traffic (with a backup machine of course). What this
company does that needs so many instances? And they use the same tech too
(.NET). Instead of thinking about that people always think about over
engineering for "scale" to overcompensate bad code.

~~~
collyw
Do you have any idea how big their actual database is? And how many clients
thay serve?

It does seem to be an example of a "standard" architecture done well. Our
application has a tiny fraction of the traffic and it struggles with some
things.

~~~
manigandham
It's all public:
[https://stackexchange.com/performance](https://stackexchange.com/performance)

Here's a series of blog posts with a lot more detail by one of their devs:
[https://nickcraver.com/blog/archive/](https://nickcraver.com/blog/archive/)

They're very optimized and can serve all of their traffic on a single
webserver, redis instance and SQL server.

------
geuis
Sigh. I had a very long and detailed reply typed out on my phone about the
trevails of dealing with kubernetes in the last 2 weeks. Then Safari decides
to reload the page and it all got lost.

I’m literally emotionally drained after unsuccessfully working with k8s after
2+ weeks.

It’s incredibly over complicated and documentation is all _over the place_. I
had a large write up of my experiences but those are lost and I don’t have the
energy to retype all of that.

I simply wanted to utilize k8s to help provide some auto scaling and
redundancies for a 10 year old service I run.

After 2 weeks of deep diving on this topic and getting essentially nowhere,
even with the help of a friend that does this for his day job and him waving
his hands not being able to help, I’m reluctantly done.

The technology is just not ready. It’s too complicated. The documentation
isn’t sufficient. Sure you can document every nut and bolt, but if you can’t
create simple patterns for people to follow you lose. There’s too much change
going on between versions.

At my last 2 companies, they each had a team of 2-10 people working on
implementing kubernetes. After over a year at each company, no significant
progress had been made on k8s. Sure some stuff was migrated over but no
significant services were running with it.

~~~
stewartm
You are definitely right that it is a fast moving target and hence can be
frustrating to work with at the moment, particularly if you are trying to get
it running on-prem. It is still relatively early days, and there is plenty of
distillation to come, before an easy predictable set of patterns emerge.

Not wanting you to go to the pain of trying to recreate your original post,
but of interest, what kinda of things were the primary areas of pain from your
work?

------
PudgePacket
This is quite a neat strategy, leveraging elastic compute costs and kubernetes
"self-healing". I'm surprised I haven't heard more about this kind of
technique before.

I fully acknowledge this will only work in certain scenarios and for certain
workloads, eg not ideal for long running/cache/database style services.

~~~
gopalv
> leveraging elastic compute costs and kubernetes "self-healing"

The indirect effect of building on a system like this is that the recovery
mechanisms get tested on a regular basis instead of just on the odd day when
things fail.

Spot instances are like a natural chaosmonkey mode, with money being saved and
forcing you to build failure tolerance, retries & circuit breakers early in
dev.

~~~
halbritt
GCP has a similar instance type called "preemptible", they're not quite as
cheap as spot, but they don't "dry up" and they're guaranteed to go down every
24 hours.

This precludes one from becoming complacent with spot instances that rarely go
away.

~~~
tuananh
You're right. spot instances are a lot more stable than preemptible. we've
seen spot instances that last a year for us.

~~~
halbritt
This is the second response of this sort that I'm replying to.

"A lot more stable" isn't really a desirable characteristic of ephemeral
compute capacity. In my experience, the less frequently the instances went
away, the more complacent the operators became.

Preemptible instance are stable in the sense that you know they're going away
within 24 hours and must be prepared for that.

~~~
tuananh
true that.

spot used to be that way and the price was very sensitive. but AWS tweaked it
so that it's more stable.

to the point, after a year or 2 of running spot instances, we don't feel the
difference of spot and ondemand that much. we got complacent.

------
z3t4
What kind of services are you guys running that require you to scale up and
down? Why not get one or two dedicated server(s) and run everything on them!?
The post had no numbers, but I'm pretty sure you would come off even cheaper
if you use dedicated servers, even managed.

~~~
tpxl
I'm running a workload that takes about 3 minutes to run per request (compute
heavy MVP), which means that a big surge of users once a day at peak would
require a LOT of dedicated servers to serve in time.

My plan is to use dedicated servers for most of the load and some elastic
capacity at peak loads if necessary.

~~~
ramraj07
I have a similar use case in mind, TBF I plan to just keep a small EC2
instance up for this need, and assemble a nice PC at home to catch up with the
queue from my home for heavy workloads, the cost is so much cheaper and I get
a PC as well! Worst case I spin up one more worker with more specs if the
queue gets long. Sounds less effort than doing all this scaling work for a MVP
and counterproductive when I'd rather spend more time on my actual logic

~~~
tpxl
That's my initial plan as well, until the requirements get too high for my PC
:)

------
tuananh
author here: please note that this is what we did in 2016-2017 where kops(what
we use to provision) did not support spot fleet yet.

also, this works out so well for our use case because we were using .net
framework at the time so the cost saving was huge.

a lot has changes since then.

Also, this strategy is not limited to AWS, the similiar type of instances are
also available on Azure, GCP, etc...

------
sgt
How much would it cost to host this on bare metal and co-lo servers, I wonder.
Probably orders of magnitude less, but only if your ops costs are very low. If
you developers have a DevOps culture, it's doable.

~~~
tuananh
our clients (airlines) would very much prefer we use AWS over the smaller,
lesser known offering.

~~~
aiisjustanif
Very interesting seeing as Airlines don’t move fast tech wise and usually have
DCs.

------
rcarmo
If you want to play with an equivalent (barebones, spot instance) K3s Azure
setup I use, the template is available here:

[https://github.com/rcarmo/azure-k3s-cluster](https://github.com/rcarmo/azure-k3s-cluster)

This is NOT For production use (that’s what the managed AKS service is for),
but I like to tinker with the internals, so I keep an instance running
permanently with a few toy services.

~~~
tuananh
Thanks. this would be useful to spin up a k3s cluster for testing.

------
Thristle
Btw, spotinst (we, I work there) released Ocean in 2018/2019 which is the K8S
equivalent of our EC2 solution (elaatigroup) and Eco which is our AWS reserved
recommendation product. I won't start a whole marketing speak but I'll just
say that spotinst moved away from just a "cost saving solution" and are now a
more rounded cloud managememt solution (ease of use, cost monitoring and
insights).

~~~
tuananh
Hi,

The whole idea was from spotinst blog. Thanks a lot! I just glue all the
opensource projects together with some changes here and there. If the idea
didn't work, I will def consider using Spotinst.

However, every cost saving is important for our startup back then. we were a
small shop in Southeast Asia where senior engineer merely cost $1000 a month.
I was thinking maybe I can save the cut from Spotinst too :)

~~~
Thristle
Now that is the cut-throat cost saving strategy we like to see! :)

------
saym
.NET core has been a godsend in making .NET an interoperable option for cloud
architectures. I'm continually impressed with the open source embrace from
Microsoft.

------
jimnotgym
> Now, the biggest sunk cost are obviously RDS and EC2.

I don't understand how something you pay for every month is being considered a
sunk cost? Am I missing an up front charge, or does the writer not understand
what it means?

------
khc
This is pretty common, databricks for example uses regular instances for the
driver and spot instances for workers by default.

~~~
johnc1231
I suspect doing that will break down in the case of large Spark shuffles
though?

~~~
khc
It's the default but you can change it. Most people will appreciate the cost
savings

------
zdawg
Love this thread. If you are using K8s and want to reduce both the time you
spend managing compute infra and the associated cloud costs (whether for AWS,
Azure or GCP) a Spot Instance or Preemtible VM, DIY approach is certainly
possible, but will require a lot of setup work. Imagine handling multiple
autoscaling groups for multiple Spot Instance types - an absolute necessity to
diversify interruption risk, dealing with slow autoscaling or classic
autoscaling that only considers cpu/mem and not actual pod requirements or no
easy way to create a buffer of spare nodes for high priority workloads that
suddenly need capacity, or identifying over-provisioning of machine sizes
(based on incorrect pod requirements)which greatly exceed the actual needs of
your pods. As an alternative, you can try Spotinst's Ocean product (yes, I
work there) for K8s and ECS where not only is your infra management
simplified, but you can easily reduce your cloud compute cost by 80%.

------
johnjungles
I also did this with GCP preemtiple instances and worked great for a while
until I found out one random day that networking issues may also occur in
addition to shutting off your instances within 24 hours. On sandbox clusters
thou it’s been very smooth for over half a year. Highly recommend.

~~~
transect
We use preemptibles for our CI fleet and it's great. We can run a hundred
instances at full tilt boogie for 8 hours a day and the nodepool downscales to
zero while we sleep. It's a no brainer if your controller (and use case) can
handle preemption gracefully.

~~~
__float
What CI software do you use? I played around with spot instances and Jenkins,
and it was quite a poor experience.

~~~
transect
We use Jenkins to invoke Tekton pipelines
([https://github.com/tektoncd/pipeline](https://github.com/tektoncd/pipeline))
with a wrapper we wrote. The pipeline runs, outputs junit to a bucket and we
pull it back and give it to jenkins. Was a bit of a lift to get working out of
the gate but it's been mostly smooth (and flexible and cheap) since then.

------
varelaz
Well I don't think that using spot instances is a very good thing to consider
there.

1) What if all of them go down would it make application unusable? I recall
one or two weeks on AWS when all spots were down because of capacity problems.

2) You cannot rely on requests to complete on spot instances they can go up
and down when they need. For background jobs it's fine, you can retry, but for
UX?

And if you need to choose this approach to save your costs instead of
migration to bare metal, which is usually ~10 times less than AWS then
something is wrong with your architecture and you are locked.

~~~
tuananh
> save your costs instead of migration to bare metal,

like i mentioned in other comments. due to our use cases, bare metal is out of
question since the clients (airlines) would very much prefer we use AWS over
the smaller, lesser known offering.

~~~
varelaz
We had recent migration from AWS to Hetzner. We used combined clusters with
ondemand + spot instances. After migration to Hetzner we doubled capacity and
still saved around half. AWS is very good for flexible work loads, but if load
and process is constant and well defined, bare metal hosting much more
effective. And there are a lot of well known bare metal hosting providers on
the market.

~~~
tuananh
if workload is constant, nothing beats baremetal in term of cost.

------
rasikjain
Good to see .NET Core here in article and the Microsoft trying to embrace into
more opensource solutions. Due to the interoperability, This makes the
application to be deployed at lower cost on containers.

------
prchaudhari_007
We are using a similar strategy. We have 4 different node groups 2 with on
demand instance types 2 with spot instance types It has been working really
well for us. We are doing some further optimisations to reduce the costs and
improve the performance of our stack. Also, we have jenkins scheduled jobs
which kills the staging infra every evening and starts it up in morning. We
also keep the staging completely down over the weekends.

------
AtlasBarfed
"We don’t have any real ops guy as you think these days. Whenever we need
something setup, we just have to page someone from India team to create
instances for us and then proceed to set them up ourselves."

Whiskey Tango Foxtrot. That is insane. This is a vain attempt to constrain
costs?

------
tra3
I’m still trying to decide on an orechestrator for my deployments. Any reason
swarm wouldn’t work instead of k8s?

~~~
sk0g
Docker is pretty much abandoning Swarm though, so wouldn't recommend that at
least.

[https://www.bretfisher.com/is-swarm-dead-answered-by-a-
docke...](https://www.bretfisher.com/is-swarm-dead-answered-by-a-docker-
captain/)

~~~
thrtythreeforty
What's a good replacement for Swarm that isn't as complex as Kubernetes? I
like the platform abstraction and flexibility of k8s, it's just really heavy
for many of my use cases.

~~~
manigandham
I'd advise learning the basics of Kubernetes anyway. Managed offerings like
GKE take away all of the operational burden so you can just deploy your app
with minimal setup, usually just 1 or a handful of YAML files.

------
tkyjonathan
I reduce my RDS bill by 66% by keeping it on EC2 and using Percona Server.

------
kim0
Hello fellow hackers, I had worked on some terraform automation for running
eks on spot instances. If you find that interesting, let me know. I can help
run that in production if needed.

------
jijji
I've found that rewriting an application from C# to something quicker like PHP
or Go negates the whole need to do this kind of stuff because the C#
application is 10x faster written in PHP or 20x faster written in Go. I love
it when you change one line of a shared lib in C# and it takes an hour for the
application to recompile.

~~~
oblio
1\. Compilation time != execution time.

2\. PHP is ridiculously slow compared to C#. You're doing wrong (or whoever
created the application did it wrong). Go is comparable, but not really
faster.

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/fastest/csharp.html)

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/fastest/php.html)

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/fastest/go.html)

~~~
strumpy123
yeah benchmarking all the stuff that is not really needed for a real web app.

~~~
oblio
1\. Not everything is an web app.

2\. That stuff is deep inside the web frameworks you use, in some way or
another.

3\. There you go, bud:
[https://www.techempower.com/benchmarks/](https://www.techempower.com/benchmarks/)
The first non-micro framework is ASP.NET Core.

Personally, I don't think performance is everything. Spring (Java) is #221 in
that ranking and I'd bet it's used in almost as many places as everything
above it combined.

