Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How did you significantly reduce your AWS cost?
100 points by kacy on Oct 30, 2017 | hide | past | favorite | 75 comments
One of the most obvious things is to better utilize your Reserved Instances, but what are some things that are not so obvious? Do you use any tools? Thanks!

I'm a consultant in AWS, mainly doing cost optimization.

My favorite that most people miss is the S3 VPC endpoint. Putting an S3 endpoint in your VPC gives any traffic to S3 it's own internal route, so it's not billed like public traffic. I've seen several reductions in the 20-50k a month range with this one trick.

Otherwise, stop doing baremetal processes on the cloud. It's dumb. For instance, I see people operating a file gateway server (with 5 figure a year licensing costs) on ec2 using EBS storage. This is a perfect use case for replacement with S3, with lambda code running in response to whatever file gateway event you're monitoring.

Lastly, you need to really challenge people's conceptions about when resources need to be in use. Does a dev/test site need to be running the 16 hours that people are not at work? Of course not, except when people are staying late. So you create incentives to turn it off, or you run it on something like ECS or kubernetes with software to stop containers if they're not actively used in a 15 minute window (and then the cluster can scale down).

Great post. Something I’d add is if you can use tags and reporting to incur chargeback pain in internal budgets, it’s a great motivator to not waste internal resources. We’ve saved six figures a month in AWS spend by doing this.

Tactically: spot instances, reserved instances, auto-scaling groups, deleting data that we didn't need.

Strategically: I've only ever see it work by making it somebody's job to save money on Amazon, ahead of feature velocity. Find a tool that analyses where you're spending [0], pick the biggest bucket of spend you don't really understand, and drill down until you're sick of looking at it. Make some optimizations around how you're spending that money, rinse, repeat.

Every org I know has done a first pass that's pretty effective, where you buy a bunch of reserved instances for the compute power you need. A couple of the companies I've worked with stalled out at that point. The others figured out real cost / benefit models for real projects and did things like move data from S3 to Glacier, or from hosted MySQL to RDS, or from Cassandra to DynamoDB.

There are only so many free-lunch cuts you can make, so it's worth your while start down a path that allows you to consider trade-offs, which includes empowering somebody to lobby for them.

[0] Seems nice, haven't used it: https://github.com/Teevity/ice. I know the folks at http://cloudhealthtech.com, they're building a pretty solid business if you're willing to spend money to save money.

The HFT Guy has a good blog and did some posts on this. Basically Google Cloud is much cheaper than AWS so you should look at that. There are also other alternatives including IBM SoftLayer which I've never looked at before.

"Run an entire tech company in the cloud, or run only a single [big] project requiring more than 10 servers? Google Compute Engine

Run less than 10 servers, for as little cost as possible? Digital Ocean

Run only beefy servers ( > 100GB RAM) or have special hardware requirements? IBM SoftLayer or OVH"



I like his summary however, on the low end, I found OVH is more affordable than DigitalOcean depending on your needs. Digital Ocean is $20/month for 2 GB of RAM while OVH is $28/month for 7 GB with roughly similar specs for the rest (SSD, etc). Note that OVH also offers VPS SSD with 8 GB for $13.50/month however it doesn't automatically renew (at least in Canadian region) so you have to manually pay your bill each month. The $28/month option is under their public cloud offering and supposedly does auto renew.

You can also scale up memory at OVH without scaling up the rest of the server. So you can get to 15 GB of RAM for $34/month (at Digital Ocean, that is closer to $120/month or $160/month with much better specs for rest of server). But if all you need is more RAM...

Digital Ocean is a lot more user-friendly though than OVH. If you need good support, DO is probably a better bet.

On the low end, Linode is also an option: 2GB/$10.

Don't forget Vultr

Cutting out idle time is where you'll make the saving. So use lambada and emr where appropriate. Use spot by default.

We stopped storing our IoT raw data in databases. We still need to search it, but now we store only metadata in the database (we know what we will search by, so we can make appropriate metadata) and store the raw data in S3. So any searches are in the DB, using the DB for what it is good at. This means that our storage / database cost approaches just the S3 cost, because our metadata is ~0.001 the size of the raw data. It took our product from "we are going to shut this thing down" to making money.

Got the idea from an AWS talk here https://youtu.be/7Px5g6wLW2A. Blew my mind. I coded up all the changes in a day. Took way longer to move the data than to code it.

This is one of the ways to use AWS 'the right way'. Without serious optimization, using AWS as a IaaS provider is going to cost more.

AWS provides a ton of building block primitives you can use to build with at a price point better than you can do it yourself. If you just try to do it yourself using their IaaS offerings 24/7 (ec2, vpc, etc) then you're in for a bad time.

How slow is it to retrieve the data from s3 though?

It is slow, but most of the time, we don't use the raw data - we use the metadata. It worked great for our app. We got a excited and tried to implement some on the fly data summaries where we would have to touch a bunch of files every time we wrote anything. That just didn't work because of the speed.

Let's be specific on speed. Most of the time, the MB or two needed for a plot is a fraction of a second, which our customers can deal with. That retrieval is a single object from S3, the way we organize things.

Having said that, the talk I linked to has some great advice - use the "folder" structure to write data so you don't search, you just use the naming scheme to do a direct object read. In addition, we can keep most reads to a single object, which is fairly fast.

As is always true, you will need to test to see if it fits your speed needs. But even with just the naming scheme and meta data in the database to eliminate all searches on S3, the speed works for us.

Yes, s3 is very slow. Maybe to pursuade clients to move to their db solution.

After cleaning up all our unneeded instances and optimising our architecture, we finally realised that our developer stacks and test environments didn’t need to run 24/7, which has so far been the biggest single cost saving we’ve implemented. By only running stacks from 8am-8pm then stopping everything, we’re getting a 50% saving on our instance costs. If a developer needs a stack they can still launch a new one or start their existing one. We also moved a few test stacks to use single AZ modes instead of multi AZ for another significant saving.

Dialing down our snapshots in non production environments was also a great help to cut our costs.

We wrote about how we analyzed our AWS usage https://segment.com/blog/spotting-a-million-dollars-in-your-..., and then some of optimizations we made to cut down costs https://segment.com/blog/the-million-dollar-eng-problem/.

This is an awesome breakdown!

I had lots of DNS zones in Route 53 that didn't really need to be there. So I moved a bunch of them to ClouDNS, which supports ALIAS records to have apex domains point to CloudFront distributions. Seems to work.

Also I realized that DynamoDB autoscaling is relatively expensive for a large number of tables, because it sets up multiple CloudWatch Alarms for each table. So I turned off autoscaling for tables that don't really need it.

I have started to replace DynamoDB tables with Cloud Directory instances in cases where I don't want to have the fixed cost of tables sitting idle (Cloud Directory is charged by request). But this takes a lot of work to do retroactively and Cloud Directory doesn't have a nice CRUD UI, and of course it only fits projects that work well with a graph database.

These are not big costs in absolute $$$, but for personal projects they can be significant when you want to avoid piling up fixed monthly costs to keep services running.

For RDS, it is much cheaper to pay for more storage than more IOPS, at least if you are under 3k IOPS.

IE, you will save a bundle if you are currently buying 3k provisioned IOPS vs just paying for 1 TB of storage.

More here: http://blog.textit.in/why-buying-provisioned-iops-on-rds-may...

It's for EBS in general, if your application is not IOPS intensive, use large gp2 volumes instead. We've had stress test wrong and chose provisioned volumes then have it to revert back to general volumes when in production.

You may find FittedCloud automated/transparent EBS optimization useful - https://www.fittedcloud.com/solutions/aws-ebs-optimization/

Shameless plug here. Working on an side hustle to help you monitor and measure your AWS spend: http://www.cloudforecast.io

You'll get a daily report that breaks down your cost for you via inbox with weather symbols indicating the health of your cost. Think of this like the weather you check every morning before you start your day. Our forecasting in our report is pretty accurate too!

Hit me up if anyone is interested with helping us and provide product feedback. We're offering a 60 day free trial for our first users so we can evolve from a reporting tool to a tool that help you save on AWS cost.

Would love to hook you all up: hello@cloudforecast.io

This post has been super helpful to figure out what we should start thinking about and what to build next!

I think the first option is to go to AWS and tell them that. They can discount prices + have staff to help you reduce your bills. I'd start here actually

How come every time I try, the response from AWS is abysmal? I've been told this many times, and try to reach out, but it is always a pitiful case of broken telephone.

It depends on your budget. They start caring about >10K$/month I think.

Also if you're a startup, you can get usually 100K$ of free credit

Rather than spot instances, use spot fleets and structure your bid such that you arbitrage across different node types and data centers that are equivalent for you.

We cut our spot expenses by 20% , which were already cheaper than on-demand by about 70%.

No matter how we played with the price calculator, Google was twice as expensive than our spot fleet algorithm

Spot fleets are really powerful but they take a lot of parameters and are quite hard to get configured right.

But running standalone spot requests is indeed much worse.

Autoscaling groups come much more natural to people, it's just that the native spot implementation is unreliable.

I mentioned before in this thread that I implemented a tool called autospotting that allows your on-demand autoscaling groups to be automatically converted into a sort of spot fleet, by replacing their members with the best available spot instances.

You get the best of both worlds: easy configuration based on your group settings, so you don't have to worry about bid prices, instance types and weights: the bid price is automatically set to the on-demand hourly price of your original instances, the instance type is also automatically determined based on the original instance type, so you can get any other type that's at least as large, even from a different generation, the selection is currently based on the lowest price, so you will pretty much automatically get various types without explicit configuration. When no spot instance types are priced lower than the on demand price the group will happily run the initial on demand instances until eventually some become available for a better price.

Gradually these spot instances are launched and attached to the existing group, and on demand instances are terminated to keep the capacity constant.

Unlike the spot fleets, this supports out of the box anything that is backed by autoscaling groups, such as Elastic Beanstalk, Kubernetes clusters built using kops, environments managed by CodeDeploy, and so on.

And unlike spot fleets, the migration to spot and back is a matter of setting a tag on the group, so you can easily revert back to the original group configuration if you decide to.

Can you provide some details about your spot fleet algorithm ?

Buy your own hardware. Esp for dev work having your own rack in the office really is cheap and its kinda nice. The other benefit is it limits the budget to what we have, rather than allowing anyone to create more servers...

1. Use RI's (like you said) and make sure your utilization is high.

2. Look at any under utilized EC2 instances and make them smaller. If they are very under utilized, consider moving what they do to Lambda, which is very cheap.

3. S3 for everything storage. Only use EBS volumes when you need to.

4. Someone mentioned multiple availability zones and it's a great tip. Turning it on is 2x the price for RDS, and it's probably not needed for anything but critical instances.

5. Make sure unused instances are terminated, not stopped.

6. For very expensive systems, use spot instances to strategically reduces your costs when ec2 instances are cheaper.

7. Try to keep your instance types down to a few types. This will give you fewer headaches when you go to purchased reserved instances.

What helped when we first got started is more quickly terminate test/development environments. We'd often launch a full environment to test out a theory or do a POC without touching the regular development, staging and production environments. It would often happen that those environments would run a little longer than supposed to. We've put automated systems in place that detect inactivity and notify us.

This only helped cut our bill before we launched and started spending more and more money on our production environment when these kind of things still made up a significant amount of our a bill.

I use AWS and online.net. I use the latter for 8 core bare metal servers that cost me 16euro a month. AWS for the rest (at work)

We need a couple of things:

1. A nice easy way to distribute load through cheap bare metal providers with peak load taken by cloud. Kubernetes is nearly there.

2. Somebody to build a LCD display that sits in the office and shows the projected cost for cloud deployment at current usage. It's so easy to throw up a 40 node cluster and not think about the cost - especially where I currently work, a government funded project, where the taxpayer will pony up.

Over the years we have done a number of things, probably the most impactful was auto-scaling on our core engines. Whenever we scale down it's just like scooping money out.

Amazon Lightsail instances can provide cheaper bandwidth. The $5/month subscription includes 1TB, while 1TB of outgoing data on EC2 would cost $90 ($0.09 per GB).

(This is such a big difference that I don't really understand it.)

Blueprint your Footprint

Observe your AWS footprint. Dump it as CloudFormation, or hook up a tool like http://hava.io or http://www.visualops.io/ to get a spatial representation of your infrastructure.

They both have powerful costing tools. Use this data in conjunction with your baseline performance metrics (memory/CPU/disk consumption) and see if there are any broad savings to be made by refactoring your platform.

If you have been steadily adding components (and their SG dependencies), its likely there is no clear understanding of the End-to-End scope of your AWS footprint.

From all this data comes your Blueprint. You choose what's necessary; VPC, ELB, AZ, RDS, S3, R53 and so on.

Use something that is not the AWS console to define and capture a new, blueprinted environment and deploy it. See if it does the job by running your availability and performance suite on it.

Start doing Blue/Green Deploys so your costs are reflective of your application.

Shameless plug: if you are using autoscaling and have stateless/12factor components, you would like to try spot instances but don't feel like spending time and much effort with a migration to spot fleets, just give autospotting a try.

It will continuously replace your group's on-demand instances with the best priced compatible(at least as big and identically configured) spot instance types, with minimal configuration changes performed by simply tagging the group. The group configuration is unchanged so it would keep launching on demand when scaling out or when replacing outbid spot instances.

The tool is open source and available on github and can be set up in a few minutes: https://github.com/cristim/autospotting

> How did you significantly reduce your AWS cost?

We moved to Google Cloud Platform.

Snark aside, we're continuing to be pretty happy about that decision, in addition to paying a lot less.

AWS Trusted Advisor, gives great metrics on underused resources... Sadly, part of Premium.


By switching to Hetzner GPUs (99 euro/month for a server with an NVIDIA 1080)

Serverless / Zappa, hosting my API’s for 10x decrease in monthly costs.


We are in the process of converting over to this as well

Being extremely aggressive about cleaning up/terminating/scheduling dev environments have led to tremendous cost savings. Also, monitoring those instances and downsizing them has helped a lot too.

These are usually the quickest and most common solutions to cost problems that I’ve seen. After that, you look for “minor gains:” storage (S3 instead of EBS, Glacier instead of S3 for infrequently accessed data), using reserved instances, moving to RDS (though reliances on underlying OS can make this difficult and many databases have them), autoscaling, running ALBs/ELBs instead of EC2 instances running HAproxy, CloudWatch alarms for things that exceed cost parameters, etc.

One of the biggest cost issues that I’ve seen is people treat AWS like an internal PaaS cloud (like vSphere) and move their ways of maintaining bare-metal hardware/software into AWS (i.e. “lift and shift”). It saves time in that you don’t have to actually think about how your application works in a cloud world and gives managers that “I did the impossible and moved us to the Cloud in three months” badge/bonus, but it costs SO MUCH MORE to run and really does a disservice to what AWS can do. (Cynically, I guess this creates another carrot: “I did the impossible and saved us tons of money by fixing the cost explosion from lift and shift!”)

Breaking that culture down is paramount to efficient usage of any cloud in a way that won’t break the bank.

There's no one silver bullet, as always, you just have to get really good at using Cost Explorer.

Project 1: We reduced our EC2 bill by about $10k a month by moving to spot instances for our Continuous Integration servers

Project 2: We saved another $10k a month by moving to CloudFront for our downloads, instead of using S3 directly

Project 3: Finding and eliminating waste from years of tech debt. Saved $5k a month so far.

As we go we need to spend more and more time to find the same amount of cost savings, it's a law of diminishing returns.

Spot instances. Spot instances. Spot instances.

This, this, and this again.

I'll also add, do not keep large EBS volumes around. If you need to store data forever, get it moved to S3.

If I had to significantly reduce my AWS cost, I'd turn to Corey Quinn of "Last Week in AWS" (https://lastweekinaws.com) who fixes AWS bills for a living (https://www.linkedin.com/in/coquinn/)

Docker. ECS, Kubernetes, Mesos, whatever floats your boat. Everything on RIs. Then start playing with spot-instances.

I didn't see this one mentioned.

Use VPC end points for S3 to save cost.

If you are using NAT gateways and S3 (probably applies to Dynamo as well) and your app is using S3 then you could be saving $$ by having that traffic run through a VPC endpoint which is free bandwidth and not using the NAT gateway which is paid.

This is totally just an idea, and not something I have ever seen put into practice (not worked anywhere that used AWS), but depending on your expense, hiring someone who is specifically skilled at objective performance measurement and optimization could pay for itself I would imagine.

We developed a continuous scalability testing technology that lets you meet the concurrency and throughput requirements of your microservices build, before they ship to production.

Our community uses this information to reduce the waste at the software, middleware and AWS level. e.g. configure Auto Scaling groups to scale upon reaching a custom ELB Target Trigger equal to your measured concurrency.

I wrote an article about this specific ELB use case a couple weeks ago, at https://medium.com/@stacktical/how-to-lose-money-with-the-ne...

Maybe it can give you some ideas.

- For consumer compute consider serverless (api gateway and lambda) - this addresses the problem of idle/pre-prod environments. - Data at rest: consider s3, retention policies and glacier. - Consider aggressive retention policies on cloudwatch logs, watch out for unnecessary cloudwatch metrics (I still prefer statsd because of the percentiles) - If you use Kinesis - monitor In/Out bytes and provision shards as needed, control retention. - Watch out for cross region/hybrid (on prem) stacks - data transfer is very expensive and not obvious. - Use reserved instances. - Remove unused EBS volumes. - Watch out for high IO EBS volumes. - Use Cloudfront to cache static content.

Shameless side project plug

I’m working on https://www.dormantbear.com which is an aws scheduler to turn instances on and off for certain times of the day.

I use it for my own VPN box which I only require during office hours.

The main commercial use case is for turning off staging environments when not needed during out of office hours. For setups that are fixed and don’t spin up test environments on the fly.

For security you should provide aws creds that can access your staging boxes with permission to list instances and turn them on and off. Even though they are symetrically encrypted.

The service is free as I have not written billing or marketing pages yet.

* Use spot instances whenever possible, especially for development. This can sometimes cost more time than it's worth, but it can turn out quite useful, especially for broad QA testing across many nodes.

* We (Userify[1] plug: SSH key management) identified our most expensive activity (in our case, ~70% of our bill was simply bandwidth) and then switched to external or third-party resources where possible, but..

* After we went through all of that effort, we mentioned it to our account manager who worked with us to resolve it; wish we'd just started there and saved a good bit of time!

* Switched historical real-time data storage to archival mode (ie a quick script to extract old data, compress, and save to S3). This saved a lot in Redis/Elasticache memory fees which were growing extremely quickly.

* Disabled detailed monitoring on autoscaled instances where we didn't need higher-resolution data.

* Removed NAT instances from VPC's where they weren't needed.

* Found and retired old snapshots and volumes (those are hard to figure out, but this exercise is worth pursuing, since they seem to multiply like rabbits). Far less expensive but easier to track down are unused Elastic IP's and Route 53 zones.

* Don't launch an instance without tagging it, if at all possible. You should use tags to group instances (and other resources) and provide some way to track down the instance owner. Often people are reluctant to turn something off because they don't know if it's still in use. If it's critical, it should be clearly labeled so.

* Look through seldom-used accounts and regions for left-over instances that are still running. Turns out there's often a lot of these.

* Look at CPU/RAM/IO etc for larger instances and decide if it can be reduced without affecting user experience. This is especially effective with dev or internal boxes.

* Stop (but do not terminate!) all dev/non-critical boxes when outside normal business hours. This can be done with a route 53 record for the box (to prevent needing EIP for each one) and a tag that indicates the box can be stopped when needed. It's amazing how many project boxes just keep running and costing $. This is also the case if you're not sure if something looks unused and can be terminated safely; stop it and see if anyone screams and add a tag with your contact info. (corollary: make sure critical boxes have uptime monitors, cloudwatch, etc!)

* Don't front-end static resources with Cloudfront if latency and scalability are not going to be an issue for them, to reduce bandwidth/tx/region costs by at least half.

* Try to avoid cross-region (or cross-account in the same region) or even cross-AZ bandwidth charges. Avoid multi-AZ builds unless critical infra.

* Watch the per-transaction fees on especially things like Lambda, but also frequently-updated or written items on S3

* Pay attention to those tiny fees that come along with certain types of AWS technology that can add up huge over time. (esp DynamoDB and Lambda; however, these can be awesome for fast-to-develop low-utilization projects)

* Use classic ELBs instead of ALB's if you have a lot of short sessions instead of fewer long-running sessions. Also become familiar with NLB's[2], which are very useful in some circumstances.

* NAT's can be a very significant (esp in bandwidth) and hard-to-surface cost... Delete them if you don't need them.

* Learn CloudFormation[3] and use it to build clearly labeled dev and production stacks. The great thing about this is that when you're done, you can just delete the dev stack (which cascade deletes all of the dependencies) and launch a prod stack in a different account. CloudFormation has gotten a LOT easier to use with YAML support, and a LOT faster at spinning up. Start with something small and work up from there.

* Keep the prod account separate (and backed up as offline as possible, preferably on your premises, or in another cloud). Only let a few trusted people manage prod, and anything outside of prod is fair game to get turned off. Ideally, give each dev their own sandbox account, and use tools like Userify (plug!) and third-party IAM roles.

* Lastly, if costs are really too high for you, also look at AWS Trusted Advisor, third-party tools like Cloudability or Teevity Ice, and of course talk with your account manager after gathering data about where your biggest costs are. They have a lot of power to help you and they often really do have your best interests at heart.

1. https://userify.com

2. http://docs.aws.amazon.com/elasticloadbalancing/latest/netwo...

3. https://aws.amazon.com/cloudformation/

The thing with AWS is that you have to first measure your costs in order to reduce them - which isn't always obvious before the bill arrives.

What are some of the big frustrations people have found when it comes to monitoring these costs?

That depends a lot on your application, that said, i'd like to add CloudFront. The more you respond from the CDN cache, the less requests hits your backend and that can be huge, it is not unusual to offload 80-90% of traffic.

When it hits the backend, if it is Lambda functions you can also save a lot by ensuring you don't have wasted resources. Same idea for databases, DynamoDB is very cost effective. Usually the less you manage or provision by hand, the more you save.

You don't use AWS in the first place. Most projects are better off going with bare metal and use spot instances for spikes. Will cut your costs dramatically.

This blog post is straight forward and useful: https://www.datawire.io/4-simple-strategies-for-managing-you...

tl;dr: Configure your reserved instances to be region-scoped; buy in a popular region; pay-as-you-go not upfront; sell your excess reserved instances (like you mentioned)

You may want to check out FittedCloud (www.fittedcloud.com). It identifies potential cost savings opportunities using machine learning and provide optimization recommendations in form of actionable advisories for most of the AWS resources. Actionable advisories are one click to automation for the provided recommendation. They have a free version with limited capabilities and a paid version for full capabilities.

Maybe 'engage' with the AWS sales team?

Like I am doing...I have £300K of annual cloud and dedi hosting costs spread over several providers so I reached out to the sales team last week. I'm sure they'll be rushing to call me any day now - I just hope it's not this afternoon as I'll be on a booked call with Google.

Depending on your spend, the specialists over at www.quinnadvisory.com should be able to sort you out quickly.

When you can't use reserved instances... switching to GCP (with their automatic discount) saved us ~50%.

So, aside from using reserved instances...

1. Autoscale with triggers based on utilization. 2. Don't use a full OS if you don't need, consider containers that are extremely targeted with your needs only. 3. Services help, you don't need to build everything on your own.

1. Using Cloudflare as a CDN of s3 to reduce traffic cost. 2. Using Cloudflare Firewall instead of AWS WAF. Cloudflare only charge for passed traffic. AWS WAF charge for both passed and blocked traffic (I have confirmed this with aws billing department)

Cross-AZ fees can add up!

Big instances w/ lots of containers and putting everything in cloudformation/terraform so that you can nuke non-prod environments at certain times.

Check your cloudfront and data transfer usage. We had bingbot overcrawling a number of our urls resulting in high cloudfront costs.

Small gain: Deleted all snapshots that were accumuated over 6 months across 100 instances. (Generated by Cloudwatch rules).

I compressed some data stored in S3 in a lossy way to 2% of its original size, saved a bundle.

What did you lose? We've seen some interesting things like lowering precision of timestamps and gps have large impact because zip doesn't work as well on basically random numbers.

Never thought of that, something to keep in the very (very) far back of my mind for future usage

Easiest, quickest and most flexible way is with www.parkmycloud.com or similar tools.

Use Windows EC2 instances only as a last resort.

Have a clear retention policy. You must differentiate between hot and cold data, it will help you minimize the size and thus cost of your high-performance cluster and the lower performance one.

Re-architect for the cloud.

I hate how people think this kind of garbage is worse than maintaining your own stack...

Some great bits on here about effective cloud strategies


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact