Strategically: I've only ever see it work by making it somebody's job to save money on Amazon, ahead of feature velocity. Find a tool that analyses where you're spending , pick the biggest bucket of spend you don't really understand, and drill down until you're sick of looking at it. Make some optimizations around how you're spending that money, rinse, repeat.
Every org I know has done a first pass that's pretty effective, where you buy a bunch of reserved instances for the compute power you need. A couple of the companies I've worked with stalled out at that point. The others figured out real cost / benefit models for real projects and did things like move data from S3 to Glacier, or from hosted MySQL to RDS, or from Cassandra to DynamoDB.
There are only so many free-lunch cuts you can make, so it's worth your while start down a path that allows you to consider trade-offs, which includes empowering somebody to lobby for them.
 Seems nice, haven't used it: https://github.com/Teevity/ice. I know the folks at http://cloudhealthtech.com, they're building a pretty solid business if you're willing to spend money to save money.
My favorite that most people miss is the S3 VPC endpoint. Putting an S3 endpoint in your VPC gives any traffic to S3 it's own internal route, so it's not billed like public traffic. I've seen several reductions in the 20-50k a month range with this one trick.
Otherwise, stop doing baremetal processes on the cloud. It's dumb. For instance, I see people operating a file gateway server (with 5 figure a year licensing costs) on ec2 using EBS storage. This is a perfect use case for replacement with S3, with lambda code running in response to whatever file gateway event you're monitoring.
Lastly, you need to really challenge people's conceptions about when resources need to be in use. Does a dev/test site need to be running the 16 hours that people are not at work? Of course not, except when people are staying late. So you create incentives to turn it off, or you run it on something like ECS or kubernetes with software to stop containers if they're not actively used in a 15 minute window (and then the cluster can scale down).
"Run an entire tech company in the cloud, or run only a single [big] project requiring more than 10 servers? Google Compute Engine
Run less than 10 servers, for as little cost as possible? Digital Ocean
Run only beefy servers ( > 100GB RAM) or have special hardware requirements? IBM SoftLayer or OVH"
You can also scale up memory at OVH without scaling up the rest of the server. So you can get to 15 GB of RAM for $34/month (at Digital Ocean, that is closer to $120/month or $160/month with much better specs for rest of server). But if all you need is more RAM...
Digital Ocean is a lot more user-friendly though than OVH. If you need good support, DO is probably a better bet.
Got the idea from an AWS talk here https://youtu.be/7Px5g6wLW2A. Blew my mind. I coded up all the changes in a day. Took way longer to move the data than to code it.
AWS provides a ton of building block primitives you can use to build with at a price point better than you can do it yourself. If you just try to do it yourself using their IaaS offerings 24/7 (ec2, vpc, etc) then you're in for a bad time.
Let's be specific on speed. Most of the time, the MB or two needed for a plot is a fraction of a second, which our customers can deal with. That retrieval is a single object from S3, the way we organize things.
Having said that, the talk I linked to has some great advice - use the "folder" structure to write data so you don't search, you just use the naming scheme to do a direct object read. In addition, we can keep most reads to a single object, which is fairly fast.
As is always true, you will need to test to see if it fits your speed needs. But even with just the naming scheme and meta data in the database to eliminate all searches on S3, the speed works for us.
Dialing down our snapshots in non production environments was also a great help to cut our costs.
Also I realized that DynamoDB autoscaling is relatively expensive for a large number of tables, because it sets up multiple CloudWatch Alarms for each table. So I turned off autoscaling for tables that don't really need it.
I have started to replace DynamoDB tables with Cloud Directory instances in cases where I don't want to have the fixed cost of tables sitting idle (Cloud Directory is charged by request). But this takes a lot of work to do retroactively and Cloud Directory doesn't have a nice CRUD UI, and of course it only fits projects that work well with a graph database.
These are not big costs in absolute $$$, but for personal projects they can be significant when you want to avoid piling up fixed monthly costs to keep services running.
IE, you will save a bundle if you are currently buying 3k provisioned IOPS vs just paying for 1 TB of storage.
You'll get a daily report that breaks down your cost for you via inbox with weather symbols indicating the health of your cost. Think of this like the weather you check every morning before you start your day. Our forecasting in our report is pretty accurate too!
Hit me up if anyone is interested with helping us and provide product feedback. We're offering a 60 day free trial for our first users so we can evolve from a reporting tool to a tool that help you save on AWS cost.
Would love to hook you all up: firstname.lastname@example.org
This post has been super helpful to figure out what we should start thinking about and what to build next!
Also if you're a startup, you can get usually 100K$ of free credit
We cut our spot expenses by 20% , which were already cheaper than on-demand by about 70%.
No matter how we played with the price calculator, Google was twice as expensive than our spot fleet algorithm
But running standalone spot requests is indeed much worse.
Autoscaling groups come much more natural to people, it's just that the native spot implementation is unreliable.
I mentioned before in this thread that I implemented a tool called autospotting that allows your on-demand autoscaling groups to be automatically converted into a sort of spot fleet, by replacing their members with the best available spot instances.
You get the best of both worlds: easy configuration based on your group settings, so you don't have to worry about bid prices, instance types and weights: the bid price is automatically set to the on-demand hourly price of your original instances, the instance type is also automatically determined based on the original instance type, so you can get any other type that's at least as large, even from a different generation, the selection is currently based on the lowest price, so you will pretty much automatically get various types without explicit configuration. When no spot instance types are priced lower than the on demand price the group will happily run the initial on demand instances until eventually some become available for a better price.
Gradually these spot instances are launched and attached to the existing group, and on demand instances are terminated to keep the capacity constant.
Unlike the spot fleets, this supports out of the box anything that is backed by autoscaling groups, such as Elastic Beanstalk, Kubernetes clusters built using kops, environments managed by CodeDeploy, and so on.
And unlike spot fleets, the migration to spot and back is a matter of setting a tag on the group, so you can easily revert back to the original group configuration if you decide to.
2. Look at any under utilized EC2 instances and make them smaller. If they are very under utilized, consider moving what they do to Lambda, which is very cheap.
3. S3 for everything storage. Only use EBS volumes when you need to.
4. Someone mentioned multiple availability zones and it's a great tip. Turning it on is 2x the price for RDS, and it's probably not needed for anything but critical instances.
5. Make sure unused instances are terminated, not stopped.
6. For very expensive systems, use spot instances to strategically reduces your costs when ec2 instances are cheaper.
7. Try to keep your instance types down to a few types. This will give you fewer headaches when you go to purchased reserved instances.
This only helped cut our bill before we launched and started spending more and more money on our production environment when these kind of things still made up a significant amount of our a bill.
We need a couple of things:
1. A nice easy way to distribute load through cheap bare metal providers with peak load taken by cloud. Kubernetes is nearly there.
2. Somebody to build a LCD display that sits in the office and shows the projected cost for cloud deployment at current usage. It's so easy to throw up a 40 node cluster and not think about the cost - especially where I currently work, a government funded project, where the taxpayer will pony up.
Observe your AWS footprint. Dump it as CloudFormation, or hook up a tool like http://hava.io or http://www.visualops.io/ to get a spatial representation of your infrastructure.
They both have powerful costing tools. Use this data in conjunction with your baseline performance metrics (memory/CPU/disk consumption) and see if there are any broad savings to be made by refactoring your platform.
If you have been steadily adding components (and their SG dependencies), its likely there is no clear understanding of the End-to-End scope of your AWS footprint.
From all this data comes your Blueprint. You choose what's necessary; VPC, ELB, AZ, RDS, S3, R53 and so on.
Use something that is not the AWS console to define and capture a new, blueprinted environment and deploy it. See if it does the job by running your availability and performance suite on it.
Start doing Blue/Green Deploys so your costs are reflective of your application.
It will continuously replace your group's on-demand instances with the best priced compatible(at least as big and identically configured) spot instance types, with minimal configuration changes performed by simply tagging the group. The group configuration is unchanged so it would keep launching on demand when scaling out or when replacing outbid spot instances.
The tool is open source and available on github and can be set up in a few minutes: https://github.com/cristim/autospotting
We moved to Google Cloud Platform.
Snark aside, we're continuing to be pretty happy about that decision, in addition to paying a lot less.
(This is such a big difference that I don't really understand it.)
These are usually the quickest and most common solutions to cost problems that I’ve seen. After that, you look for “minor gains:” storage (S3 instead of EBS, Glacier instead of S3 for infrequently accessed data), using reserved instances, moving to RDS (though reliances on underlying OS can make this difficult and many databases have them), autoscaling, running ALBs/ELBs instead of EC2 instances running HAproxy, CloudWatch alarms for things that exceed cost parameters, etc.
One of the biggest cost issues that I’ve seen is people treat AWS like an internal PaaS cloud (like vSphere) and move their ways of maintaining bare-metal hardware/software into AWS (i.e. “lift and shift”). It saves time in that you don’t have to actually think about how your application works in a cloud world and gives managers that “I did the impossible and moved us to the Cloud in three months” badge/bonus, but it costs SO MUCH MORE to run and really does a disservice to what AWS can do. (Cynically, I guess this creates another carrot: “I did the impossible and saved us tons of money by fixing the cost explosion from lift and shift!”)
Breaking that culture down is paramount to efficient usage of any cloud in a way that won’t break the bank.
Project 1: We reduced our EC2 bill by about $10k a month by moving to spot instances for our Continuous Integration servers
Project 2: We saved another $10k a month by moving to CloudFront for our downloads, instead of using S3 directly
Project 3: Finding and eliminating waste from years of tech debt. Saved $5k a month so far.
As we go we need to spend more and more time to find the same amount of cost savings, it's a law of diminishing returns.
I'll also add, do not keep large EBS volumes around. If you need to store data forever, get it moved to S3.
Use VPC end points for S3 to save cost.
If you are using NAT gateways and S3 (probably applies to Dynamo as well) and your app is using S3 then you could be saving $$ by having that traffic run through a VPC endpoint which is free bandwidth and not using the NAT gateway which is paid.
Our community uses this information to reduce the waste at the software, middleware and AWS level. e.g. configure Auto Scaling groups to scale upon reaching a custom ELB Target Trigger equal to your measured concurrency.
I wrote an article about this specific ELB use case a couple weeks ago, at https://medium.com/@stacktical/how-to-lose-money-with-the-ne...
Maybe it can give you some ideas.
I’m working on https://www.dormantbear.com which is an aws scheduler to turn instances on and off for certain times of the day.
I use it for my own VPN box which I only require during office hours.
The main commercial use case is for turning off staging environments when not needed during out of office hours. For setups that are fixed and don’t spin up test environments on the fly.
For security you should provide aws creds that can access your staging boxes with permission to list instances and turn them on and off. Even though they are symetrically encrypted.
The service is free as I have not written billing or marketing pages yet.
* We (Userify plug: SSH key management) identified our most expensive activity (in our case, ~70% of our bill was simply bandwidth) and then switched to external or third-party resources where possible, but..
* After we went through all of that effort, we mentioned it to our account manager who worked with us to resolve it; wish we'd just started there and saved a good bit of time!
* Switched historical real-time data storage to archival mode (ie a quick script to extract old data, compress, and save to S3). This saved a lot in Redis/Elasticache memory fees which were growing extremely quickly.
* Disabled detailed monitoring on autoscaled instances where we didn't need higher-resolution data.
* Removed NAT instances from VPC's where they weren't needed.
* Found and retired old snapshots and volumes (those are hard to figure out, but this exercise is worth pursuing, since they seem to multiply like rabbits). Far less expensive but easier to track down are unused Elastic IP's and Route 53 zones.
* Don't launch an instance without tagging it, if at all possible. You should use tags to group instances (and other resources) and provide some way to track down the instance owner. Often people are reluctant to turn something off because they don't know if it's still in use. If it's critical, it should be clearly labeled so.
* Look through seldom-used accounts and regions for left-over instances that are still running. Turns out there's often a lot of these.
* Look at CPU/RAM/IO etc for larger instances and decide if it can be reduced without affecting user experience. This is especially effective with dev or internal boxes.
* Stop (but do not terminate!) all dev/non-critical boxes when outside normal business hours. This can be done with a route 53 record for the box (to prevent needing EIP for each one) and a tag that indicates the box can be stopped when needed. It's amazing how many project boxes just keep running and costing $. This is also the case if you're not sure if something looks unused and can be terminated safely; stop it and see if anyone screams and add a tag with your contact info. (corollary: make sure critical boxes have uptime monitors, cloudwatch, etc!)
* Don't front-end static resources with Cloudfront if latency and scalability are not going to be an issue for them, to reduce bandwidth/tx/region costs by at least half.
* Try to avoid cross-region (or cross-account in the same region) or even cross-AZ bandwidth charges. Avoid multi-AZ builds unless critical infra.
* Watch the per-transaction fees on especially things like Lambda, but also frequently-updated or written items on S3
* Pay attention to those tiny fees that come along with certain types of AWS technology that can add up huge over time. (esp DynamoDB and Lambda; however, these can be awesome for fast-to-develop low-utilization projects)
* Use classic ELBs instead of ALB's if you have a lot of short sessions instead of fewer long-running sessions. Also become familiar with NLB's, which are very useful in some circumstances.
* NAT's can be a very significant (esp in bandwidth) and hard-to-surface cost... Delete them if you don't need them.
* Learn CloudFormation and use it to build clearly labeled dev and production stacks. The great thing about this is that when you're done, you can just delete the dev stack (which cascade deletes all of the dependencies) and launch a prod stack in a different account. CloudFormation has gotten a LOT easier to use with YAML support, and a LOT faster at spinning up. Start with something small and work up from there.
* Keep the prod account separate (and backed up as offline as possible, preferably on your premises, or in another cloud). Only let a few trusted people manage prod, and anything outside of prod is fair game to get turned off. Ideally, give each dev their own sandbox account, and use tools like Userify (plug!) and third-party IAM roles.
* Lastly, if costs are really too high for you, also look at AWS Trusted Advisor, third-party tools like Cloudability or Teevity Ice, and of course talk with your account manager after gathering data about where your biggest costs are. They have a lot of power to help you and they often really do have your best interests at heart.
What are some of the big frustrations people have found when it comes to monitoring these costs?
When it hits the backend, if it is Lambda functions you can also save a lot by ensuring you don't have wasted resources. Same idea for databases, DynamoDB is very cost effective. Usually the less you manage or provision by hand, the more you save.
tl;dr: Configure your reserved instances to be region-scoped; buy in a popular region; pay-as-you-go not upfront; sell your excess reserved instances (like you mentioned)
Like I am doing...I have £300K of annual cloud and dedi hosting costs spread over several providers so I reached out to the sales team last week. I'm sure they'll be rushing to call me any day now - I just hope it's not this afternoon as I'll be on a booked call with Google.
1. Autoscale with triggers based on utilization.
2. Don't use a full OS if you don't need, consider containers that are extremely targeted with your needs only.
3. Services help, you don't need to build everything on your own.