It is unfortunate that cost management isn’t something most engineers keep an eye out for on a regular basis. Spinning up unnecessary resources, not cleaning up resources properly once not needed, writing inefficient code, etc. all quickly adds up to hundreds of thousands of dollars per month in big companies.
I once found a “test” db cluster from an engineer who hadn’t worked in the company for 3 years. We were paying 300k yearly for it before discounts. It took me a literal click to shut it down. And I’m not proud of it but, had to send out an org wide email on the savings achieved (corporate politics :shrug:).
The huge achievement of Amazon was designing a system and selling it to people where developers no longer had to pre-approve spending. Previously developers were hamstrung by purchase order requirements; it could take weeks to authorize a single computer. Now the pendulum has swung in the direction. Developers can spend unlimited amounts of company money without realizing, billed in arrears.
And in many cases this is a huge net win! After all, there's another way to waste company money invisibly: design a process which requires meetings and waiting while work is held up.
Overall good points but don't forget that pre-approval processes resulted in asking for resources that exceeded the near term needs and once approved ongoing costs were rarely fully reviewed. I have personal experience with "enterprise" clients making a huge months long process to get server resources, reminding us that changes would take 30+ days. when the project was over and we did everything we could to let them know that the servers could be spun down or put to other uses we got back a "ok thanks!" only to find them still running our project code YEARS later. This is infra that was costing them about 1 engineer FTE per year, not even a 10$/mo toy env
Yes but it is up to a company to control its spending. It must have a process and policy in place to deal with this. It's not Amazon's fault if it hasn't.
That's the whole ploy with Agile, isn't it? In the classical SDLC or Waterfall paradigm, everything was pre-approved and signed off, not just the cost or billing but even the software design itself. Any change in the process and the designers had to raise a change request. Agile changed all of that and now we know how bad things can get with that.
No, Agile is about tight development loops. When weaponized/used by large corporations it often times turns into a Stay-Puft man of sorts, but as someone who _hates_ processes normally, I actually kinda like it and Kanban when done well.
It basically helps keep things clean with decomposition and doesn't necessarily hamstring older devs as much while giving a good guideline for younger devs to work in. All things considered, it seems like not a bad system to me, and the team customizing the process to their own needs is nice as well.
There's a million ways for it to go wrong, but it's not too terrible on the whole I thinks. <3 :"))))
There is no such thing as "no process"; is something you always have whether you talk about it or not. The often heard "I hate process" is counterfactual then - what it really means is "I hate process that I see as intrusive/wasteful/whatever".
The +'ves you are listing are what comes from looking at how things are actually done and doing some of it a bit more thoughtfully.
The common -ves often come from stakeholders outside of developement injecting their needs ... sometimes this is unavoidable (e.g. regulatory) sometimes it is just political, but either way there are better and worse ways to do it.
But see https://news.ycombinator.com/item?id=39092563 ; the cost of pre-approval can be pretty large, as is the cost of change once you discover that you've frozen in the wrong thing. Agile benefits both consulting situations where the client genuinely doesn't understand their own needs, and also startups where you're building a new product and need to rapidly iterate in response to market feedback.
(Escalating costs of pre-approval, and the need to design around every possible objection, are a big part of why physical infrastructure costs so much more in the West!)
I found that the problem happens mostly when companies
1. Don't ask developers how much something costs, engineers love optimisation, getting as much as possible out of a system for cheap is great fun.
2. Lock down the UI, so devs can't even find out how much things cost. That's my current situation. Why block the billing dashboard, then expose it through billing dashboard tools that are not really any better, and in many ways worse?
It's rhetorical really as I know why. Terrible architecture from "enterprise". Stick everything in a single account so it's hard to figure out how much is your spend. All 3000 databases, and make sure your k8s cluster is 5 8XL boxes so no one can scale down excess capacity.
> 2. Lock down the UI, so devs can't even find out how much things cost. That's my current situation. Why block the billing dashboard, then expose it through billing dashboard tools that are not really any better, and in many ways worse?
This is so true. billing transparency is very important.
in the past, i had a case like this: dev accidentally enable backup policy for test database with no retention. finops think that db backup is important and ignore it. dev has no access to billing and have no idea what's creeping up the bill
> getting as much as possible out of a system for cheap is great fun
In certain circumstances, absolutely, however it's extremely aggravating to be in the position of being constantly pestered to ship features faster without the authority to overprovision some of the infrastructure the software runs on.
Waking up in the middle of the night because we saved money by allocating too little disk for the primary database or because the latest release included new dependencies that increased memory usage and the OOMKiller is picking off web servers like a wolf in the lamb's pen, or we're just swapping our way to hell while web requests 502...eh. Not for me.
More visibility into costs, though, absolutely agreed. Engineers should know that when they turn on some new cool serverless gizmo and then forget about it, it's costing $ each month.
I didn't mean that engineers love having no control over their systems, I just see the labours of love that get posted here about getting nginx to throw out 1000 pages a second on an Atari 800, or getting LLMs designed for $2000 GPUs running on a phone.
The question should be, we currently cost $X a month and we need to half it because [reason], what can we do to bring it down? Which might be reducing hardware, or maybe something else, might be both. Puzzles can be fun.
I get paid the same whether or not I spend time going around to save company cost, it’s not like they’re going to share their savings, then shareholders get their juice, management gets their juice and Im the clown who went out of my way for them, who cares
I do software not cost management
Oh for the inflation thats awesome, you know how many interviews with engineers Ive had saying they were at a place for 10 years, but the company had a raise cap of 2%, then saying that they wouldn’t hire me because I couldn’t give assurances to be a likeminded clown making corporate rich
This is fun, but these policies are based on the fact that your managers don't care about you and in fact prefer you remain as poor as they can legally make you. So any kind of reasoning along these lines ("well this makes sense") just is not real and won't ever become real.
I consistently tried to push for cost management at my last job, but the product manager just wanted to push new features he could show off to management above him. We let costs inflate to ridiculous levels despite my constant discussion around the topic. Software engineers ultimately had no say in the matter.
I was laid off this week in a mass layoff because the company doesn’t have enough money to pay all of us anymore. It’s disappointing to see, and I wonder how many other teams ignored these optimizations and how much unnecessary total cost it all summed to.
I had to implement a 2nd deploy for a QA environment, and my first question to the infrastructure team was “won’t this be costly is there a better way to handle this?” They shrugged off the cost and said they would optimize my deploy once I was done with the initial implementation. 6 months later their optimization was to undo all the work not because it wasn’t a good implementation but because it revealed how much non-optimization had went into the QA environment before I even touched it. A lot of cost is probably due to the “we just taped these two things together” strategy for lower environments.
In the organization that I work in, costs are transparent to everyone involved and most people are aware of the need to keep costs as low as possible.
One of the downsides with this approach is that engineers/developers are not very good business people and don't really understand the notion of "the cost of doing business". And from time to time we have issues with "but it costs $70 more per month", and spend $1000 to optimize those $70 :)
In the end, even with some of the wrinkles mention above it helps and saves money when costs are transparent and readily available for anyone.
I think "engineer" isn't really the correct word to use for the artisans who build much of the tooling used by most companies.
An engineer either wears a striped hat and drives a train, or, went to a credentialed school and passed a bunch of test and is allowed to sign documents that state "this thing, if built this way, won't collapse and kill people."
It is expected that an engineer can predict with reasonable accuracy the expense and timeline of a project, and how to maintain the resulting thing, without resorting to voodoo like "scrum velocity." In large part that's because engineers stick to doing things that are well understood and predictable, and if there's risk they resolve the risk before undertaking the project. (Is there bedrock over here upon which to build a foundation? I don't know; let's find out first!). Sure, there are engineering disasters even today -- buildings that unexpectedly lean over and door/wall things that unexpectedly fly off the side of airplanes, but those are typically organizational / process problems not "engineering doesn't work" problems.
“engineers stick to doing things that are well understood and predictable”
I’m calling BS on this. If this were true, we’d still be a ground species. Engineering has been and will always be about creating something electrical, mechanical, computerized, or all, that solves a problem. Understood or not. Engineers are not oracles. They can not predict whether a tower built in Italy will eventually begin to lean due to erosion. They can not predict that a steel beam rated for 300T of force would break at 180T. They can not predict a rogue developer removing a package from underneath their dependency tree.
You can give estimates all you want but you are still guessing.
If engineers were as you say they are, we would never have delays, we would never have traffic jams, we would never have crap software, we would never have flight.
“Engineering is the art of modelling materials we do not wholly understand, into shapes we cannot precisely analyse so as to withstand forces we cannot properly assess, in such a way that the public has no reason to suspect the extent of our ignorance.”
- Dr. AR Dykes
Engineers manage risk and cost. They certainly make mistakes, like those couple buildings that are famously leaning over in SF and NYC, or the citycorp center where they got the wind sheer loads wrong and had to hot patch the building.
But looking at the malarkey that goes on in "software engineering" or whatever -- clearly not engineering, at least not where I've seen it.
Engineering: a process of repeatably solving an understood problem predictably.
Craft: a process of solving an understood problem.
Science: a process of solving a problem without an exactly understood outcome.
Art: a process of working.
These are all made-up definitions.
I'd expect a software engineer to give me a system that locally caches and verifies distribution artifacts and validates changes -- a craftsperson who gives me a tool chain that yeets goo from the internet and builds on that without validation is not, in fact, an engineer. They could be quite practiced at the art of building working systems, but they're not managing risk....
What makes software engineering special is the systems are more complex and are cheaper to test and break. You get a completely different engineering culture when you can roll back a bad change after seeing it fail during the canary push. That, and what's usually on the line is money, not life. I'd feel a lot better making a $1M mistake than making a mistake that killed someone.
> An engineer [] went to a credentialed school and passed a bunch of test and is allowed to sign documents that state "this thing, if built this way, won't collapse and kill people."
Ahhh - that old craptacular definition. You completely ignore mechanical engineers, chemical engineers, electrical & electronics engineers. Not all engineers make bridges.
Secondly, the implied cause and effect even within civil engineering is a fantasy. Signatures on documents by credentialed engineers doesn't prevent disasters as you noted: Bridges fall down, buildings burn. Read the engineering reports on civil engineering disasters, and look at the consequences for the engineers involved.
You do some handwaving about organizational/process problems, but actually that is the key to safe engineering. Organisations deliver engineering projects and they do it across jurisdictional borders using insurance and liability and with a variety of other means that work: "signatures don't prevent disasters".
Lockheed Martin's skunk-works and SpaceX are real engineering. Any good definition of engineering needs to encompass an extremely wide variety of activities.
I would like to know the psychology behind why people wish to believe credentialed signatures are so powerful? Maybe a cross between two concepts #1: "that individual engineers run the world" and #2: "that retributive punishment of individuals works as a deterrent". I think concept #1 comes from the egotist idea of most engineer-types that we are the center of everything (I need a whole article to explain the concept). I think concept #2 is related to beliefs about the value of incarceration and also punishment beliefs derived from religion (especially in the USA where prisons are not fixing problems?).
Edit: issue #3: the idea that we should make rules about what words mean. It takes a certain worldview to think words should be defined rather than evolve (or worse that words should be part of a justice system)
> Ahhh - that old craptacular definition. You completely ignore mechanical engineers, chemical engineers, electrical & electronics engineers. Not all engineers make bridges.
I suppose you've got an engineering degree in pedantic engineering? Engineers manage cost and risk. The skunk-works stuff is marginally "science" not "engineering" given the relatively large budgets and relative lack of "we know this works." Cern is similarly an enormous engineering enterprise in that it's a huge stack of "we know this works" in service of "we're not sure what this will do"
A discussion of how "software engineers" deliver projects with neither cost or risk as part of the process implies, to me, that they're not engineers.
You are the one trying to push your definition of engineering.
I provided counter-examples that show engineering encompasses a lot more than your definition.
I simply don't understand why anyone thinks writing software is somehow uniquely not "real" engineering. Somehow we are indoctrinated to believe that it isn't but all the evidence seems to show software engineering is a valid description.
I have no lack of experience watching the fuck-ups made by electronics engineers, or the fuck-ups made by mechanical engineers. You appear to want to define engineering only as certified civil engineering. And I've seen enough of their fuck-ups too, with signatures. In fact I'll ask my bridge engineer friend from uni about it! Unfortunately my bridge building grandad is dead so I can't ask him.
The vast majority of people I encounter with some "engineering" title, in software (or the related "Architect") are in fact not trained as engineers or architects, in any field.
A site reliability "engineer" or a software "engineer" is not an actual engineer just because they've got that in their title or job description. If I were to hire a "chemical engineer" position and instead hired a chemist, or a mechanic, or a rando who's cooked meth, I may end up with things working okay, or I might end up with a serious mess, even if those people I hire call themselves "engineers" (but in fact have no formal training as such).
I'm not sure to what degree credentials matter, but do credentials matter more than "not a god damned bit" ?
I'm not saying the title makes you "not an idiot" -- people gonna people -- but attention to "cost" and "risk" is (theoretically) one of the distinguishing characteristics of engineering training vs ... "mather" or "programmer" or "philosopher".
Yeah, the debasement of meaning is annoying - vice-president is one I hate. Another one that surprises me from the US is "licensed nail technician".
I have a bachelor of engineering title I can use with my name, but that is another distinct type of bullshit.
In New Zealand one relevant legal certification is CPEng which you can apply for after receiving your degree and working for a few years: https://www.engineeringnz.org/join-us/cpeng/ And apparently our government agreed in 2022 to introduce a new licensing regime for engineers doing safety-critical work.
But in an international world, how relevant are certified individuals? When I purchase a stove from a US brand and it catches fire, there needs to be other liability/retribution/corrective systems to deal with the problem. It matters little to me who signed off on the product in the US.
Can I import custom structural steel beams? How many New Zealanders have signed off on this steel construction: https://ccc.govt.nz/the-council/future-projects/major-facili... We need a new stadium because the last one broke. Unfortunately it wasn't insured due to some cockup at the city council (which I suspect had zero retribution on the people that cocked up - I wonder if they signed bits of paper?).
Over-credentialisation is a problem too - where is the right balance? The shift to everyone needing credentials is fucked. My friends (nurses, teachers) literally weep at the absolute trash they have to "learn" for their credential. I also vividly remember the crap I needed to disgorge to get my degree.
I don't know what the answer is, but I honestly believe most credentials are pointless waste and adding more credentials is not actually effective. Neither do I believe that that the anarchy of libertarian free markets are a workable answer.
And that's how insignificant the costs of cloud providers are in the grand scheme of things. It's a lot of money to a bootstrapping startup, but for the vast majority of these cloud providers' customers, it's a rounding error that's easily forgotten for 3 years.
And that's precisely why you and your little bootstrapper or indie firm should not be using globocloud: you do not have mountains of cash to piss away. Bare metal is trending again. And in this downturn, it's no wonder why. Smaller companies are getting smarter and more efficient. They've decided to chase money instead of cargo cults.
Globocorps burning cash on globocloud is not a signal for small fish to do the same - it's a signal to do the polar opposite. You're not going to become like them by copying what they're doing now. It will not work for you. Globocloud isn't successful because they shovel cash into AWS's shredder, they shovel cash into the shredder because they're successful.
You'd be surprised. I've seen AWS bills well in the 9 figures. It's just that fixing expensive designs is, in itself, quite expensive, and many of those very large corps have hiring practices that don't allow them to complete for the top of the market. Sometimes there's tens, if not hundreds of millions in savings a year, but corporate sclerosis makes it very difficult for broad cost-saving initiatives to be identified and approved.
It's the same issue in any large organization: Large levels of success somewhere allow for large levels of waste somewhere else, but often the waste is not required for the success to exist: The success just makes the organization complacent.
Well, it’s definitely fathomable. Does my employer have cost control baked into their proceses, tooling, and culture or are they rushing me to get projects out the door leaving barely enough time to make sure they’re production-ready?
Most places I’ve worked had no formal production readiness review before launching infrastructure.
> It is unfortunate that cost management isn’t something most engineers keep an eye out for on a regular basis.
That’s because they were explicitly told not to worry about costs for the last 10 years so majority of ICs at this point never had to do it their entire careers
What I've observed is that people don't really keep track of what they are spending. I like to set up weekly newsletters that show costs and also if there has been a decrease or increase. In bigger corps, you also should have team based tagging of resources so that specific teams get exactly what they are spending. At the very least, managers will look each week and be like "why did costs increase this week? What's going on?" even if the engineers don't care. "What's get measured gets managed" as they say.
I tend to have the opposite problem. I obsesses over the cost of things, and am pretty bashful about bringing it up to my manager, and he's always surprised that scaling some resources doesn't cost more. But I learned the hard way as a contractor about letting these resources run crazy and had to pay out of pocket so I have PTSD about it, which is why I'm vigilant.
This is entirely artificial: I now work at a company where we know very clearly what our infrastructure costs. Yes, we know the exact costs (what was negotiated, not what is on the public pages).
And we celebrate costs slashing as much feature delivery and other stuff.
But this is entirely a management problem: at my previous job, only one manager (skip-level manager from my point of view) knew what exactly were we paying for infrastructure.
That moron wouldn't share that information with us engineers managing infrastructure of course, so there were a lot of infrastructure choices that didn't really made sense according to the public prices but (I guess?) made sense according to a price sheet we didn't know.
So we didn't know what we were spending, didn't have the basic data to estimate the price of a new solution or a new service and didn't have the data to determine how much would we be saving by making changes (optimizing stuff etc).
I fought that battle for a bit but then i just said "GFYS, i'm not going to have fights with you so that you can save money" and let go. Later i left the company completely.
Former colleagues tell me it's even worse now: there are consultants from the cloud provider involved, they know the pricing deals, and whenever the topic comes up the manager shushes the consultant so that the engineers don't hear the prices.
tl;dr: it's an entirely artificial problem, and it's most likely a cultural/management problem.
edit: and i'm not even talking about incentives, as somebody else has correctly pointed out.
I recently helped save $150k per year by deleting node_modules.
I noticed that one of our S3 buckets had high data transfer costs, a bucket that our app downloads HTML+JS assets from when we push out a new release. I downloaded the "directory" of files for our latest release and saw it was mostly node_modules. I checked the code and confirmed that, yes, if this file exists in the bucket then it'll be downloaded by the user. I wrote a quick Python script to list out each directory that had this problem, and a quick Slack message to the appropriate team later, we discovered the specific commit that was the cause, a change to our CI that inadvertently uploaded that directory when we wanted to ignore it.
A few months later, I checked the billing metrics, the effect was an avg of $12,500 reduction in cost for this bucket, or around $150k per year, or 4% of our bill. Not bad for one hour of work. Over the course of a quarter I reduced our bill by over $1m, or around 30% of our bill.
I might write a blog post explaining how to go about something like that. A lot of people are not familiar with tools like Trusted Advisor which can easily tell you if you have, for example, unused EC2 instances that can be terminated.
Not sure yet, but probably nothing. I completely understand the expectations written in this thread to receive something in return, but I've given this thought and I'm not sure how to do this in a fair way in this situation. First, I was given dedicated time that quarter to work on cost savings and other people weren't, if I received a bonus is that fair to other people who didn't have the same opportunity? Not to mention the possibility of people abusing this process.
I would be happy to receive some extra cash, don't get me wrong, but I work for non-monetary benefits as well, and I have received some of those as part of this work. If I worked at a company with a different culture and I was being punished for doing the work, I would demand some bonus.
Just curious, do you feel the same way when someone chooses a job that pays less but has a better work life balance, or a shorter commute, or more opportunities for growth, or similar?
I was thinking about this recently. I work at a large company with untold millions in AWS spend. I'm 100% confident that I could shave off a few thousands (maybe even tens or hundreds of thousands) from the bill with a little bit of effort on my side. If I go up the management chain and ask if (1) I can make this an official project and put it on the roadmap or (2) I can do this on my own time and keep some % of the savings for myself as a reward, the answer would be a very clear "no" to both. So overall, as an end developer I really have no incentive to work harder and ensure lower operating costs for my company, and I'm sure most developers in the industry are in the same position as me.
If your company has a spending commitment with AWS in order to get a few percent savings, and it's just barely hitting the contracted amount, it may not be worth the effort to pursue any cost savings. Suppose your company has committed to hit 5M in spend, and they're just barely inching over the line at 5.01M. You might spend a bunch of time and labor expense knocking it down to 4M of usage and not really move the bottom line at all.
> (2) I can do this on my own time and keep some % of the savings for myself as a reward
Theoretically, your company could reduce their commitment to 4M next year, but the AWS sales would start negotiating hard against it, like "you will not get the same discount with less commitment".
Not only will you not get the discount you won't get the commit at all, AWS only reups commits with incremental growth. You can spend less; you just don't get commit pricing.
Ironically, I was asked by a manager some time ago if we can imagine using some (more) resources from AWS to reach the next spending commitment. If you're just below the threshold, it's probably inconvenient.
EDIT: To give some context: you can only do this, of course, when you know that you're not really wasting resources, otherwise you end up with just burning money to save little to no money :)
Can you explain how this is perverse incentives? Wouldn’t this actually be a good way to align incentives of the employee and the company (it makes both of them want to save money)? Or do you mean it makes the employee increase cost unnecessarily from time to time so he can reduce it later?
Spending commitment on AWS? Are you referring to ec2 reserved instances?
Even if you have one year reservations on your instances, starting service migrations/deprecations now would pay off quickly enough. Your commitment expires in 6 months, on an average basis.
When you're a large enterprise customer you get private pricing in exchange for committing to spending a certain amount over a few years. This is unrelated to reserved instances and such.
Part of it is that it creates an incentive to create wasteful systems, only to "optimize" them later to rack up a bonus. Even if it gets changed to only pay out for reducing spend incurred by other engineers, it's possible to collude in such a way to extract bonuses from the company.
A better way to have aligned incentives for the company and the employees would be to allocate a bonus pool for the entire company, from which AWS expenses are taken out of, but that might be a bit unorthodox.
> allocate a bonus pool for the entire company, from which AWS expenses are taken out of
Also a perverse incentive.
If we use ec2.small, the customer's query will take 3x longer but be half the price. Let's turn off the nightly security audits. We can live with quarterly backups, right? What do we need all these logs for, anyway? We could hack something that works together in 2 weeks, but if we spend 3 months, it could be really efficient, let's do that...
This is one of many insights that hint at why biz-facing cloud architecture is so popular, wasteful, and profitable.
The incentives are designed to form an enormous cash siphon. From aggressively marketing toward fearful & liable (or maybe just tech-cost-illiterate) upper-management to the silencing effect that the low-rung experts experience when sounding the alarm.
Companies make and spend so much money that this doesn't matter. Thousands, 10s of thousands, 100s is pointless in comparison the potential of building features. A developer costs about half million a year, (salary, bonus, rsu, taxes, benefits)
If they are paying you to saves 10s to maybe hundreds the company is losing money on you so they won't do this.
If your at a public company, look at your company's quarterly reports and see what it would take to many any kind of impact on net income.
Maybe there should be a sort of "anti-saas-sales" role: you get commission on whatever costs you're able to justify as superfluous. After all, the person at AWS makes commission selling you the stuff.
Unfortunately it seems like businesses only wake up and asks for cost reduction on the infrastructure spend side once the problem is out of control. At that point, the level of operations effort to get it under control feels more like a "big rewrite" than a collection of small tweaks.
Nah, that'd imply they think about this objectively. Most companies simply don't.
Most time this "opportunity cost" is then spent on useless hacky features that are never used and forgotten right after release (redesign anyone?).
Of course there are exceptions to all of these, but IME the majority of companies are either focusing on pennies or ignoring it completely. Not sure why you don't see balanced approaches more often. Maybe this will change with less VC money flying around.
That's why in our company we have 2 type of engineering effort, core projects impact and improvement. Not all time is spent on the impact. A balancing act is needed
I've mentioned this before, in my company (big media company) I saw some S3 costs creeping up each month. I looked into it and it was a system we abandoned that was still copying files to this bucket.
I reached out to the team and they turned it off, it saved us $1m a year. The higher-ups rewarded me by telling me that a team should have caught this so I should meet with them now.
It's truly fascinating how companies won't bat at an eye at spending ungodly amounts of money on things they don't need, but will sweat profusely at the thought of a tiny fraction of that going towards additional compensation.
Hopefully your company doesn't have its head up its ass and has basic things like logs setup. The vast majority of what I do in production is logged through AWS (Cloudtrail) or git.
I work for a company called CloudFix, and we are solely focused on AWS costs. We do automated AWS cost optimization. We find one of two reactions when we deliver savings to customers:
(A) "Hey wow, this is great! We are so excited to be saving from here on out." OR,
(B) "This should have been caught earlier. $TEAM was supposed to be experts..." and then blame game starts.
It is really unfortunate when institutions react in the latter way. Often the engineers are assigned to cost optimization, along with a million other things. And, the incentives aren't really aligned well to reward savings. For example, S3 Intelligent Tiering is the right thing in 99.9% of cases - so it should be your default bucket type. BUT, engineers often face only downside risk for the change, and very little upside reward. And, it isn't their money so they just leave it. The cost of overprovisioned S3 can be staggering!
What is really needed is to establish a proper FinOps discipline, put someone in charge of cost savings, and make sure incentives are aligned properly. And of course check out CloudFix if you can!
We work with a competitor called Vega, the product seems OK although the UI is very slow and confusing.
The biggest problem they have is they have no business insight into what these costs are, and if we can reduce the cost without any kind of engineering, effort, or loss of performance.
One small issue I have as a developer who can spin up just about anything on AWS is this:
I have zero insight into the costs.
Yes, my company could turn that on for me but it's rare that they do so it's nearly impossible to know if I did something that costs a lot of money (relatively or in general) without access to the cost explorer/billing dashboard.
And before "well can look up what a t2.2xlarge costs and calculate it", sure. In a very contrived example I might be able to see what it costs but so many things are hidden/hard to see in AWS. For example, I recently spun up an RDS customer on my own AWS account. After testing for a while I decided it wasn't what I wanted and I deleted the cluster. Fast forward a month and my bill is well over what I expected (Like $30, no it's not a ton of money but it's my personal account and I wasn't expecting that charge). Come to find out it created a VPC as part of the RDS cluster (I think maybe it was for the RDS proxy? Still not sure) that didn't get deleted. I had to go chase that down and even that process wasn't easy. I had to make sure that it wasn't be used by anything else and then delete other things that were created when I made the RDS cluster before I could remove the VPC.
I was only able to do the above because I had access to the billing info. I would have left that VPC indefinitely on my work's AWS account by accident and been none the wiser.
I'm more than happy to take costs into account but without access to what things are actually costing us I can't help that much. Mostly because I need to know the costs to know what's worth optimizing. Sure I know I could improve X feature but if that costs us pennies a day (or month sometimes) then it's not worth it. Similarly if I know feature/infra Y is costing $XX,000/mo then I know I should rethink or investigate if that's correct/worth it.
in the past, i had a case like this: dev accidentally enable backup policy for test database with no retention. finops think that db backup is important and ignore it. dev has no access to billing and have no idea what's creeping up the bill
Exactly, sometimes it's not clear at all what something will cost (and/or if the costs will go up). I'm happy to glance at the monthly costs here and there and if I see a jump I can dive in and see where it's coming from. We all make silly mistakes, like leaving logs on infinite retention in CloudWatch, and that's something I can easily fix/address but only if I have the info.
I've asked, off-hand, a couple times for billing access but nothing has come of it. I don't want to seem pushy but also it feels like data I need to perform my job to the best of my ability (especially at a small company). I don't think it comes from a place of "We don't want to give Josh access" or secrecy as much as it not being a priority but I need to bring it up again.
I’m very aware of that tool but it’s far from perfect. I’ve spec’d things out on that then seen very different prices when I actually create things in AWS. In part because the tool doesn’t take some things into account or because sometimes it’s impossible to guess your usage for a new feature.
I don’t believe the VPC was factored in when I used that calculator, even after selecting RDS Proxy.
I’m convinced that once a company reaches ~$10m/year in AWS spend it becomes entirely reasonable to hire an in-house engineer whose sole job is to find cost savings opportunities. Literally a “find unused stuff and turn it off” engineer.
I've spent some time doing this. There's always old systems people don't really understand, ownership is poorly defined, and no one knows what happens if you turn it off. It's archeology. Understand what the system is doing and how it interacts with other systems and the business. If it looks unneeded back it up, stop the VM, wait and watch for fallout, and eventually terminate it.
There’s definitely a science to it. To complicate matters, the way you explore those connections, take backups, identify owners, and perform restores is different across pretty much every cloud service.
Readers may find Steampipe's [1] AWS Thrifty Mod [2] useful. It will automatically scan multiple accounts and regions for 50 cost saving opportunities - many of which are looking for over-provisioned or unused resources. For example, it's crazy how much you can save by doing things like just converting your EBS volumes to the newer gp3 type. Combine with Flowpipe [3] to automate checks and actions. It's all open source and extensible.
It is interesting to note that the author works at VPBank, which is one of the larger Vietnamese banks. Saving $150k per year on an AWS bill, is really nothing to them.
The fact that they even outsource their compute to AWS is kind of surprising when they could just fill up their existing data centers (like VNTT https://vntt.com.vn/) with equipment, and save a whole lot more money.
And it's also interesting that they can outsource their compute to AWS because AWS's nearest data centers are in Hong Kong & Singapore. I didn't realize a bank would allow that.
I thought it, but I wasn't going to say it. Vietnam's internet connection is notoriously unstable. The running joke is that sharks attack the fiber connections [0] because pointing fingers is a national past time. The fact that a major bank is relying on an external AWS like that, makes it even more comical.
My guess is that nobody in corporate approved this guys posting and if word got back, it would disappear quickly.
Reminds me to forward this to my buddies who run Timo, which VPBank used to own, but then dropped [1]. Timo was the first forward thinking bank in Vietnam with a great tech platform, likely because it was started and run by foreigners... ¯\_(ツ)_/¯.
> The best optimization is simply shutting things off
This is the way.
A similar idea has been bouncing around in my mind for a while now. An ideal, turnkey system would do the following:
- Execute via Lambda (serverless).
- Support automated startup and shutdown of various AWS resources on a schedule influenced by specially formatted tags.
- Enable resources to be brought back up out of schedule when demand dictates.
- Operate as a TCP/HTTP proxy that can delay clients so that a given service can be started when it is dormant or, even better, the service isn't serverless but you want it to be. This can't work for everything, but perhaps enough things such that the need to run always on services is reduced.
Cloud Custodian [1] can purportedly do some of this, but I've been reluctant to learn yet another YAML-based DSL to use it.
So this is my "make things designed to be always-on serverless instead" project and the work AWS has done to make Java apps function on Lambda keeps me thinking about the potential to take things that 1) have a relatively long startup time and 2) are designed to be long running service loops, and find a way to force them into the serverless execution model.
> Operate as a TCP/HTTP proxy that can delay clients so that a given service can be started when it is dormant or, even better, the service isn't serverless but you want it to be. This can't work for everything, but perhaps enough things such that the need to run always on services is reduced.
My team mostly builds internal stuff and we save tons of $$$ by using Knative + Karpenter, which basically does that on container + EC2 levels.
Everything I've built in AWS is strictly serverless. You can do an incredible amount with a clever DynamoDB pay-per-request setup, S3 and CloudFront. I haven't once felt the need to reach out to EC2 or RDS and I can't imagine building any sort of control plane to spool them up and down for me.
This is especially easy if you can shutdown environments that are only used for dev/staging tasks. With 168 hours in a week - how many hours do those things need to be running?
I run a little tool for Heroku to make it easy to do this kinda thing.
This assumes they have something like RI etc for those resources. Those are typically used for production, but far too often, dev/test resources are usually turned on ad hoc.
This is why you should start the conversation with "I have drawn up a plan to save the company $14M per year. I will execute this plan in exchange for $7M upon completion."
If they say no then just go back to your regular duties.
Very few companies would make this deal at 0.7M, 0.07M, or even 0.007M. Direct sharing a % of savings with the responsible employee is simply not the way most companies work. Consultants, now, that's different...
But was that your job? Because if so, you really got salary+bonus. And if you'd found nothing you'd still have got salary. So you can look at it several ways.
I worked adjacent some telecom consultants in the 90s whose income was solely driven by a percentage of cost savings they could trim from telephone bills. Seemed like a very brash business model but they clearly knew there was gold to be mined.
I keep thinking I should be doing "cloud optimization" work and being compensated this way. Slicing and dicing output from usage/billing APIs and providing an "optimized spend" probably has the potential for a lot of low hanging fruit.
As someone who's been doing this for the better part of a decade: it's a mirage. No client is going to sign for a "percentage of savings" model when it comes to cloud. Believe you me, I've looked at this up, down, and sideways; neither the math nor the psychology work out the way you'd hope.
I saved my previous company $4000-5000/mo on AWS billing just by auditing the AWS account and turning off unused machines that that old devs has spun up and deleting hard drives after backing them up to S3, just in case. No one had really even bothered looking at it for years and I did it in my "free time" at work without being tasked with it.
Ironically, I asked for a raise a year later and was denied, despite single handedly saving the company nearly $50000/yr. The raise I asked for wasn't close the cost savings I had brought. I left the company shortly after.
I saw someone else have a similar experience here and a comment to it was saying rewarding this produces a bad incentive...well, honestly why would I have even bothered cutting costs if I felt I wouldn't be rewarded? Not rewarding it just makes me half regret doing it at all.
Looking forward to the kubernetes one - Most kubernetes clusters are designed for high availability, not necessarily for being able to quickly spin up/down and there’s often a lot of hidden complexity there (at least on aws).
Have you done this or attempted this yet? Every kubernetes cluster is different, but in my time working with them the last several years I anticipate the following issues:
- dependent services not coming up in the order you expect/want
- issues draining nodes due to crashlooping/erroring pods (can also be caused by dependent upstream services going down in wrong order)
- Persistent Volume retention/synchronization
- IAC not cooperating
- Configuration annoyances with deployments’ availability/replica settings
- Thundering herd types of problems
I can think of tons of things that can make this extraordinarily difficult. I’ve had many managers over the years pitch this idea of “rapidly deployable/destructable EKS clusters” and the projects always get killed due to the complexity around this. IMHO they simply aren’t really designed for this type of thing, however, I could be misunderstanding exactly what you’re trying to do.
Automatic reconciliation is like half the reason Kubernetes exists. A well designed system should handle this, not having expectations in order for example.
I’ve seen several clusters where one could kill more or less everything and it would just come back again.
> I’ve had many managers over the years pitch this idea of “rapidly deployable/destructable EKS clusters” and the projects always get killed due to the complexity around this.
This is exactly what we do: blue green eks cluster.
We just thought if we do it on monthly basis, DRP will be piece of cake :)
Sorry for the rant, but this is usually wrong. The amount of people that just keeps their computer on is noticeable. And when I ask it's usually "just to avoid having to wait" or "I've always done that".
I personally always hibernate my computer. When I turn it off it takes more time, but I'm already on the other side of the building so I don't care. When I turn it on it takes basically the same amount of time, and it is exactly as I left it. People keep the computer on just because convenience...and I don't think it's a good thing.
I keep mine on because it's my jumphost for working remotely. But I agree many people don't need to do this. My company, though, wants people to leave their PCs on so they can get automatic updates and be centrally managed.
I always leave my PC on, but for different reasons:
- I have a plex server running on it
- I can remote into it from my phone, this comes in handy a lot of the time.
- I can remote into it when traveling through my Fire stick using parsec, which means I don't have to carry a laptop with me everywhere I go ( I also setup my phone so I can use it as keyboard/mouse when I do this).
Regarding energy costs, it's negligible for the benefits it gives me
I always shutdown my machine at night and sometimes restart when I leave for lunch. I've noticed running Docker and other apps for a while makes my machine slower. I'm convinced there's a memory leak somewhere and restarting away fixes those issues.
Restart the computer on a daily basis? Like it’s 1998? You need a different system man. Ubuntu can go strong for a month easily, with only sleep for leaving it unused. Only reason for restarts is security updates or that your battery ran totally dry.
Sure, after you reengineer your application. Even then, "serverless" apps often use persistent resources like databases, and your developers will likely spin up those resources for the same reasons as indicated in the article.
Cost savings can be incredible if you use the FaaS product in the most aggressive way possible. For us, this means using functions as a simple translation layer between SSR web forms served directly as text/html and whatever SQL provider (ideally also on a consumption-based tier).
90% sounds just about right. We are seeing figures going from $120/m for a VM-based QA environment to $10/m for a consumption-based / serverless stack.
depends on what your utilization looks like, serverless is usually +/- an order of magnitude more expensive. Ideally your workloads are stateless and containerized so you can shuffle them between serverless, container orchestration that you own and dedicated VMs.
You should always calculate if you're actually going to see cost savings. Counterintuitively, running for fewer hours can increase your bill if it causes you to switch to on-demand pricing [1]. There's a break even point you need to get past.
--> "The best optimization is simply not spinning things up!"
At least for local development and testing, as made possible by LocalStack (https://localstack.cloud), among other local testing solutions and emulators.
We've seen so many teams fall into the trap of "someone forgot to shut down dev resource X for a week and now we've racked up a $$$ bill on AWS".
What is everyone's strategy to avoid this kind of situation? Tools like `aws-nuke` (https://github.com/rebuy-de/aws-nuke) are awesome (!) to clean up unused resources, but frankly they should not be necessary in the first place...
In one project we had testing setup which costs 600k USD allually. It was three times more expensive than production setup we had for product which was more than 3 year old.
Nothing special, just mongo and Kafka with enormous size. If you run automation tests Manu times per day but do not clean anything - you'll get mongo with terabytes of test data. And then, on top there was elastic search which multiply bills.
I've for a long time set my cloud VMs to shutdown on idle. I usually use it to also justify running a much larger VM to cut down on build and test times.
Just set a cron to run the shutdown command with a grace period. And then if you're working late, you just cancel the shutdown and the shutdown will be retried in a couple of hours. And have a script or command to just run the cloud API calls to boot the VM in the morning / when needed, and the environment boots in a minute or two.
For other stuff I've been tempted to do a more complicated setup, with something like a micro-vm as a proxy, that will do the shutdown / activation on TCP connection, but haven't gotten around to it.
I’ve mentioned this before, but probably one of the most egregious costs on AWS are NAT gateways and NAT bandwidth pricing. Typically I deploy one NAT gateway per AZ so looking at $99 a month just for three NAT gateway instances with zero traffic.
Disclosure: I'm CEO of https://www.vantage.sh/ -- a cloud cost observability platform. I previously worked at both AWS and DigitalOcean.
For people looking for how to save money on AWS - I'd [selfishly] recommend connecting up to Vantage. We profile AWS for all sorts of savings and give you the information on how much we can save prior to you paying us. It can be a good gut-check if nothing else on how well optimized you are.
Unfortunately we don't have an option for that route -- but we're happy to help support any paperwork for getting things up and running if you contact me or my team: ben [at] vantage [dot] sh.
If you're running a small setup and don't need any value add products or multi-AZ/multi-region this might work, but Hetzner and major cloud providers are by no means comparable.
Hetzner offers a 99.9% uptime guarantee only on their network. AWS has SLAs for every product offering - EC2 for example starts paying out credits if they fall below 99.99% uptime.
If you're a user of various managed cloud products, these will cost quite a bit to replicate on Hetzner and you'll be spending money on personnel to build these out and maintain them instead of just paying for the cloud product on AWS/GCP/Azure.
Was doing some research this weekend on cloud exit. Hetzner is attractive, but our company is pretty much limited to the US (no international companies due to our current business model). How practical would it be?
Also, I've seen a lot of concern over blocked IPs, especially for lower-cost hosts. Is that an issue with Hetzner?
I imagine one reason I didn't hear about them before is that they don't seem to offer self service. I absolutely don't want to talk to anyone when I am buying object storage, cloud vms, or even dedicated.
When I joined my current company I quickly found our internal Azure environment was effectively unmanaged. Four weeks in I shaved nearly 55k in monthly spend just scripting out VM shutdowns and service pauses.
Cloud revenue in most large companies is at least 25$, maybe up to 40%, pure developer waste because nobody upstairs knows the difference.
Reacting to events to install security defaults (or any kind of defaults) sounds really error prone. Are people running AWS where devs just click buttons in aws and spin up random stuff? I thought we all decided that was dumb and switched to gitops/terraform?
Does anyone have a good experience with tools / services that track and analyze cloud usage? We don't use any, but could benefit from better visibility in spending patterns.
Here's a startup idea: a profiler for infrastructure-as-code that shows how much each line of code is costing per month, instead of showing where the CPU spends most of its time.
I did this at a previous employer. I leveraged a Lambda functions and tags applied to instances to determine when they should be and when they should be off.
The problem is always around abuse. If it becomes known that you can get a big bonus by wasting a lot of money on useless infra first and then reducing it, other people will start playing the game.
How do you reward cloud cost awareness without creating perverse incentives?
It's always the same answer: managers who pay attention to the details. People familiar with your work should be able to tell if you're gaming the system or not.
Will they also be paying what they owe on the added costs that could have been noticed earlier with due diligence? I'm imagining that what they'll receive instead is the compensation expected and agreed upon by both parties negotiated during either initial hiring or the multiple points in the year that allow for easy communication about changing payroll expectations, instead of hawking for dimes at first sight.
I once found a “test” db cluster from an engineer who hadn’t worked in the company for 3 years. We were paying 300k yearly for it before discounts. It took me a literal click to shut it down. And I’m not proud of it but, had to send out an org wide email on the savings achieved (corporate politics :shrug:).