At least AWS allows using a prepaid credit card so they’ll need to call me if things go haywire. I bet if that $72k charge went through it would have been much harder to get out of. “Sorry, we don’t have the money” is a much better negotiating position than “can we please have our money back?”
Sorry until you pay, no more Amazone services for your company.
Now you must move to a new cloud provider (or make a new company).
Oh, wait they now interchange (bad) customer information to better find fraud and you just got marked at "owning a lot to amazon" so no cloud for you anymore at any provider.
Now you want to buy your own hardware. So you need a credit from the bank, but dang, your owe to much to a big company and the bank now, so no credit for your company either.
While part of above's scenario is luckily not how reality currently works. But then who knows when (part of) such a horror scenario becomes reality.
In the end relaying on forcibly not paying back money you contractually own is just not a very viable strategy in my view.
I agree but why would you like to be in either position anyway? The so-called cloud services are terribly overpriced when compared to traditional servers.
Seem companies hire five 6 figure people to try and cut amazon bill by a couple of grand a month.
Never understood spending 50-100k a month to maybe save 5k
Not really, computing done correctly is about avoiding all of the pitfalls and finding ways to get zero cost benefits, free computation out of necessary redundancy, etc. Selling cloud computing is about creating options around every pitfall and finding ways to charge for every mitigation that will be necessary and charge for redundancy in the mitigation strategy for the mitigation strategy..
Even if you pay for all the redundant managed blah they offer to not lose your business by having any single point of technical failure in their network, their billing and IAM are your single points of failure, if you diversify to multiple clouds all the guarantees either cloud offers is now pointless redundancy so you are paying 10X pricing for an inadequate redundancy layer.
If you look at Google's own model for computing, they didn't fall for this themselves, the computers they used were intentionally unreliable to not recursively pay for reliability and redundancy at any layer that can't provide the needed guarantee.
You can basically go all in with one of these clouds and become a franchise add-on with roughly the same rights as your average mcdonald's store owner, or you are managing a strategy that is far more complex because of the complexity of these offerings than just using metal and free software.
However, for growing companies that 5k/month AWS premium can hit 200+k/month very quickly
I might want to have an app because I don't mind spending 50 dollars on my pet project as a hobby, but I don't ever want to spend more than that. Not if I write a wrong query that's suddenly becomes very expensive, not when I got attacked, and not even when I have legit users.
By the way, the same goes for some companies, too, just the threshold would be different.
They want to suck the maximum money from consumers before they realize.
For one person that will complain wildly and having to do a gesture, there are hundreds other companies that will not notice or just pay without recourse.
This is a naive understanding of how corporations like Google and Amazon work. Bad will and using gym membership tactics aren't how they scale or make money. Getting you to confidently try things knowing you won't get charged (the reason they have those free tiers) so you'll get your company, your start-up, your next side project on it is much better for business.
It's a miss that things like this aren't implemented and widespread, not by design.
> It's not complicated to add configurable hard limits for these companies but they don't allow it because the current situation is more interesting for them.
I'm not in this space, but from my observations:
- Each service has a different billing model and metering model. Most likely this data is held by the service. I'm familiar with AWS so I'll use them as an example. I'd wager only DynamoDB or only Lambda (the service owners) know how much of those services you've consumed
- Billing is most likely reconciled asynchronously after collecting all data from all services by an entirely different department with knowledge of payments and accounting
- GCP, AWS, Azure launch 50+ services a year
- Each large customer most likely has a special rate. I bet Samsung or Snap pay an entirely different set of rates than the normal customer. There are thousands of these exceptions
- Cutting your service off when your over the limit is an incredibly complex set of edge conditions. Your long running instance hosting your critical service is shut off because of experimenting on a new ML workflow?
Even with only the above I can see the difficulty in globally limiting your spending limit at an accurate level. I know there are features for both AWS and GCP and they try.
It's easy to stand on the sidelines and handwave away technical complexity at scale, but I'd encourage you to give all of these providers a more charitable view, at least on this topic.
Except they do that with their actions.
>Cutting your service off when your over the limit is an incredibly complex set of edge conditions.
Sure! But if they cared about customers as you claim, they'd let users set hard limits, and when one of these mishaps happened, stop the services when their system eventually knows that the quota has been exceeded... and, make the user only pay the hard limit as the maximum amount. If this continues to happen, warn the user that their account will be terminated... and that's that. But they'll never do that.
Most of their clients pay for these mistakes because they don't have the reach or skills to make this a viral social media article to get people's attention and hence get them to forgive the costs.
I'm sure they know how much they make in revenue because of these mistakes and they deliberately don't do anything about it.
What is your educated guess by when this feature would be essentially correctly implemented in AWS and GCP (essentially = negligible costs to the providers due to either false negatives (bills they eat) and false positives (PR fallout, when SomeSite gets shutdown despite not being over limit)?
I think that's true. It's easier to measure usage and aggregate that data after the fact than to meter it in real time and stop at a limit. Those are very different things. What happens if you hit the cap while running multiple processes spread across a cloud?
One improvement might be to throttle things as the cap approaches but that doesnt really change the problem at all. Do that and have provider eat any overages should solve it from the user point of view.
I have very little money so I just don't use their services because a mistake would be disastrous. They might be losing out on me making a unicorn app on their platform. It's unlikely, but while the possibility of catastrophe exists I'll stick to not using them. That extends to not recommending anyone uses them either in case the worst happens.
Then the harsh reality is: companies don't care. Yeah, your app might turn out to be a unicorn, but the overwhelming odds are that it won't. And no one cares that you'll tell your other broke friends to avoid the service.
We'd all like to think it to be different, that a company might care about appeasing my broke ass. But as already pointed out, they want the whales. I also wonder, despite the number of years "cloud services" have been around, if companies aren't still trying to figure out a gazillion other things and limiting customer spend might be a bit low on the priority list.
I do my best to avoid FAANG giants who don't think about me.
...is it? If a lazy dev leaves their corporate account open and you can bill it for their negligence, protected by the contract you already signed, you earn a lot of money. From a purely business perspective, it is stupid(!) to provide a stopgap for that.
Edit: to be clear I am not advocating one way or the other. But it is surprising that people are "baffled" by this obvious profit optimization.
If everyone has this policy, Google, Amazon, Microsoft, and the rest are in a good place. And suddenly it's the "industry standard."
This hypothetical is already enacted today...
Here's the specific example
In our case, we racked up a $10000 bill on BigQuery in ~6 hours, when a job was failing and auto-retrying.
We had set up every alert correctly and our reaction time was about 5 minutes (about $100 of usage, no big deal). So how did we get a $5000 bill? Google's alert was 6 hours late (according to them, this was root-caused to us, because we were submitting jobs continuously). They pointed to their TOS and said they don't guarantee on-time delivery of the alert.
I had to write up a blog post with fancy graphs and prepare it for social media before they finally agreed to eat the bill.
Link (see blue info box a a bit below the anchor on which the page is opened):
But if 1,000,000% lower doesn't work ($7 vs $70k) then...?
you misunderstand the intent of this - you basically set this. even if it fails (because messages are delayed), google will refund.
This has happened to us before - they do a refund - since you had set the limits correctly. In general, they are not super assholes. I actually dont know a case, where they have refused to refund.
AWS is better here - since GCP doesnt have a support dashboard. So the "chasing them" experience is much worse.
This looks like it has the same problems as the post, because it also relies on those budget alerts that can happen a long while after you've exceeded them.
"Resources [...] might be irretrievably deleted."
Also it's not automatic, you have to manually write code to do it, and test it, and make sure not to break it.
A reasonable implementation of this feature would be built into the console, guarantee a maximum spend, not require writing your own fallible code, and provide an option to preserve storage (at normal cost) so that all your data isn't deleted when your compute/API stuff is shut down.
- hard limits caused downtime more often than they prevent these blog posts
- hard limits were inconsistently enforced, even within GAE
- platform wide quota notifications were implemented (reached "GA"), leaving the question of "how a developer wants to handle this" to the developer, not the platform
- maintenance burden
The "I bankrupted my startup by running tests in an infinite loop" blog posts happen ~once a year, while the number of customers (including internal teams!) who inadvertently went down because of this quota was staggering. I feel like I used to see one a week, at least. Most often someone on the team was like "oh I'm going to turn this down to zero because we don't want to spend any money during development", never told anyone, and then they go live and they forgot to turn the knob back up (or didn't properly estimate traffic/costs and set it too low).
I can tell you it hurts revenue a lot more when a large customer goes down for 15 minutes due to quota issues and their usage drops to zero (both in terms of revenue and customer credibility) vs when tiny developer accidentally blows through 10k in a month and we refund it (since, obviously, the providers cost is a lot less than that).
When I also think of the fact that Google tied it to requiring a credit card for almost every single transaction even if it is free gives the impression that it is for financial purposes (aka a way to get more out of developers or those who might be free-loading on the free tier of App Engine)
Compute is an active resource, when you exceed your budget it can be automatically shutdown.
Storage is a passive resource, when you exceed your budget it can be automatically....deleted? That's almost always the wrong action.
Providing fine-grained cost limits help some, as passive resources usually don't have massive cost spikes while active resources do, so you can better "protect" your passive resources by setting more aggressive cost limits on the active resources.
This quickly gets more complicated. Another example is most monitoring services are a combination of active (actual metric monitoring) and passive (metric history) resources. A cost limit on that monitoring service likely won't provide sub-service granularity, mostly depending on whether the service even has different charges for monitoring vs history.
Oh, also, even for a passive resource like storage, you also have active resource charges whenever you upload/download your data.
Ugh, what a mess. The best thing to do is pay attention to your spending, just like you do with your personal & corporate budget.
If anything it seems an easier problem than processor time.
I recall disk quotas on shared systems at university back in 1998 and I'm sure they existed before that.
Two thresholds IIRC, one at which you get a warning, second at which you can't write any further and the disk write operation fails.
I don't think they deleted files, it was just you couldn't write more than [quota] bytes to your disk.
Is there something particular about cloud based systems that prevent this from working?
ie. is this a specific problem with distributed storage?
AWS has quotas on everything, including quotas on EBS storage per region.
You will realize that after you spin up some instances with disks and it's failing because you've hit 10 TB of EBS storage. Have to raise a ticket to raise the limit.
A better option would be to automatically reduce the budget by the amount it would cost to keep the storage forever. If doing that would reduce the budget to zero, do not allow increasing the amount of storage. That is: assume the storage will not be deleted, and budget according to that.
Even if we say "you get N months of storage before we delete it" and subtract N * current storage cost/month, what happens after you're locked out of all actions because you added an extra GB? Storage APIs cost money to use, so you would get locked out of those too (note that if you're not, people would set arbitrarily low limits and get storage access for free) and couldn't retrieve anything. The only remaining actions are delete (which is free) or raise the quota and do the whole rodeo over again.
Abuse is impossible to ignore at public cloud scale, so "free storage forever" (or even, storage at a one time fixed price) as the fallback isn't a viable option.
Lastly, from an optics perspective, which blog post would you rather see on the front page of HN: "I did something dumb and spent too much money on Cloud" or "Google is holding our data hostage" (or "Google deleted all my data")?
Source: I launched Firebase Storage, which has a GCS bucket that has a hard limit.
Practically every time these blog posts come up they end with the provider refunding the costs. I just want that refund to be a feature.
How will Google automatically differentiate between an "honest mistake" and someone taking advantage of this feature?
Yes. But they should also develop mechanisms to warn users that they've made a mistake before it happens, and improve the speed they can detect mistakes to lower the cost, and invent some way to detect someone intentionally abusing the feature.
But mostly they should make the fact they do give away $4900 when a mistake happens explicit. That isn't actually a change. They just need to make it clear that's what happens.
Your examples are simple given this framework. Uploading/downloading data to storage is an API call. Monitoring is compute. Metric history storage is storage.
When there's no budget left, what do you do with those accruing costs for existing storage?
Once you get the alert that your budget is tripped you can go and see what's in storage via the console and delete it, only paying for a few hours of storage for things you don't want.
You set a quota for 50GB of storage and no more.
The server then restricts you by disk quota to that amount of storage.
The cost is then calculated as 1.15USD per month.
So you don't pay more than 1.15 per month.
Compute and transfer (and other things) could be covered by separate similar quotas with a single maximum spend figure at the bottom of the table.
It's really not a simple problem because the next action depends on the choice the developer wants to make: do they increase the budget or decrease usage, and no cloud provider wants to make this choice because no matter what the choice is it will be viewed as wrong. The best they can do is provide developers the best insight and tooling to make this choice themselves.
You could also have a setting in the admin panel as to what the system should do:
[ ] I want to keep going beyond my quotas (but email me)
[ ] Please shutdown my site
The other issue is that many large customers pay different prices, so billing and quota aren't really tied to each other, and it wouldn't be easy to reconcile this.
As for the button... having been on the product side of building this button, there is no right answer: people will say they never got the email (or it went to the wrong inbox, or their dog ate their phone...) or that they never checked the box to "shut down the site" ("I didn't think it would do X that made my app not work").
Probably arranged so you can type in a figure at the bottom for monthly expenditure and it would balance out the requirements based on typical use cases.
So enter $50 in the monthly cap figure and it allocates, say, $20 to compute, $20 to transfer operations and API calls, $10 to storage
which you could then fiddle with of course.
I can't offer much on the second point other than to say that unexpected bills annoy me much more than services that stop working.
I've also never worked anywhere with unlimited budgets. (alas)
I can see that there are probably cases where uptime is more important so they would be more annoyed the other way around.
Throttling doesn't stop the drain.
I'm guessing there's a good chance a lot of systems are only eventually consistent, which could explain why billing takes a long time to update.
Aggregation of service usage for billing could also be an expensive operation, so it's only updated irregularly instead of being near real-time.
It would be a great feature, but I can imagine it being very complex. It's also probably cheaper for them to just wave away excess usage like this instead of building out a solution.
If I say I only want to pay a maximum of $1000 a month, and I hit that limit but it takes a bit for the provider to shut everything down so really $1100 of resources were consumed, then the provider eats the $100 overrun and I get a bill for $1000.
With an actual hard limit you create a financial incentive for the provider to minimize this overrun. Yes it might be difficult to fix but I assure you, if hard limits existed, the technical issues would be solved soon enough because now there's a reason to invest in a solution.
It's a fun exercise similar to global rate-limiting/load-balancing.
If you have the time could you (anyone feel free) talk a bit about how you would implement a globally distributed budget?
I can imagine a few simple options, but they all seem to have significant shortcomings.
Web servers check with the leaf nodes for every ad they want to show. If that leaf has a budget greater than zero it decrements its own budget and returns success. If the web server gets a success it shows the ad, if not it checks with another budget server or two. Web servers frequently log how many ads were served per client.
Whenever leases are up the intermediate nodes inform the parents of how much was spent and get a new lease. If nodes crash or otherwise don't return their lease then their parents have to assume the whole budget was spent, but leases are kept small to avoid big discrepancies.
If the root crashes then there are problems so the root can be a slow ACID replicated database as long as its immediate children are mostly reliable and take large enough leases to minimize load on the root.
Periodically web server logs are aggregated to adjust the root budgets to account for crashed intermediate nodes and web servers.
The tree approach allows global low latency operation guaranteeing no overspending and minimizing underserving. Nodes are provisioned from the leaves on up to handle the necessary amount of traffic and to ask for leases large enough for 99.X% percent of child requests to succeed.
Any cloud provider could use the same technology on individual hosts to grab leases of CPU, RAM, disk, etc. by the minute per user and terminate services with no budget. Leases could be a lot longer because most budgets are monthly to cover all service needs and not pathological ad campaigns with low budget, high bid, and huge audience.
It's up to cloud (or ad server) providers to decide whether to stop services if the budget system is broken. In most cases it makes sense to fail open and keep serving and eat the loss because shutting everything down will incur even bigger losses.
One of your competitors could just rent a cheap server on OVH with uncapped transfer and incur you $10k in cost in a few hours.
* I dont have any idea about OVH
There is no engineering hurdle that is a valid excuse for allowing a customer to go over their stated budget by 86 million percent.
S3 -- you can't just delete customer data because they hit a billing limit
RDS -- not going to drop databases on the 27th of the month
Anything with persistent data is going to have to stay alive and accumulate costs. Admittedly these services aren't where the crazy bills come from, but it does make a simple kill switch a bit more complex.
Most service that has a limit cap will have a "grace period" of a couple of days during which the service does not work but the data is not deleted. That give your some time to get notified of the issue, and fix the problem/increase the limit.
You could factor that into the price, but then you're potentially making the price point even more unattractive to users than it already is, and users that are responsible with their budgets would be subsidizing those that aren't. Not a very workable solution.
I'd say a good solution is giving customers the option to stop accruing more storage capacity, and to have a max deadline accounted for in their budget to store data (basically each customer decides whether or not to pay for a grace period).
If you don't buy one coffee, or put a 20 dollar note in a book one month. Then you're fine. And if you have to use EC2, just use a t2.micro or a raspberry pi on your desk.
But really the first lesson you should learn in any cloud setup is Billing Alarms :)
If you're doing ML or CV work then it's probably cheaper to build on the desktop and port to cloud once you understand what the workloads are.
If you get it right, great. If you get it wrong then you end up doing billions of operations by mistake, which could cost a huge amount. That's what happened to the author of the article.
But really the first lesson you should learn in any cloud setup is Billing Alarms
Alarms only tell you that something is going wrong. They don't stop it. If your mistake is costing $1000/minute and you're an hour away from a computer you have a very expensive problem.
This is HN, many of us are solo founders with no coworkers or employees. Also how could a "friend" pull the plug? If it was a physical server running in your house maybe, otherwise you can't really give them access to your AWS account with all your private clients data in there.
As for having a non-employee pull the plug, set up an IAM user with permission to access the test instance
Agile. Bringing you bankruptcy at the speed of cloud.
And why would you start a test if you won't be there to see the results of the test? Seems more sensible to either leave after you've run the test or wait to do so until you get back.
Whoever helped you inside Google will have gone to a LOT of trouble, opened a bunch of tickets and attended many, many meetings to make this happen.
It's a little bit nuts that there are no guardrails to prevent you from incurring such huge bills (especially as a solo developer that might just be trying out their services).
To me it looks very similar to the personal checking overdraft schemes banks were using up until a few years ago.
On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users. And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.
Note: Principal at AWS. Have worked to issue many credits/bill amendments, but dont work in that area nor do I speak for AWS.
What? Why wouldn't this just be an opt in thing? It could even be tied to the account being used. It's not like AWS accounts are expensive or hard to setup.
If a user opts in to the "kill if bill goes too high" and they kill a critical portion of their business, then that's on them. Similar to how a user choosing spot instances if their spot ends up being destroyed. You've already got that "I can kill your stuff if you opt into it" capability.
> On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users.
Yeah, and what happens if someone isn't big enough to justify AWS's forgiveness? What if they get a rep that blows off their request or is having a bad day? You are at the mercy of your cloud provider to forgive a debt, which is a real shitty place to be for anyone.
> And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.
And what do they do if they miss the alert? You can rack up a huge bill in very little time the right AWS services.
The point of the kill switch cap is to guard against risk. The fact is that that while 72k isn't too big for some companies, it means bankruptcy for others. Its because you might want to give your devs a training account to play with AWS services to gain expertise, but you don't want them to blow $1 million dollars screwing around with Amazon satellite services.
"Oh cool, I'll set a $1k cap, never gonna spend that on this little side proj." Fast forward a year, the side proj has turned in to a critical piece of the business but the person who set it up has left and no one remembered to turn of the spending cap. Busy christmas shopping period comes along, AWS shuts down the whole account because they go over the spending cap, 6hr outage during peak hours, $20k sales down the pan.
Of course it is technically the customers fault but it's a shit experience. Accidentally spending $72k is also technically the customers fault and also a shit experience. I don't think there is an easy solution to this problem.
"Oh cool, I'll only scale to 1 server, never gonna see high load for this little side proj."
"Oh cool, I'll deploy only to US West 1, outages are never going to matter for this little side proj."
There are a million ways to be out of money as a company. Why should this be any different? Why is the singular particular instance one where it is simply intolerable to accept that users can screw things up?
There are lots of things that are "shit experiences" that are consumers fault.
There is an easy solution. Give consumers the option and let them deal with the consequences. There are enough valid reasons to want hard caps on spending that it's crazy to not make it available because "Someone MIGHT accidentally set the limit too low which will cause them an outage in production that MIGHT mean they lose money"
This is simple wrong.
Depending on your use-case disabling active resources is the right reasonable solution with less downsides.
E.g. most (smaller) companies would prefer their miscellaneous (i.e. no core-product) website/app/service to be temporary unavailable then have a massive unexpected cost they might not be able to afort which might literally force them to fire people because they can't pay them....
I mean think about it, what worth is it if my app doesn't go temporary unavailable during it's free trial phase when it means I'm going bankrupt from today to tomorrow and in turn can't benefit from it at all.
Sure hug companies can always throw more money at it and will likely prefer uninterrupted service. But then for every hug company there are hundreds smaller companies which have different priorities.
In the end it should be the users choice, a configuration settings you can set (per preferably per project).
And sure limits should probably be resource limits (like accumulated compute time) and not billing limits as prices might be in flux or dependent on your total resource usage or similar so computing it is non trivial or even impossible.
I often have the feelings that hug companies like Amazone or Google often get so detached from how thinks work for literally every one else (who is not a hug company). That they don't realize that solutions proper for hug companies might not only be sub-optimal but literally crippling bad for medium and small companies.
I'm no longer that person, but I think GCP/AWS are just being lazy about this - perhaps because they earn a lot of money from engineer mistakes. Of course it's possible to create an actual limit. There'll be some engineering cost, like 0.5%-1% extra per service?
Edit: Being European I think legislation might be the fix, since both Amazon and Google have demonstrated an unwillingness to fix this issue, for a very long time.
Lol what ... this is exactly what happens any time you hit a rate limit on any AWS service. The customers application is "catastrophically interrupted" during its most popular/active period.
The only difference is in that case, it suits AWS to do that whereas in the case of respecting a billing limit, it doesn't.
If you hit a billing limit, everything beyond that point is dropped, and the graph of requests plunges to zero. You're effectively hard down in prod.
I didn't even have any customers at that point.
Email me when we cross $X amount in one day, Text when we cross $Y, and Call when we cross $Z. Additionally, allow the user to configure a hard cut-off limit if they desire.
Just provide the mechanisms and allow users to make the call. Google et al would have a much stronger leg to stand on when enforcing delinquent account collections if they provided these mechanisms and the user chose to ignore them.
Additionally, Google et al should protect _themselves_ by tracking usage patterns, and reaching out to customers that grossly surpass the average billable amount - just like OP with their near $100k bill in 1 day. Zero vetting to even have a reasonable guarantee the individual or company is even capable of paying such a large bill.
And then what? Sue a company that doesn't have $100k for $100k? This makes zero sense.
A better solution would have been 'limits' which they used to have (at least for Google App Engine) but which has been deprecated.
We had to spend sometime to research and see if there was a work around because just like the author of the article, we were quite worried about suddenly consuming a huge amount of resources, getting a spike in our bill and our accounts being cut off/suspended because we hadn't paid the bill. We've documented our solution here
Nor does that address the other complaint - Google (and possibly others) seem to be willing to extend an unlimited credit line to all customers without any prior vetting for ability to pay. That's crazy.
Well, this is true, but this is also true of a lot of limits, like limits.conf. Sometimes you really want to spawn loads of processes or open many files, but a lot of the time you don't, so a barrier to limit the damage makes sense.
There is no one solution that will fit everyone: people should be able to choose: "scale to the max", "spend at most $100", etc. If my average bill is $100, then a limit of $500 would probably make sense, just as a proverbial seat belt. This should never be reached and prevents things going out of control (which is also the reason for limits.conf).
This could be ameliorated by using namespacing techniques to separate prod from dev resources. For example, GCP uses projects to namespace your resources. And you can delete everything in a project in one operation that is impossible to fail by just shutting down the project (no "you can't delete x, because y references it" messages).
Aggressive billing alerts and events, that delete services when thresholds are met, could be used only in the development namespace. That way, fun little projects can be shut down and prod traffic can be free to use a bit more billing when it needs to.
Making the worst case scenario no worse than traditional infrastructure.
Well there's a very easy way, adding a checkbox and an input:
[ ] I am just trying things out, don't charge me more than [ ] USD
And for those that are heading into that financial barrier it should be a straightforward problem to look at trending to anticipate the shutdown and send out an alert.
Literally did this my first week when trying out GCP for my company. It is entirely possible and documented (with code):
(Source link in parent post, emphasis mine).
In this case they had a additional cost due to delay of $72k. Which, lets be honest means this feature kinda useless for anything but the somewhat harmless case.
Only by combining this with resource limits in load balancers, instance and concurrency limits and similar can the maximal worst cost be limited. But tbh. this partially cripples auto-scaling functionality and it's really hard to find a good setting which doesn't allow to much "over" cost and at the same time doesn't hinder the intended auto-scaling use-case.
> I created a new GCP project ANC-AI Dev, set up $7 Cloud Billing budget, kept Firebase Project on the Free (Spark) plan.
There's a lot of middle ground between $7 and $72k. Your quote explains it perfectly though. They flat out can't because the accounting and reporting is badly designed and incapable of providing (near) real-time data.
IMHO the easiest solution to this is government regulation. If you set a budget for a pay-what-you-use online service there should be legislation forbidding companies from charging you more than that.
I also find it (sort of) hilarious they can magically lock the whole thing down once payment fails, but not before the CC is (presumably) maxed out. Lol. Talk about a good deal for Google.
Unfortunately for all of us, your solution doesn't work, per the huge disclaimer on the page that says those alerts can be days late. You can rack up an almost unlimited $ bill in hours.
Now, think some of these quotas can still lead to some pretty crazy bills.. but that is the point of at least some of them....
I mean like the article mentioned they could have set the instances and concurrency settings to lower values. Which in this case would have worked.
But finding the right settings to balance intentional auto-scaling and limiting auto-scaling to limit of how fast unexpected cost might rise is hard and prone to get wrong.
Let's be honest it's in the end a very flawed workaround which maybe might help (if you know about it, and did it right).
I never used google again.
I run so many websites on Google Cloud Run that sometimes I feel I might be abusing them, but I have ensured each of my site has max limit of 2 hosts.
Thanks for sharing!
I have no idea what they did internally, but something like this was my guess. I only communicated through customer support channel and replied to emails, and shared my doc (which cited all the loopholes) with them.
It took them 10-15 days to get back and make a one-time good will contribution. The contribution didn't cover logging cost, so we did pay few hundred dollars.
Luckily we were able to get $11K refunded on our card and received $6K credits after spending all night with Google support.
Especially if you consider the dollar value of all those approvals and the business you might lose to some other platform and/or hesitance people will have to use those platforms for such things in the future.
> If you owe the bank $100 that's your problem. If you owe the bank $100 million, that's the bank's problem.
Crappy situation for OP and his startup, but I find the part about reading up on bankruptcy to be a bit premature.
Perhaps not the most ethical choice, but what stops OP from just not paying the bill, and finding a different cloud provider? Obviously they'll want to not repeat the "experiment", but seriously... there's no mechanism at Google to stop a new client from running up a near-$100k bill in a single day?
That's absurd, and should be a learning lesson for Google more than this startup. Some malicious actor could apparently consume hundreds of thousands of dollars of Google resources and "get away" with it.
Wait and see what happens, then deal with it - would be sane advice.
Bankruptcy fear was real at the time. Google has at least a few thousand lawyers on payroll. They probably also have a process of handling delinquencies and sending them notices. A quick look at the lawyer fee to just manage the case, let alone fight it, is enough for bootstrapped company to raise hands.
+1 to bad actors possibility. I shared this with Google team, I'm not sure what they have done since.
We are out of that situation and I wrote the post so others, relatively new to Cloud don't make same mistakes.
Fail fast is a very bad idea with Cloud.
However, Google's army of lawyers costs them real money, where your bill is largely made up numbers.
Perhaps the true cost is still enough to warrant sic'ing their lawyers on your company.
Even in that situation, a wait-and-see approach is still pretty advisable. The worst case scenario was already known to you - bankruptcy.
Nothing Google or their lawyers do would change that worst-case outcome, and if Google was aware you literally don't have $72k, and might just declare bankruptcy and walk away, they'll be much more eager to negotiate a more reasonable bill and settle your account. It's exactly as J. Paul Getty said...
Very glad it's being worked out and you will not have to go down that path.
You could even go scorched earth, represent yourself, and drag it out as long as possible. "Your honor, I'm a free man on the land and all I was doing was travelling the information super highway. I'm not bound by your laws!" Haha.
It's 220,000 square feet, but I've lived in a tent out back for the last 6 months because I can't get an occupancy permit, it's not zoned residential, and I refuse to pay rent on an apartment.
This is the first video I made there showing a bit of the size:
Is it an old airplane hangar?
Even trying to read the amazon pricing for their instances, hours and what not, drives me insane.
Seems this is done on purpose. no wonder they make so much money with it.
So i have never seen a reason to move any stuff to the cloud.
Just grab a dedicated server for a few bucks and put a bunch of docker containers on those.
Its way cheaper, usually not more complicated. Just use a CI with Gitlab runners or whatever and be done with it.
Most apps don't need scaling anyway and if you do, just put that app on bare metal fitting your requirements.
Having just started my own journey into building products for myself, pretty much the first thing I realised with my tech was I need to get dedicated servers instead of cloud just because it costs 100x less.
> Just grab a dedicated server for a few bucks and put a bunch of docker containers on those.
Exactly, if you really want kubernetes coolness to act cloud like, install kubernetes it's free and is super easy to setup.
And with the cost savings you can literally buy multiple spare servers and with kubernetes using them all while keeping the usage low allowing to scale up new nodes if needed.
Can you point me to the super easy setup guide? Because I've tried a few and never gotten it working.
I set up a 3 node cluster using it in an afternoon and haven't had any problems since.
And I don't believe they make "more money" that way at all. AWS margins are either very low or very high, and the higher margins and prices tend to be the "simpler" ones: packaged, managed products such as Redshift that are billed on fewer tiers and flatter prices.
When you design your application with AWS, pricing has to enter your design considerations. For example if you are designing something that will interact a lot with S3 you want to minimize PUTs. You want to minimize ram usage on lambda by streaming rather than buffering. Etc.
AWS is not a suitable product for playground stuff. The only reason it gets used as such is because it's easier if you're already using AWS for other things (or it's you're already very familiar with it).
>AWS pricing is not obscure
There is a massive secondary consulting market because of AWS's price obscurities.
While that's true, there is consulting market for most things that are complicated. Doesn't mean they are shady. It's simply not for you. You are welcome either to dive in or get a consultant.
I promise you though, that AWS pricing isn't difficult once you understand few concepts and know your way around the Cost Explorer. With proper tagging, it's easy to drill down which resource is consuming how much. I don't believe there is a way to have a simple billing for a complicated product(s).
It does mean it's not simple though.
Does AWS have high-margin prices? In aggregate, somewhat, but this is mostly driven by the big ticket managed enterprise items: Aurora, Redshift, Quicksight, probably Fargate, etc. A lot of their more popular stuff (S3, Lambda, …) offer incredible value for very little money. EC2 is the exception I believe, because I understand it to be high margin for how popular it is. But EC2 pricing is one of their simplest ones.
Could AWS simplify some of their pricing? Yes, probably. There's always room for optimization. Personally for example I'd like to see their pricing be global rather than different by region (with understandable exceptions for govcloud and china).
Is AWS making its pricing complicated for nefarious purposes? No, there is no evidence to support that.
AWS pricing absolutely is not simple. It's a part of the AWS stack. You need to study AWS's events/signals system to be able to write apps that make the best use of AWS's interconnected stack. You need to study their APIs / SDKs to really understand what you're able to implement. And you need to study their billing systems to understand how to implement apps that run cheaply, and be able to predict potential runaway costs.
It has to be a part of the design. That's why you may want to hire consultants for it: People who understand it better than you do, and will be able to assist you in reducing your costs.
It's just another kind of optimization. Maybe some software engineers don't like it because it hits them where it hurts (the wallet) when they don't do it right, rather than be able to brush it off as they usually do.
There is a massive secondary consulting market because the enterprise market is addicted to secondary consulting. This secondary consulting market includes AWS pricing because it includes pretty much any IT service the target market might be interested in.
A rational need for decomplexification isn’t necessary to explain the existence or coverage of enterprise secondary consulting, IT or otherwise.
> There is a massive secondary consulting market because of AWS's price obscurities.
Its. Not. For. You.
AWS pricing is a part of your design. With some exceptions (that you aren't talking about), they charge you more for using more resources. You are forced to design systems that use less resources if you want to optimize your bill.
That consulting market is an optimization market. It's economics at its best.
If you are too small to have to take these things into account regardless, AWS is not for you. You're welcome to use it, but don't be surprised if you end up having to deal with these kinds of things which simply don't exist in the world of flat-price underprovisioned droplets.
This is marketing.
It's like saying you want to build a house and the quote you got ends up blowing up 100x overnight.
Great example is the 100k credit for startups. You can repeat it's not for you all you want, but their business is predicated on pricing ignorance and vendor lockin.
You can get the $100/$300/$1000 tier if you are in "just checking it out" solo mode. $5k and up requires either connections, partnerships, or a serious application.
Anyway I don't know what your point is, I'm not even sure if you have one. They're not "marketing" their pricing, nor the fact that you are "forced to design systems that use less resources".
I think they are referring to this statement:
> > AWS pricing is a part of your design. With some exceptions (that you aren't talking about), they charge you more for using more resources. You are forced to design systems that use less resources if you want to optimize your bill.
It is a defense that I've heard in many AWS talks in the past.
Where it turns into a 'marketing' blurb to me is my real world experience in these AWS talks in the places I work. As a real world example, we had a product that required -some- architectural work, but otherwise was solid, and could run on 3 live EC2 instances (2 web LB, 1 live backend) and 1 spare (spare backend)
The Consultant that AWS partnered us with? Suggested a very overdone architectural revamp, moving everything possible into AWS Specific technologies.
It's marketing in that in many of our experiences, we know there is often at least one person on a team who does -not- have the discipline and/or experience to -keep- a system using less resources as the field goes from green to brown.
I'm having trouble seeing how this changes what I'm saying: That with the way AWS pricing is structured, you are supposed to take it into account when designing your product.
When you reach a certain size / complexity and you have to design infrastructure, you should be making schematics, predictions on the usage peaks and troughs, how various parts of the infra will be affected, how active/idle they will be.
When you are dealing with AWS, pricing becomes extremely predictable because it can be derived from those plans. And it is far better to be dealing with that kind of model than to deal with "unlimited with a million asterisks" or something. AWS is predictable, reliable, and most notoriously has never ever increased their prices, so whatever you calculated will not go up because of Amazon's decisions.
To be honest, depending on the technology, the savings could be worth it... for example, did you know you get a discount if your traffic is served over cloudfrount? even if your distribution is set to no cache any resource, you can front your APIs using cloudfront and save networking.
It's not that complicated, it's just not something engineers are usually used to do. If you use an AWS service, you look at its pricing.
Take s3 for example: whenever you use it, you'll pay for outgoing bandwidth, PUTs, GETs, and storage.
So you seek to minimize all of these:
1. Bandwidth: use cache layers. This also minimizes GETs.
2. PUTs: design your app in a way that doesn't do unnecessary inserts into s3. Consider alternatives such as redis, postgres or filesystem depending on the need.
3. Storage: compress your objects if they compress well. If they aren't often accessed, use storage classes and auto lifecycle management.
Pricing in AWS generally reflects some kind of engineering limitations you will face at scale in the first place, so it makes sense to go through this whole exercise either way.
I wish programmers had the prestige it deserved for combining Science, tradition, authority, and art.
Engineers are not allowed to use tradition, authority or art. They are restricted to being modern day calculators.
Nothing is wrong with either.
An interesting analogue would be the Automotive industry; As time progressed, Companies focused more and more on 'engineering' versus art/tradition/etc. But as the industry evolved, "Flashy" vehicles that took risks became moreso either a halo product for a brand, or relegated to Luxury/Boutique.
And, of course, there was the dark side of this shift; A good example from the 70s, the level of 'engineering' driving the design of the vehicle and it's assembly didn't take into consideration the actual line worker; in Ohio the workers wound up getting overworked, burned out, and in some cases actively sabotaged the product, because they were being treated like automated machines.
Engineers are applied scientists.
Programmers are not applied scientists.
I'm guessing you are neither?
Software engineering could learn a lot from, say, civil engineering. It could also learn a lot from interface design and I'm sure even microbiologists and astronauts could teach us a lot. Engineering is not special.
This is exactly right. I host stuff in buckets/cloudfront and uses a bit of lambda/route53. I end up paying $4 a month.
now that will be very different if 10 million people suddenly decide to visit my site, but if that happens money probably won't be a problem after all.
I get your sentiment but the pricing is that way because they want to charge you exactly what you use, not for reserving stuff.
For example if you deploy a EC2 instance that comes out to be $15/mo total, and you deploy it on say the 10th of the month, do you want to be charged the whole $15? No, you want pro-rated.. But what if you only need that instance for like 6 days? Then what? You gonna do the math to figure out what it would have cost you yourself, or just read it per hour billing?
Man, exactly right. Many of guys here would love crypto once you stop asking why and start asking how.
The most lucrative projects these days are completely frontend UI, they don't even their own backends as they just read state from the nearest node when the client connects their wallet.
Some people forgot that the scalability game was to convert traffic into money with o(logn) overhead costs. So ditch that, and remember you are in the money game.
Most important !
Every time I read one of these stories, I get more and more convinced I will just simply never use scalable cloud tech for my side projects. I'm not going to risk my family's retirement savings on the all-too-possible chance that a small deep-implication error will cause runaway charges.
Your assumption is incorrect. I haven't been in touch with anyone in Google, and used 0 internal connections. Happy to make another post with my conversation + documentation to support this.
I reached out to the GCP through their regular channels. This is not a paid post, and we are not sponsored by Google in anyway.
> Having been a Googler for ~6.5 years and written dozens of project documents, incident reports, and what not, I knew how to put the case for Google team when they would come back to work in 2 days.
That certainly reads as an advantage that most non-Googlers would not have.
I'll share the doc I prepared and sent to Google in my consult, in one of the next posts.
Perhaps OP can edit the bottom to make it more clear?
Cloud is still good though, I believe it's the future. I just don't believe in deceiving your customers to hopefully rack up a high bill with them.
I view not having the ability to say "Shut everything down if I go over $100/mo" the same as pre-checked hidden cross sales that MasterCard/Visa cracked down on in the adult industry few years ago. Just money grabs.
I will definitely be putting such measures in my cloud platform.
"Yay! NBD! Google is the best! All I had to do was work there for a few years, rub elbows, make connections and ask for favors!" Really, google would be the best if there was no way to accidentally go over your stated budget cap by 86 million percent, or at the very least have a policy to refund people who can demonstrate that this is what happened.
So much of it is so unnecessary to begin with. You can do so much with a cheap VCS or two without thinking about lambdas or cloud functions or kubernetes or who knows what. But these days you'd be forgiven for thinking it's dark magic.
You're not going to run up a 5 digit bill in a day by starting up on a few $10 VPSs. And you'll probably have an architecture that fits in your head to boot.
Also: The article title should really be "Saved 72k and avoided bankruptcy by being an ex-Googler."
> If you count the number of pages in GCP documentation, it’s probably more than pages in few novels. Understanding Pricing, Usage, is not only time consuming, but requires a deep understanding of how Cloud services work. No wonder there are full time jobs for just this purpose!
Great write-up - thanks for sharing @bharatsb! As you say, cloud pricing has become too complex for developers to understand quickly (they want to ship features, not calculate costs). Infra-as-code is great, but it has made it even harder to understand which code/config option costs what. `terraform apply` is like a checkout screen without prices.
We're trying to solve this problem with infracost.io, initially looking at Terraform. It would be interesting to get your feedback on whether such an approach might have helped you? Probably not as it doesn't look like you were using Terraform?
Based on this experience, we decided to lower the default value of "max instances" to 100 for future deployments. We believe 100 is a better trade off between allowing customers to scale out and preventing big billing surprises. Of course, customers can always decrease it or increase it up to 1,000, or even above with a simple quota increase request.
Why don't cloud providers allow setting a budget which cannot be exceeded? A simple, 1-click way to say: this account should never go over $500 a month. Just stop creating resources or responding to requests if it does.
- Early dev sets a limit.
- Product launches.
- Slowly grows.
- One day suddenly the entire business grinds to a halt. Globally. Across the carefully isolated shards. Everyone scrambles to figure out why! Tens of thousands of dollars are lost because of going $10 over a budget. End-users are lost. Trust is burned. If it's providing a critical system, maybe even people are hurt.
- Google then has to explain why they built in instant, global failure mode.
There are ways to design this, they can send notifications at 60% of the threshold, 80%, 90%, 95%. They can give you a grace period, put up prominent warnings in the console and for command line tools, etc. There are ways to do it, it's far from an intractable problem.
I'm not saying that it can't happen but do you want to bet that a certain percentage of their business, for all cloud providers, is from carelessness and resources still running when they shouldn't or using more than they expected? Especially for bigger companies where it's easy to miss something. Just like gym subscriptions or other kinds of subscriptions where they're banking on you not noticing for a long time ;-)
This is would be one more checklist that you need to regularly review, just like domain name registration and certificates, ...
Any of those expiring will cause outage also.
Outages happen. Will happen. This just one more way that they can happen. It happens you learn and move on.
Hell even the biggest providers with the beset admin teams on the planets have outages.
And Google doesn't need to explain anything just like your name register doesn't have to explain anything if you cc was declined/not current ...
2) Also if they are able to figure out when you've hit your daily free quota and cut you off almost immediately, how are they not able to figure this out?
This is good to hear. I use Cloud Run a lot for personal projects and I always set concurrency to 80, max instances to 1, memory to 128Mi (unless it's something beefy that needs the memory), and CPU to 1. If I need to scale it up, or I decide to open it up to actual usage, I'll do it when I recognize the need.
If I knew something was going to take more than a couple dozen milliseconds to run, it was built on the DO droplet.
Why would I pay by the CPU second for something that is taking a lot of CPU seconds? That billing model doesn't make sense.
For my super quick REST endpoints, yeah, all on Firebase, the convenience of writing + deploying makes it an obvious win. (Unless something goes wrong, debugging Firebase functions is not fun...)
Deploy your Postgres (for DB), minio (for s3 storage) and your webapp from Caprover. Add nodes as you need to scale out.
> To overcome the timeout limitation, I suggested using POST requests (with URL as data) to send jobs to an instance, and use multiple instances in parallel instead of using one instance serially. Because each instance in Cloud Run would only be scraping one page, it would never time out, process all pages in parallel (scale), and also be highly optimized because Cloud Run usage is accurate to milliseconds.
> If you look closely, the flow is missing few important pieces.
> Exponential Recursion without Break: The instances wouldn’t know when to break, as there was no break statement.
> The POST requests could be of the same URLs. If there’s a back link to the previous page, the Cloud Run service will be stuck in infinite recursion, but what’s worst is, that this recursion is multiplying exponentially (our max instances were set to 1000!)
Did you not consider how to stop this blowing up before implementing? Having one cloud function trigger another like this with no way to control how many functions are running at the same time with no simple and quickly met termination condition (with uncapped billing) is playing with fire. It's not going to be optimal either if most of the time each function is waiting for the URL data to download.
You need to be using something like a work queue, or just keep life simple and keep it on a single server if you can.
It honestly reminds me of debugging a Jenkins pipeline. Something that was designed to be super generic of a runtime but yet the tooling can inexplicably only live on computers that are not your local development machine, and all of it is maximally painful to stub or test or debug to seduce you into "just running it live".
It's like the opposite of the "small agile team" thing they were talking about. If your program requires 7 API keys and some cloud environment to do a test run, I want no part of it.
Launching a cloud function that recursively triggers the same cloud function, that doesn't have a simple safeguard for it looping or blowing up, and where billing scales with the number of cloud functions ticks the "very high risk" and "very high impact" boxes for me. A program running on a single server isn't similar here (you could accidentally create a DoS attack though).
Typical cloud function use is some event gets triggered like a user sign up, the function executes, then it halts. The above isn't a standard use case and is so incredibly risky this approach shouldn't be attempted in my opinion.
Basic tier, 5 DTUs, 2 GB is listed as ~$4.8971/month or $0.0068/hour on this page. Extra storage would cost more but is not available for the basic tier.
We've seen this happen with similar stories on AWS. Neither platform supports prepayment with a hard limit on costs, and this seems unlikely to change.