Hacker News new | past | comments | ask | show | jobs | submit login
Burnt $72k testing Firebase and Cloud Run and almost went bankrupt (tomilkieway.com)
282 points by bharatsb 9 months ago | hide | past | favorite | 388 comments

The fact that cloud providers don't have a simple "This is how much I can afford, don't ever bill me more than that!" box on their platforms makes development a lot scarier than it really needs to be.

This is my worst nightmare. Lol. I guess now is a great time to give Azure a shoutout for sitting on their hands for 8 years without so much as a response to the community for half a decade [1].

At least AWS allows using a prepaid credit card so they’ll need to call me if things go haywire. I bet if that $72k charge went through it would have been much harder to get out of. “Sorry, we don’t have the money” is a much better negotiating position than “can we please have our money back?”

1. https://feedback.azure.com/forums/170030-signup-and-billing/...

But then consider following hypothetical but possible scenario:

Sorry until you pay, no more Amazone services for your company.

Now you must move to a new cloud provider (or make a new company).

Oh, wait they now interchange (bad) customer information to better find fraud and you just got marked at "owning a lot to amazon" so no cloud for you anymore at any provider.

Now you want to buy your own hardware. So you need a credit from the bank, but dang, your owe to much to a big company and the bank now, so no credit for your company either.

While part of above's scenario is luckily not how reality currently works. But then who knows when (part of) such a horror scenario becomes reality.

In the end relaying on forcibly not paying back money you contractually own is just not a very viable strategy in my view.

I see no reason why such an arrangement couldn't be optional. Different projects, teams and people have different needs, cloud computing services are marketed specifically on this point. It makes no sense that there isn't even an option in Firebase or AWS to immediately stop services over a certain amount. The current a situation is ripe for lawsuits IMO.

> “Sorry, we don’t have the money” is a much better negotiating position than “can we please have our money back?”

I agree but why would you like to be in either position anyway? The so-called cloud services are terribly overpriced when compared to traditional servers.

Done correctly they save a lot of IT time.

Seem companies hire five 6 figure people to try and cut amazon bill by a couple of grand a month.

Never understood spending 50-100k a month to maybe save 5k

>Done correctly they save a lot of IT time

Not really, computing done correctly is about avoiding all of the pitfalls and finding ways to get zero cost benefits, free computation out of necessary redundancy, etc. Selling cloud computing is about creating options around every pitfall and finding ways to charge for every mitigation that will be necessary and charge for redundancy in the mitigation strategy for the mitigation strategy..

Even if you pay for all the redundant managed blah they offer to not lose your business by having any single point of technical failure in their network, their billing and IAM are your single points of failure, if you diversify to multiple clouds all the guarantees either cloud offers is now pointless redundancy so you are paying 10X pricing for an inadequate redundancy layer.

If you look at Google's own model for computing, they didn't fall for this themselves, the computers they used were intentionally unreliable to not recursively pay for reliability and redundancy at any layer that can't provide the needed guarantee.

You can basically go all in with one of these clouds and become a franchise add-on with roughly the same rights as your average mcdonald's store owner, or you are managing a strategy that is far more complex because of the complexity of these offerings than just using metal and free software.

They're very useful if you're testing a concept, need agile scaling of computational power, or are just starting a service and don't want to / can't invest the capital in dedicated hardware. I agree with you on your last point though, making your service entirely dependent on these services makes you little more than a franchise and is a potential vulnerability if you ever compete with any important existing service. It probably isn't a good idea for a mature or rapidly maturing business to rely heavily on these services.

It’s often a fixed vs ongoing cost question. Spending 200k to save 5k per month breaks even in 3.4 years.

However, for growing companies that 5k/month AWS premium can hit 200+k/month very quickly

It is baffling why cloud providers don't have that option.

I might want to have an app because I don't mind spending 50 dollars on my pet project as a hobby, but I don't ever want to spend more than that. Not if I write a wrong query that's suddenly becomes very expensive, not when I got attacked, and not even when I have legit users.

By the way, the same goes for some companies, too, just the threshold would be different.

It's not complicated to add configurable hard limits for these companies but they don't allow it because the current situation is more interesting for them.

They want to suck the maximum money from consumers before they realize.

For one person that will complain wildly and having to do a gesture, there are hundreds other companies that will not notice or just pay without recourse.

> They want to suck the maximum money from consumers before they realize.

This is a naive understanding of how corporations like Google and Amazon work. Bad will and using gym membership tactics aren't how they scale or make money. Getting you to confidently try things knowing you won't get charged (the reason they have those free tiers) so you'll get your company, your start-up, your next side project on it is much better for business.

It's a miss that things like this aren't implemented and widespread, not by design.

> It's not complicated to add configurable hard limits for these companies but they don't allow it because the current situation is more interesting for them.

I'm not in this space, but from my observations:

- Each service has a different billing model and metering model. Most likely this data is held by the service. I'm familiar with AWS so I'll use them as an example. I'd wager only DynamoDB or only Lambda (the service owners) know how much of those services you've consumed

- Billing is most likely reconciled asynchronously after collecting all data from all services by an entirely different department with knowledge of payments and accounting

- GCP, AWS, Azure launch 50+ services a year

- Each large customer most likely has a special rate. I bet Samsung or Snap pay an entirely different set of rates than the normal customer. There are thousands of these exceptions

- Cutting your service off when your over the limit is an incredibly complex set of edge conditions. Your long running instance hosting your critical service is shut off because of experimenting on a new ML workflow?

Even with only the above I can see the difficulty in globally limiting your spending limit at an accurate level. I know there are features for both AWS and GCP and they try.

It's easy to stand on the sidelines and handwave away technical complexity at scale, but I'd encourage you to give all of these providers a more charitable view, at least on this topic.

>Bad will and using gym membership tactics aren't how they scale or make money.

Except they do that with their actions.

>Cutting your service off when your over the limit is an incredibly complex set of edge conditions.

Sure! But if they cared about customers as you claim, they'd let users set hard limits, and when one of these mishaps happened, stop the services when their system eventually knows that the quota has been exceeded... and, make the user only pay the hard limit as the maximum amount. If this continues to happen, warn the user that their account will be terminated... and that's that. But they'll never do that.

Most of their clients pay for these mistakes because they don't have the reach or skills to make this a viral social media article to get people's attention and hence get them to forgive the costs.

I'm sure they know how much they make in revenue because of these mistakes and they deliberately don't do anything about it.

I work in this space and you’re absolutely correct. Your last paragraph hits the nail on the head for pretty much every complain people have about the public clouds.

Right, so let's say Congress passes a bill that requires cloud providers to enable hard spending limits by start of February 2021, and eat any extra usage costs that exceeded a set limit.

What is your educated guess by when this feature would be essentially correctly implemented in AWS and GCP (essentially = negligible costs to the providers due to either false negatives (bills they eat) and false positives (PR fallout, when SomeSite gets shutdown despite not being over limit)?

The fact that the dashboards and alerts have a delay sounds like there might be difficult consistency stuff going on. Many nodes need to coordinate their usage and billing. It may be a difficult problem, but solving billing problems might not really motivate anyone at the company. It's not a "cool" problem for engineers and not profitable for product.

>> The fact that the dashboards and alerts have a delay sounds like there might be difficult consistency stuff going on.

I think that's true. It's easier to measure usage and aggregate that data after the fact than to meter it in real time and stop at a limit. Those are very different things. What happens if you hit the cap while running multiple processes spread across a cloud?

One improvement might be to throttle things as the cap approaches but that doesnt really change the problem at all. Do that and have provider eat any overages should solve it from the user point of view.

Theres an easy solution: You set a limit and everytime a service needs to spend some many it allocates a small portion of the budget and after some threshold it will put unused money back to the budget. The only downside is that your spending limit will be reached to optimisticly, but i prefer that to paying thousands more than i wanted to. Knowing the system works like maybe a lower and higher threshold for the budget could be set.

Every time a GCloud rep would ask us about what we need, we would say: fix the billing interface. As far as I know, it never got fixed. The feelings I would get when looking at cloud billing interfaces can be summed as: obfuscated, like a pawnshop, and caveat emptor. I kind of came to the conclusion that if the cloud giants are not fixing their billing interfaces, then just like Amazon not sending you the details of the items you ordered by email and thus causing you to use the app to help with primenesia, there is a 'business' reason why the billing interfaces are generally incomprehensible.

They want to suck the maximum money from consumers before they realize.

I have very little money so I just don't use their services because a mistake would be disastrous. They might be losing out on me making a unicorn app on their platform. It's unlikely, but while the possibility of catastrophe exists I'll stick to not using them. That extends to not recommending anyone uses them either in case the worst happens.

I have very little money...

Then the harsh reality is: companies don't care. Yeah, your app might turn out to be a unicorn, but the overwhelming odds are that it won't. And no one cares that you'll tell your other broke friends to avoid the service.

We'd all like to think it to be different, that a company might care about appeasing my broke ass. But as already pointed out, they want the whales. I also wonder, despite the number of years "cloud services" have been around, if companies aren't still trying to figure out a gazillion other things and limiting customer spend might be a bit low on the priority list.

Meanwhile this leaves an opportunity for a different company to provide these services.

I do my best to avoid FAANG giants who don't think about me.

The highly price sensitive customer will force you to compete only on price. That's just forcing yourself into a commodity market. It's bad business. I would never try to cater to that market. Very dangerous. Competition will drive margins down to near zero.

For hobby projects you probably don't need auto-scaling, and should use a provider that charges a fixed monthly rate. You'll "waste" a little bit of money on unused uptime, but for a hobby project it will be a minuscule amount.

But then I don't get to be a "serverless hero" and write blog posts about how my side project (that no one uses) costs $0.000034 to run instead of $5.

> It is baffling why cloud providers don't have that option.

...is it? If a lazy dev leaves their corporate account open and you can bill it for their negligence, protected by the contract you already signed, you earn a lot of money. From a purely business perspective, it is stupid(!) to provide a stopgap for that.

Edit: to be clear I am not advocating one way or the other. But it is surprising that people are "baffled" by this obvious profit optimization.

Google is around a trillion dollar company, your $75,000 is a completely immaterial amount to them. Not to mention it would be a one time payment that would drive away customers and lead to bad PR like this post.

Except Google has 4 million customers. If 1% of their customers made a 75k mistake, they make 3 billion dollars.

So... where else are you going to go if you don't like these policies?

If everyone has this policy, Google, Amazon, Microsoft, and the rest are in a good place. And suddenly it's the "industry standard."

This hypothetical is already enacted today...

As a former victim to the same issue as OP, I am furious every time I see a Googler promote that as a solution.

In our case, we racked up a $10000 bill on BigQuery in ~6 hours, when a job was failing and auto-retrying.

We had set up every alert correctly and our reaction time was about 5 minutes (about $100 of usage, no big deal). So how did we get a $5000 bill? Google's alert was 6 hours late (according to them, this was root-caused to us, because we were submitting jobs continuously). They pointed to their TOS and said they don't guarantee on-time delivery of the alert.

I had to write up a blog post with fancy graphs and prepare it for social media before they finally agreed to eat the bill.

(By now) GCP documentation says billing alerts can be late by days. Yes days not just hours. Totally crazy.


Link (see blue info box a a bit below the anchor on which the page is opened):


> Recommendation: If you have a hard funds limit, set your maximum budget below your available funds to account for billing delays.

But if 1,000,000% lower doesn't work ($7 vs $70k) then...?

P.S. so im not a googler.

you misunderstand the intent of this - you basically set this. even if it fails (because messages are delayed), google will refund.

This has happened to us before - they do a refund - since you had set the limits correctly. In general, they are not super assholes. I actually dont know a case, where they have refused to refund.

AWS is better here - since GCP doesnt have a support dashboard. So the "chasing them" experience is much worse.

Is there a public postmortem anywhere? Your message points to 'no', but just in case.

> There is a delay of up to a few days between incurring costs and receiving budget notifications. Due to usage latency from the time that a resource is used to the time that the activity is billed, you might incur additional costs for usage that hasn't arrived at the time that all services are stopped. Following the steps in this capping example is not a guarantee that you will not spend more than your budget.

This looks like it has the same problems as the post, because it also relies on those budget alerts that can happen a long while after you've exceeded them.

Very late to the post, but this seems like "eventually consistent billing". distributed systems seem to rely on "eventual consistency" but... "eventual consistency" is not what most people want in billing threshold scenarios...

"Following the steps in this capping example is not a guarantee that you will not spend more than your budget."

"Resources [...] might be irretrievably deleted."

Also it's not automatic, you have to manually write code to do it, and test it, and make sure not to break it.

A reasonable implementation of this feature would be built into the console, guarantee a maximum spend, not require writing your own fallible code, and provide an option to preserve storage (at normal cost) so that all your data isn't deleted when your compute/API stuff is shut down.

Extremely technically, the only GCP product that had this feature was App Engine Standard v1, but looks like it's deprecated as of the end of 2019 (https://cloud.google.com/appengine/docs/managing-costs#chang...)

Probably hurt revenue ;)

As a former App Engine PM who spent a lot of time with billing/quotas (though, not the one who deprecated this feature), it's likely due to some combination of:

- hard limits caused downtime more often than they prevent these blog posts

- hard limits were inconsistently enforced, even within GAE

- platform wide quota notifications were implemented (reached "GA"), leaving the question of "how a developer wants to handle this" to the developer, not the platform

- maintenance burden

The "I bankrupted my startup by running tests in an infinite loop" blog posts happen ~once a year, while the number of customers (including internal teams!) who inadvertently went down because of this quota was staggering. I feel like I used to see one a week, at least. Most often someone on the team was like "oh I'm going to turn this down to zero because we don't want to spend any money during development", never told anyone, and then they go live and they forgot to turn the knob back up (or didn't properly estimate traffic/costs and set it too low).

I can tell you it hurts revenue a lot more when a large customer goes down for 15 minutes due to quota issues and their usage drops to zero (both in terms of revenue and customer credibility) vs when tiny developer accidentally blows through 10k in a month and we refund it (since, obviously, the providers cost is a lot less than that).

Personally, I don't think this is a good enough reason. Worst case, if I experience an unplanned shut down, I will increase my spending limit. Removing the feature entirely because of this just doesn't make sense.

When I also think of the fact that Google tied it to requiring a credit card for almost every single transaction even if it is free gives the impression that it is for financial purposes (aka a way to get more out of developers or those who might be free-loading on the free tier of App Engine)

I gotta say that seems like a bad reason to remove the feature. If someone intentionally set a hard spend limit - hit it - and their service went down because of it that's not Google's fault. The simple solution for that customer is to just turn off or increase the limit.

This is a reasonable way of achieving the balance needed. My company would freak out if we had even a short outage that affected all our customers because we set a billing quota too low. And I'd feel a lot more comfortable experimenting with serverless on my own projects if I knew Google would have my back if I made one of those once-in-a-year mistakes.

OP claims that the budgets are not real time, they are eventually accurate but if it happens that you spend too fast you may end up with a larger than your budget sum before anything triggers.

It's surprisingly complex to do that. Let's take a simple example and say your cloud account is doing 2 things - compute & storage.

Compute is an active resource, when you exceed your budget it can be automatically shutdown.

Storage is a passive resource, when you exceed your budget it can be automatically....deleted? That's almost always the wrong action.

Providing fine-grained cost limits help some, as passive resources usually don't have massive cost spikes while active resources do, so you can better "protect" your passive resources by setting more aggressive cost limits on the active resources.

This quickly gets more complicated. Another example is most monitoring services are a combination of active (actual metric monitoring) and passive (metric history) resources. A cost limit on that monitoring service likely won't provide sub-service granularity, mostly depending on whether the service even has different charges for monitoring vs history.

Oh, also, even for a passive resource like storage, you also have active resource charges whenever you upload/download your data.

Ugh, what a mess. The best thing to do is pay attention to your spending, just like you do with your personal & corporate budget.

But we've had disk quotas before that mostly worked?

If anything it seems an easier problem than processor time.

I recall disk quotas on shared systems at university back in 1998 and I'm sure they existed before that.

Two thresholds IIRC, one at which you get a warning, second at which you can't write any further and the disk write operation fails.

I don't think they deleted files, it was just you couldn't write more than [quota] bytes to your disk.

Is there something particular about cloud based systems that prevent this from working?

ie. is this a specific problem with distributed storage?


S3 costs money to keep your files in, even if you're not touching them, so just preventing further uploads wouldn't do much to prevent your AWS bill from increasing.

It would let you set an upper limit on the price you pay though. Better than accidentally misconfiguring a logging service and writing gigabytes of unneeded data.

>>> But we've had disk quotas before that mostly worked?

AWS has quotas on everything, including quotas on EBS storage per region.

You will realize that after you spin up some instances with disks and it's failing because you've hit 10 TB of EBS storage. Have to raise a ticket to raise the limit.

The IaaS services are easy. Other services are more difficult. Something like BigQuery ML could generate massive bills pretty easily.

> Storage is a passive resource, when you exceed your budget it can be automatically....deleted? That's almost always the wrong action.

A better option would be to automatically reduce the budget by the amount it would cost to keep the storage forever. If doing that would reduce the budget to zero, do not allow increasing the amount of storage. That is: assume the storage will not be deleted, and budget according to that.

How does this actually work? It clearly can't be forever, since any non-zero dollar amount * infinity months is infinity dollars, which is going to reduce the budget below zero since any non-infinite number minus infinity is less than zero... thus locking it immediately.

Even if we say "you get N months of storage before we delete it" and subtract N * current storage cost/month, what happens after you're locked out of all actions because you added an extra GB? Storage APIs cost money to use, so you would get locked out of those too (note that if you're not, people would set arbitrarily low limits and get storage access for free) and couldn't retrieve anything. The only remaining actions are delete (which is free) or raise the quota and do the whole rodeo over again.

Abuse is impossible to ignore at public cloud scale, so "free storage forever" (or even, storage at a one time fixed price) as the fallback isn't a viable option.

Lastly, from an optics perspective, which blog post would you rather see on the front page of HN: "I did something dumb and spent too much money on Cloud" or "Google is holding our data hostage" (or "Google deleted all my data")?

Source: I launched Firebase Storage, which has a GCS bucket that has a hard limit.

For it to work, obviously the budget has to be per month (for instance, $100/month), instead of an absolute limit. Most of the time, that's what you'd want: if you calculated that what you use will cost $50 each month, setting a budget of $100 per month would give some room for growth while preventing billing disasters (and you can always increase it a bit if necessary).

Off the top of my head I'd say that if you're budgeting for storage, the max you can afford for the time period that you'd need to recover data in the event of a budget overrun taking into considerations the delay time for notifications would be a way to calculate that. And that sounds like something that is reasonable to put on the customer to calculate.

You've explained why it's hard for Google to not give me resources I can't pay for, but that's not what I care about, or what I'm asking for. What I'm asking for is a feature where I set a hard limit of $100 and that's the most I get billed - if my account accidently uses $5000 of resources before Google reconciles the usage with my budget then Google automatically waives the additional $4900 and then limits my account in some way until the problem is rectified.

Practically every time these blog posts come up they end with the provider refunding the costs. I just want that refund to be a feature.

So...you're saying that Google should give away $4900 of usage?

How will Google automatically differentiate between an "honest mistake" and someone taking advantage of this feature?

So...you're saying that Google should give away $4900 of usage?

Yes. But they should also develop mechanisms to warn users that they've made a mistake before it happens, and improve the speed they can detect mistakes to lower the cost, and invent some way to detect someone intentionally abusing the feature.

But mostly they should make the fact they do give away $4900 when a mistake happens explicit. That isn't actually a change. They just need to make it clear that's what happens.

It's almost like you could make it configurable so users can choose what happens if they go over, and to what extent.

It's not really that complex. All compute should shut down. All API calls should fail. Storage should be (optionally) preserved at normal cost.

Your examples are simple given this framework. Uploading/downloading data to storage is an API call. Monitoring is compute. Metric history storage is storage.

But storage costs continue to add up even when you're not accessing them - there's a cost to storage existing which continues to accrue with time.

When there's no budget left, what do you do with those accruing costs for existing storage?

Storage costs are predictable and slow to accumulate. They are rarely the problem people are trying to address when they set a budget. As I said, storage would optionally continue to be charged at the normal rate, the other option being immediate deletion if you really need a super hard budget cap.

Once you get the alert that your budget is tripped you can go and see what's in storage via the console and delete it, only paying for a few hours of storage for things you don't want.

If the amount of storage that you can use is limited by quota (say 50GB) the problem becomes relatively easier.

You set a quota for 50GB of storage and no more. The server then restricts you by disk quota to that amount of storage.

The cost is then calculated as 1.15USD per month.

So you don't pay more than 1.15 per month.

Compute and transfer (and other things) could be covered by separate similar quotas with a single maximum spend figure at the bottom of the table.

Moreover, once API calls are locked, what next? You can't delete files, and even if you can delete them, you aren't able to retrieve them before deletion... If a platform allows you to do those actions, then it's rife for abuse, and at public cloud scale that ends up being a far, far bigger problem than the occasional blog post that ends up as a refund (because the other blog post is "I got free storage forever with this one weird trick").

It's really not a simple problem because the next action depends on the choice the developer wants to make: do they increase the budget or decrease usage, and no cloud provider wants to make this choice because no matter what the choice is it will be viewed as wrong. The best they can do is provide developers the best insight and tooling to make this choice themselves.

Once API calls are locked you can open the console, disable all the things that caused you to hit your budget, and then raise the budget a bit to get access to the storage APIs again and manage your storage. Or, the console's storage browser should let you browse and delete files as well. And again, there should be an option to delete all storage immediately for a hard cap on your budget if you really want that.

You need separate costed quotas for each type of activity with a combined total at the bottom.

You could also have a setting in the admin panel as to what the system should do:

[ ] I want to keep going beyond my quotas (but email me)

[ ] Please shutdown my site

If the answer is "you have a dollar limit set of GCS GETs, GCS PUTs, etc." I guess I could see this working, but hot damn that'll be a horrific interface.

The other issue is that many large customers pay different prices, so billing and quota aren't really tied to each other, and it wouldn't be easy to reconcile this.

As for the button... having been on the product side of building this button, there is no right answer: people will say they never got the email (or it went to the wrong inbox, or their dog ate their phone...) or that they never checked the box to "shut down the site" ("I didn't think it would do X that made my app not work").

I'd probably want it grouped by category with a drill down interface for the specifics.

Probably arranged so you can type in a figure at the bottom for monthly expenditure and it would balance out the requirements based on typical use cases.

So enter $50 in the monthly cap figure and it allocates, say, $20 to compute, $20 to transfer operations and API calls, $10 to storage

which you could then fiddle with of course.

I can't offer much on the second point other than to say that unexpected bills annoy me much more than services that stop working.

I've also never worked anywhere with unlimited budgets. (alas)

I can see that there are probably cases where uptime is more important so they would be more annoyed the other way around.

Not only development but also running in production. You can configure alerts but you can't configure a hard limit. Thats just insane. That makes working with GCP like playing with fire.

What about throttling?

Nice to have, but people want a throttle that shuts off dead at a certain number of dollars

aka "Bankrupt me more slowly"

Throttling doesn't stop the drain.

Probably because it's not so simple on the backend.

I'm guessing there's a good chance a lot of systems are only eventually consistent, which could explain why billing takes a long time to update.

Aggregation of service usage for billing could also be an expensive operation, so it's only updated irregularly instead of being near real-time.

It would be a great feature, but I can imagine it being very complex. It's also probably cheaper for them to just wave away excess usage like this instead of building out a solution.

This is a billing question, not a technical question, and looked at through that lens it's easy to put a hard limit on a monthly bill: just don't ever issue bills greater than that amount.

If I say I only want to pay a maximum of $1000 a month, and I hit that limit but it takes a bit for the provider to shut everything down so really $1100 of resources were consumed, then the provider eats the $100 overrun and I get a bill for $1000.

With an actual hard limit you create a financial incentive for the provider to minimize this overrun. Yes it might be difficult to fix but I assure you, if hard limits existed, the technical issues would be solved soon enough because now there's a reason to invest in a solution.

It's also a mostly solved problem because advertisers have budgets and it's common to implement globally distributed budget servers to avoid showing more ads than the advertiser paid for, despite tens of thousands of individual web servers needing to know which ads in their inventory have budget left.

It's a fun exercise similar to global rate-limiting/load-balancing.

That is fascinating.

If you have the time could you (anyone feel free) talk a bit about how you would implement a globally distributed budget?

I can imagine a few simple options, but they all seem to have significant shortcomings.

I think the simplest is a tree of servers (which can be sharded by user if necessary for load balancing). The root has the total budget and offers short-term small leases of ad views to child nodes, who may also have child nodes doing the same thing with even smaller leases.

Web servers check with the leaf nodes for every ad they want to show. If that leaf has a budget greater than zero it decrements its own budget and returns success. If the web server gets a success it shows the ad, if not it checks with another budget server or two. Web servers frequently log how many ads were served per client.

Whenever leases are up the intermediate nodes inform the parents of how much was spent and get a new lease. If nodes crash or otherwise don't return their lease then their parents have to assume the whole budget was spent, but leases are kept small to avoid big discrepancies.

If the root crashes then there are problems so the root can be a slow ACID replicated database as long as its immediate children are mostly reliable and take large enough leases to minimize load on the root.

Periodically web server logs are aggregated to adjust the root budgets to account for crashed intermediate nodes and web servers.

The tree approach allows global low latency operation guaranteeing no overspending and minimizing underserving. Nodes are provisioned from the leaves on up to handle the necessary amount of traffic and to ask for leases large enough for 99.X% percent of child requests to succeed.

Any cloud provider could use the same technology on individual hosts to grab leases of CPU, RAM, disk, etc. by the minute per user and terminate services with no budget. Leases could be a lot longer because most budgets are monthly to cover all service needs and not pathological ad campaigns with low budget, high bid, and huge audience.

It's up to cloud (or ad server) providers to decide whether to stop services if the budget system is broken. In most cases it makes sense to fail open and keep serving and eat the loss because shutting everything down will incur even bigger losses.

I think that's not really an issue though is it? If you say "never charge me more than $100" they can a) ensure they never charge you more than $100 and b) work to optimize their own systems so that they cut you off as close to $100 as humanly possible. In the beginning they might eat some costs since it takes them a day to catch it, but they could work over time to bring that down. And it's not like it's costing GCP/AWS/Azure "sticker price" to provide their services.

Azure has it for some plans [1], but not others like pay-as-you-go. It seems arbitrary.

1. https://azure.microsoft.com/en-us/support/legal/offer-detail...

It's ever worse for services like AWS Cloudfront.

One of your competitors could just rent a cheap server on OVH with uncapped transfer and incur you $10k in cost in a few hours.

Maybe that it is your cue to move your server from AWS to OVH*

* I dont have any idea about OVH

CloudFront is a CDN. What the poster you've replied to is talking about is a competitor setting up a server that repeatedly downloads your content to rack up a huge CDN bill. OVH is not a suitable replacement for a CDN so you can't migrate from Cloudfront to an OVH server because a server is not a viable replacement for a CDN.

There is an easy explanation: It's hard to build this feature, there is no pressing demand from upper management, it's easier to get promoted doing other simpler projects. Think about what a real time snapshot means: you need to know how much of all the services are being used, project that in the future and compute the costs.

Really, it is a bit disappointing to see a bunch of engineers in this thread talking like this is some monumental, borderline unsolveavle problem. The solution is pretty easy to figure out, even taking into consideration different needs of different customers. The implementation might not be trivial, and legal liability questions might have to be considered beforehand, but the problem is not that hard.

That should be illegal, but hey, at least they support noble causes, so let them be. It sounds cynic but this is their game.

To me this is akin to the personal checking overdraft scams banks were running for many years until those practices were made illegal.

There is no engineering hurdle that is a valid excuse for allowing a customer to go over their stated budget by 86 million percent.

AFAIK digitalocean has notification if you go over user defined limit.

FWIW, Azure has that option

Price transparency is the antithesis to the "cloud" and it's current financial success.

There are some cloud services where it's not quite this simple.

S3 -- you can't just delete customer data because they hit a billing limit

RDS -- not going to drop databases on the 27th of the month

Anything with persistent data is going to have to stay alive and accumulate costs. Admittedly these services aren't where the crazy bills come from, but it does make a simple kill switch a bit more complex.

You don't have to immediately delete customer data.

Most service that has a limit cap will have a "grace period" of a couple of days during which the service does not work but the data is not deleted. That give your some time to get notified of the issue, and fix the problem/increase the limit.

This is a solved problem for every other service out there. You don't just delete the data, you give the customer a few days, weeks, or a month to pay their bill and if they don't, then you delete their data.

The problem with this though is it opens a vector for exploitation: users could just use the grace period to store data for free for a period of time. This can quickly become a heavy financial burden if enough people do it.

You could factor that into the price, but then you're potentially making the price point even more unattractive to users than it already is, and users that are responsible with their budgets would be subsidizing those that aren't. Not a very workable solution.

I'd say a good solution is giving customers the option to stop accruing more storage capacity, and to have a max deadline accounted for in their budget to store data (basically each customer decides whether or not to pay for a grace period).

I've accidentally let my OVH subscription go unpaid, and they gave a 7 day window to pay my invoice or delete my data. That's seems pretty fair to me, and they seem to have wide enough margins to eat the cost and still offer some of the cheapest prices out there right now.

I wouldn't be too scared. For AWS you get about $0.20 per 1 million requests on Lambda. You can do quite a lot with a single Lambda function. And a million of anything is a lot for a dev. Put a HTTP API Gateway infront of that with a CDN and you're hitting ~ a few dollars.

If you don't buy one coffee, or put a 20 dollar note in a book one month. Then you're fine. And if you have to use EC2, just use a t2.micro or a raspberry pi on your desk.

But really the first lesson you should learn in any cloud setup is Billing Alarms :)

If you're doing ML or CV work then it's probably cheaper to build on the desktop and port to cloud once you understand what the workloads are.

For AWS you get about $0.20 per 1 million requests on Lambda.

If you get it right, great. If you get it wrong then you end up doing billions of operations by mistake, which could cost a huge amount. That's what happened to the author of the article.

But really the first lesson you should learn in any cloud setup is Billing Alarms

Alarms only tell you that something is going wrong. They don't stop it. If your mistake is costing $1000/minute and you're an hour away from a computer you have a very expensive problem.

You can trigger events from alarms. And Lambda's only last 15 minutes. So still cheaper than 75K :D.

That's not a bad idea. You could set it up to delete all Lambdas (assuming you've got a CI/CD system capable of redeploying them quickly later) if the billing goes over. Of course, this may hurt you more because of the outage it would cause. Up to you really.

So you're taking code that you haven't validated locally to see what resources it uses, you're putting this up on the cloud to test it, then you are immediately going to the middle of nowhere without your laptop/phone/etc, and you can't arrange for a coworker or friend to pull the plug for you if something goes wrong?

> and you can't arrange for a coworker or friend to pull the plug for you if something goes wrong?

This is HN, many of us are solo founders with no coworkers or employees. Also how could a "friend" pull the plug? If it was a physical server running in your house maybe, otherwise you can't really give them access to your AWS account with all your private clients data in there.

If you don't have anybody who can monitor your test, and you're not monitoring your test, why are you doing a test?

As for having a non-employee pull the plug, set up an IAM user with permission to access the test instance

> you're not monitoring your test, why are you doing a test?

Agile. Bringing you bankruptcy at the speed of cloud.

If I'm the only developer on a project and I really need to get to market I might do just that. I sometimes do day hikes on weeknights so this is actually a likely scenario for me.

Do you go hiking alone without your phone? That seems dangerous.

And why would you start a test if you won't be there to see the results of the test? Seems more sensible to either leave after you've run the test or wait to do so until you get back.

If a test is going to take more than an hour then I'm not going to sit around after work waiting for it to finish.

Yeah pretty frequently. It's not dangerous at all. Maybe if I was climbing alone it would be.

Just to expand on this. You can have a hard limit. For AWS, create a role/user that's essentially ~root like access. Make a lambda function that's triggered by a billing alert at your threshold to just turn off things from most expensive to least. So turn of the DB servers. So the apps error out and the users go away.

As an ex-Googler working in a customer facing role in Cloud you did very well to get a $72k bill written off! It's definitely possible but requires a lot of approvals and pulling in a few favours. I went through the process to write off a ~$50k bill for one of my customers and it required action every day for 3 months of my life.

Whoever helped you inside Google will have gone to a LOT of trouble, opened a bunch of tickets and attended many, many meetings to make this happen.

I know there's no reason for Google or AWS to do this, but man do I wish there was a way to put down a spending limit and simply disable anything that goes over that limit.

It's a little bit nuts that there are no guardrails to prevent you from incurring such huge bills (especially as a solo developer that might just be trying out their services).

In my opinion, and maybe I'm an absolutist about this, the fact that there aren't these guardrails is opportunistic and predatory. Agile, iterative design and testing will inevitably lead to failures, that's the whole point. Marketing a cloud service to developers who need scalable and changing access to computing during that process should take that into consideration.

I do not think the intentions here are to be opportunistic and predatory but inability to empathize with small developers. A large customer will very likely just pay off few hundred thousand dollars extra expenses. It is only individual developers who are at risk here and cloud operators do not have much interest in them.

I don't know about that. Large customers can almost always find the capital to run their own infrastructure and save on cost. That's not to say that there aren't big customers of these types of services or that certain business models make using them more attractive than maintaining infrastructure yourself, but I would guess that revenue from these sorts of services are largely built on appealing to smaller customers, so their needs would be taken into consideration. Not taking potential cost overruns into consideration to me seems a bit deliberate.

To me it looks very similar to the personal checking overdraft schemes banks were using up until a few years ago.

When I was a young and naive student I thought I could not charge my debit card under 0$. Got down to -3$ and had to pay a 40$ something fee when I already was out of money.

Definitely drank a few 45 dollar lattes in my day. Sucks.

The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users. And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.

Note: Principal at AWS. Have worked to issue many credits/bill amendments, but dont work in that area nor do I speak for AWS.

> And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

What? Why wouldn't this just be an opt in thing? It could even be tied to the account being used. It's not like AWS accounts are expensive or hard to setup.

If a user opts in to the "kill if bill goes too high" and they kill a critical portion of their business, then that's on them. Similar to how a user choosing spot instances if their spot ends up being destroyed. You've already got that "I can kill your stuff if you opt into it" capability.

> On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users.

Yeah, and what happens if someone isn't big enough to justify AWS's forgiveness? What if they get a rep that blows off their request or is having a bad day? You are at the mercy of your cloud provider to forgive a debt, which is a real shitty place to be for anyone.

> And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.

And what do they do if they miss the alert? You can rack up a huge bill in very little time the right AWS services.

The point of the kill switch cap is to guard against risk. The fact is that that while 72k isn't too big for some companies, it means bankruptcy for others. Its because you might want to give your devs a training account to play with AWS services to gain expertise, but you don't want them to blow $1 million dollars screwing around with Amazon satellite services.

> What? Why wouldn't this just be an opt in thing?

"Oh cool, I'll set a $1k cap, never gonna spend that on this little side proj." Fast forward a year, the side proj has turned in to a critical piece of the business but the person who set it up has left and no one remembered to turn of the spending cap. Busy christmas shopping period comes along, AWS shuts down the whole account because they go over the spending cap, 6hr outage during peak hours, $20k sales down the pan.

Of course it is technically the customers fault but it's a shit experience. Accidentally spending $72k is also technically the customers fault and also a shit experience. I don't think there is an easy solution to this problem.

"Oh cool, I'll use spot instances, never gonna need reliability for this little side proj."

"Oh cool, I'll only scale to 1 server, never gonna see high load for this little side proj."

"Oh cool, I'll deploy only to US West 1, outages are never going to matter for this little side proj."

There are a million ways to be out of money as a company. Why should this be any different? Why is the singular particular instance one where it is simply intolerable to accept that users can screw things up?

There are lots of things that are "shit experiences" that are consumers fault.

There is an easy solution. Give consumers the option and let them deal with the consequences. There are enough valid reasons to want hard caps on spending that it's crazy to not make it available because "Someone MIGHT accidentally set the limit too low which will cause them an outage in production that MIGHT mean they lose money"

There exist totally a solution. It is also user hostile enough so it might get's adopted. $cloud_vendor just has to (and probably will) constantly nudge people to loosen the limit. Have a red banner that says, "you spend 3% of your monthly budget already think about increasing it". Also routinely send out emails to remind people. " Black Friday comes up think about increasing your quota " when your service has nothing to do with E-Commerce.

> The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

This is simple wrong.

Depending on your use-case disabling active resources is the right reasonable solution with less downsides.

E.g. most (smaller) companies would prefer their miscellaneous (i.e. no core-product) website/app/service to be temporary unavailable then have a massive unexpected cost they might not be able to afort which might literally force them to fire people because they can't pay them....

I mean think about it, what worth is it if my app doesn't go temporary unavailable during it's free trial phase when it means I'm going bankrupt from today to tomorrow and in turn can't benefit from it at all.

Sure hug companies can always throw more money at it and will likely prefer uninterrupted service. But then for every hug company there are hundreds smaller companies which have different priorities.

In the end it should be the users choice, a configuration settings you can set (per preferably per project).

And sure limits should probably be resource limits (like accumulated compute time) and not billing limits as prices might be in flux or dependent on your total resource usage or similar so computing it is non trivial or even impossible.

I often have the feelings that hug companies like Amazone or Google often get so detached from how thinks work for literally every one else (who is not a hug company). That they don't realize that solutions proper for hug companies might not only be sub-optimal but literally crippling bad for medium and small companies.

The upside for the noob trying out/learning is huge.

I'm no longer that person, but I think GCP/AWS are just being lazy about this - perhaps because they earn a lot of money from engineer mistakes. Of course it's possible to create an actual limit. There'll be some engineering cost, like 0.5%-1% extra per service?

Edit: Being European I think legislation might be the fix, since both Amazon and Google have demonstrated an unwillingness to fix this issue, for a very long time.

"The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active."

Lol what ... this is exactly what happens any time you hit a rate limit on any AWS service. The customers application is "catastrophically interrupted" during its most popular/active period.

The only difference is in that case, it suits AWS to do that whereas in the case of respecting a billing limit, it doesn't.

If you hit a rate limit, the marginal portion of requests exceeding that limit is dropped: if you plot the requests, the graph gets clipped. Bad, but not catastrophic.

If you hit a billing limit, everything beyond that point is dropped, and the graph of requests plunges to zero. You're effectively hard down in prod.

And for some companies/individuals, if you keep charging then THEY will plunge to a large negative debt. It's not even zero, it's a lot worse than that.

Just as it is with bank accounts. Once you run out of money you hit a hard floor.

I was creating some side-project. I already incurred around $100 fee. I imagine if I made some looping/recursion bug I could've easily incurred a cost of $10,000 or frankly, infinite cost. How easy would it have been for me to get this pardoned? And at the very moment I would discover that I just lost $100,000 - would I know in advance that they are definitely going to forgive this because I'd be full panic mode? It was very scary for me to use cloud in this case.

I didn't even have any customers at that point.

Why not alert thresholds, configurable by the user?

Email me when we cross $X amount in one day, Text when we cross $Y, and Call when we cross $Z. Additionally, allow the user to configure a hard cut-off limit if they desire.

Just provide the mechanisms and allow users to make the call. Google et al would have a much stronger leg to stand on when enforcing delinquent account collections if they provided these mechanisms and the user chose to ignore them.

Additionally, Google et al should protect _themselves_ by tracking usage patterns, and reaching out to customers that grossly surpass the average billable amount - just like OP with their near $100k bill in 1 day. Zero vetting to even have a reasonable guarantee the individual or company is even capable of paying such a large bill.

And then what? Sue a company that doesn't have $100k for $100k? This makes zero sense.

Google has alert thresholds (you set it up under your Budget). But practically speaking, an alert is not enough - what if you are unavailable to get the alert, it comes in the middle of the night, etc?

A better solution would have been 'limits' which they used to have (at least for Google App Engine) but which has been deprecated.

We had to spend sometime to research and see if there was a work around because just like the author of the article, we were quite worried about suddenly consuming a huge amount of resources, getting a spike in our bill and our accounts being cut off/suspended because we hadn't paid the bill. We've documented our solution here


Doesn't look like there's any cutoff mechanism there, and it's a separate, optional step instead of part of the setup flow with a mandatory opt-out warning.

Nor does that address the other complaint - Google (and possibly others) seem to be willing to extend an unlimited credit line to all customers without any prior vetting for ability to pay. That's crazy.

> The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

Well, this is true, but this is also true of a lot of limits, like limits.conf. Sometimes you really want to spawn loads of processes or open many files, but a lot of the time you don't, so a barrier to limit the damage makes sense.

There is no one solution that will fit everyone: people should be able to choose: "scale to the max", "spend at most $100", etc. If my average bill is $100, then a limit of $500 would probably make sense, just as a proverbial seat belt. This should never be reached and prevents things going out of control (which is also the reason for limits.conf).

> It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

This could be ameliorated by using namespacing techniques to separate prod from dev resources. For example, GCP uses projects to namespace your resources. And you can delete everything in a project in one operation that is impossible to fail by just shutting down the project (no "you can't delete x, because y references it" messages).

Aggressive billing alerts and events, that delete services when thresholds are met, could be used only in the development namespace. That way, fun little projects can be shut down and prod traffic can be free to use a bit more billing when it needs to.

>It would mean a catastrophic interruption to the customers application exactly when its the most popular/active.

Making the worst case scenario no worse than traditional infrastructure.

Correct. That argument assumes that every penny spent autoscaling has a positive ROI.

I think this is a insightful way to think about. Thanks.

> "And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource."

Well there's a very easy way, adding a checkbox and an input:

[ ] I am just trying things out, don't charge me more than [ ] USD

There are ways it could be done relatively benignly, such as defaulting to paranoid and explicitly opting out.

And for those that are heading into that financial barrier it should be a straightforward problem to look at trending to anticipate the shutdown and send out an alert.

Or just ask for a default when opening a new account ;=)

This. App Engine used to offer hard spending limits, and they were removed with precisely because so many users set them up to shoot themselves in the foot at precisely the worst possible moment.

^^ this. Hard spending limits seem great until your app/service gets super popular and you have to explain to the CEO why you were down during the exact window you needed to be serving the demand.

I feel less troubled by this in AWS because they actually have functional customer service.

> but man do I wish there was a way to put down a spending limit and simply disable anything that goes over that limit.

Literally did this my first week when trying out GCP for my company. It is entirely possible and documented (with code):


> Note: There is a delay of up to a few days between incurring costs and receiving budget notifications. Due to usage latency from the time that a resource is used to the time that the activity is billed, you might incur additional costs for usage that hasn't arrived at the time that all services are stopped. Following the steps in this capping example is not a guarantee that you will not spend more than your budget. Recommendation: If you have a hard funds limit, set your maximum budget below your available funds to account for billing delays.

(Source link in parent post, emphasis mine).

In this case they had a additional cost due to delay of $72k. Which, lets be honest means this feature kinda useless for anything but the somewhat harmless case.

Only by combining this with resource limits in load balancers, instance and concurrency limits and similar can the maximal worst cost be limited. But tbh. this partially cripples auto-scaling functionality and it's really hard to find a good setting which doesn't allow to much "over" cost and at the same time doesn't hinder the intended auto-scaling use-case.

> it's really hard to find a good setting which doesn't allow to much "over" cost and at the same time doesn't hinder the intended auto-scaling use-case

> I created a new GCP project ANC-AI Dev, set up $7 Cloud Billing budget, kept Firebase Project on the Free (Spark) plan.

There's a lot of middle ground between $7 and $72k. Your quote explains it perfectly though. They flat out can't because the accounting and reporting is badly designed and incapable of providing (near) real-time data.

IMHO the easiest solution to this is government regulation. If you set a budget for a pay-what-you-use online service there should be legislation forbidding companies from charging you more than that.

I also find it (sort of) hilarious they can magically lock the whole thing down once payment fails, but not before the CC is (presumably) maxed out. Lol. Talk about a good deal for Google.

There's something uncanny about understanding the situation enough to turn on the budget alerts, while at the same time not realizing it's not going to help in time if your system runs amok.

I'm not sure if you meant it this way, but your tone makes it seem like the parent just needs to "read the docs".

Unfortunately for all of us, your solution doesn't work, per the huge disclaimer on the page that says those alerts can be days late. You can rack up an almost unlimited $ bill in hours.

The article says that they had a limit in place but that in practice the billing limit lags up to 24 hours behind the "real" number.

thats not the best thing you can do. the best thing you can do is put excessive time into quotas. aws has way better quotas for starters than gcp has, sadly

There are guard rails in quotas. Like you can only spin up X servers without opening a ticket to ask for more.

Now, think some of these quotas can still lead to some pretty crazy bills.. but that is the point of at least some of them....

They are broken, unreliable, hard to correctly setup guard rails.

I mean like the article mentioned they could have set the instances and concurrency settings to lower values. Which in this case would have worked.

But finding the right settings to balance intentional auto-scaling and limiting auto-scaling to limit of how fast unexpected cost might rise is hard and prone to get wrong.

Let's be honest it's in the end a very flawed workaround which maybe might help (if you know about it, and did it right).

Tbh, its lack is why I don't use Google or AWS for projects.

If you're on App Engine, we did an article about that


Yeah, had that when I just started using it and it happily kept scaling like crazy. $200 bill in one day.

I never used google again.

There are already such features but a lot of indie developers are lazy to configure their infra properly. Now, default low limit does not make sense as it will piss off large customers.

I run so many websites on Google Cloud Run that sometimes I feel I might be abusing them, but I have ensured each of my site has max limit of 2 hosts.

This is already present and very easy to set up.

OP here.

Thanks for sharing!

I have no idea what they did internally, but something like this was my guess. I only communicated through customer support channel and replied to emails, and shared my doc (which cited all the loopholes) with them.

It took them 10-15 days to get back and make a one-time good will contribution. The contribution didn't cover logging cost, so we did pay few hundred dollars.

Sounds like you found an amazing support rep and made a great case for it - good job!

I went through this very scary experience recently as well (although in our case it was $17K, not $72K). One of our devs accidentally caused an infinite loop with one of the Google Maps paid APIs on our dev account and within hours both our prod and dev accounts were suspended (pro tip: don't link your prod account to the billing account of your dev account). The worst part was that after removing the suspension, our cloud functions were broken and had to be investigated and fixed by Google engineers themselves resulting in our prod app being down for 24 hours... be very careful.

Luckily we were able to get $11K refunded on our card and received $6K credits after spending all night with Google support.

By contrast I hear stories of AWS doing this quite often for one-off mistakes (crediting thousands of dollars). It doesn't make much sense to me not to consider well-intentioned requests for this sort of thing.

Especially if you consider the dollar value of all those approvals and the business you might lose to some other platform and/or hesitance people will have to use those platforms for such things in the future.

If I were in this situation I would probably offer 10% of the bill to the employee as a reward for their help.

That's too low. I usually tip my customer service reps 25% of whatever they save me.

So zero.

Sounds like a bribe.

Right, better hope they help me out of the goodness of their heart instead.

As J. Paul Getty once mused[1]:

> If you owe the bank $100 that's your problem. If you owe the bank $100 million, that's the bank's problem.

Crappy situation for OP and his startup, but I find the part about reading up on bankruptcy to be a bit premature.

Perhaps not the most ethical choice, but what stops OP from just not paying the bill, and finding a different cloud provider? Obviously they'll want to not repeat the "experiment", but seriously... there's no mechanism at Google to stop a new client from running up a near-$100k bill in a single day?

That's absurd, and should be a learning lesson for Google more than this startup. Some malicious actor could apparently consume hundreds of thousands of dollars of Google resources and "get away" with it.

Wait and see what happens, then deal with it - would be sane advice.

[1] https://www.brainyquote.com/quotes/j_paul_getty_129274

OP here.

Bankruptcy fear was real at the time. Google has at least a few thousand lawyers on payroll. They probably also have a process of handling delinquencies and sending them notices. A quick look at the lawyer fee to just manage the case, let alone fight it, is enough for bootstrapped company to raise hands.

+1 to bad actors possibility. I shared this with Google team, I'm not sure what they have done since.

We are out of that situation and I wrote the post so others, relatively new to Cloud don't make same mistakes.

Fail fast is a very bad idea with Cloud.

All true, and good points you raise.

However, Google's army of lawyers costs them real money, where your bill is largely made up numbers.

Perhaps the true cost is still enough to warrant sic'ing their lawyers on your company.

Even in that situation, a wait-and-see approach is still pretty advisable. The worst case scenario was already known to you - bankruptcy.

Nothing Google or their lawyers do would change that worst-case outcome, and if Google was aware you literally don't have $72k, and might just declare bankruptcy and walk away, they'll be much more eager to negotiate a more reasonable bill and settle your account. It's exactly as J. Paul Getty said...

Very glad it's being worked out and you will not have to go down that path.

> Even in that situation, a wait-and-see approach is still pretty advisable. The worst case scenario was already known to you - bankruptcy.

You could even go scorched earth, represent yourself, and drag it out as long as possible. "Your honor, I'm a free man on the land and all I was doing was travelling the information super highway. I'm not bound by your laws!" Haha.

This is why you create a shell company to use cloud services with while your real company leases the servers from that company. As soon as you run up a bill you can’t pay you shut down the whole shell company and reopen a new one.

One of my favorite quotes of all time. J. Paul Getty was quite the weirdo. His Wikipedia article is worth a look, especially the section on his frugality.

Lol. I love it. I moved to a state I'd never considered because it had the largest, cheapest building in the US.

It's 220,000 square feet, but I've lived in a tent out back for the last 6 months because I can't get an occupancy permit, it's not zoned residential, and I refuse to pay rent on an apartment.

You live in a 220,000 square foot building?! Is this an abandoned missile silo or something? I want to know more.

It's the old headquarters of Varco Pruden. They manufactured steel buildings, and there are long, wide manufacturing bays with overhead cranes. You can see much more on my YouTube channel. I've got a few videos of different areas.

This is the first video I made there showing a bit of the size: https://youtu.be/qdkLzioUiAE

> It's 220,000 square feet

Is it an old airplane hangar?

As an interesting coincidence, large part of Google Cloud organization resides in a building that was formerly headquarters of Getty Images, a company founded by Mark Getty, a grandson of J. Paul Getty.

That sort of crap is the reason we host all our stuff on root servers.

Even trying to read the amazon pricing for their instances, hours and what not, drives me insane.

Seems this is done on purpose. no wonder they make so much money with it.

So i have never seen a reason to move any stuff to the cloud.

Just grab a dedicated server for a few bucks and put a bunch of docker containers on those.

Its way cheaper, usually not more complicated. Just use a CI with Gitlab runners or whatever and be done with it.

Most apps don't need scaling anyway and if you do, just put that app on bare metal fitting your requirements.

> That sort of crap is the reason we host all our stuff on root servers.

Having just started my own journey into building products for myself, pretty much the first thing I realised with my tech was I need to get dedicated servers instead of cloud just because it costs 100x less.

> Just grab a dedicated server for a few bucks and put a bunch of docker containers on those.

Exactly, if you really want kubernetes coolness to act cloud like, install kubernetes it's free and is super easy to setup.

And with the cost savings you can literally buy multiple spare servers and with kubernetes using them all while keeping the usage low allowing to scale up new nodes if needed.

> kubernetes ... is super easy to setup

Can you point me to the super easy setup guide? Because I've tried a few and never gotten it working.

Have a look at https://github.com/hobby-kube/guide

I set up a 3 node cluster using it in an afternoon and haven't had any problems since.

Canonical has microk8s, which you can install as a snap package. Super simple and works great.

I don't like use kubernetes raw but I am a fan of Caprover (which has kubernetes support)

AWS pricing is not obscure, it's just not for you. So in that sense, you are correct to not see a reason to move to the cloud, but your advice does not apply to everyone.

And I don't believe they make "more money" that way at all. AWS margins are either very low or very high, and the higher margins and prices tend to be the "simpler" ones: packaged, managed products such as Redshift that are billed on fewer tiers and flatter prices.

When you design your application with AWS, pricing has to enter your design considerations. For example if you are designing something that will interact a lot with S3 you want to minimize PUTs. You want to minimize ram usage on lambda by streaming rather than buffering. Etc.

AWS is not a suitable product for playground stuff. The only reason it gets used as such is because it's easier if you're already using AWS for other things (or it's you're already very familiar with it).

AWS's margin is currently 30+% which is massive.

>AWS pricing is not obscure

There is a massive secondary consulting market because of AWS's price obscurities.

> There is a massive secondary consulting market because of AWS's price obscurities.

While that's true, there is consulting market for most things that are complicated. Doesn't mean they are shady. It's simply not for you. You are welcome either to dive in or get a consultant. I promise you though, that AWS pricing isn't difficult once you understand few concepts and know your way around the Cost Explorer. With proper tagging, it's easy to drill down which resource is consuming how much. I don't believe there is a way to have a simple billing for a complicated product(s).

> While that's true, there is consulting market for most things that are complicated. Doesn't mean they are shady.

It does mean it's not simple though.

Obscure and complex are different concepts. I'm part of that "secondary consulting market" FWIW, so I'd like to think I know a thing or two about it.

Does AWS have high-margin prices? In aggregate, somewhat, but this is mostly driven by the big ticket managed enterprise items: Aurora, Redshift, Quicksight, probably Fargate, etc. A lot of their more popular stuff (S3, Lambda, …) offer incredible value for very little money. EC2 is the exception I believe, because I understand it to be high margin for how popular it is. But EC2 pricing is one of their simplest ones.

Could AWS simplify some of their pricing? Yes, probably. There's always room for optimization. Personally for example I'd like to see their pricing be global rather than different by region (with understandable exceptions for govcloud and china).

Is AWS making its pricing complicated for nefarious purposes? No, there is no evidence to support that.

AWS pricing absolutely is not simple. It's a part of the AWS stack. You need to study AWS's events/signals system to be able to write apps that make the best use of AWS's interconnected stack. You need to study their APIs / SDKs to really understand what you're able to implement. And you need to study their billing systems to understand how to implement apps that run cheaply, and be able to predict potential runaway costs.

It has to be a part of the design. That's why you may want to hire consultants for it: People who understand it better than you do, and will be able to assist you in reducing your costs.

It's just another kind of optimization. Maybe some software engineers don't like it because it hits them where it hurts (the wallet) when they don't do it right, rather than be able to brush it off as they usually do.

It's much easier to ignore the waste produced by, say for example, the 3000 javascript dependencies shipped with the fat, unoptimized electron app they ship on their users' desktops, that do a ton of unnecessary expensive computing; when all that crap is client-side and it's the downstream user's electricity bills and CPU time that's being used.

> There is a massive secondary consulting market because of AWS’s price obscurities.

There is a massive secondary consulting market because the enterprise market is addicted to secondary consulting. This secondary consulting market includes AWS pricing because it includes pretty much any IT service the target market might be interested in.

A rational need for decomplexification isn’t necessary to explain the existence or coverage of enterprise secondary consulting, IT or otherwise.

The margin is absolutely not the same across all products.

> There is a massive secondary consulting market because of AWS's price obscurities.

Its. Not. For. You.

AWS pricing is a part of your design. With some exceptions (that you aren't talking about), they charge you more for using more resources. You are forced to design systems that use less resources if you want to optimize your bill.

That consulting market is an optimization market. It's economics at its best.

If you are too small to have to take these things into account regardless, AWS is not for you. You're welcome to use it, but don't be surprised if you end up having to deal with these kinds of things which simply don't exist in the world of flat-price underprovisioned droplets.

>AWS pricing is a part of your design. With some exceptions (that you aren't talking about), they charge you more for using more resources. You are forced to design systems that use less resources if you want to optimize your bill.

This is marketing.

It's like saying you want to build a house and the quote you got ends up blowing up 100x overnight.

Great example is the 100k credit for startups. You can repeat it's not for you all you want, but their business is predicated on pricing ignorance and vendor lockin.

The $100K credit (which I've been granted multiple times) is there because if Amazon can get you to invest serious work into their infra, they'll make up for it in the long run. It's not "lock in", it's sales. The only amazon "lock in" really is their bandwidth-out pricing, which is a sleazy tactic for sure but I'm not hesitant to call it out when it's the case.

You can get the $100/$300/$1000 tier if you are in "just checking it out" solo mode. $5k and up requires either connections, partnerships, or a serious application.

Anyway I don't know what your point is, I'm not even sure if you have one. They're not "marketing" their pricing, nor the fact that you are "forced to design systems that use less resources".

> Anyway I don't know what your point is, I'm not even sure if you have one. They're not "marketing" their pricing, nor the fact that you are "forced to design systems that use less resources".

I think they are referring to this statement:

> > AWS pricing is a part of your design. With some exceptions (that you aren't talking about), they charge you more for using more resources. You are forced to design systems that use less resources if you want to optimize your bill.

It is a defense that I've heard in many AWS talks in the past.

Where it turns into a 'marketing' blurb to me is my real world experience in these AWS talks in the places I work. As a real world example, we had a product that required -some- architectural work, but otherwise was solid, and could run on 3 live EC2 instances (2 web LB, 1 live backend) and 1 spare (spare backend)

The Consultant that AWS partnered us with? Suggested a very overdone architectural revamp, moving everything possible into AWS Specific technologies.

It's marketing in that in many of our experiences, we know there is often at least one person on a team who does -not- have the discipline and/or experience to -keep- a system using less resources as the field goes from green to brown.

Overengineering is easy and happens not just with AWS but with just about anything in software engineering.

I'm having trouble seeing how this changes what I'm saying: That with the way AWS pricing is structured, you are supposed to take it into account when designing your product.

When you reach a certain size / complexity and you have to design infrastructure, you should be making schematics, predictions on the usage peaks and troughs, how various parts of the infra will be affected, how active/idle they will be.

When you are dealing with AWS, pricing becomes extremely predictable because it can be derived from those plans. And it is far better to be dealing with that kind of model than to deal with "unlimited with a million asterisks" or something. AWS is predictable, reliable, and most notoriously has never ever increased their prices, so whatever you calculated will not go up because of Amazon's decisions.

> Suggested a very overdone architectural revamp, moving everything possible into AWS Specific technologies.

To be honest, depending on the technology, the savings could be worth it... for example, did you know you get a discount if your traffic is served over cloudfrount? even if your distribution is set to no cache any resource, you can front your APIs using cloudfront and save networking.

How do you take pricing into your design considerations? Does it come with experience from using an AWS service in production and understanding how it's priced, combined with the usage numbers the new system might get? I'm trying to learn more about how engineers currently do this.

Basically, yes.

It's not that complicated, it's just not something engineers are usually used to do. If you use an AWS service, you look at its pricing.

Take s3 for example: whenever you use it, you'll pay for outgoing bandwidth, PUTs, GETs, and storage.

So you seek to minimize all of these:

1. Bandwidth: use cache layers. This also minimizes GETs.

2. PUTs: design your app in a way that doesn't do unnecessary inserts into s3. Consider alternatives such as redis, postgres or filesystem depending on the need.

3. Storage: compress your objects if they compress well. If they aren't often accessed, use storage classes and auto lifecycle management.

Pricing in AWS generally reflects some kind of engineering limitations you will face at scale in the first place, so it makes sense to go through this whole exercise either way.

Calling programmers "Engineers" is a misnomer.

I wish programmers had the prestige it deserved for combining Science, tradition, authority, and art.

Engineers are not allowed to use tradition, authority or art. They are restricted to being modern day calculators.

Nothing is wrong with either.

The shift from 'Developer/Programmer' to engineer has indeed been part of a push away from creativity towards cookie-cutter work.

An interesting analogue would be the Automotive industry; As time progressed, Companies focused more and more on 'engineering' versus art/tradition/etc. But as the industry evolved, "Flashy" vehicles that took risks became moreso either a halo product for a brand, or relegated to Luxury/Boutique.

And, of course, there was the dark side of this shift; A good example from the 70s, the level of 'engineering' driving the design of the vehicle and it's assembly didn't take into consideration the actual line worker; in Ohio the workers wound up getting overworked, burned out, and in some cases actively sabotaged the product, because they were being treated like automated machines.

I think that missed the point.

Engineers are applied scientists.

Programmers are not applied scientists.

Why does any of this matter?

Different expertise.

I'm guessing you are neither?

Incorrect guess, and it still doesn't really change anything. You're just playing with words. It's no more useful than a full thread arguing about a misspelling; Just pure noise.

Software engineering could learn a lot from, say, civil engineering. It could also learn a lot from interface design and I'm sure even microbiologists and astronauts could teach us a lot. Engineering is not special.

> Most apps don't need scaling anyway

This is exactly right. I host stuff in buckets/cloudfront and uses a bit of lambda/route53. I end up paying $4 a month.

now that will be very different if 10 million people suddenly decide to visit my site, but if that happens money probably won't be a problem after all.

> Even trying to read the amazon pricing for their instances, hours and what not, drives me insane.

I get your sentiment but the pricing is that way because they want to charge you exactly what you use, not for reserving stuff.

For example if you deploy a EC2 instance that comes out to be $15/mo total, and you deploy it on say the 10th of the month, do you want to be charged the whole $15? No, you want pro-rated.. But what if you only need that instance for like 6 days? Then what? You gonna do the math to figure out what it would have cost you yourself, or just read it per hour billing?

> Most apps don't need scaling anyway and if you do

Man, exactly right. Many of guys here would love crypto once you stop asking why and start asking how.

The most lucrative projects these days are completely frontend UI, they don't even their own backends as they just read state from the nearest node when the client connects their wallet.

Some people forgot that the scalability game was to convert traffic into money with o(logn) overhead costs. So ditch that, and remember you are in the money game.

> Most apps don't need scaling anyway and if you do, just put that app on bare metal fitting your requirements.

Most important !

Deeply dissatisfying to read. Ex-Googler uses connections to get his (understandable!) cloud mistake refunded.

Every time I read one of these stories, I get more and more convinced I will just simply never use scalable cloud tech for my side projects. I'm not going to risk my family's retirement savings on the all-too-possible chance that a small deep-implication error will cause runaway charges.

OP here.

Your assumption is incorrect. I haven't been in touch with anyone in Google, and used 0 internal connections. Happy to make another post with my conversation + documentation to support this.

I reached out to the GCP through their regular channels. This is not a paid post, and we are not sponsored by Google in anyway.

You might want to take another look at your paragraph:

> Having been a Googler for ~6.5 years and written dozens of project documents, incident reports, and what not, I knew how to put the case for Google team when they would come back to work in 2 days.

That certainly reads as an advantage that most non-Googlers would not have.

Thanks for sharing, I see now what you mean.

I'll share the doc I prepared and sent to Google in my consult, in one of the next posts.

FWIW, I didn't intend "deeply dissatisfying" as criticism against your writing, although I phrased that poorly so I can understand it coming across that way. If anything I feel for you when that unfair surprise hit you. It just sucks that it's possible, and that the odds feel against us when we're seeking a refund.

it's an advantage, sure, but still he was right in saying he did not use internal connections as alleged. calm down everyone.

Yeah, that would be a very good idea. The way it is stated gives that impression, and helping people how to understand how to resolve an issue like this would be priceless IMO.

Yes, please make a post sharing your knowledge on how to convince them. I suspect most people would have no idea how to pull this off.

You only wrote half the story and no technical details...... why even write a post at all, it's just clickbait and you give no other information how what went wrong, how it got fixed, what exact technical problem was...

Did you read Part 2, linked from the bottom? It has plenty of technical details.

Up until this comment, I did not realize that was a link either.

Perhaps OP can edit the bottom to make it more clear?

I didn't realize that was a link; seriously why does every website have to make links look different.

Lol to me it looked like part 2 wasn't written yet, but I clicked it anyway just to check and the page loaded, so I read it. No real downside in getting a 404.

I'm absolutely astounded that cloud providers allow stuff to get this way and don't even go "ope, this looks out of the ordinary, we should look into it" Nor do they offer the ability to straight up kill all services if exceeding a certain price set by the customer.

Cloud is still good though, I believe it's the future. I just don't believe in deceiving your customers to hopefully rack up a high bill with them.

I view not having the ability to say "Shut everything down if I go over $100/mo" the same as pre-checked hidden cross sales that MasterCard/Visa cracked down on in the adult industry few years ago. Just money grabs.

I will definitely be putting such measures in my cloud platform.

I'm with you. I have yet to use any of these cloud computing providers for building or testing anything, and it is partly due to this (and partly due to privacy and confidentiality considerations).

"Yay! NBD! Google is the best! All I had to do was work there for a few years, rub elbows, make connections and ask for favors!" Really, google would be the best if there was no way to accidentally go over your stated budget cap by 86 million percent, or at the very least have a policy to refund people who can demonstrate that this is what happened.

I think I'll treat this as the latest in a several line of warnings about not going all-in on all these Cloud services until you seriously know what you're doing.

So much of it is so unnecessary to begin with. You can do so much with a cheap VCS or two without thinking about lambdas or cloud functions or kubernetes or who knows what. But these days you'd be forgiven for thinking it's dark magic.

You're not going to run up a 5 digit bill in a day by starting up on a few $10 VPSs. And you'll probably have an architecture that fits in your head to boot.

Also: The article title should really be "Saved 72k and avoided bankruptcy by being an ex-Googler."

Just don't go all-in on them at all unless you're spending someone else's money.

> Had we chosen max-instances to be “2”, our costs would’ve been 500 times less. $72,000 bill would’ve been: $144. Had we chosen concurrency of “1” request, we probably wouldn’t have even noticed the bill.

> If you count the number of pages in GCP documentation, it’s probably more than pages in few novels. Understanding Pricing, Usage, is not only time consuming, but requires a deep understanding of how Cloud services work. No wonder there are full time jobs for just this purpose!

Great write-up - thanks for sharing @bharatsb! As you say, cloud pricing has become too complex for developers to understand quickly (they want to ship features, not calculate costs). Infra-as-code is great, but it has made it even harder to understand which code/config option costs what. `terraform apply` is like a checkout screen without prices.

We're trying to solve this problem with infracost.io, initially looking at Terraform. It would be interesting to get your feedback on whether such an approach might have helped you? Probably not as it doesn't look like you were using Terraform?

(Cloud Run PM here) I am sorry for the experience described in the blog post, we could definitely be better at bill management. I am glad that it worked out in the end and the customer was not required to pay for the bill.

Based on this experience, we decided to lower the default value of "max instances" to 100 for future deployments. We believe 100 is a better trade off between allowing customers to scale out and preventing big billing surprises. Of course, customers can always decrease it or increase it up to 1,000, or even above with a simple quota increase request.

Well, the real question for all cloud providers, for which I expect crickets as an answer, is:

Why don't cloud providers allow setting a budget which cannot be exceeded? A simple, 1-click way to say: this account should never go over $500 a month. Just stop creating resources or responding to requests if it does.

This is a outage waiting to happen for every customer:

- Early dev sets a limit.

- Product launches.

- Slowly grows.

- One day suddenly the entire business grinds to a halt. Globally. Across the carefully isolated shards. Everyone scrambles to figure out why! Tens of thousands of dollars are lost because of going $10 over a budget. End-users are lost. Trust is burned. If it's providing a critical system, maybe even people are hurt.

- Google then has to explain why they built in instant, global failure mode.

They can put it behind a clear warning, do stuff like AWS does for bucket deletion (the bucket has to be empty, you have to check a box and manually type the full name of the bucket).

There are ways to design this, they can send notifications at 60% of the threshold, 80%, 90%, 95%. They can give you a grace period, put up prominent warnings in the console and for command line tools, etc. There are ways to do it, it's far from an intractable problem.

I'm not saying that it can't happen but do you want to bet that a certain percentage of their business, for all cloud providers, is from carelessness and resources still running when they shouldn't or using more than they expected? Especially for bigger companies where it's easy to miss something. Just like gym subscriptions or other kinds of subscriptions where they're banking on you not noticing for a long time ;-)

So ?

This is would be one more checklist that you need to regularly review, just like domain name registration and certificates, ...

Any of those expiring will cause outage also.

Outages happen. Will happen. This just one more way that they can happen. It happens you learn and move on.

Hell even the biggest providers with the beset admin teams on the planets have outages.

And Google doesn't need to explain anything just like your name register doesn't have to explain anything if you cc was declined/not current ...

My guess is that the billing logic is separate from the application logic. There's probably a delay between the two and mostly one-way communication.

1) But they supported this before on GAE. GAE had 'spending limits'.

2) Also if they are able to figure out when you've hit your daily free quota and cut you off almost immediately, how are they not able to figure this out?

If I recall correctly, GAE is an example of something they made specifically to be a cloud product. Products like Compute Engine, GCS, Bigtable, and Pub/Sub are things developed internally and then sold publicly once they realized others might find them useful. Perhaps the products developed first for internal use weren't developed with features like measuring billing usage in real time in mind.

AWS recently released AWS Budget Actions which should allow you to do that.


It still looks rather complex but if it can actually enforce a budget, that would be great.

> Based on this experience, we decided to lower the default value of "max instances" to 100 for future deployments. We believe 100 is a better trade off between allowing customers to scale out and preventing big billing surprises.

This is good to hear. I use Cloud Run a lot for personal projects and I always set concurrency to 80, max instances to 1, memory to 128Mi (unless it's something beefy that needs the memory), and CPU to 1. If I need to scale it up, or I decide to open it up to actual usage, I'll do it when I recognize the need.

Why don't you just allow customers to set a limit if they would like to?

I don't understand why developers use cloud for bootstrapping/side project. Digital Ocean is all you need $5 droplet + $15 Postgres or even better $7 dyno on Heroku.

I built my startup on a combo of DO and Firebase.

If I knew something was going to take more than a couple dozen milliseconds to run, it was built on the DO droplet.

Why would I pay by the CPU second for something that is taking a lot of CPU seconds? That billing model doesn't make sense.

For my super quick REST endpoints, yeah, all on Firebase, the convenience of writing + deploying makes it an obvious win. (Unless something goes wrong, debugging Firebase functions is not fun...)

$5 droplet, 2GB swap (set it internally) and run Caprover.

Deploy your Postgres (for DB), minio (for s3 storage) and your webapp from Caprover. Add nodes as you need to scale out.


> To overcome the timeout limitation, I suggested using POST requests (with URL as data) to send jobs to an instance, and use multiple instances in parallel instead of using one instance serially. Because each instance in Cloud Run would only be scraping one page, it would never time out, process all pages in parallel (scale), and also be highly optimized because Cloud Run usage is accurate to milliseconds.

> If you look closely, the flow is missing few important pieces.

> Exponential Recursion without Break: The instances wouldn’t know when to break, as there was no break statement.

> The POST requests could be of the same URLs. If there’s a back link to the previous page, the Cloud Run service will be stuck in infinite recursion, but what’s worst is, that this recursion is multiplying exponentially (our max instances were set to 1000!)

Did you not consider how to stop this blowing up before implementing? Having one cloud function trigger another like this with no way to control how many functions are running at the same time with no simple and quickly met termination condition (with uncapped billing) is playing with fire. It's not going to be optimal either if most of the time each function is waiting for the URL data to download.

You need to be using something like a work queue, or just keep life simple and keep it on a single server if you can.

We've all had a program crash from a stack overflow. The problem seems to be that instead of the "serverless panacea" they were promised, the code they built can now only run on one of many Google servers, none of which are theirs. No way to kick the tires at all.

It honestly reminds me of debugging a Jenkins pipeline. Something that was designed to be super generic of a runtime but yet the tooling can inexplicably only live on computers that are not your local development machine, and all of it is maximally painful to stub or test or debug to seduce you into "just running it live".

It's like the opposite of the "small agile team" thing they were talking about. If your program requires 7 API keys and some cloud environment to do a test run, I want no part of it.

> We've all had a program crash from a stack overflow.

Launching a cloud function that recursively triggers the same cloud function, that doesn't have a simple safeguard for it looping or blowing up, and where billing scales with the number of cloud functions ticks the "very high risk" and "very high impact" boxes for me. A program running on a single server isn't similar here (you could accidentally create a DoS attack though).

Typical cloud function use is some event gets triggered like a user sign up, the function executes, then it halts. The above isn't a standard use case and is so incredibly risky this approach shouldn't be attempted in my opinion.

I'm just a student but I've spent about 10 hours trying to figure out why Azure has been charging me >$5/day for their "basic" database @5DTUs, 2gb max storage. This morning I was so exasperated I sent a letter threatening to report them for fraud if nobody could tell me why I was being charged 30x the listed rate, which so far no one has. This is an extremely cathartic post to see that I'm not alone, thanks for sharing.

Go to billing > cost analysis > filter by resource break down. Azure billing analysis is pretty amazing.

Yeah, but it just shows my database cost which is higher than is listed as far as I can tell.

Could it be listed "hourly" and you're charged "daily"? Add in VAT (equal to 25% in some countries) and you match the 30 times higher than expected charge.


Basic tier, 5 DTUs, 2 GB is listed as ~$4.8971/month or $0.0068/hour on this page. Extra storage would cost more but is not available for the basic tier.

Do you have geo-replication turned on? More regions will be an additional $5/month (plus bandwidth between regions) if you replicate. You can serve everything out of a single region but it is pretty easy to add others if you're not paying attention during initial setup.

> Google let go of our bill as a one time gesture!

We've seen this happen with similar stories on AWS. Neither platform supports prepayment with a hard limit on costs, and this seems unlikely to change.

Yeah a friend of mine wanted a real cert, not letsencrypt (I don't understand how that is more real but ok), as a bit of a noob he clicked around on the AWS website and some days later had a bill op 1500 eur. They also nulled it. Still, this scares the hell out of me.

AWS Certificate Manager gives you free non-extended SSL to your machine, it's pretty nifty.

Yeah, it seems to be by design, hard to imagine otherwise.

I can sympathise with some of these stories, like the ones where an overnight DDOS attack racks up a huge unexpected bill, but this one in particular is just a story of gross incompetence and negligence. The guy hacked together some code in a few days and deployed it to a service with unlimited billing without any kind of sanity checks and without even understanding what he was paying for. He’s an ex-Googler, it’s not like he hasn’t heard stories like this before. And the takeaway? “Oops don’t deploy buggy code” and “I shouldn’t have used the default settings”. OK, sure, let me know how that works out for you.

But "The Guy" was under the impression that he could do so because he set sane limits in both GCP and his Credit Card.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact