It's unfortunate that the solution to cloud pricing complexity that all providers are adopting is – add even more complexity on top.
The number you see on your bill is increasingly calculated by running some black box algorithm on top of the billing events your resources generate. Was it accidental or not? What is a "weird" deployment vs a normal deployment? By what factor should the spikes on your billing graph be smoothened? None of this can be deterministically calculated from the pricing page. And there's no way for you to do these checks yourself before deployment because you have no idea what this logic even is. So you are entirely at the mercy of the provider for "forgiveness".
Who wants to bet that some provider is going to launch "AI cloud billing" within the next year?
Competition only matters for new contracts. Once you have picked a provider and made a big enough infrastructure investment, there's no realistic path to switch to someone else.
Keeping everything, including mission critical databases, "inside" Kubernetes is smart. The smart people keep everything "in" Kubernetes. Everyone else can pay up.
Like everything else it's a tradeoff. If you are running _everything_ inside Kubernetes you can easily move away, but I think you're loosing much of the benefit of being in the $big_cloud to begin with. If you have the staffing to provide your own db's, block storages, caches, IAM, logging/monitoring, container registries etc in a reliable way "as a service", the jump to some much cheaper bare metal hosting is not that far.
For me the sweet spot is to have all compute in Kubernetes and stick to open source or "standard" (e.g S3) services for your auxiliary needs, but outsource it to the cloud provider. Fairly easy to move somewhere else, while still keeping the operational burden down.
But agree that having e.g all your databases in a proprietary solution from the cloud vendor seems sketchy.
Yeah, not _everything_, but almost or "pretty much" everything. The main exception to "everything" ultimately being S3, and some monitoring exceptions here and there that are purely cloud-side like monitoring AWS' Service Control Policies and using some cloud-side AWS tooling.
AWS has very nice one-to-one mapping of K8s serviceAccount with IAM roles.
We used some cryptography-centric GitOps patterns to eliminate any human hands beyond a dev environment which also helps IAM be easier (but of no less reasonable granularity and quality).
> the jump to some much cheaper bare metal hosting is not that far.
Heh, at a consulting firm I was at not too long ago, all the K8s nodes were sole-tenant on the cloud providers these K8s nodes were on. Intra-cluster Cilium-based Pod-to-Pod networking across cloud-based datacenter sites has been super smooth, but I have to admit I'm probably tainted/biased by that team's uncommon access to talent.
Don't be ridiculous. To the nearest 9 decimal points, nobody keeps their mission critical database inside Kubernetes, and the K8s tax is excessive for essentially all cloud users anyway.
The last time I was involved in this kind of thing (long enough but not too long ago), we ran Postgres and other databases strictly in K8s per a mandate by our CTO. Using reserved instances are great for avoiding a lot of cost.
I think for that particular firm, "K8s tax" measured in fiat currency is negligible and anything on the human side their people upon hearing "K8s tax" would respond with some ridicule along the lines of "someone doesn't know how to computer".
To be fair, most of the commenters on HackerNews should use something like Heroku.
This is absolutely not true. Customers regularly shift spends in the >= 7-digit between cloud providers. Yes, it takes planning, and it takes many engineer-months of work, but it definitely happens.
And also factor in (1) the claim that most cloud growth is ahead of us, eg. moving large customers from on-prem to cloud, and (2) it would be terrible policy to try to charge existing customers more than new customers.
>> it would be terrible policy to try to charge existing customers more than new customers.
Companies do this very, very often. This is part of the reason why they have a "call the sales department for special pricing" option. They can give large contracts a nice discount to get them onboard, then slowly (or in some cases, if they really think they have you hooked, not so slowly) ratchet up the price. This is common in both B2c and B2B businesses.
Then how do AWS, Azure and GCP charge 100 times the amount for bandwidth as other hosting providers and the IP transit quote sitting in my email inbox?
Because they land your data in their systems by offering you a sweetheart deal in year 1-2 and then they jack the price back to the extortionate level once you're stuck. Compute is portable between providers, but data is not.
I've often wondered about this. I'm guessing large customers can all negotiate extremely deep discounts on bandwidth from all three providers. Smaller customers may not be paying that much in bandwidth that it would be decisive. I think also that if say GCP cut its bandwidth charges by 10x they might attract customers they really don't want.
A beligerent that so completely outclasses their opponent that they can inflict existentially threatening losses with so little effort that its expenditure is indistinguishable from normal overhead
Alternatively, when Stitches managed to ambush your level appropriate character alone in Duskwood
Part 1 is bandwidth prices vary tremendously by location, but clouds would like to have customers in expensive regions too, and if US overcharges at 100x, and Brazil overcharges at 25x, the customer will only have to pay 2x for bandwidth in Brazil. Not a lot of cheap hosting in Brazil, from what I've seen, but there are a lot of users in Brazil, attracting cloud customers justifies putting more cloud hardware there which benefits the cloud.
The other part is that bandwidth is easy to measure and broadly correlates with general usage. On a shared system, you can't really meter watt hours, but you can meter bandwidth. Bandwidth charges are how the difficult accounting is reconciled so that the cloud can charge enough to hit their desired margins. If bandwidth was just a cost plus, other things would cost more; if everything had to be cost plus, accounting would be much more difficult for everyone (and it's already pretty non transparent)
If the cost calculation is complex and opaque, that successfully prevents anyone from evaluating what their costs would be on a competing service. And vendor lock-in makes it expensive and nontrivial to simply try it out.
I'd recommend thinking of this (and other accident forgiveness schemes from competitors) as a gesture of goodwill that rarely happens rather than an official part of the billing policy.
If you actually look at your contract, no cloud provider is going to contractually obligate themselves to forgive your bill, and you shouldn't be planning or predicting your bill based on it.
Anyone else remember we had widely available and completely understandable options to rent everything from baseline web-hosting all the way up to private rack services for actual years before AWS came along and apparently all of that was forgotten, Warhammer 40K style?
I've rented a VPS from a vendor for going on 20 years now (Holy fuck I'm old) and I've never once been surprised at the bill.
You know that book "JavaScript: The Good Parts"? There needs to be a similar one "AWS: just the good parts". It would probably talk about EC2 (VPS), S3 (cheap bulk storage), SES (emails), and that's about it. When folks get into the elastic-super-beanstalk-container-swarm-v0.3 products, that's when they really kill themselves on the bills.
That said, yes, just using a VPS vendor is the easy way to stick to the good parts.
You still can. My company rents space from Cogent communications, and they are very good at customer service, although the offering of "rent a cage and get a pipe" is a lot more DIY than AWS.
It's far from perfect, but the large providers do give you tools for categorizing your charges (tagging, etc). There's some fuzziness, especially around data transfer, but for the most part developers can look at an application and know what the cost of each resource is. The biggest risk seems to be out-of-control auto scaling turned on without doing reasonable analysis first.
>If you do something luridly stupid and rack up costs, AWS and GCP will probably cut you a break. [...] Everyone does.
If the incidents that made the rounds here in the last few months are any indication, they'll start out insisting you pay no matter what. You'll then have to write a blog post about it, post it to Twitter, HN, and Reddit, get a couple hundred comments expressing anger at the provider, and wait for someone from their PR department to see it. Only then will they finally waive the costs.
Good on Fly.io for trying to handle such situations more sensibly.
There's plenty of situations you don't hear about.
A few years ago I f'ed up and accidentally pushed keys to a public repo, and by the next morning, we racked up $50k in AWS charges from crypto miners. We reached out, they gave us a security checklist that if we followed, they'd take off the charges. We did, and by Monday (my code push was Friday evening) the charges were taken off. No public shaming required.
You likely already know that, but to anyone else interested: a good way to prevent these kinds of situations is to run 'nosey parker' on your git repo before pushing it to a remote. It will dig through your code and configs, looking at files and through all the git history, and highlight anything that looks like tokens, passwords, keys, etc. You can set it as a pre-commit hook to block the offending code from even being committed.
It's been a few years, so I'm going off of memory, but it was mostly best practices stuff (enabling Cloudtrail, rotating older keys, etc). Anything to ensure that once the attackers no longer had access, removing/monitoring anything that would have longer term implications.
What are you referring to, exactly? That doesn’t sound like AWS at all.
AWS are very well-known for bill forgiveness. It’s not something set in stone, but if your bill explodes accidentally, even due to a mistake you made, they will normally forgive it if you ask them. You don’t need to go running to social media at all.
You're right, I was getting two relatively recent posts mixed up:
1. After a DDoS attack, someone got a $100k bill from Netlify for his static site and after he asked to have it waived, they generously reduced it to $5k. Only after his posts about it blew up did Netlify waive it completely. [1]
2. Someone got a $1k bill from AWS because lots of people made _unauthorized_ requests against his empty S3 bucket. AWS did agree to waive it immediately, prior to any social media posts. [2]
I probably just remembered (2) as "that ridiculous billing situation involving AWS" but got the details of what exactly happened mixed up with (1).
I have personal knowledge of two cases of this happening and both were immediately dropped. One was a high school student's personal project gone awry for like $3k worth the other was a startup where it was $120k or so. Two different providers. Neither had to particularly argue their case, just ask and wait a few days.
We incurred a 6 figure bill when the API an authentication token handling lambda was updating from was taken down. The lambda went into a crazy loop self invoking, as it had a retry mechanism and a CRON schedule, which piled invocations on top of retries. (over worked team, poor design, etc.)
So far we have gotten no concessions from AWS, and we have annual bills in the millions, just not for this application whose budget now has an awkward and obvious spike.
Thanks, we are trying to compose an argument with this service in mind, we didn't have any recursive invocation protections for this lambda, and the AWS services it hammered during its invocations contributed to the massive cost.
If you have elasticity then make sure you understand the limits. Even Serverless has these knobs but for various reasons they default to high elasticity.
But many folks don’t need the elasticity at all. So you should factor that into your architectural decisions.
Well, the ones where they forgive it right away don't write blog posts about it, so you don't know about them. I guarantee that every case which makes the front page is either an example of a mistake made by an individual support engineer following the wrong script, or the person posting it is being dishonest about what actually happened (I recall a particularly suspicious recent one where they hid the name of their site). The cloud providers all have a policy of forgiving these sort of cases. The only speedbump is ascertaining if it was a genuine mistake.
An argument that is often only made by folks who didn't ran anything at scale on a public cloud. If you've ever set in a room when a cloud deal was signed, you'd know it cable: you don't get to pay less by threatening to switch providers (I've seen people try to suggest that and get laughed out of the room), but by buying more from a single one.
Hence accident forgiveness is in the same marketing bucket as free credits for new customers: reducing the aversion to trying the cloud and putting more work on it.
And the switching costs, the biggest line item isn't building for multiple clouds (you can't be cloud agnostic: abstractions always leak somewhere, so you need to select a set of specific clouds to build for), but the cost of moving data between clouds. That's the real lock-in.
In the IBM, HP days it wasn’t about paying less per se by threatening to switch providers, it was getting more concessions (of which money might be one). Some big companies had both systems so they could play them off of each other. Time to order more hardware, who will kiss our butts more?
I’d be very surprised if that doesn’t still exist for the big boys. Though most of us are not big boys, and half of the biggest boys are cloud providers themselves.
We all pay more [sic] so data centers would be over provisioned, allowing quickly expansion when we need it. I’ve been in a few incident that were root caused to a cloud provider lacking capacity to provision more instances.
As that over capacity would’ve been bought and installed in the cloud provider’s data centers regardless of the errors that are being forgiven, not billing for those errors is a net gain to the provider, at minuscule cost, if at all, to other users.
There is a very obvious fix for surprise billing. Enforce a billing cap and terminate service if it's met. Even better if you send alerts when the cap is approaching.
If I pay $39/month, a default cap should be $39 per-month. Otherwise, let me set a cap I am comfortable with.
Surprise billing is never good for customers, only the business.
I promise, you are not the first person to have thought of this, and, believe it or not, there are reasons other than malice and avarice that cloud providers don't terminate service based on billing caps. Terminating service is a big deal.
Terminating service is a big deal for commercial customers' production environments.
Reversibly (i.e. shut down compute, don't delete anything, allow the customer to review, fix and reinstate quickly) terminating service is a minor annoyance for hobby/experimental setups, and in those, it's much more preferable than having to open a support ticket to deal with a massive bill.
Having quotas that the customer can increase themselves (but has to manually choose to increase) on storage prevents storage related surprise bills, and the rest you can shut down (optionally, make the user choose up front what they would prefer).
What am I missing? Too many commercial customers picking "experimental" initially and forgetting to change it?
No one really cares enough about hobby developers as a customer segment to rebuild the billing infrastructure to make this possible. Scalable billing at huge scale is solved at the cost of latency and being "eventually correct" (unless it has changed). To add a price cap feature, eventually correct isn't enough and then you ask yourself who would actually use it and you have to scroll really far down your list of biggest customers until you get to someone who wants it.
The problem with not caring about hobby developers is that it means developers won't be as familiar with your cloud environment when time comes to pick one for the next "real" project.
I would also expect a price cap feature to be useful for experimental/no-approval-required projects at work. In fact, if I ran a cloud project for work as a small team-internal project, a cost explosion would become an even bigger bureaucratic nightmare than if it happened at home.
My preferred option would be to have an optional billing cap that I can enable knowing full well that if it is exceeded the service would be terminated (obviously with notifications as that cap is approached). I could then apply this to simple hobby projects and such, while not having the risk of termination apply to more serious applications (though a 'soft cap' would be nice here so that I could still receive notifications as it approaches).
The real fix is scoping credentials on aws - if you don’t use an account or role with limited permissions then even if they had this toggle the first step in an attack would be to disable this option.
As a certain kind of user, you probably do think that. But I also think I should be able to have a root level admin account without MFA. The consensus is that no, that should not be up to the customer.
It's different here, sure, but the providers optimize for not letting customers shoot themselves in the foot, and remediation via bill forgiveness is a fine solution -- from the provider POV.
You should be able to have a root level admin account with no 2FA! I would print mine and keep it in a tamper-evident envelope in a safe at my lawyer's office with instructions for when and who can get it.
A company isn't liable if their customer gets themselves hacked because they decided to not use any of the many MFA options available to them and neither is a company liable if the customer set a billing limit rule that they executed correctly.
Companies can simply not be trusted to tell the difference between a foot-gun and a..whatever a good kind of gun would be...
I don't have MFA on my root level account, is it because my account is 16 years old or so at this point? Like my personal AWS account is tied directly to my "order more dish soap" amazon account, because that's how it worked back then, i guess.
Having flashbacks to the time where we had paid for a server and were paying for rack space for a customer and they were refusing to pay their bill. Our lawyers told us in no uncertain terms that turning off the server would be a terrible idea. “Obstruction of service” is the term that comes to mind.
While the parent point about cloud providers having arrived at their policies thoughtfully, this particular issue is likely not part of the equation. There are plenty of services that run on a quota system (chat gpt, sentry, etc). There is a difference between shutting off a service the customer reasonably expected to be always on and shutting off a service when it reaches a threshold set by the customer as part of their purchase. The former is more like repossessing a physical good willy-nilly if the customer misses a payment or you find a check bounces…you can’t do that.
Depends. For my hobby project that hasn't been monetized, being charged $100k is way worse than it going down.
If it were critical infrastructure, or monetized in a way that brought in revenue to cover the charge, then maybe I don't want it to shut down despite skyrocketing costs, but that's hardly the only situation you could be in.
But more importantly: just let the customer decide! Let them decide whether there's a threshold where an outage is less costly than the hosting, and what that threshold is
Just in case you don't know, the person you're replying to is the author of the blog post, and they mention in it that being stressed out by (potential) unexpected large bills is something they are aware of.
Although even then, "we will only threaten to charge you $100k without meaning it" isn't much of a reassurance.
Getting charged is not the same as paying those charges. I'm willing to bet real money that all who, in good faith, went and asked not to pay those bills ended up not paying them (if they didn't ask, that's a separate issue).
From that point of view, the cap seems more like a common-sense self defense feature that the cloud provider would want to implement. But we have cloud providers in this thread saying they don’t want to implement caps, so, I dunno.
From my technical understanding from a few friends who may or may not know what they are talking about, the caps issue has to do with the processing delay between billing events being emitted and understood. That by the time the billing events have been processed and action taken, the damage would already be done in all but the most extreme cases.
Secondly, the best solution is to simply stop everything. But now the customer has to cold-start their entire infrastructure, which may actually cost more than paying the bill.
Thirdly, it is likely that customers will set a billing limit and then forget about it years later. Suddenly, they've got a complicated infrastructure setup spanning the globe. They finally hit a scale where they hit their billing limit that they had completely forgotten that Bill configured in their early days (who doesn't even work there anymore). Suddenly, the entire global infrastructure is shut down in the middle of the night.
This made me think. There is usually some "hard" cost limit X that you cannot / don't want to afford, so terminating the service is preferable.
There is also usually some "soft" limit Y < X that you don't want to exceed, and don't plan to exceed, but you'd rather pay >Y than face an outage.
But a hard limit would have to be set to X to avoid that outage, and if it gets exceeded, you'll face a bill of X and an outage.
So what a customer would actually need is to specify both X and Y, with the rule: If the cost would exceed X, then terminate it early so the cost doesn't actually exceed Y.
Sounds complicated to implement, but then, the current practice of waiving the bill is complicated too if you tried to formalize it.
(For the sake of this discussion, I'm ignoring all the technical difficulties of terminating a high-availability service at all.)
It would, of course, be just mean and unscrupulous for a cloud vendor to look at the number you have set as being ‘the absolute most I am willing to pay for this service’ and then optimize their pricing offer to you specifically to make sure they go right up to that line and no further.
I didn't mean to imply that. A hard limit at X that stays inactive <X, and at >X, leaves you with a bill of X and an outage is the easiest approach from a technical side: Terminate the service when X is reached, and bill exactly what was provided. It is something you would instantly come up with when asked to implement a cost limit, and you don't for a second put yourself in the customer's position.
Of course cloud vendors do put themselves in the customer's position, and that's why they say that customers would not be happy with a limit, even though they are asking for it.
Is this a soft limit or a trajectory prediction? I think there isn’t such a thing as a soft limit. Nobody wants to spend any money really right? But you need to spend some to avoid losing service. That’s just a cost you don’t like but need to pay.
I definitely get the idea of: I don’t want to spend X so if it looks like I will, terminate service at Y. But I think that’s a special case of the general situation, I want to know how much I’m on track to spend, right?
But I don’t know much about this at all. My whole experience was accidentally getting my own personal self a $500 AWS charge and then deciding they cloud services were dumb.
> Is this a soft limit or a trajectory prediction?
I don't know. I just tried to frame the problem from a customer's point of view, because cloud vendors' statement that customers would not like a limit is (IMHO) limited by their POV. Customers do want a limit, but not the way that cloud vendors would implement it. I think a huge part of the problem is understanding what exactly it is that you need when you use a cloud service. (This is varying from customer to customer, and from service to service, of course. You usually have important services that must be running, and others where an outage would be unpleasant but not critical.)
> Nobody wants to spend any money really right? But you need to spend some to avoid losing service. That’s just a cost you don’t like but need to pay.
That is not the issue. From a customer's POV, I would be ready to spend extra to keep the service running, but there is a limit where I'd prefer an outage because I can't bear that much. There are two problems with that: First, the limit is blurry. Second, a simple hard limit would leave me with a huge bill AND an outage. I would want to be able to choose one of those evils, not be left with both. And these two problems compound.
> But I don’t know much about this at all. My whole experience was accidentally getting my own personal self a $500 AWS charge and then deciding they cloud services were dumb.
I don't think they are dumber than the alternative. If you run your own hardware, you have a hard limit in both cost and computing power. You could technically get that with the cloud too, but it is not usually offered because it doesn't really solve the problem, but neither does it for for your own hardware.
That said, it would be nice if the major clouds would offer a "hard limit" option, but it really only works for "unimportant" applications that are cost-sensitive and can take an outage.
For many company's use cases (make EC2s a few times a year, otherwise leave them running) I would imagine they would really feel better if there was an option that said "if charges incurred is over (configurable threshold you set to 10k higher than your normal monthly spend), then robocall the customer, and if they do not reply after 3 automated calls then block all additional AWS API calls that would incur more than 10% of my monthly spend if left running for 24h."
This would still allow all production services to run, but would stop someone from spinning up 200 crypto miners. I'm sure AWS is capable of implementing this, and I don't want to say it's "easy" but I would be shocked if they lacked the technical expertise to do this.
We’ve now gone from hard billing caps to “soft billing with alerting”.
If you look at lots of these threads you’ll see that many people don’t want to provide phone numbers, lots of people ignore emails, even repeated ones, directly to them, from billing.
This isn’t a technical problem, it’s a service problem. I can see the hn posts already “my site went viral and <HOST> shut me down”
Here's the thing. We have apis for everything and their grandmother. You create an instance and there are apis for adding tags, labels, nicknames...but not for spending caps? I understand I don't know all of the complexities involved, but if you can bill by the second or by the hour, you can certainly alert by the same metrics.
We have been measuring CPU, MEM with extreme granularity, how about considering price as a resource and measuring the same way, so that a service with a price cap can self manage and self terminate according to some priority field?
This might not be the actual solution, but we have been at this for a very long time, seems like there is not even a hint of an attempt at solving it by the giants. This is about incentives, sorry.
I was pleasantly surprised when I was messing around with the Google maps API and found I was able to adjust a quota to put an upper cap on daily spend.
It made me feel much more comfortable hacking around and not needing to worry that I'd accidentally create a render loop or something that could rack up a bill whilst I wasn't looking
When you build an app for resiliency you end up with classes of service where the app fails in stages.
But to extend that to the billing case, you’d have to have a partnership with your customers, not just a dashboard where they push buttons and an API where you add or delete machines.
Maybe the website goes read only except for admin traffic when the budget is exceeded, for instance. Not as a bespoke process each company has to reinvent, but as functionality provided by the vendor.
I understand it’s super hard, but Azure has pay as you go and credits that do exactly this. Seems like if you were designing a billing system it’d be a good idea to include this feature.
When you're paying $39/month for something that generates $0/month, that is a very sensible policy.
When you're paying $50,000/month for something that generates $200,000/month in value, or if an outage can generate $100,000/month in costs, or if the people that can fix an outage cost $100,000/year, then it's not.
Because data storage costs money, hard billing caps require deleting both your data and its backups to stay under the hard cap. There are very few use cases where that's actually acceptable, not even development environments where people will get upset for losing their work that they haven't pushed to somewhere else yet.
I really, really wish there was just a way to put a hard limit on how much a cloud could charge me.
I am never going to want to spend 200k of my personal money on some project on a cloud. Never. I don't even want my ant-based basket viewing project simulator to cost me 1 thousand dollars because it went viral and all clouds overcharge for bandwith.
I think this sounds easy, but it’s actually quite nontrivial in practice. You need special handling of such a limit for each kind of resource.
E.g. when my limit is reached to they remove the database, along with all backups, and all objects on S3? Since storage is billed, it should also be stopped when the limit is reached, right?
I think in practice among companies paying most of their revenue, there’d be zero interest in this, while it would be a lot of effort to implement.
This is really something specific to hobby projects, which just shouldn’t be using those “unlimited potential cost” services.
I don't think any of these requirements are necessary. A basic "if I reach my bill limit, turn off things that are billing" toggle would suffice for 80%+ of users, especially with better billing controls/per-team billing accounts. I think you run into more problems/user frustrations with a tenuous conditional-shut-off approach.
My tinfoil hat is that a lot of cloud billing is accidental, probably from "lab environments", and they don't want to provide a way to budget/limit these.
What if my bug was saving, like, way more data to the cloud than I expected, and suddenly having a big bill for the hard drive space my data is using up just sitting there? It would be a pretty dumb bug, but hey, you never know, right? In that case they have to choose to either delete my data or keep charging me, so I guess there isn’t an easy zero-cost “stop” option.
The provider could reject further access to them (reads / writes) once the limit is reached. The cost of actually keeping objects as "cold" storage has a natural cap per billing cycle since those are billed based on time.
It there was a law about this companies would find a way. Where there's any uncertainty they can simply eat those edge costs, the extremely fat profit margins cover it easily.
Yes. Very much so. Or at least one that charged SANE rates. I get 100 mb outbound on my resident ISP, and that is unmetered, but clouds that sit almost directly on the backbone somehow charge a nontrivial amount of dollars per GB of bandwidth.
Exactly - the whole point of a cloud provider is scalability. If you're doing a personal hobby project, get off big scalable clouds and get yourself one (or multiple!) fixed-price VPS or dedicated servers.
But as I think of it, I think what people really want, for hobby projects, is not so much the scalability, but the managed offerings. They want zero-ops, zero-maintenance, zero-server-updates hosting, with a fixed price and hard limits. It won't be infinitely scalable, but it doesn't need to be - it's a hobby project.
They just don't wanna sysadmin a server of their own. Which is completely understandable.
There's room in the market for something like this.
An important part of hobby projects is the scaling to zero part too. I wager a lot of hobby projects use cloud simply because it's free or almost free (e.g. 2 cents a month), which isn't the case if you rent a VPS.
We tried this. I was tasked with automating popular software installs into fresh VMs. I think some of my scripts for doing so are on my github - wordpress and some dashboard software, at least.
offerings were published on the main page and afaik no takers. We migrated off whatever hypervisor we were using onto wok/kimchi and finally to proxmox, so my scripts still work, but proxmox has turnkey linux "quickstart" servers now as well as lxd, so there's less reason to use my scripts.
> The Fully Automated Accident-Forgiving Billing System of the Future (which we are in fact building and may even one day ship) will give you a line-item veto on your invoice.
For a lot of the personal projects and early stage startups that are most terrified of these kinds of mistakes and therefore avoid autoscaling products like fly.io, we sincerely would rather have the entire account shut down when it goes over budget than ever see a bill that's higher than our net worth. A line item veto somewhat alleviates that concern but not fully.
They indicate at the end that they're going to do something along these lines, but what they're describing there also seems over-engineered compared to a simple circuit breaker that kills the account or some subset of it. Is there a good reason for these providers to avoid implementing that feature, which on a naive look seems far simpler than what they've actually proposed?
Something like half the comments on this story are a discussion of why or why not cloud providers do or don't provide this simple circuit breaker feature.
I read yours after posting this and it provides exactly zero information:
> I promise, you are not the first person to have thought of this, and, believe it or not, there are reasons other than malice and avarice that cloud providers don't terminate service based on billing caps. Terminating service is a big deal.
"Terminating service is a big deal" how? I can explicitly cancel a subscription after a certain date—what is the problem with me explicitly cancelling a subscription after a certain amount of spend?
No one is asking for automatic circuit breakers applied to all customers indiscriminately, but I'm not seeing any justification in here for why an opt-in circuit breaker is technically or legally challenging to implement.
> I can explicitly cancel a subscription after a certain date—what is the problem with me explicitly cancelling a subscription after a certain amount of spend?
As someone also involved in billing systems for public clouds: in theory there's no difference, but in practice there is a world of difference. This is the sort of situation where the end user is commonly surprised with the consequences of their own decisions. At MGC we have some "soft shut down" processes, and we constantly hear stuff like "I know I said shut down, but this is the one situation where that really didn't make sense"; where examples are "storage which keeps backups became unavailable", "a very simple but critical user auth system disappeared", "I had no idea this was still running on my account", or "OMG not in the middle of the weekend", etc. You can build heuristics and tracking into the system to minimize these situations, but that's a lot of work.
So yeah, it is a valid use case and something many CSPs would like to provide, but implementing something that is actually better than nothing is non-trivial.
If I were fly I’d implement spending caps just to filter out unserious operators.
My implementation of that feature would be when you turned it on fly would just kick you off.
No serious operator wants a provider to turn their system off. So if you want that it’s pretty clear you are hosting a silly system. Which likely costs more to support and drives less margin.
The real solution is to make it simpler to set up billing alerts. If my hobby project is ever projected to cost more than $0.50/hr, let me know. Billing alerts should be part of onboarding for all cloud providers, especially for individuals that don’t want surprises.
the biggest reason of I am not using aws, google cloud, vercel etc... for my personal projects is surprise bills. I am not earning anything from them so I can only put $50 but can't afford $500 or $5000. so still i will feel much more secure if I can put hard limit and absolutely sure my bill will never go above $50. (or cloud billing insurance :))
Except for things like student accounts, which we're working on, I don't think hard limits are coming any time soon. Our expectation is that the enforcement of hard billing limits would mostly make customers furious with us. If you read to the end of the post, the direction we're going is preemptive detection of weird billing spikes, so you don't even have to notice and ask us.
If you're really only looking to spend $50, we should put our cards on table and say that we're generally not making product pricing decisions with you in mind. If your needs are pretty straightforward, there are hosting providers that will do a better job of serving that business than we will.
If there is a class of people who get steered to your service by a small number of organizations, you might. Universities, trade schools, a podcaster you’re sponsoring.
You can’t scale to individualized service for $50 per month/quarter/year users, that’s true. But you can shape policy for a demographic with… shall we call it flocking behavior, for lack of a better term?
I love your service, but this erroneous take makes me think that even if I was looking to spend up to $500 with a "real" business, you don't have me in mind.
There should obviously be multiple types of caps. AWS and others have set a precedent that they can get away with "gotchas" for anyone who isn't paying attention.
It's really the primary business model. People that aren't watching costs have more and more sneak in there.
Does anyone know of a service like fly.io that has a perspective that's more friendly to bootstrapped startups?
FWIW I believe your reasoning is exactly what I saw/learned at AWS. The potential cost of losing/disappointing “real” customers greatly outweighed the notional cost of lost “scared” customers.
One interesting thought was trying to model some of this as an actual insurance method. Think of the cases where an adversary of the customer might inflate their costs through 3p usage/traffic. Providers dont want to incentivize those adversaries, deny the customer service, or charge them for unuseful service. Normally it devolved to credit/forgiveness, but then that moves the customers business model risks to the provider. What if this functioned similar to an insurance model; very cheap/baked in forgiveness for everyone (as today), then based on risk profile (porn/games/polical/gambling, or previous occurrence) the customer gets the option of buying forgiveness insurance or self funding their risk. The real sticking point is around perceived/potential conflict of interest and goodwill for the provider to say “pay us more for a thing that you cant directly control.”
yeah I am pretty sure no cloud provider thinks people who can willing to spend small money. but guess what? if one of my projects start earning money, then I will continue to what I know best :) I think that's why every cloud provider still has configs for couple of $
We think all the time about people starting small projects here. I think a really good line to draw, if you want to understand how we think about this stuff, is between people who would be OK with us terminating their service because they mispredicted what their cap should be, and people who need their stuff to stay up and running and will just talk to support if they're concerned about billing. A shorter way to say this is that we making product pricing decisions for people who are doing this work professionally.
You're selecting for people who feel they're priviledged enough to reach out to support and ask for bill forgiveness even when they may've messed up, repeatedly. This does select for something but I'm not 100% sure "people who are doing this work professionally" is that clear-cut, especially once you move past western cultural norms.
One thing I'm really curious about is why caps are so hard? (Perhaps this would result in a more technical blog post?)
IE, you clearly don't want to terminate or shut down an account if they get too close to a cap. But what about things like a warning email, service slowdown, ect?
Likewise, the old "slashdotted" or "hug of death" might be an appropriate result when something goes beyond a reasonable safety buffer?
Anyway, just curious. It's clear that it's a complicated topic, and the real constraints and challenges are interesting.
We talk about warnings in the post. We'll do that at some point. The hard part about caps is what to do when someone hits them.
If there was a way to make caps work for our core customers, we'd do it. We're open to ideas. A theme of our work this past month and these next several months is extracting maximal value from ANFWWAONW, our new billing system. The thing you have to remember though is that our belief about our core customer is that they are averse to nothing more fiercely than service disruption.
We're not in principle opposed to caps. We just don't have a product story for them that we're comfortable with. Keeping you from spending more money than you wanted to is an explicit product goal of ours (again: see post); we're just very wary of trading availability off against that goal.
Configurable warnings as webhooks would be pretty cool. Then I can automate whatever needs to happen on my side.
I already automate apps, machines, etc with the machines API and GraphQL, so my big worries in this area are:
- Woops, some bad logic deployed too many machines (sounds like this policy helps)
- Some kind of mistake or attack that just explodes bandwidth usage suddenly
I don't think anyone with a serious app running on us will use a cap. Just stay fixated on this scenario: a deploy-only token gets stolen, and the attacker (like most cloud attackers) uses it to stand up a bunch of Monero miners. As a consequence... their main app goes down? Who would be OK with that?
I think the cap (if they had any) would probably reflect the amount they stand to lose if their app goes down. If your app brings in $1000/h, you don’t necessarily care about spending that amount on servers. When your costs rise to $10k/hour, you might want to go with the nuclear option.
Of course it’s nicer if you can be certain that your provider is going to refund you the excess, but I feel like it’s hard to count on it. Or at least, harder than having explicit rules, which you just can’t really do for those sitations that are sensitive to fraud.
Honestly, if I did set a cap I’d be very much aware of the fact my app could suddenly die in a situation where my deploy token were stolen (but it wouldn’t matter for me, since it’s a hobby project, I care about controlling costs, not uptime).
Metering, pricing, and billing is way more complicated than you might assume. There are historical posts here with more details if you search. In short its all async, theres variable lag, theres multiple “types” or dimensions to metering, the prices vary by SKU + customer + previous metering or billing value + other SKU usage, and billing is not uniform across customers. Imagine needing to accumulate all the metering events, apply pricing plans, modify price based on accumulated value, then periodically recalculate it all to compute a bill, then apply more transforms and credits, and that gets to the billable price.
Now evaluate your “cap” rules (which will be just as complex) and feed that back to the actual admin/control plane of the service.
No, this is motivated reasoning brainrot. It's overcomplicating the problem by hyperfocusing on a specific implementation that's already been judged infeasible to justify not doing it.
The actual problem that people want solved is "the customer wants predictable, budgetable upper bound periodic cost". You are not unique in offering a service where this is a desirable property. Realigning this sort of cost structure is the bread and butter of insurance industry, and no, as much as they'd like to, they don't actually do it by making sure to stop the earthquake before it knocks down more of your house than your price cap.
I dont understand what youre on about. The post I replied to, and many others, use “caps” to refer to limits of service based on billing. This is an endless source of comments on every “cloud” topic. I provided a very brief overview of why large billing systems are more complicated than expected and have an impedance mismatch to the stated desire. But sure, tautological brain rot. Got it. Im sure you have a wealth of experience with metering, billing, and pricing for billion dollar revenue streams.
Now if you have some _other_ proposal for how billing and service limits could function Im legitimately interested. But I dont see anything at all specific or actionable in your replies. Insurance is interesting for _some_ facets. Im curious how you think that aligns with dynamic resource utilization and what happens at the boundary.
> if you have some _other_ proposal for how billing and service limits could function Im legitimately interested.
Billing doesn't have to be so complicated you can't calculate it in less than a minute. That's a technical failure. Surely you can imagine a better way? If you really think it can't be better, then it's hard to argue against "brain rot".
Also on most systems it will work perfectly fine to use an estimate of the price per unit when calculating the bill for the last couple hours.
Again, unless you have domain expertise I suspect you are drastically underestimating the complexity of reality. As I alluded to earlier metering, pricing, and billing are _at least_ three separate complicated problems in substantial systems. There are multiple metering dimensions that interact with each other, time, and periodicity at a minimum. Saying “do billing better” without either a thorough understanding of the existing system, or an actual concrete proposal, is not useful.
Being separate systems is fine as long as they can all calculate a number within the time it takes to make a cup of coffee. That's not hard if you actually make it a design goal.
I'm sorry that "prioritize it" isn't very helpful, but it's true. If the calculation takes longer than that, it's a management-induced problem.
If we're positing a system that can already do that, then you need to be more specific about the problem you're describing because it's not obvious.
No, this is a much harder problem than you think it is; it's, like, CAP-theorem problematic. I think you should do the exercise of working through how these billing systems work.
I reiterate that you are not unique in offering a service where the customer's desire for consistent billing and your willingness to provide it are at odds.
The technical problem is not "we are operationally incapable of, say, getting someone else to underwrite insurance contracts". That's not a technical constraint, It's a business decision.
I get that the underlying issue is that your target consumer is whales who eat orders-of-magnitude pricing spreads as normal opex, and that anyone who comes in with a budget is barely a consideration. It's still absurd to pretend that pricing is too hard. Like, I'm not going to confidently assert "lol you can give up C, it'll be a rounding error anyway" but like, billing cycles are absurdly long on the scale of your technical constraints.
I don't think you've thought very carefully about this, because there are real technical problems here. If we did a cap feature, enabling it would involve disabling parts of our platform. We're just going to go back and forth on this, because I'm not going to write you an essay on this, and you're going to keep saying "it's a business decision" and I'm going to keep knowing you're wrong about that.
For the nth time on this thread: we don't make our quarterly nut billing people looking to spend $50/mo an occasional extra $1000. The people who actually pay the bills here do not want this feature.
They definitely want Accident Forgiveness, though. Which means it's going to cost us money to do this. And that's fine; they're growing, we're growing.
> If we did a cap feature, enabling it would involve disabling parts of our platform.
What I'm saying is that capping billing is not the same thing as shutting down parts of the platform. You're redirecting complaints about the latter by focusing on the difficulty of doing the latter in real time, which is granted but gross overkill.
The former is something that is not only doable, but something that already happens regularly in an informal capacity and nobody believes you don't have the data to price it.
Ok, so you think unintended usage should be costed out by the provider. Sure, T&S and support definitely have a handle on the topic. Now what? Because _today_ it’s already baked in to the P&L and pricing. You want the provider to give it to you as a line item that you dont control? Or to do actuarial evaluation of your footgun propensity to charge you more or less? Why? Im totally missing what solution your suggesting, the problem it solves, and why the provider _and revenue generating customers_ care.
> If you're really only looking to spend $50, we should put our cards on table and say that we're generally not making product pricing decisions with you in mind.
i think is what gave that vibe off. I was on a read-only phone in bed and saw the quoted message. got up and logged into the PC to think about what to say. It may be time for cloud providers to dissuade small users away, instead.
Hes telling you the reality. Cloud providers dont want to _dissuade_ hobbyist users. But thats not their target market and they will make business decisions based on the revenue generating customer profile. Thats just being frank and honest.
Major cloud-style companies dont drive significant revenue from your $50/mo cohort. And a $5/mo dev account is basically courtesy for the sales pipeline. The vast vast majority of revenue is “enterprise” sales with private pricing and spend in the hundreds of thousands to millions range.
I understand all that; and that's not really the kind of information I'm looking for. (I know deeply that metadata is often more expensive and complicated to process than the data your customer cares about.)
> I don't think anyone with a serious app running on us will use a cap. Just stay fixated on this scenario: a deploy-only token gets stolen, and the attacker (like most cloud attackers) uses it to stand up a bunch of Monero miners. As a consequence... their main app goes down? Who would be OK with that?
Reading between the lines: If a customer's utilization suddenly spikes, the assumption is that the customer's revenue will follow. IE, if my utilization goes up 10x, my revenue will go up 10x, so I'll happily pay the bill.
What they are providing is more like an insurance policy against hacking.
It "seems" simple, but people have really complicated needs, and the simplest answer if you don't have the time to build and maintain something to cover enough of the state space to make enough people happy is to not support it.
Some people are going to want you to go to 1Mb/s if you blow your bandwidth limits, or cash spend. Some might want you to go to 10 connections at once. Some might want you to just disable networking entirely.
And what happens to your storage when you blow your billing? Or CPU time?
It all becomes a hairy mess of state machines and companies wanting precisely _their_ requirements met, so you try to offer as much as you need to still have a compelling offering that enough people want to use, and no more.
Probably at best you could provide an API to make it easy for customers to build their own state machine, but that's fraught because then the customer will still blame you even if their own code did the wrong thing.
Sinclair effect in action: "It is difficult to get a man to understand something, when his salary depends on his not understanding it."
Obviously, cloud providers are capable of billing. It's not unreasonable to expect them to offer, at minimum, "Look, we're not evaluating billing continuously but we might, at our option, try to shut down your services after $X and won't bill you excess of $Y" as a (optional) hard contractual term. Having a larger, well-capitalized party absorb risk on behalf of a smaller party for a fee is a business model humans know how to price. It's just not the desired "product".
Note that this doesn't even per se require any technical artifact to implement, just very primitive metering and lush margins.
Seriously question: have you thought through what billing for bandwidth, storage, compute, and services actually entails? I'm having a bit of a reaction to the word "obviously" there.
There are so many Next.js apps. Please make sure the Next.js deployment on Fly.io is smooth as you'll inevitably get a huge exodus (including me) if/when too many people pay high prices on Vercel.
It's important that all utility metering and pricing is effectively made up to balance costs across customers.
That's why residents in california probably pay 100000x / unit of water than agriculture customers do .
The same works for the cloud. An operation that costs $100000 does not cost the operator $85000 in operations on the margins.
Yes the metering and pricing schedules are necessary, but punitive pricing for accidents really is just an artifact of the system.
I'm guessing the biggest reason AWS provides forgiveness is that if anyone took them to court over a bill, a court would throw out the charge once the wholesale cost was revealed.
Thomas, every time I read something you write it's a delight. I love your writing style, and it reads to me like you're always putting effort into making it succinct yet unambiguous, without unnecessary embellishment.
I trust them to handle even edgy cases more generous after this announcement. Just a gut feeling, as if they would risk losing significant credibility. I've written elsewhere that I'm not a big fan of those over-marketing posts, but they are small and more trustworthy than any of the large providers.
The sentence he's referring to says that we might tell you after you've, uh, aggressively asked us to forgive a series of bills, that future accidents are no longer on the house. So no, it's not the same as before.
I say that because I’ve had extensive GPU bills credited or forgiven before.
Presumably there was already a point where you were going to say “you keep messing up with the auto shut-off, at some point we’re going to stop refunding you”
When I was in highschool, my friends and I were doing a small project. My friend accidentally ran Google's image recognition service in a while loop for a whole minute, and racked up a $300 bill. Thankfully GCP waived it for us.
I ran up a $4,000 bill with like a week of TPU time bc I didn’t realize that they kept running after the attached CPU turned off (unlike GPU offerings).
GCP also waived the whole thing - even refunding the $4k they had already pulled from my payment method.
This is why I love Hetzner. It has billing alerts. What was so difficult about it that fly had to rebuild their system?
Also, do everything except databases inside kubernetes. Deploy kubernetes across multiple clouds via wireguard. Label your instances properly on each provider. Prefer bare metal instances where available. Migrate your workloads accordingly. Force cloud providers to earn your money. Don’t however have both workloads running at the same time in multiple providers as you will eat insane data costs.
I can't read the tone, but perhaps you haven't seen much technical debt. It's possible to have a fairly good codebase, full of good decisions, yet you still can't "get there from here".
Heroku is not Fly.io. Heroku was bought by the MBAs a while back, and they do something different than generic cloud hosting (they are closer to a hosted EBS). If you bother scrolling at all, you'll notice immediately below the fold that the pricing is not remotely as clear as you're pretending it is.
Just run Linux computers on ionos with unlimited traffic and get on with the job instead of worrying about silly things like how much traffic passed over some internal software system and will it bankrupt the company.
Same but on Hetzner. Incredibly beefy machines. Unlimited traffic. No fear of any sudden changes that I explicitly didn't act upon. I do not know why people fall for the shiny marketing so easily, that too hackers? Is it all the Javascript folks? Vercel I know is propped up by them for sure.
I am never going to get tired of telling people that if they can solve their problems running on Hetzner, they should do that. It's not like a dirty secret of our industry. By all means, deploy with VPSs! Or on bare metal servers!
We are incredibly backlogged on the blog. I've got 3000 words on Tigris rolling out next week, we've got the results of a $300K 3rd party GPU audit to write up, we wrote a replacement to Vault called Pet Semetary, we need to write up the billing system because, as you can see from this thread, people appear to think that's not a hard technical problem. I can imagine getting around to writing a post about when a VPS is a good alternative to launching a container on Fly.io, but it'll be aspirational; I doubt I'll ever get around to it.
i mentioned this as well in another comment to parent. Hopefully it catches on.
"Hey, SMB? you probably don't need our services. While AWS, GC, etc would be happy to take your money anyhow, [...]" I dunno. obviously any sort of thing like this has to clear all the departments because i imagine it increases support load.
No SMB is OK with us zapping their services because some screwed up deployment or stolen token exceeded a billing cap. You're not talking about SMBs, you're talking about people deploying random personal stuff. We love those people, we're happy to have them here, but we do not price the product specifically for them.
I get that this is a lot of venting about people's issues with cloud providers writ large, but damn.
i said SMB because SMB doesn't need cloud. if you have enough users to warrant cloud you're no longer S or M. I was actually going to start to concede that you might be right, maybe i should have said small-time users or something more eloquent; but no, i'll stand behind what i said. Dissuade all SMB and small users as much as possible. then there's 0% chance they get a surprise cloud bill. I solved the problem, yay.
as i mentioned elsewhere i'm intimately familiar with pretty much every intimate detail of "cloud" from hardware, software, network, cooling, and ops (i wouldn't call me a dev. I don't think anyone else would or should, either.) I've bootstrapped cloud services from empty racks twice and repurposed existing hardware for cloud once.
I understand why there's no "cap" available on any cloud services. I mostly have a problem with capitalism, which is ironic, considering this site.
Y'all really, really want caps. We think that's a crazy feature. We don't want to do it. But I just had this conversation on a Slack with a bunch of friends at other companies and they were just as yell-y about this. There's a threshold of feedback at which I think you could get us to do some kind of cap thingo. I'm just saying.
I think hobbyists 100% want caps, businesses don't know what they want (but would be happier without caps if they could experience the counterfactual) and former hobbyists hired to work for businesses will put caps on if they are available.
Okay, so now that it's inevitable you add caps, can we wager on how long between rollout and the first front page "I told fly I wanted a service cap, but not like this!" post?
People keep implying a cap means “break/delete everything.” Not necessarily so?
It would make a lot more sense to throttle service, say 80% instead like ISPs do. Slow bandwidth/fewer cpus would raise flags without necessarily breaking anything.
I've said it over and over again: if you have a serious app, and someone somehow steals a credential from you and uses it to light up a bunch of crypto miners, you don't want us shutting down your main app in response. We perceive it mostly as a feature that will blow people up.
We agree about the underlying problem! You don't want to spend $5000 in a month for services you never wanted. We don't want you spending that either. We'd rather just improve our billing so that you can fix this after the fact without trading off availability.
But I have just planted a flag: we are prepared to be wrong about this. I don't think we are, but like, I'm the only one. :)
I'm pretty sure I'm fairly far from your target customer, given that I'm not building products, and right now my only usage of fly.io is to host a tiny app that expects to be used a few times per year by my spouse and me.
Having said that, the reason why I personally would be happier with the ability to set a hard cap is that if I'm going to put a project on fly.io, I'm spending my own personal money on it. If I'm spending my own personal money on it, I want a guarantee that it cannot possibly cost more than a given amount. When it's my own money on the line, I absolutely want the service to shut down rather than have even 0.01% chance of costing me a lot of money.
The moment I'm actually building something for a business, the moment it's company money instead of personal money, then priorities change and everything you're saying makes sense. But as long as I'm just a single developer playing with stuff and billing it to my personal credit card, I want that guarantee that I won't accidentally make myself go broke.
Giving people the ability to use caps as an optional feature doesn’t force anyone to use it.
Also, you could turn things off in the inverse order they were turned on until the cap is satisfied. So all the crypto miner instances would be turned off before database backups being deleted.
I'll be honest, I think you've got an uphill battle to convince so many of us that are concerned about accidental pricing that this is exactly the thing we actually want, vs. what we feel we want. This position parrots what Vercel's leadership said when someone had a massive surprise bill, and the story made the rounds both here and elsewhere.
To be super honest, you might be right too. You might go down a huge engineering effort to build this in only for 5% of your customers to ever engage it. I think the real question is what percentage of your customers will feel better knowing that the have the choice to set those limits, and how will that comfort actually improve their trust with Fly, and cause them to choose it over another cloud provider.
It may end up being a lot like this feature we had in a platform that I used to support. It had a just-in-time analytics pipeline that at one point required tens of thousands of dollars in compute, storage and network hardware alone to function. Based on our analytics, it was barely used compared to the usage of the rest of our app, which made the zounds of resources and fairly frequent support attention it needed feel silly in comparison, so I advocated to sunset it. Product assured me that, regardless of how silly it might be to continue supporting this feature, it was a dealmaker, and losing it would be a dealbreaker.
So yeah, y'all might be right in that the majority of your customers don't actually want it. But maybe what they do need is to know that it's there, ready for them if they ever need to engage with it.
> You might go down a huge engineering effort to build this
This is an overlooked issue: billing caps are hard to implement and will likely incur losses for the cloud company that does.
Take an object storage service as an example. Imagine Company X has a hard cap at US$ 1000 but some bug makes their software upload millions of files to a bucket and rack up their bill. Since objects are charged in GB/month they will not reach the cap until some time later that month. Then, when they do, what does termination of service mean? Does the cloud provider destroy every last resource associated with the account the second the hard cap is reached? If they don't, and they still have to store those files somewhere in their infra, then they'll start taking a loss while Company X just says "oops, sorry".
That's what tptacek is talking about: you want to NOT destroy the customers' resources because they can quickly figure out that something went wrong and then adjust while still maintaining service. But the longer you keep the resources the more you're paying out of pocket as a cloud provider. If you can't bill the overages to the customer, which a hard cap would imply, then you're at a loss. Reclaiming every resource associated to an account the moment a cap is reached is an extreme measure no one wants.
A hard cap then becomes only a "soft" cap, a mere suggestion, and cloud providers would then say "you hit the cap, but we had to keep your resources on the books for 12 hours, so here's the supplemental overages charges". Which would lead to probably just as many charges disputes we have today.
$1000/mo in S3 deep glacier storage buys you a petabyte (a million gigabytes) of storage. It’s hard to imagine such a small customer uploading a petabyte without noticing, and part of what happens when you hit the cap could be moving things from normal object storage to glacier.
If you turn off servers and shut off bandwidth you get rid of the vast majority of expenses.
Storage fees are a lot less risk, but if you want to cap those then you should cap the number of gigabytes directly. That prevents the overage issues you describe.
I think your position is essentially that you mostly want to serve people who have serious apps, for whom the cap idea doesn’t work. And you definitely know more about the general trajectory of your customers than all of us sitting here randomly speculating. But, is it really so uncommon for an application to transition from unserious, I want to cap, to serious?
I guess I’m slightly confused because I thought one of the nice things about cloud services is that they give you the ability to fit your infrastructure to your size while you are trying things. If I’m still trying things, I might not even know if I’m serious yet, right?
Sure, but bear in mind that across the entire industry of public clouds, which has been around for something like 2 decades now, nobody really has this feature. In fact, didn't Google have this feature and then pull it?
I've def. wanted to have spending caps on a lot of my projects. The ones I've tried: AWS and GCP made this ridiculously difficult. It might be that they will waiver the bill, but the general feeling that they get to decide if I go bankrupt or not is nightmare fuel.
In the end, I've just set up a 5$ vps where I self-host all my apps. That removes all the stress.
Yeah, so one issue we run into here is that we're not trying to get you to stop using $5 VPS's. That is not good business for us! There is nothing at all wrong with managing your own servers.
Yes, clearly is it not the convention to provide this feature. But the chain of thought seems so straightforward and obvious to me: cloud infrastructure is great for exploring what you’ll be doing, and guardrails make you more willing to explore fearlessly. I’m obviously missing something, though!
It’s also easy to see why the convention would be to not have it. It earns the provider more money. After all, not every customer will ask for forgiveness for a surprise bill.
This is not why we don't have caps. Maybe it's why AWS doesn't. This simply isn't how we make our money. Our top line is dominated by companies growing businesses with us. Getting a hobbyist to cough up an extra $100 doesn't move a single dial for us.
When I was 15 years old, I was walking the mean streets of Chicago after attending an opera with my parents when a billing cap leapt from the alley and snatched my mom's purse. My dad fought it valiantly, but it prevailed, killing both my parents in the process. From that day forward...
That's why my phone SIM is still not a subscription but rather I manually buy credit each month. The price is even lower than subscription and benefits exceed it, shitloads of Gb and minutes. If I'll use roaming I'll pay for a more expensive credit that month but I have piece of mind I won't be charged thousands of euro at the end of the month.
Like telecom providers of course cloud providers could have metered service billed under "get only as much as you paid" policy but it's obviously they totally and in bad faith don't want that.
Whilst it doesn't have the mindshare or some features which are tablestakes for enterprise customers, I've found Linode's pricing to be extremely predictable for small projects. Even post Akamai acquisition. I'm sure other smaller players are also just fine.
Couple that with the fact you can achieve quite a bit on simple set-ups that are adequately sized to begin with, you can save quite a bit.
Not all of us need elasticity, or environments being spun up/down on commit.
Isn't it easier to set hard limits on the account and notify the client when they're close to them so they can decide to increase them on their own if needed?
But then you run into the risk that the client will never increase those hard limits and ... pay less than they could. Not good for business.
> Meanwhile: like every public cloud, we provision our own hardware, and we have excess capacity. Your messed-up CI/CD jobs didn’t really cost us anything
This^ Don't feel bad about asking for credits when you accidentally make a costly mistake.
I've never understood why providers don't just give you the option of setting a cost cap: "If this piece of infra exceeds [5x expected usage cost], shut it down and send me an email"
You can say this for a whole bunch of new features people role out but this doesn't erase the fact that they are very much not the stakes and introducing them is great because it raises them!
> " You probably can’t tell me how much electricity your home is using right now, and may only come within tens of dollars of accurately predicting your water bill. But neither of those bills are all that scary, because you assume there’s a limit to how much you could run them up in a single billing interval. "
I had a $600 surprise water bill. It was (partially) forgiven because the water department could drive to my house and see evidence of the leak next to my water meter. It did turn out to be on my side of the meter, so it is my responsibility.
If the water department had driven to my house and seen evidence of commercial agriculture (so to speak), then it would not have been forgiven.
---
The parallel here is that the water department can't come into my house uninvited - the cloud provider SHOULD NOT have intimate access to the running code, but they are able to observe some patterns without 'breaking in'.
---
Side note: the size of the bill and the amount of forgiveness was largely driven by waiving an 'excess usage' surcharge - similar to how you can get a discount for cloud service reservations.
Water billing is indeed nonlinear with no cap. I had a surprise $3000 excess water bill due to a broken connection at the meter. Followed by a $3000 excess sewer bill (!) because the assumption is if I used $3000 of water, it must have gone down the drain. However, if you demonstrate to an inspector that the water must have leaked below ground (as in this case) there’s a process for getting the second charge waived. Unfortunately the leak was on our side of the meter, so the first charge was correct, though we got partial forgiveness.
So, when they read the meter, and saw that it was flooded at the meter or nearby, they didn't think to mention that to you? I ask because you said "surprise bill" - another comment i made explains my neighbor's water meters have a light on them that lights up if there's a leak, which the meter reader can see and the homeowner usually cannot - the only reason to not notify the homeowner of an issue is because of $.
If I recall correctly, we discovered and fixed the leak because the ground around the meter had become a mud pit, then the meter was read some time later. The size of the bill was still a surprise.
There's a note on the draft of this post with like 6 comments from people on our team about how the water bill is a bad example. I stand by what I wrote!
That’s the problem with irrigation. You get charged for the sewage cost of that much water even though it doesn’t end up there. Unless you’re a farmer.
my well only costs me electricity and has no meter - well, i have a meter somewhere, i just never installed it. I know for a fact it uses 9A to run, so it would cost me 30¢ or so, all-in, to run it for an hour. it's somewhere around 10-12GPM, so 600-700 gallons for 30c
I've wanted to put it on solar the entire time i've had it but the start current is 18A and an inverter that can handle that is (or maybe was, idk) real expensive, considering.
anyhow if you saw the area around my house, in the dog days of summer, right now, you'd think it had been raining non-stop all summer - in fact, it hasn't really rained at all since the beginning of june when we had 12" - i've personally watered 6+" over about 1/3rd of an acre, so like 50,000 gallons since then.
edit: oh crap i forgot the reason i wanted to comment on this at all! My neighbor recently had some issues with a pipe running along the bottom of his house that sprung a leak and they got a surprise $600 water bill - no forgiveness. the water company did, in fact, have the audacity to tell my neighbor "there's a light on the meter if you have a leak." His meter is like a quarter mile from his house, first of all. and second of all, when they read the meter and saw the light, why didn't they go across the street and knock on his door, or make a note to call/mail?
Inverters have gotten a lot more capable and quite cheap; you should take a second look. Also look up "soft starters", they're devices designed to reduce start current on motors. They get mostly used for solar and generator A/C, but would work with a well pump, and might put it in the range of your current inverter.
Speaking as someone who has had to run a gasoline generator to pump well water and is about to move it to solar, with great rejoicing.
!! oh wow, they actually support the full open-circuit voltage that my panels can be series linked* to reach! the only thing i saw many years ago was like... the power wall!
Not only that, it can run every panel i currently own all by itself. that is a leap forward, compared to the last time i checked (a decade ago or so, it was depressing to want a separate solar grid, you need the grid tie stuff so you can run your house if the power goes out, but i mostly wanted the solar to manage water and lights and if possible, the small window/wall aircons that i use to keep servers cool.
I'm suffering for this now. They just erroneously charged me almost $4,000 out of nowhere because they decided to scan all of old container images for no reason and without notification. I have a support partner and everybody is just passing the buck around, it's super frustrating.
Being at the level where you have an account manager should help. Our technical account manager always said if we ever accidentally rack up a bunch of cloud costs we should at always ask if we can get a refund, and as long as its infrequent there's a good chance to get some credits.
In other words, they are socializing the costs. Servers and electricity aren't free. Wouldn't it be better for the customer if they had no accident forgiveness and passed those cost savings along? Instead, you are paying for other peoples mistakes plus all the extra overhead caused from fraud that they incentivized.
We have to invest in fighting fraud and abuse anyway, such is the public cloud business. We don't intend to diminish user experience in service of fighting it.
One of the things I really admire about Fly is "how human they come across". Fly is a collection of people providing a computing service to other people. As a fellow human, there is a very primitive part of my brain that will always prefer "tech from humans" over "tech".
Given that the very teleology of cloud computing is digital & ephemeral resources, having an actual human face associated with it makes it tangible in a way that is hard to place.
My little brain can briefly understand complex computer systems on a need-to-know basis, with knowledge constantly coming and going based on the demands of the day. I can never fully "understand" my cloud infrastructure. But hey, Kurt's standing there. I like Kurt. I can understand Kurt as he's a human and so am I and he's in the same exact place I am. Let's work hard and make cool stuff, what do ya say Kurt?
I should add that another way Fly distinguishes itself as a "tangible" cloud company is that their blog art is all based on physical mediums. It feels very familiar and quantifiable to my brain.
no matter what, you'll have some fraud. Perhaps a better solution is the customer has a know they can turn that sets max limits and some alerts on when usage is abnormal.
I have absolutely no idea how much I pay for water, but vaguely think it’s probably between 0 and 100. Why does knowing it’s small mean you can predict it precisely, in a world where auto-payment exists so you never have to actually look at your bills?
This thread is full of people complaining that they just want the ability to set a hard billing cap. And yet, providers continue to insist that no serious customer wants this and/or it's impossible to implement.
What would it take for providers to listen to real customers here?
I have $25k in cloud spend that we absolutely cannot go one single cent over due to the politics of internal budgeting. That's my reality. If you want my $25k, I need to ensure that I don't spend more than this amount.
As is, my solution is to use old-school pre-rented, long-term contracted commitment VM hosting providers. This is really the only way to guarantee that you are paying an exact amount and no more.
But, I really would like to use a more scalable system that didn't require pre-provisioning. And, I wish people would believe the customer when they say something and not continue to gaslight us.
Providers say it's impossible, but I don't see how it would be so hard. Here's my sketch of how it could work:
The main component is a system that monitors billing events and watches for the slope of the bill to ensure that there is enough runway to stay under the cap. Optionally, they could implement rate limits on resource creation to ensure that a sudden surge doesn't outrun the monitoring.
You also need notifications for when the projected spend exceeds the cap. Optionally, you could implement a soft cap, where no new resources can be created.
And finally, you need the hard cap where things start to get deleted. If they're feeling generous, the provider could implement a period where VM's/lambda's/etc are shut off, blob storage is not accessible, and so on so that the account holder has some time to fund the account and/or fix whatever is causing the overage.
That set of features are all totally within the competency of a cloud provider. Knowing how much things cost, billing for them, and turning them on and off is their main business. And I can't believe that they expect us to believe that it's impossible to do that tracking.
This is how I do system migrations. There are escalating warnings until one day the service is shutoff for an hour and turned back on. That wakes up any laggards that missed the dozen or so communications over a three-to-six month period. Finally, after a few more days, the service is shutdown for good and then data is deleted a week after that. Though I almost always keep a copy in cold storage. But that isn't necessary as a provider with a limited relationship to the client.
Nobody's gaslighting you. It's not impossible to build this, though it is much more difficult than it seems (cloud billing is a large-scale eventually-consistent distributed system, and if you've done any distributed systems work the issue with plugging a system like that directly into a control loop should be obvious). It's just expensive to build, and disproportionately serves the interests of customers who aren't running real apps.
At $25k/mo spend, you can talk to many cloud providers (certainly including us) to work out an "I can't get invoiced for more than $25k" solution which will not involve having your app turned off abruptly when the 2,500,001st cent get spent in October. What you'll notice in this thread is that people generally want billing caps for accounts they plan to spend, like, $10/mo on. And we can build cap systems for those people --- but they'll involve turning parts of the platform off for them.
I'm really pleading with people with strong opinions about hard caps to do the exercise of working out how these billing systems work. There are a huge number of apps running here, across a huge fleet of physicals, running in almost 40 regions around the world. Each of those apps has several different kinds of resources that meter at different granularities and incur different costs. Speaking as a witness to the creation of a new billing system just a month or two ago: it is kind of a miracle that these things work at all. Do the thought experiment, read some Call Me Maybe posts, and then tell me it's obvious that this feature should be straightforward to build.
It seems I've committed the cardinal sin of failing to specify my units. So to clarify, it's $25k per year.
But my mistake that aside, I appreciate your reply. I saw in another comment thread that you added that you have a blog post in the pipeline on building the new billing system. So I'm really looking forward to reading that. I've enjoyed reading other billing/payments content. And I'm sure your post will also be insightful and highly detailed.
I think it's so so fascinating that this feature is consistently solved at the contract/legal and customer support "layer" of the stack. That's really unpleasant to me, because it is a lot harder for me to wrap my head around the specifics of how different edge cases will play out.
Like as a programmer, I've built up all these skills on reaching technical documentation, understanding systems, their limits, and complex interactions. But instead of using that muscle memory, I have to try to talk with a human and deal with the seemingly intentional vagueness of the legal system.
It feels to me a lot like the story from Mitchell Hashimoto about dealing with the bank for his startup, where he was dodging calls from his account executive and generally behaving in a way that the bank is not used to from enterprise clients. [0]
I'm ready to admit my behavior is anti-social and irrational here. But, it is what comes natural to me.
This is meandering now. But, I just want to sneak in a bit more info on my use cases, since you also mentioned that people are wearing you down and you just might implement this if forced to.
I build and run internal tools (think CRUD & reporting/analytics) for a small department in an extremely large enterprise. Our stuff is on the order of 99.9% available. So not particularly great but not not terrible. But, others are extremely bad. For example, one vendor has over 36 hours of scheduled downtime per month. And that system is way more critical to the business than mine.
So the standard that my coworkers in the department have come to expect is very low. If the tool is down, they just continue with their day doing some other task.
Many of the systems I manage are also purely background jobs. And no one would even notice if they were down for 12-24 hours.
Lastly, we have external backups for everything (on a different provider) and every system's deployment is automated from creating the VM's, networks, and block storage all the way through to installing system dependencies, the app, and data.
So, if a system were to magically get deleted some day, I'd get paged and have it back up in about an hour. And this is totally fine for our business.
On the other side though, there will be dire consequences for my career if we go over $25k annual spend. Even if the bill arrives and we have to contact support, it will give my management a heart attack and they will absolutely remember come review time.
Given this environment, I'd really appreciate the ability to protect myself against misconfiguration or leaked keys causing me to get possibly fired. The data will be fine. And systems can be restored quickly. But the damage to my reputation, compensation, and future job can't be restored quickly.
The number you see on your bill is increasingly calculated by running some black box algorithm on top of the billing events your resources generate. Was it accidental or not? What is a "weird" deployment vs a normal deployment? By what factor should the spikes on your billing graph be smoothened? None of this can be deterministically calculated from the pricing page. And there's no way for you to do these checks yourself before deployment because you have no idea what this logic even is. So you are entirely at the mercy of the provider for "forgiveness".
Who wants to bet that some provider is going to launch "AI cloud billing" within the next year?