Hacker News new | past | comments | ask | show | jobs | submit login

This is more just "missed optimization opportunities in EC2" than a statement about mistakes in AWS as a whole.

If you want to talk systemic AWS mistakes you can make, we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours. You can accidentally create this issue across lots of different AWS services if you don't verify you haven't created any loops between resources and don't configure scaling limitations where available. "Infinite" scaling is great until you do it when you didn't mean to.

That being said, I think AWS (can't speak for other big providers) does offer a lot of value compared to bare-metal and self-hosting. Their paradigms for things like VPCs, load balancing, and permissions management are something you end up recreating in most every project anyways, so might as well railroad that configuration process. I've experienced how painful companies that tried to run their own infrastructure made things like DB backups and upgrades that it would be hard to go back to a non-managed DB service like RDS for anything other than a personal project.

After so many years using AWS at work, I'd never consider anything besides Fargate or Lambda for compute solutions, except maybe Batch if you can't fit scheduled processes into Lambda's time/resource limitations. If you're just going to run VMs on EC2, you're better off with other providers that focus on simple VM hosting.




> we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours

May I ask how you dealt with this? Were you able to explain it to Amazon support and get some of these charges forgiven? Also, how would you recommend monitoring for this type of issue with Lambda?

Btw, this reminds me a lot of one of my own early career screw-ups, where I had a batch job uploading images that was set up with unlimited retries. It failed halfway through, and the unlimited retries caused it to upload the same three images 100,000 times each. We emailed Cloudinary, the image CDN we were using, and they graciously forgave the costs we had incurred for my mistake.


> May I ask how you dealt with this? Were you able to explain it to Amazon support and get some of these charges forgiven? Also, how would you recommend monitoring for this type of issue with Lambda?

AWS support caught it before we did, so they did something on their end to throttle the Lambda invocations. We asked for billing forgiveness from them; last I heard that negotiation was still ongoing over a year after it occurred.

Part of the problem was we had temporarily disabled our billing alarms at the time for some reason, which caused our team to miss this spike. We've enabled alerts on both billing and Lambda invocation counts to see if either go outside of normal thresholds. It still doesn't hard-stop this from occurring again, but we at least get proactively notified about it before it gets as bad as it did. I don't think we've ever found a solution to cut off resource usage if something like this is detected.


Earlier in the week there was threads about how AWS will never implement resource blocking like you're talking about because big companies don't want to be shut off in the middle of a spike of traffic, and small companies don't pay enough money, and it's not like it hurts Amazon's bottom line


We use memory safe languages, type safe languages. AWS is not fundamentally billing safe.

Just to give you nightmares. There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.

I don't know how you monitor it, part of the issue is the sheer complexity. How do you know what to monitor? The billing page is probably the place to start - but it is too slow for many of these events.

I guess you could start with the common problems. Keep watchdogs on the number of lambdas being evoked, or any resource you spin up or that has autoscaling utilization. Egress bandwidth is definitely another I'd watch.

Dunno, just seems to me you'd need to watch every metric and report any spikes to someone who can eyeball the system.

For me? I limit my exposure to AWS as much as I reasonably can. The possibilities combined with the known nightmare scenarios, with a "recourse" that isn't always effective doesn't make for good sleep at night.


> There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.

AWS Shield Advanced actually offers DDoS cost protection to mitigate this specific risk: https://aws.amazon.com/shield/features/


> There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.

That’s interesting because I seems like it would happen, but what is in it for the attacker, whrn under threat they can implement caps?


A severe enough bill can cause an organization to be instantly bankrupt. No opportunity to try to do something like caps.

Regardless, turning on spending caps isn't a final solution to this particular attack. With caps the site/resources will hit the cap and go offline. Accomplishing what a DDoS generally tries to accomplish anyway.

The only real solution is that you have to have a cheap way to filter out the attacking requests.


Could only be an attack of spite, can’t really hold a ransom because the IPs of malicious traffic could be blocked or limits set after initial overspend. Perhaps if the botnet was big enough.


Some people get paid to destroy competition, others just enjoy watching the world burn...


I think you're limited to 1,000 concurrent Lambda invocation by default anyway. That said, it's not easy to get an overview of what's going on in an AWS account (except through Billing, but I don't know how up to the moment that is).


I've been able to get AWS support to waive fees for a runaway Lambda that no one spotted for a few weeks - they wanted an explanation of what happened and a mitigation strategy from us and that was it. It is still unresolved because AWS wants us to pay the bill so they can then issue a credit but the company credit card doesn't have a high enough limit to cover the bill.


>"Racked up a several-hundred-thousand dollar bill in a couple of hours."

This is enough to rent big server from Hetzner / OVH for like forever and have person looking after it with plenty of money left.

>"I've experienced how painful companies that tried to run their own infrastructure made things like DB backups"

I run businesses on rented dedicated servers. It had taken me a couple of days to create universal shell script that can create new server from the scratch and / or restore the state from backups / standby. I test this script every once in a while and so far had zero problems. And frankly excluding cases when I want to move stuff to a different server there was not a single time in many years when I had to use it for real recovery.

I did deployments and managed some infrastructure on Azure / AWS for some clients and contrary to your experience I would never touch those with the wooden pole when I have a choice. Way more expensive and actually requires way more attention than dedicated servers.

Sure there a cases when someone need "infinite scalability". Personally I have yet to find a client where my C++ servers deployed on real multicore CPU with plenty of RAM and array of SSD came anywhere close to being strained. Zero problems handling sustained rate of thousands of requests per second on mixed read / write load.


I think your last paragraph is the sales pitch for AWS. Hiring that level of expertise doesn’t scale. Easier and cheaper to hire 10x as many “developers” and pay the AWS bill than headhunt performance gurus that understand hardware and retain them .


What expertise? My specialty is new product design. I am very far from being performance hardware guru. I just understand basics and do not swallow propaganda by loads.


Even if you're right, it's still cheaper to get a dozen dedicated servers than to get a huge pile of AWS servers.

Bad performance means you need more servers, it doesn't mean you need instant scaling.


> Bad performance means you need more servers, it doesn't mean you need instant scaling.

Or better code/a better engineering organization


Oh sure but that's been declared too expensive in this scenario.


I'm not saying it can't be done cheaper or more efficiently on simpler providers or even self-hosting, but you need the expertise and time to stand up the foundation of a secure platform yourself then. For example, AWS Secrets Manager is just there and ready to code against, as opposed to standing up a Vault service and working through all of the configuration oddities before you can even start integrating secrets management into an application. If you already have a configuration-in-a-box that you can scale up, then more power to you.

Your use-case of running a web service that is written in a very efficient language like C++ is not something you see too much these days. While it would be nice if most devs could pump out services built on performant tech stacks, our industry isn't doing things that way for a reason. Even high-prestige companies with loads of talented engineers only build select parts of their systems using low-level languages.


>"Your use-case of running a web service that is written in a very efficient language like C++ is not something you see too much these days"

In some place including big ones it is very much being used.

>"our industry isn't doing things that way for a reason"

I think the real reason is - the slower your stack the more money you will pay to Amazon, Azure, Google or whoever else. And by way of advertising, trickling down to education and lots of other means they make sure that this is what everybody (well most) uses.

>"using low-level languages."

Since when modern C++ is "low level". It is rather "any level". I compared my C++ server code with the similar ones written in JS, Python, PHP etc and frankly if you skip standard libraries C++ code can end up being actually smaller.


> > "Racked up a several-hundred-thousand dollar bill in a couple of hours."

> This is enough to rent big server from Hetzner / OVH for like forever and have person looking after it with plenty of money left.

That's no fair comparison, as you're comparing the cost of a worst case caused by a misconfiguration under very specific circumstances with the cost it takes to operate the service without such a worst case.

If you want to avoid any possibility to generate costs like that by accident you of course are better off with self-hosting. However even then generating such costs is certainly possible, e.g. by accidentally leaking a database with customer data through a misconfiguration.

Without assuming such worst cases AWS Lambda can be much more cost efficient than a dedicated server, depending on the use case.

There is no silver bullet. For some use cases self hosting makes sense, for other use cases using a cloud provider is the better choice.


AWS value comes from things like * RDS - easy db backups * CloudWatch - easy monitoring * IAM - easy access control * Systems Manager - easy fleet management and distributed parameter store. Integrated with IAM so you can hide your secrets. The list goes on.

If all you need is one server then you don’t need all of that. Things change as soon as you need 40 servers, or you have 40 people accessing 10 servers.

You can do it with open source tools. It takes time and expertise to do so. Both expertise and time are not available to the most companies.


> If you want to talk systemic AWS mistakes you can make, we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours.

I did more or less the same thing, but with a 3rd party webhook. The bill almost killed my company.


> The bill almost killed my company.

You had to pay although it was a mistake?


I was able to get AWS to forgive the majority of the bill. I think that was my one time pass though.


You spent resources. Of course to have to pay.


The resource usage required to tank a small startup (that could’ve become a bigger customer later) is probably peanuts to Amazon. I’m not sure how often they do this (or whether they do it at all) but it would make business sense for them to occasionally grant “billing forgiveness” in serious situations.


Of course Amazon can sponsor those companies hoping that they'll bring more profits in the end. But that's not a guarantee, just a good will, may be depending on mood of support person who'll handle that specific case. I made a mistake with Amazon in the past which costed me $100 and I did not get a refund despite asking for it. I had some bad sour in my mouth, but whatever, my mistake, my money.


There are other comments here today by people saying they could not get forgiveness


If you are able to share the story, what went wrong with the webhook?


It was a 3rd party resource that when updated would call a lambda via a webhook, which would then update the 3rd party resource. So it would create an infinite loop for each resource that was modified.

Dumb mistake...


At Poll Everywhere we run a few high volume SMS short codes. Somehow somebody texted one of our short codes, which replied to an Uber SMS phone number, which replied to our short code, which replied back to Uber, which replied back to our short code, which replied back to Uber, which replied back to our short code…

After a few days of this we racked up a bill that I think was $10,000’s in SMS fees before we noticed the problem and terminated the loop.

It’s pretty crazy the problems you’ll run into given enough time and scale.


AWS doesn't offer this now, do they?


Not that I know of.


> Racked up a several-hundred-thousand dollar bill in a couple of hours.

Not doubting you, but curious how you hit such a high figure. Can you walk through the math? Are we talking trillions of <10ms requests?


AWS is the solution looking for a problem, which happened to be modern web dev practices.


Things AWS solves for me that I've always wanted to have solved:

* Database administration

* Security best practices by default

* Updated infrastructure

* Automatic load balancing

* Trivial credentials management

* 2FA for all infra administration

* Container image repositories

* Distributed file systems

I was and old-school bare-metal UNIX systems admin 15 years ago. Each of those things, in medium to large companies, would take a full-time sysadmin to keep it all up to date.


Same. I used to really hate "cloud" given I was old-school devops and could do all of these things listed above with self-written management tools and a sprinkling of Ansible. However, I think in the past year, I've seen the light, shaved off my neckbeard, and have really enjoyed the lightening of shoulders of not having to think about things problems that no longer exist.


Many companies want disaster recovery and multi-region deployments without the capital expenses required to deploy this themselves.

I don't want to have to buy hardware from a vendor, find cabinet space, negotiate peering and power agreements, deal with 3am alerts for failed NICs, or hear about someone spending hours freeing up disk space while waiting on new drives to arrive.

I want the benefit of all these things, but I'd rather pay a premium for it over time than deal with the upfront capital expenses.


The problem is that not everyone wants to self-host, not everyone wants to manage hardware, and not everyone's tech scales in an extremely predictable and easy way. We launched a new tenant that required a bunch of new EC2s, databases, etc. Was trivial with AWS with terraform. If we did our own homegrown solution we would have had to have that hardware either ordered and waited on or have that hardware ready in reserve just burning cash doing nothing.


Concurrency limits is the main way to prevent such a thing right?


I think it's the easiest and most reliable way to prevent runaway bills.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: