Long running lambda functions are called AWS Batch. It’s a relatively unknown service but pretty decent if you need something like a GPU or long running jobs and can tolerate a 90 second cold start.
we even made an interface in code that runs a given task on a lambda (backed by a docker image) or batch (backed by the same docker image) depending on cpu/mem/time constraints. the tasks themselves also send a message when they're done so you can just subscribe to that vs. long polling.
In most cases, I think you can do fine with just having a proper async interface to the data pipeline, where you trigger something, and then you respond to the trigger that it has completed.
But then you wouldn't have the constant white knuckle anxiety that you could run up a five figure bill in 30mins by accidentally misconfiguring your infra deployment.
Don’t worry! You would replace that with a new white knuckle anxiety that misconfiguring your infra deployment would exhaust your four figure cost limit, causing AWS to helpfully shutdown ALL of your infra to avoid any accidental overspend.
I feel like I'm the only engineer out there who just doesn't care for all the complexity of the modern cloud - like, I just run Ubuntu VMs on ec2, I don't use all the whizbang services, I just don't care.
And this model lets me be cloud agnostic for the most part - I run data workloads on gcp, dev/build workloads on linode, I've run bare metal in some places where I needed on-prem stuff. It's all just very much simpler than every cloud's flavor of doing everything slightly differently through different apis and tooling...
At my company, I've seen too many developers trying to cram fairly complex Flask websites (providing business tools) into Lambda functions. They deploy, and then the users complain that the website, which used to be immediately available, now runs very slow, because every request is also initiating the Flask app from scratch. That's a ridiculous amount of overhead. Tools like this belong in ECS, not Lambda. Or rewrite as a SPA and use microservices. A classic case of hammer and screw.
We have a process where we strip out the text of PDFs and shove in to elastic. The lambda starts by counting the pages and if it’s 250 or less it handles the job. If it’s larger than that we make the lambda kick the job to a temp ec2 instance which takes over the job. Our cutoff is around 250 pages but it’s highly dependent on text density.
It would be great if the lambda could handle running long. Id probably even be fine if the duration was punitive in that the longer you run over X time it becomes progressively more expensive. This would create a disincentive for using the service wrongly but would allow for oddball tasks.
Why not count the pages and then create a separate task for batches of N pages? You could orchestrate the batch thru parallel map in step function and bring together the results on the final step.
The audit logging story sucks unless you give them more money to understand the data they are throwing at you. I had a problem recently that was entirely Amazons fault and resulted in a massive increase in billing. I'm still trying to scrape the data together that they want to issue a credit but it's a pain in the ass scouring through all the event logs because of all the internal stuff (do I really need a log entry every time an AWS internal process hits another AWS internal process for data?) polluting the output.
Having just started with AWS, I would say: letting me make a Kubernetes cluster that doesn't require so many different cloud objects before it will start to function.
This isn't true. 10GB is the limit on docker backed lambda function sizes. Layers are capped to 256MB just like lambda functions.
A couple of weeks ago, I tried to deploy a lambda function that created Azure Subnets in python, and the Azure client was 265GB alone. My layer creation api call failed because of this.
Lambda is insanely expensive is why they don’t allow long-running jobs. A 1GB allocation is $43/mo. And most lambda users are running 1 single task/process per lambda invocation.
The actual reason is that long running invocations run synchronous workflows, typically requiring holding threads and sockets open for the entire duration of execution.
Lambda is a complex system, and holding those sockets for long times across many services could cause resource starvation issues. You've got load balancers, data plane, control plane, tenant vms, and a whole bunch of caches, and support services that all need to be ready to roll over the lifetime of the invocation.
And you have to consider the use case for draining and patching lambda pools. If someone is running a two hour function and you need to take down any server that's currently holding a thread or socket for it, you need to wait for the function to complete. You can't start a new load, so you are really inefficiently using resources until the function completes.
I would feel better about using AWS if Amazon they treated all of their employees properly, including those in the warehouses doing hard physical labour.
Again AWS is great for small businesses just starting up that can make a lot of use of the free tier, and find the pay per use pricing attractive, and for massive corporations that only need one approved vendor that can service all their needs without going through the purchasing process again.
It's the middle where you will get squeezed and not get a cost effective value without a dedicated AWS guy or two.
Completely scrap Cloudformation and CDK and come up with something that requires /less/ code - not more, and has resource changes applied in parallel where possible. CFn is pretty garage - CDK just makes it more complex.
I've gone through a bunch of audits, and automated scans, and I constantly have to explain this shit, even to AWS Employees.
How it works with ALBs, which do support security groups:
You want to receive traffic on port :443, and allow it to be accessible to the world.
You have EC2 instances, and they are listening on the VPC at port :1234
So, you create:
- ALB my_alb which listens on :443, and forwards traffic to tg_traffic
- Target group tg_traffic, which contains the EC2 instances and targets the EC2 instance with port 1234
- Security Group sg_alb, attached to my_alb with two rules:
- rule 1, inbound, from 0.0.0.0/0:443
- rule 2, outbound, to sg_servers:1234
- Security Group sg_servers, attached to the EC2 instances with one rule:
- rule 1, inbound from sg_alb:1234
This makes everyone happy. The rules require that traffic from the internet has to go through the ALB.
Now how it works on a NLB, with the same scenario:
You want to receive traffic on port :443, and allow it to be accessible to the world. You have EC2 instances, and they are listening on the VPC at port :1234
However, NLBs, as mentioned, don't support security groups.
So, you create:
- NLB my_nlb which listens on :443, and forwards traffic to tg_traffic
- Target group tg_traffic, which contains the EC2 instances and targets the EC2 instance with port 1234
- Security Group sg_servers, attached to the EC2 instances with one rule:
- rule 1, inbound from 0.0.0.0:1234 (not :443, because the NLB translates the port for you, but not the source ip)
...that's it.
However, now every audit/automated scan of the EC2 instance & it's security group is going to see that you're listening on some random port, and allowing traffic from anywhere. This throws errors/alerts all the time. Even AWS's automated scans are throwing these alerts.
When it's an auditor you have to take the time to explain that, no, that's how NLBs work. For automated scans, you have to just ignore the warnings/errors constantly.
If your instance has no public IP associated, then at least only that port is exposed, and traffic does have to go through the NLB.
If for some reason the instance does have a public IP associated, then anyone who can reach the public IP can bypass your NLB.
If you could have a SG attached, then you could force the traffic to go via the NLB and not come direct to the instance.
I don’t think that’s right. You wouldn’t use 0.0.0.0/32 you would use the private subnet range so they have no direct access to the internet and use a nat gateway to give them access to the outside world but no inbound access.
Edit: in the first example the servers should be in a private subnet. And not have a public ip allocated. They would require ssh hopping via a bastion or a vpn.
If your traffic is coming through a NLB, then the instances attached to the target group need to have 0.0.0.0/0 as the permitted source. (Assuming you want traffic from anywhere)
If you dont believe me, try setting it up yourself.
As for instances being on a private VPC/non-public IPs thats deployment specific.
In any case, everything then complains about listening on strange ports with 0.0.0.0/0
I tried it out and I don’t know why but I feel like it’s complicated to me. Setting everything up with an application load balancer, public and private subnet’s and security groups all makes perfect sense. But the network load balancer makes me confused.
But adding it to my list of things to learn and understand better.
yep, it just works differently from everything else.
Once you realise that, from a network traffic perspective, AWS wants you to pretend its just a magic straw delivering traffic from your source to your targets, and it's basically invisible to everything, then it started to click more for me.
That's why I want NLBs to support SGs. or at least a NLBv2 that does it.
If you run the NLB in ‘ip’ mode instead of instance mode you don’t have this problem. You still can’t setup security groups. However, you can put the NLB in a small subnet then whitelist the subnet.
‘ip’ mode is very different from instance mode and might not be what you want. Though, ‘instance’ mode is subtly bugged if you are using cross zone load balancing.
What about, "treat AWS workers better"? Pay your people for their on call hours! Let them work on side projects and games in their spare time! Give them more than seven paid holidays. Give them more than two weeks vacation!
Only six weeks of paid parental leave?
I would absolutely be willing to pay more for AWS if I knew that amount was going to treating the poor folks who built it all better.
AWS/Amazon might be great for customers but it's a horrible place to work. Having worked in AWS for 3 years, almost all services are half baked, tech debt filled in all parts of the code. But hey, we never see any issue? It's because it has army of oncallers who are manually running commands and fixing issues.
I used to work in one of the DB services and we used to get 20+ pages (sev2) every day. Due to insane amount of pages every day, we used to have daily on-call rotations.
Yes I work at AWS. I’m never on call and I haven’t worked for more than 40 hours unless I’m learning something new trying to figure out. I control my own calendar and I manage expectations for my projects.
Have you thought that of a company had over 1.5 million employees, everyone’s experience wouldn’t be the same?
Besides, I purposefully put myself in a position that I wouldn’t have to relocate to a high cost of living area. I knew that Azure, AWS, and GCP had a Professional Services department that may require a lot of travel. But no relocation, no on call, etc.
Then a worldwide pandemic happened that reduced travel…
Amazon is a terrible place to work for your well-being generally (personal experience and data-based). But, 2 weeks vacation only applies to 1st year employees outside of CA.
Seattle dev: 1st year -> 2 weeks, 2-4th year -> 3 weeks, 5th+ year -> 4 weeks.
California dev: 1st year -> 3 weeks, 2-4th year -> 4 weeks, 5th+ year -> 5 weeks.