What would make AWS even better

maxcan · on Sept 3, 2022

Long running lambda functions are called AWS Batch. It’s a relatively unknown service but pretty decent if you need something like a GPU or long running jobs and can tolerate a 90 second cold start.

gabelerner · on Sept 3, 2022

exactly this.

we even made an interface in code that runs a given task on a lambda (backed by a docker image) or batch (backed by the same docker image) depending on cpu/mem/time constraints. the tasks themselves also send a message when they're done so you can just subscribe to that vs. long polling.

cbsmith · on Sept 3, 2022

...and there's also AWS DataPipeline.

pid-1 · on Sept 2, 2022

Long running lambdas would be sick for infrequent / low concurrency data pipelines.

My wet dream is "bidirectional IaC". Let me make changes using the GUI, commit to repo automatically.

cbsmith · on Sept 3, 2022

In most cases, I think you can do fine with just having a proper async interface to the data pipeline, where you trigger something, and then you respond to the trigger that it has completed.

moltar · on Sept 3, 2022

Take a look at former2 and cdk-dasm.

develatio · on Sept 3, 2022

Make it possible to set actual cost limits, not alarms.

kelseyfrog · on Sept 3, 2022

But then you wouldn't have the constant white knuckle anxiety that you could run up a five figure bill in 30mins by accidentally misconfiguring your infra deployment.

mdavis6890 · on Sept 3, 2022

Don’t worry! You would replace that with a new white knuckle anxiety that misconfiguring your infra deployment would exhaust your four figure cost limit, causing AWS to helpfully shutdown ALL of your infra to avoid any accidental overspend.

kelseyfrog · on Sept 3, 2022

Whew. That was too close to emotional tranquillity for comfort. Thanks!

mattbillenstein · on Sept 3, 2022

I feel like I'm the only engineer out there who just doesn't care for all the complexity of the modern cloud - like, I just run Ubuntu VMs on ec2, I don't use all the whizbang services, I just don't care.

And this model lets me be cloud agnostic for the most part - I run data workloads on gcp, dev/build workloads on linode, I've run bare metal in some places where I needed on-prem stuff. It's all just very much simpler than every cloud's flavor of doing everything slightly differently through different apis and tooling...

teilo · on Sept 3, 2022

At my company, I've seen too many developers trying to cram fairly complex Flask websites (providing business tools) into Lambda functions. They deploy, and then the users complain that the website, which used to be immediately available, now runs very slow, because every request is also initiating the Flask app from scratch. That's a ridiculous amount of overhead. Tools like this belong in ECS, not Lambda. Or rewrite as a SPA and use microservices. A classic case of hammer and screw.

mocha_nate · on Sept 2, 2022

i was told that AWS Glue solves the "15 minutes or more" need that lambda cannot provide. never tried it so i cant say if its a good substitute.

that's the only service i see OP did not mention

pid-1 · on Sept 2, 2022

Glue is Apache Spark as a service. If you don't need Spark, you will be dealing with lots of uncecessary complexity.

If you do need Spark, Databricks is likely a better option though :)

mchusma · on Sept 3, 2022

I am pretty surprised they don't compete with Stripe. They have some Amazon pay thing I'd never use, but competing with Stripe seems obvious.

Same with Twilio. They do kind of compete with them, but not really.

Their managed airflow is insanely basically unusably expensive, I don't get that.

astonex · on Sept 3, 2022

A lot of Stripe's success comes from their UX and dev documentation. Something AWS struggles with

social_quotient · on Sept 3, 2022

We have a process where we strip out the text of PDFs and shove in to elastic. The lambda starts by counting the pages and if it’s 250 or less it handles the job. If it’s larger than that we make the lambda kick the job to a temp ec2 instance which takes over the job. Our cutoff is around 250 pages but it’s highly dependent on text density.

It would be great if the lambda could handle running long. Id probably even be fine if the duration was punitive in that the longer you run over X time it becomes progressively more expensive. This would create a disincentive for using the service wrongly but would allow for oddball tasks.

scarface74 · on Sept 3, 2022

Since Lambda is just one of at least three ways to “launch a Firecracker task to do something”, I choose the right one.

For something like that, I would use CodeBuild. In essence all CodeBuild is a method to run a list of bash commands in a Linux Docker container.

Standard disclaimer: I work at AWS in Professional Services.

social_quotient · on Sept 3, 2022

Thanks for this! I’m not sure I would have looked at CodeBuild

scarface74 · on Sept 3, 2022

I meant to mention.

You can run and rest CodeBuild locally. Just download the Docker image and run the shell script.

https://docs.aws.amazon.com/codebuild/latest/userguide/use-c...

moltar · on Sept 3, 2022

Why not count the pages and then create a separate task for batches of N pages? You could orchestrate the batch thru parallel map in step function and bring together the results on the final step.

philliphaydon · on Sept 3, 2022

Use a step function and run a parallel task?

Karunamon · on Sept 3, 2022

The audit logging story sucks unless you give them more money to understand the data they are throwing at you. I had a problem recently that was entirely Amazons fault and resulted in a massive increase in billing. I'm still trying to scrape the data together that they want to issue a credit but it's a pain in the ass scouring through all the event logs because of all the internal stuff (do I really need a log entry every time an AWS internal process hits another AWS internal process for data?) polluting the output.

robertlagrant · on Sept 2, 2022

Having just started with AWS, I would say: letting me make a Kubernetes cluster that doesn't require so many different cloud objects before it will start to function.

Lucasoato · on Sept 2, 2022

Poof: AWS Blueprint removes a lot of boilerplate code, pair it with terraform and that's it.

robertlagrant · on Sept 4, 2022

Poof?

RavlaAlvar · on Sept 3, 2022

Everyone complains about the 15 minutes problem on lambda, but am I the only that have a problem with the 250MB deployment size limit?

scarface74 · on Sept 3, 2022

You can use Lambda to run Docker containers of sizes up to 10GB

https://aws.amazon.com/blogs/aws/new-for-aws-lambda-containe...

killyourcar · on Sept 3, 2022

What are you publishing that has a binary that big but can't use a container image?

cbsmith · on Sept 3, 2022

Sure you can get around that with proper use of Layers.

rlayton2 · on Sept 3, 2022

Total layer size is also restricted.

cbsmith · on Sept 3, 2022

Restricted to 10GB. I'm sorry, but if you need more than that, you probably shouldn't be using Lambdas the way you are using Lambdas.

ManWith2Plans · on Sept 4, 2022

This isn't true. 10GB is the limit on docker backed lambda function sizes. Layers are capped to 256MB just like lambda functions.

A couple of weeks ago, I tried to deploy a lambda function that created Azure Subnets in python, and the Azure client was 265GB alone. My layer creation api call failed because of this.

cbsmith · on Sept 7, 2022

You're right. I forgot it was docker backed lambda rather than layers.

Out of curiosity, why didn't you use an Azure docker image to back your lambda function?

hactually · on Sept 2, 2022

Being able to save state and restore it. Bonus points for being able to browse other configs, remix them and deploy them too.

lucb1e · on Sept 3, 2022

TL;DR: "Long running lambda functions. Subscribe and hit the bell to learn when I announce the next one!"

datalopers · on Sept 2, 2022

Lambda is insanely expensive is why they don’t allow long-running jobs. A 1GB allocation is $43/mo. And most lambda users are running 1 single task/process per lambda invocation.

pid-1 · on Sept 2, 2022

Wait, so AWS does not allow long running lambdas because that would make too much money for them?

killyourcar · on Sept 3, 2022

The actual reason is that long running invocations run synchronous workflows, typically requiring holding threads and sockets open for the entire duration of execution.

Lambda is a complex system, and holding those sockets for long times across many services could cause resource starvation issues. You've got load balancers, data plane, control plane, tenant vms, and a whole bunch of caches, and support services that all need to be ready to roll over the lifetime of the invocation.

And you have to consider the use case for draining and patching lambda pools. If someone is running a two hour function and you need to take down any server that's currently holding a thread or socket for it, you need to wait for the function to complete. You can't start a new load, so you are really inefficiently using resources until the function completes.

philliphaydon · on Sept 3, 2022

It sounds like you’re trying to use lambda for the wrong things and complaining.

astonex · on Sept 3, 2022

I would feel better about using AWS if Amazon they treated all of their employees properly, including those in the warehouses doing hard physical labour.

raverbashing · on Sept 3, 2022

Yes, it can be improved

It feels things like the S3 API are design by committee. If you use tools like the cli you'll notice how clunky it is

thiht · on Sept 3, 2022

Environment variables on lambda aliases would be a good thing to implement, to make the thing somewhat usable.

taf2 · on Sept 3, 2022

IMO discounts would make aws better… the kind you get when you use more and can negotiate

buscoquadnary · on Sept 3, 2022

Be an enterprise customer.

Again AWS is great for small businesses just starting up that can make a lot of use of the free tier, and find the pay per use pricing attractive, and for massive corporations that only need one approved vendor that can service all their needs without going through the purchasing process again.

It's the middle where you will get squeezed and not get a cost effective value without a dedicated AWS guy or two.

taf2 · on Sept 3, 2022

how much does one need to spend to become an enterprise customer?

nowandlater · on Sept 2, 2022

For long running tasks I like to use CodeBuild, ECS Fargate (task) is also an option.

chrisweekly · on Sept 3, 2022

Pretty good experiences here with ECS Fargate for Dockerized Next.js runtime.

leerob · on Sept 5, 2022

Glad to hear Next.js + Docker has been a smooth experience! Let me know if there's anything else we could do better.

nathants · on Sept 3, 2022

a good pattern is using lambda to boot and then monitor ec2 spot. you the flexibility of lambda and the power of ec2 spot.

some external event triggers the boot lambda.

1 minute schedule triggers the monitor lambda.

smcleod · on Sept 3, 2022

Completely scrap Cloudformation and CDK and come up with something that requires /less/ code - not more, and has resource changes applied in parallel where possible. CFn is pretty garage - CDK just makes it more complex.

philliphaydon · on Sept 3, 2022

While I don’t like cdk. I like cloud formation.

yrgulation · on Sept 3, 2022

Anything.

paranoidrobot · on Sept 3, 2022

Network Load Balancers supporting security groups.

I've gone through a bunch of audits, and automated scans, and I constantly have to explain this shit, even to AWS Employees.

How it works with ALBs, which do support security groups:

You want to receive traffic on port :443, and allow it to be accessible to the world. You have EC2 instances, and they are listening on the VPC at port :1234

So, you create:

- ALB my_alb which listens on :443, and forwards traffic to tg_traffic

- Target group tg_traffic, which contains the EC2 instances and targets the EC2 instance with port 1234

- Security Group sg_alb, attached to my_alb with two rules:

  - rule 1, inbound, from 0.0.0.0/0:443

  - rule 2, outbound, to sg_servers:1234

- Security Group sg_servers, attached to the EC2 instances with one rule:

  - rule 1, inbound from sg_alb:1234

This makes everyone happy. The rules require that traffic from the internet has to go through the ALB.

Now how it works on a NLB, with the same scenario:

You want to receive traffic on port :443, and allow it to be accessible to the world. You have EC2 instances, and they are listening on the VPC at port :1234

However, NLBs, as mentioned, don't support security groups.

So, you create:

- NLB my_nlb which listens on :443, and forwards traffic to tg_traffic

- Target group tg_traffic, which contains the EC2 instances and targets the EC2 instance with port 1234

- Security Group sg_servers, attached to the EC2 instances with one rule:

  - rule 1, inbound from 0.0.0.0:1234  (not :443, because the NLB translates the port for you, but not the source ip)

...that's it.

However, now every audit/automated scan of the EC2 instance & it's security group is going to see that you're listening on some random port, and allowing traffic from anywhere. This throws errors/alerts all the time. Even AWS's automated scans are throwing these alerts.

When it's an auditor you have to take the time to explain that, no, that's how NLBs work. For automated scans, you have to just ignore the warnings/errors constantly.

If your instance has no public IP associated, then at least only that port is exposed, and traffic does have to go through the NLB.

If for some reason the instance does have a public IP associated, then anyone who can reach the public IP can bypass your NLB.

If you could have a SG attached, then you could force the traffic to go via the NLB and not come direct to the instance.

philliphaydon · on Sept 3, 2022

I don’t think that’s right. You wouldn’t use 0.0.0.0/32 you would use the private subnet range so they have no direct access to the internet and use a nat gateway to give them access to the outside world but no inbound access.

Edit: in the first example the servers should be in a private subnet. And not have a public ip allocated. They would require ssh hopping via a bastion or a vpn.

paranoidrobot · on Sept 3, 2022

If your traffic is coming through a NLB, then the instances attached to the target group need to have 0.0.0.0/0 as the permitted source. (Assuming you want traffic from anywhere)

If you dont believe me, try setting it up yourself.

As for instances being on a private VPC/non-public IPs thats deployment specific.

In any case, everything then complains about listening on strange ports with 0.0.0.0/0

philliphaydon · on Sept 3, 2022

I will try test it out this evening.

philliphaydon · on Sept 4, 2022

I tried it out and I don’t know why but I feel like it’s complicated to me. Setting everything up with an application load balancer, public and private subnet’s and security groups all makes perfect sense. But the network load balancer makes me confused.

But adding it to my list of things to learn and understand better.

paranoidrobot · on Sept 4, 2022

> But the network load balancer makes me confused

yep, it just works differently from everything else.

Once you realise that, from a network traffic perspective, AWS wants you to pretend its just a magic straw delivering traffic from your source to your targets, and it's basically invisible to everything, then it started to click more for me.

That's why I want NLBs to support SGs. or at least a NLBv2 that does it.

benmmurphy · on Sept 3, 2022

If you run the NLB in ‘ip’ mode instead of instance mode you don’t have this problem. You still can’t setup security groups. However, you can put the NLB in a small subnet then whitelist the subnet.

‘ip’ mode is very different from instance mode and might not be what you want. Though, ‘instance’ mode is subtly bugged if you are using cross zone load balancing.

killyourcar · on Sept 3, 2022

What about, "treat AWS workers better"? Pay your people for their on call hours! Let them work on side projects and games in their spare time! Give them more than seven paid holidays. Give them more than two weeks vacation!

Only six weeks of paid parental leave?

I would absolutely be willing to pay more for AWS if I knew that amount was going to treating the poor folks who built it all better.

uji · on Sept 3, 2022

AWS/Amazon might be great for customers but it's a horrible place to work. Having worked in AWS for 3 years, almost all services are half baked, tech debt filled in all parts of the code. But hey, we never see any issue? It's because it has army of oncallers who are manually running commands and fixing issues.

I used to work in one of the DB services and we used to get 20+ pages (sev2) every day. Due to insane amount of pages every day, we used to have daily on-call rotations.

scarface74 · on Sept 3, 2022

https://www.amazon.jobs/en/landing_pages/pto-overview-us

15 PTO days. 6 personal days

Yes I work at AWS. I’m never on call and I haven’t worked for more than 40 hours unless I’m learning something new trying to figure out. I control my own calendar and I manage expectations for my projects.

I do work in ProServe though…

killyourcar · on Sept 3, 2022

I work in tech in Seattle. AWS is notably worse then its peers. And even still, your experience is not typical for AWS.

scarface74 · on Sept 3, 2022

Have you thought that of a company had over 1.5 million employees, everyone’s experience wouldn’t be the same?

Besides, I purposefully put myself in a position that I wouldn’t have to relocate to a high cost of living area. I knew that Azure, AWS, and GCP had a Professional Services department that may require a lot of travel. But no relocation, no on call, etc.

Then a worldwide pandemic happened that reduced travel…

throwaway_4ever · on Sept 3, 2022

Amazon is a terrible place to work for your well-being generally (personal experience and data-based). But, 2 weeks vacation only applies to 1st year employees outside of CA.

Seattle dev: 1st year -> 2 weeks, 2-4th year -> 3 weeks, 5th+ year -> 4 weeks.

California dev: 1st year -> 3 weeks, 2-4th year -> 4 weeks, 5th+ year -> 5 weeks.

killyourcar · on Sept 3, 2022

If the average Amazonian washes out in under two years, I think it's pretty fair to say most amazonians only get two weeks.

killyourcar · on Sept 3, 2022

Say what you will about GCP or Azure, at least those folks get to see their families.