Hacker News new | past | comments | ask | show | jobs | submit login
What would make AWS even better (yehudacohen.substack.com)
28 points by ManWith2Plans on Sept 2, 2022 | hide | past | favorite | 68 comments



Long running lambda functions are called AWS Batch. It’s a relatively unknown service but pretty decent if you need something like a GPU or long running jobs and can tolerate a 90 second cold start.


exactly this.

we even made an interface in code that runs a given task on a lambda (backed by a docker image) or batch (backed by the same docker image) depending on cpu/mem/time constraints. the tasks themselves also send a message when they're done so you can just subscribe to that vs. long polling.


...and there's also AWS DataPipeline.


Long running lambdas would be sick for infrequent / low concurrency data pipelines.

My wet dream is "bidirectional IaC". Let me make changes using the GUI, commit to repo automatically.


In most cases, I think you can do fine with just having a proper async interface to the data pipeline, where you trigger something, and then you respond to the trigger that it has completed.


Take a look at former2 and cdk-dasm.


Make it possible to set actual cost limits, not alarms.


But then you wouldn't have the constant white knuckle anxiety that you could run up a five figure bill in 30mins by accidentally misconfiguring your infra deployment.


Don’t worry! You would replace that with a new white knuckle anxiety that misconfiguring your infra deployment would exhaust your four figure cost limit, causing AWS to helpfully shutdown ALL of your infra to avoid any accidental overspend.


Whew. That was too close to emotional tranquillity for comfort. Thanks!


I feel like I'm the only engineer out there who just doesn't care for all the complexity of the modern cloud - like, I just run Ubuntu VMs on ec2, I don't use all the whizbang services, I just don't care.

And this model lets me be cloud agnostic for the most part - I run data workloads on gcp, dev/build workloads on linode, I've run bare metal in some places where I needed on-prem stuff. It's all just very much simpler than every cloud's flavor of doing everything slightly differently through different apis and tooling...


At my company, I've seen too many developers trying to cram fairly complex Flask websites (providing business tools) into Lambda functions. They deploy, and then the users complain that the website, which used to be immediately available, now runs very slow, because every request is also initiating the Flask app from scratch. That's a ridiculous amount of overhead. Tools like this belong in ECS, not Lambda. Or rewrite as a SPA and use microservices. A classic case of hammer and screw.


i was told that AWS Glue solves the "15 minutes or more" need that lambda cannot provide. never tried it so i cant say if its a good substitute.

that's the only service i see OP did not mention


Glue is Apache Spark as a service. If you don't need Spark, you will be dealing with lots of uncecessary complexity.

If you do need Spark, Databricks is likely a better option though :)


I am pretty surprised they don't compete with Stripe. They have some Amazon pay thing I'd never use, but competing with Stripe seems obvious.

Same with Twilio. They do kind of compete with them, but not really.

Their managed airflow is insanely basically unusably expensive, I don't get that.


A lot of Stripe's success comes from their UX and dev documentation. Something AWS struggles with


We have a process where we strip out the text of PDFs and shove in to elastic. The lambda starts by counting the pages and if it’s 250 or less it handles the job. If it’s larger than that we make the lambda kick the job to a temp ec2 instance which takes over the job. Our cutoff is around 250 pages but it’s highly dependent on text density.

It would be great if the lambda could handle running long. Id probably even be fine if the duration was punitive in that the longer you run over X time it becomes progressively more expensive. This would create a disincentive for using the service wrongly but would allow for oddball tasks.


Since Lambda is just one of at least three ways to “launch a Firecracker task to do something”, I choose the right one.

For something like that, I would use CodeBuild. In essence all CodeBuild is a method to run a list of bash commands in a Linux Docker container.

Standard disclaimer: I work at AWS in Professional Services.


Thanks for this! I’m not sure I would have looked at CodeBuild


I meant to mention.

You can run and rest CodeBuild locally. Just download the Docker image and run the shell script.

https://docs.aws.amazon.com/codebuild/latest/userguide/use-c...


Why not count the pages and then create a separate task for batches of N pages? You could orchestrate the batch thru parallel map in step function and bring together the results on the final step.


Use a step function and run a parallel task?


The audit logging story sucks unless you give them more money to understand the data they are throwing at you. I had a problem recently that was entirely Amazons fault and resulted in a massive increase in billing. I'm still trying to scrape the data together that they want to issue a credit but it's a pain in the ass scouring through all the event logs because of all the internal stuff (do I really need a log entry every time an AWS internal process hits another AWS internal process for data?) polluting the output.


Having just started with AWS, I would say: letting me make a Kubernetes cluster that doesn't require so many different cloud objects before it will start to function.


Poof: AWS Blueprint removes a lot of boilerplate code, pair it with terraform and that's it.


Poof?


Everyone complains about the 15 minutes problem on lambda, but am I the only that have a problem with the 250MB deployment size limit?


You can use Lambda to run Docker containers of sizes up to 10GB

https://aws.amazon.com/blogs/aws/new-for-aws-lambda-containe...


What are you publishing that has a binary that big but can't use a container image?


Sure you can get around that with proper use of Layers.


Total layer size is also restricted.


Restricted to 10GB. I'm sorry, but if you need more than that, you probably shouldn't be using Lambdas the way you are using Lambdas.


This isn't true. 10GB is the limit on docker backed lambda function sizes. Layers are capped to 256MB just like lambda functions.

A couple of weeks ago, I tried to deploy a lambda function that created Azure Subnets in python, and the Azure client was 265GB alone. My layer creation api call failed because of this.


You're right. I forgot it was docker backed lambda rather than layers.

Out of curiosity, why didn't you use an Azure docker image to back your lambda function?


Being able to save state and restore it. Bonus points for being able to browse other configs, remix them and deploy them too.


TL;DR: "Long running lambda functions. Subscribe and hit the bell to learn when I announce the next one!"


Lambda is insanely expensive is why they don’t allow long-running jobs. A 1GB allocation is $43/mo. And most lambda users are running 1 single task/process per lambda invocation.


Wait, so AWS does not allow long running lambdas because that would make too much money for them?


The actual reason is that long running invocations run synchronous workflows, typically requiring holding threads and sockets open for the entire duration of execution.

Lambda is a complex system, and holding those sockets for long times across many services could cause resource starvation issues. You've got load balancers, data plane, control plane, tenant vms, and a whole bunch of caches, and support services that all need to be ready to roll over the lifetime of the invocation.

And you have to consider the use case for draining and patching lambda pools. If someone is running a two hour function and you need to take down any server that's currently holding a thread or socket for it, you need to wait for the function to complete. You can't start a new load, so you are really inefficiently using resources until the function completes.


It sounds like you’re trying to use lambda for the wrong things and complaining.


I would feel better about using AWS if Amazon they treated all of their employees properly, including those in the warehouses doing hard physical labour.


Yes, it can be improved

It feels things like the S3 API are design by committee. If you use tools like the cli you'll notice how clunky it is


Environment variables on lambda aliases would be a good thing to implement, to make the thing somewhat usable.


IMO discounts would make aws better… the kind you get when you use more and can negotiate


Be an enterprise customer.

Again AWS is great for small businesses just starting up that can make a lot of use of the free tier, and find the pay per use pricing attractive, and for massive corporations that only need one approved vendor that can service all their needs without going through the purchasing process again.

It's the middle where you will get squeezed and not get a cost effective value without a dedicated AWS guy or two.


how much does one need to spend to become an enterprise customer?


For long running tasks I like to use CodeBuild, ECS Fargate (task) is also an option.


Pretty good experiences here with ECS Fargate for Dockerized Next.js runtime.


Glad to hear Next.js + Docker has been a smooth experience! Let me know if there's anything else we could do better.


a good pattern is using lambda to boot and then monitor ec2 spot. you the flexibility of lambda and the power of ec2 spot.

some external event triggers the boot lambda.

1 minute schedule triggers the monitor lambda.


Completely scrap Cloudformation and CDK and come up with something that requires /less/ code - not more, and has resource changes applied in parallel where possible. CFn is pretty garage - CDK just makes it more complex.


While I don’t like cdk. I like cloud formation.


Anything.


Network Load Balancers supporting security groups.

I've gone through a bunch of audits, and automated scans, and I constantly have to explain this shit, even to AWS Employees.

How it works with ALBs, which do support security groups:

You want to receive traffic on port :443, and allow it to be accessible to the world. You have EC2 instances, and they are listening on the VPC at port :1234

So, you create:

- ALB my_alb which listens on :443, and forwards traffic to tg_traffic

- Target group tg_traffic, which contains the EC2 instances and targets the EC2 instance with port 1234

- Security Group sg_alb, attached to my_alb with two rules:

  - rule 1, inbound, from 0.0.0.0/0:443

  - rule 2, outbound, to sg_servers:1234
- Security Group sg_servers, attached to the EC2 instances with one rule:

  - rule 1, inbound from sg_alb:1234
This makes everyone happy. The rules require that traffic from the internet has to go through the ALB.

Now how it works on a NLB, with the same scenario:

You want to receive traffic on port :443, and allow it to be accessible to the world. You have EC2 instances, and they are listening on the VPC at port :1234

However, NLBs, as mentioned, don't support security groups.

So, you create:

- NLB my_nlb which listens on :443, and forwards traffic to tg_traffic

- Target group tg_traffic, which contains the EC2 instances and targets the EC2 instance with port 1234

- Security Group sg_servers, attached to the EC2 instances with one rule:

  - rule 1, inbound from 0.0.0.0:1234  (not :443, because the NLB translates the port for you, but not the source ip)
...that's it.

However, now every audit/automated scan of the EC2 instance & it's security group is going to see that you're listening on some random port, and allowing traffic from anywhere. This throws errors/alerts all the time. Even AWS's automated scans are throwing these alerts.

When it's an auditor you have to take the time to explain that, no, that's how NLBs work. For automated scans, you have to just ignore the warnings/errors constantly.

If your instance has no public IP associated, then at least only that port is exposed, and traffic does have to go through the NLB.

If for some reason the instance does have a public IP associated, then anyone who can reach the public IP can bypass your NLB.

If you could have a SG attached, then you could force the traffic to go via the NLB and not come direct to the instance.


I don’t think that’s right. You wouldn’t use 0.0.0.0/32 you would use the private subnet range so they have no direct access to the internet and use a nat gateway to give them access to the outside world but no inbound access.

Edit: in the first example the servers should be in a private subnet. And not have a public ip allocated. They would require ssh hopping via a bastion or a vpn.


If your traffic is coming through a NLB, then the instances attached to the target group need to have 0.0.0.0/0 as the permitted source. (Assuming you want traffic from anywhere)

If you dont believe me, try setting it up yourself.

As for instances being on a private VPC/non-public IPs thats deployment specific.

In any case, everything then complains about listening on strange ports with 0.0.0.0/0


I will try test it out this evening.


I tried it out and I don’t know why but I feel like it’s complicated to me. Setting everything up with an application load balancer, public and private subnet’s and security groups all makes perfect sense. But the network load balancer makes me confused.

But adding it to my list of things to learn and understand better.


> But the network load balancer makes me confused

yep, it just works differently from everything else.

Once you realise that, from a network traffic perspective, AWS wants you to pretend its just a magic straw delivering traffic from your source to your targets, and it's basically invisible to everything, then it started to click more for me.

That's why I want NLBs to support SGs. or at least a NLBv2 that does it.


If you run the NLB in ‘ip’ mode instead of instance mode you don’t have this problem. You still can’t setup security groups. However, you can put the NLB in a small subnet then whitelist the subnet.

‘ip’ mode is very different from instance mode and might not be what you want. Though, ‘instance’ mode is subtly bugged if you are using cross zone load balancing.


What about, "treat AWS workers better"? Pay your people for their on call hours! Let them work on side projects and games in their spare time! Give them more than seven paid holidays. Give them more than two weeks vacation!

Only six weeks of paid parental leave?

I would absolutely be willing to pay more for AWS if I knew that amount was going to treating the poor folks who built it all better.


AWS/Amazon might be great for customers but it's a horrible place to work. Having worked in AWS for 3 years, almost all services are half baked, tech debt filled in all parts of the code. But hey, we never see any issue? It's because it has army of oncallers who are manually running commands and fixing issues.

I used to work in one of the DB services and we used to get 20+ pages (sev2) every day. Due to insane amount of pages every day, we used to have daily on-call rotations.


https://www.amazon.jobs/en/landing_pages/pto-overview-us

15 PTO days. 6 personal days

Yes I work at AWS. I’m never on call and I haven’t worked for more than 40 hours unless I’m learning something new trying to figure out. I control my own calendar and I manage expectations for my projects.

I do work in ProServe though…


I work in tech in Seattle. AWS is notably worse then its peers. And even still, your experience is not typical for AWS.


Have you thought that of a company had over 1.5 million employees, everyone’s experience wouldn’t be the same?

Besides, I purposefully put myself in a position that I wouldn’t have to relocate to a high cost of living area. I knew that Azure, AWS, and GCP had a Professional Services department that may require a lot of travel. But no relocation, no on call, etc.

Then a worldwide pandemic happened that reduced travel…


Amazon is a terrible place to work for your well-being generally (personal experience and data-based). But, 2 weeks vacation only applies to 1st year employees outside of CA.

Seattle dev: 1st year -> 2 weeks, 2-4th year -> 3 weeks, 5th+ year -> 4 weeks.

California dev: 1st year -> 3 weeks, 2-4th year -> 4 weeks, 5th+ year -> 5 weeks.


If the average Amazonian washes out in under two years, I think it's pretty fair to say most amazonians only get two weeks.


Say what you will about GCP or Azure, at least those folks get to see their families.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: