Hacker News new | past | comments | ask | show | jobs | submit login
AWS API Performance Comparison: Serverless vs. Containers (alexdebrie.com)
123 points by abd12 31 days ago | hide | past | web | favorite | 54 comments



Perhaps I misread, but it sounds like the author never managed to get Fargate to run without even a minimum of 3% failures?

That to me sounds like an astoundingly high number, especially for the amount of resources dedicated to the test (400GB of memory and 200 CPUs). Hell, if you're monitoring uptime, that's one nine (97% success).

I can't believe that this is a well-configured setup. Running 50 instances with four CPUs each at a total of ~100rps means that half of the CPUs are probably doing exactly nothing (and if my understanding is correct and they're using Flask in a single-threaded way, 150 of the 200 CPUs are going to be idle).

Triggering SNS is an API call. Assuming that's all the test application is doing, you hardly need one server to do this. I'd bet that you could make 100 simultaneous API calls from a stock Macbook Pro with a small handful of Node or Go processes without even making your CPU spin up.

If Fargate can't handle 100rps (to make a single API call per request) with <10 instances, it's a useless product. But I find it hard to believe that Amazon would put something so absolutely incapable into the wild. With the specs the author put up, that's the equivalent of ~$10/hr (if I'm reading their pricing page correctly). You could run 380 A1 instances, 58 t3.xl instances, or two m5a.24xlarge instances for that price.


> If Fargate can't handle 100rps (to make a single API call per request) with <10 instances,

It can. Easily. I've done it.

> I can't believe that this is a well-configured setup.

It's not.

The author hasn't provided anywhere near enough information to gather any useful information from this. I don't know why there were errors (the other doesn't give any precise information), I don't know how long the tests were run, the state of instances before the tests were run, what operations were being performed, how the testing was being performed, or numerous other things.

Simply put, it's not useful.


The author didn't spend much time discussing the application load balancer, wonder if it was the source of misconfiguration (besides their http server itself).


I tend to think of these services as fulfilling different use cases.

For Fargate, the ideal scenario in my workloads is async background processing. Add tasks to SQS, Fargate pulls tasks off and does the job. Elastically scales up or down and lots of flexibility on machine specs. Ok with some failure rate.

For AWS Lambda, recently I like the combo of adding a Cloudflare worker in front and using it as an API gateway. More flexibility on routing, faster performance, and reverse proxy for free which can be good for SEO. And you get all the goodies of Cloudflare like DDOS protection and CDN.


I think you're right that Fargate and Lambda often serve different use cases.

However, I think containers and Lambda can and do serve this particular use case -- handle an API request and forward it to a different system. And Fargate is a good stand-in for containers generally. I wouldn't expect much different performance by using ECS or EKS or EC2 -- it's still going to be a load balancer forwarding to a container instance.

Definitely not perfect, but I think it works as a general approximation. For this particular common use case, you have three options. Was curious to see the perf differences between them.

* Original author here.


Do you have any details about using Cloudflare Workers as an API gateway?


Check out the boilerplate here for workers:

https://github.com/detroitenglish/cloudflare-worker-webpack-...

Then invoke lambda with this:

https://github.com/mhart/aws4fetch

Important to note that workers have a 15s timeout, so this is really only good for routing. You probably don’t want this to manage tasks that could potentially take longer.


> Important to note that workers have a 15s timeout

Not true -- the timeout on outgoing HTTP requests is (I think) 100 seconds (or unlimited as long as data is streaming).

The 15-second limit you may be thinking of is that Workers used to not let you start new outgoing HTTP requests 15 seconds into the event, but already-started requests could continue. This limit was recently removed -- instead, Workers now cancels outgoing requests if the client disconnects, but as long as the client is connected, you can keep making new requests. This was changed to support streaming video use cases where a stream is being assembled out of smaller chunks.

(I'm the tech lead for Workers.)


Didn’t know that! Thanks for chiming in.


What do you mean by reverse proxy?


If you have a blog running at blog.mycompany.com, that is not going to be as effective for SEO as if the blog were at mycompany.com/blog. You want your primary domain getting the credit. But it’s also not ideal to have a separate subfolder if you have two different services (blog and main product).

So the best solution is to reverse proxy so that internet traffic hits /blog, but the worker is actually forwarding the traffic to your internal service.


Got it. So you do that through a Cloudflare worker? Aren’t you loosing the caching perks from cloudflafe? It would nice to be able to do that directy through a page rule.


There are probably multiple ways to reverse proxy, it’s just that CF workers are versatile to handle many use cases. You don’t have to lose the caching benefits. You can still pull your assets through Cloudflare CDN since workers operate at the request level (afaik).

In fact it can be even more performant. You can use Workers KV to cache as well. So a request comes in, check KV store, return if found. If not, pull from asset CDN.


Don't use Workers KV for caching -- use the Cache API: https://developers.cloudflare.com/workers/reference/cache-ap...

KV is a global persistent data store, so reads and writes may have to cross the internet. In comparison, the Cache API reads and writes from the local datacenter's cache. Also, Cache API doesn't cost extra (KV does).

However, better than either of these is to formulate your outgoing fetch() calls such that they naturally get the caching properties you want. fetch() goes through Cloudflare's usual caching logic. When that does what you want, it works better because this is the path that has been most optimized over many years.


Can you point to some example code that uses fetch more optimally than Cache API?


There isn't really any magic here. If your resources or API are naturally cacheable using generic HTTP caching logic, then it's best to use fetch() and rely on that rather than try to re-implement the same logic with cache API. But if the default logic isn't a good fit, cache API makes sense.


This benchmark is all over the place; there are way too many variables changing to make any sorts of conclusions. You have flask-gunicorn-meinheld stack on the fargate side, all bits that might impact performance. Then there are the failing requests that alone would disqualify the comparison. I would also assume ApiGW to be slower than plain ELB, although I don't know for sure and this comparison certainly did give any information about that. No comparison of different lambda/fargate instance types, nor any mention about costs. Obviously just throwing more money should get better perf, so perf/$ is sort of important.


Unless I misread the article, but the writer makes a claim of having to support 100 req/sec with about 50 instances. "For Fargate, this meant deploying 50 instances of my container with pretty beefy settings — 8 GB of memory and 4 full CPU units per container instance.". If the api service was just returning some data with a few manupilations, then that is quite inefficient. At NodeChef ( https://www.nodechef.com ), there are users running over 1000 req/sec with just around 12 containers each with 512 MB ram and 2 CPUs.


Original author here.

The two sentences right before what you quoted are helpful:

"I’m not a Docker or Flask performance expert, and that’s not the goal of this exercise. To remedy this, I decided to bump the specs on my deployments.

The general goal for this bakeoff is to get a best-case outcome for each of these architectures, rather than an apples-to-apples comparison of cost vs performance."

I wasn't trying to squeeze out every ounce of performance and determine the minimum number of instances to handle 100 req/sec. I was trying to normalize across the three patterns as much as possible to see best-case performance. I didn't want resource constraints to be an excuse.


I think there’s a pretty big flaw in your benchmark because that failure rate is _insane_ and you shouldn’t need anywhere near that hardware to accomplish this. I don’t think your data is credible as a result.


As usual with such benchmarks, I miss performance comparisons over time. Did Fargate outperform the other two solutions because it's generally faster or because API Gateway was just slow during the few minutes that benchmark took?

What'd also be interesting here would be a price comparison. Without having done the math I'd expect Fargate to be significantly more expensive than the other solutions, which'd make a nice trade-off of cost vs. performance: If performance matters choose Fargate, if cost matters choose API Gateway as service proxy.


Author here. I'd say this was about in line with what I'd see from using API Gateway and Lambda.

I thought about doing a price comparison, but that gets really tricky. At some point, you're testing the skills of the tester more than the services themselves.

Agree with your expectations though. I think Fargate is naturally faster and can be moreso given the number of knobs you can tune. This will likely cost you more, both in direct resource costs and in engineering time trying to fine-tune the knobs. Whether that's worth it depends on your business needs.


It's a (relatively) fixed-cost model vs pay-as-you-go. At a certain level of traffic something built on top of ALB/Fargate will be cheaper to run than serverless solutions.


Hmmm what about Application Load Balancer to Lambda??? Cut api gateway out..


This is possible?!


Yup, means your requests aren't limited to 29 seconds that API Gateway enforces. And you don't have to deal with fiddly config. So if you're using something like ASP.NET Core its insanely easy to setup a website in a lambda now.


There's something I don't quite get. If you set up a website with asp.Net core or node, and you're using async io everywhere. Doesn't lambda restrict the process to one request at a time? If it's like that, doesn't lambda underutilize resources?


I guess resource allocation would be under utilized, but being able to respond to web requests allows for a single lambda to execute for any number of different requests. For example we had a client who wanted a SOAP service to interface with our system, so instead of creating a WCF service or something and hosting it in IIS, we used SoapCore which is just middleware for accepting a soap action or sending the wsdl information back. So we have a small lambda (it's like 500kb) in a Lambda accepting ~1-10 requests during monday-friday for this client. And since Lambda only charges for duration of execution time it ends up costing nothing as it falls into the free tier. So even if the resources are underutilized I don't see an issue. (sorry for wall of text I'm on my phone)


Hmm down voted... Would like to know why?


It was announced recently. It’s really convenient to slowly introduce lambda in a “classic” application.


Ah, yes I read this announcement. I had the impression it would still require API-Gateway.



I only learned this very recently and it’s saving us like 95% off of the (really unreasonably high) API Gateway costs.


Oh, it's also cheaper? Nice!


im confused how the author thinks it's ok that 10% of requests to fargate are failing because they don't know how to configure docker / flask?


Helpful analysis. Thanks. I was planning to do something like this soon. I’m interested in the performance difference between the same docker image running on fargate vs ECS with a c5.large EC2 backing instance. My initial tests ( a few months ago before they started using firecracker for fargate) have showed at least a 200% performance increase. I also noticed fargate performance really depends on the processor AWS allocates to the fargate task. And the processor is not always the same between identical tasks. Would be worth checking if there are significant differences between fargate with different underlying CPU types.


only one of these has a reasonable local testing story


of course you wouldn't rely on local testing so that's is moot.


This was so bad, I'm not even sure where to start. As a general advice when you see 4cores/8GB doing 100rps~~ there is some serious problem somewhere, every dev should know that this kind of instance doing that kind of benchmark ( simple REST calls ) should be in the ballpark of 4-5 digits not 3.


I think people are overly concerned with 'performance'.

So long as the approaches meet real world thresholds for specific things, particularly the speed of certain time-sensitive requests - then the technology is 'viable'.

(And let's also assume 'reliability' as a key component of 'viability')

Once the tech is 'viable' - it's really a whole host of other concerns that we want to look at.

#1 I think would be the ability of the tech to support dynamic and changing needs of product development.

Something that is easy, fewer moving parts, smaller API, requires less interference and support from Devops - this is worth a lot. Strategically - it may be the most valuable thing for most growing companies.

'Complexity' in all it's various forms represent a kind of constant barrier, a force that the company is going to have to fight against to make customers happy. This is the thing we want to minimize.

Obviously, issues such as switching costs and the 'proprietary trap' are an issue, and of course 'total cost of operations' i.e. the cost of the services are important basis of comparison, but even the later is an issue later on for the company, once it reachers maturity. (i.e. A 'Dropbox' type company should definitely start in the cloud, and not until they have the kind of scale that warrants unit-cost scrutiny would they consider building their own ifra)

In the big picture of 'total cost of ownership' - it's the ability of the system to meet the needs of product development, now 'how fast or cheap' it is, that's really the point.

Something that is 2x the cost, and 10% 'less performant' - but is very easy to use, requires minimal devops focus, and can enable feature iteration and easy scale - this is what most growing companies need.

Unless performance or cost are key attributes and differentiators of the product or service - then 'take the easy path'.


> For Fargate, this meant deploying 50 instances of my container with pretty beefy settings — 8 GB of memory and 4 full CPU units per container instance.

This is for hello world application. This is insane. Single EC2 HVM instance would have better performance.

> Something that is 2x the cost, and 10% 'less performant' - but is very easy to use, requires minimal devops focus, and can enable feature iteration and easy scale - this is what most growing companies need.

Cost is close to 50x and Fargate is not necessarily lower overhead. Terraform + Packer + EC2 vs ECS + Docker + Fargate is pretty the same for me. You still need to build images and manage the deployment lifecycle.


Why would you use " Terraform + Packer + EC2 vs ECS + Docker + Fargate" if you can just use AWS Lambda?


Because it's never just Lambda. It's the toolchain you use for deployment and orchestration, routing, etc. (eg, "Serverless Framework + Lambda + API Gateway")


Well 'serverless framework' really is just node.js or java. That's likely going to be used anyhow.

Lambda itself doesn't really take maintenance.

Routing can be done in the app, you only need one - or a small number of 'endpoints' so messing with API gateway can be minimized.

You can run any sized app with a single API Gateway endpoint, and a single lambda, on a simple node.js setup.

I can't see any reason to use containers on EC2 or containers on Amazon's container service until an app gains quite a degree of sophistication.


CloudFormation, API gateway, AIM, Secret Manager. Production application requires much more than just a lambda. It is better to start from something more structured than shell scripts.


Great article. I’m interested in how the performance between the same docker image running on fargate vs a c5.large EC2 instance.


> First, I ran a small sample of 2000 requests to check the performance of new deploys. This was running at around 40 requests per second.

...

> When I ran my initial Fargate warmup, I got the following results. Around 10% of my requests were failing altogether!

....

> To remedy this, I decided to bump the specs on my deployments.

> The general goal for this bakeoff is to get a best-case outcome for each of these architectures, rather than an apples-to-apples comparison of cost vs performance.

> For Fargate, this meant deploying 50 instances of my container with pretty beefy settings — 8 GB of memory and 4 full CPU units per container instance.


This is what caught my eye too. I'm not familiar with the guts of Fargate, but this seems like very poor performance?


I don't see how this could be a symptom of Fargate as such, but rather of ECS. I mean, all Fargate does is add management features to the VMs running the containers. So if there's a problem with the runtime behavior of the containers, it seems like that would be something in the ECS infrastructure.


I sort of suspect the error wasn’t just about resources, since it continued after given gobs of resources.


Having run a service on Fargate in prod before I can almost guarantee that the errors he was seeing were 504s which are caused by the server code and the ALB not having their request-timeouts set to to the same value.


It’s my understanding that you want to set the timeout in the server code to be higher than the idle connection timeout in the ALB. With them set to the same value, timeout errors will look different depending on whether the application or the ALB dropped the connection.


I agree. This tripped me up when I first started using a fargate service behind an ALB. Easy fix.


Author here. Yep, that was my suspicion as well so I didn't hold it against Fargate. I was more looking for a reasonable expectation of performance, so I just threw a ton of resources at it to rule out as much as possible.

One of the AWS container advocates mentioned it was likely something like the nofile ulimit: https://twitter.com/nathankpeck/status/1098992994131283968




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: