
AWS API Performance Comparison: Serverless vs. Containers - abd12
https://www.alexdebrie.com/posts/aws-api-performance-comparison/
======
bastawhiz
Perhaps I misread, but it sounds like the author never managed to get Fargate
to run without even a minimum of 3% failures?

That to me sounds like an astoundingly high number, especially for the amount
of resources dedicated to the test (400GB of memory and 200 CPUs). Hell, if
you're monitoring uptime, that's one nine (97% success).

I can't believe that this is a well-configured setup. Running 50 instances
with four CPUs each at a total of ~100rps means that half of the CPUs are
probably doing exactly nothing (and if my understanding is correct and they're
using Flask in a single-threaded way, 150 of the 200 CPUs are going to be
idle).

Triggering SNS is an API call. Assuming that's all the test application is
doing, you hardly need one server to do this. I'd bet that you could make 100
simultaneous API calls from a stock Macbook Pro with a small handful of Node
or Go processes without even making your CPU spin up.

If Fargate can't handle 100rps (to make a single API call per request) with
<10 instances, it's a useless product. But I find it hard to believe that
Amazon would put something so absolutely incapable into the wild. With the
specs the author put up, that's the equivalent of ~$10/hr (if I'm reading
their pricing page correctly). You could run 380 A1 instances, 58 t3.xl
instances, or two m5a.24xlarge instances for that price.

~~~
jasonlotito
> If Fargate can't handle 100rps (to make a single API call per request) with
> <10 instances,

It can. Easily. I've done it.

> I can't believe that this is a well-configured setup.

It's not.

The author hasn't provided anywhere near enough information to gather any
useful information from this. I don't know why there were errors (the other
doesn't give any precise information), I don't know how long the tests were
run, the state of instances before the tests were run, what operations were
being performed, how the testing was being performed, or numerous other
things.

Simply put, it's not useful.

------
physcab
I tend to think of these services as fulfilling different use cases.

For Fargate, the ideal scenario in my workloads is async background
processing. Add tasks to SQS, Fargate pulls tasks off and does the job.
Elastically scales up or down and lots of flexibility on machine specs. Ok
with some failure rate.

For AWS Lambda, recently I like the combo of adding a Cloudflare worker in
front and using it as an API gateway. More flexibility on routing, faster
performance, and reverse proxy for free which can be good for SEO. And you get
all the goodies of Cloudflare like DDOS protection and CDN.

~~~
gorbypark
Do you have any details about using Cloudflare Workers as an API gateway?

~~~
physcab
Check out the boilerplate here for workers:

[https://github.com/detroitenglish/cloudflare-worker-
webpack-...](https://github.com/detroitenglish/cloudflare-worker-webpack-
boilerplate/blob/master/README.MD)

Then invoke lambda with this:

[https://github.com/mhart/aws4fetch](https://github.com/mhart/aws4fetch)

Important to note that workers have a 15s timeout, so this is really only good
for routing. You probably don’t want this to manage tasks that could
potentially take longer.

~~~
kentonv
> Important to note that workers have a 15s timeout

Not true -- the timeout on outgoing HTTP requests is (I think) 100 seconds (or
unlimited as long as data is streaming).

The 15-second limit you may be thinking of is that Workers used to not let you
start new outgoing HTTP requests 15 seconds into the event, but already-
started requests could continue. This limit was recently removed -- instead,
Workers now cancels outgoing requests if the client disconnects, but as long
as the client is connected, you can keep making new requests. This was changed
to support streaming video use cases where a stream is being assembled out of
smaller chunks.

(I'm the tech lead for Workers.)

~~~
physcab
Didn’t know that! Thanks for chiming in.

------
zokier
This benchmark is all over the place; there are way too many variables
changing to make any sorts of conclusions. You have flask-gunicorn-meinheld
stack on the fargate side, all bits that might impact performance. Then there
are the failing requests that alone would disqualify the comparison. I would
also assume ApiGW to be slower than plain ELB, although I don't know for sure
and this comparison certainly did give any information about that. No
comparison of different lambda/fargate instance types, nor any mention about
costs. Obviously just throwing more money should get better perf, so perf/$ is
sort of important.

------
squid3
Unless I misread the article, but the writer makes a claim of having to
support 100 req/sec with about 50 instances. "For Fargate, this meant
deploying 50 instances of my container with pretty beefy settings — 8 GB of
memory and 4 full CPU units per container instance.". If the api service was
just returning some data with a few manupilations, then that is quite
inefficient. At NodeChef (
[https://www.nodechef.com](https://www.nodechef.com) ), there are users
running over 1000 req/sec with just around 12 containers each with 512 MB ram
and 2 CPUs.

~~~
abd12
Original author here.

The two sentences right before what you quoted are helpful:

"I’m not a Docker or Flask performance expert, and that’s not the goal of this
exercise. To remedy this, I decided to bump the specs on my deployments.

The general goal for this bakeoff is to get a best-case outcome for each of
these architectures, rather than an apples-to-apples comparison of cost vs
performance."

I wasn't trying to squeeze out every ounce of performance and determine the
minimum number of instances to handle 100 req/sec. I was trying to normalize
across the three patterns as much as possible to see best-case performance. I
didn't want resource constraints to be an excuse.

~~~
CaveTech
I think there’s a pretty big flaw in your benchmark because that failure rate
is _insane_ and you shouldn’t need anywhere near that hardware to accomplish
this. I don’t think your data is credible as a result.

------
Dunedan
As usual with such benchmarks, I miss performance comparisons over time. Did
Fargate outperform the other two solutions because it's generally faster or
because API Gateway was just slow during the few minutes that benchmark took?

What'd also be interesting here would be a price comparison. Without having
done the math I'd expect Fargate to be significantly more expensive than the
other solutions, which'd make a nice trade-off of cost vs. performance: If
performance matters choose Fargate, if cost matters choose API Gateway as
service proxy.

~~~
abd12
Author here. I'd say this was about in line with what I'd see from using API
Gateway and Lambda.

I thought about doing a price comparison, but that gets really tricky. At some
point, you're testing the skills of the tester more than the services
themselves.

Agree with your expectations though. I think Fargate is naturally faster and
can be moreso given the number of knobs you can tune. This will likely cost
you more, both in direct resource costs and in engineering time trying to
fine-tune the knobs. Whether that's worth it depends on your business needs.

------
philliphaydon
Hmmm what about Application Load Balancer to Lambda??? Cut api gateway out..

~~~
k__
This is possible?!

~~~
philliphaydon
Yup, means your requests aren't limited to 29 seconds that API Gateway
enforces. And you don't have to deal with fiddly config. So if you're using
something like ASP.NET Core its insanely easy to setup a website in a lambda
now.

~~~
pmoleri
There's something I don't quite get. If you set up a website with asp.Net core
or node, and you're using async io everywhere. Doesn't lambda restrict the
process to one request at a time? If it's like that, doesn't lambda
underutilize resources?

~~~
philliphaydon
I guess resource allocation would be under utilized, but being able to respond
to web requests allows for a single lambda to execute for any number of
different requests. For example we had a client who wanted a SOAP service to
interface with our system, so instead of creating a WCF service or something
and hosting it in IIS, we used SoapCore which is just middleware for accepting
a soap action or sending the wsdl information back. So we have a small lambda
(it's like 500kb) in a Lambda accepting ~1-10 requests during monday-friday
for this client. And since Lambda only charges for duration of execution time
it ends up costing nothing as it falls into the free tier. So even if the
resources are underutilized I don't see an issue. (sorry for wall of text I'm
on my phone)

------
kbar13
im confused how the author thinks it's ok that 10% of requests to fargate are
failing because they don't know how to configure docker / flask?

------
zleman
Helpful analysis. Thanks. I was planning to do something like this soon. I’m
interested in the performance difference between the same docker image running
on fargate vs ECS with a c5.large EC2 backing instance. My initial tests ( a
few months ago before they started using firecracker for fargate) have showed
at least a 200% performance increase. I also noticed fargate performance
really depends on the processor AWS allocates to the fargate task. And the
processor is not always the same between identical tasks. Would be worth
checking if there are significant differences between fargate with different
underlying CPU types.

------
alexnewman
only one of these has a reasonable local testing story

~~~
optimuspaul
of course you wouldn't rely on local testing so that's is moot.

------
Thaxll
This was so bad, I'm not even sure where to start. As a general advice when
you see 4cores/8GB doing 100rps~~ there is some serious problem somewhere,
every dev should know that this kind of instance doing that kind of benchmark
( simple REST calls ) should be in the ballpark of 4-5 digits not 3.

------
sonnyblarney
I think people are overly concerned with 'performance'.

So long as the approaches meet real world thresholds for specific things,
particularly the speed of certain time-sensitive requests - then the
technology is 'viable'.

(And let's also assume 'reliability' as a key component of 'viability')

Once the tech is 'viable' \- it's really a whole host of other concerns that
we want to look at.

#1 I think would be the ability of the tech to support dynamic and changing
needs of product development.

Something that is easy, fewer moving parts, smaller API, requires less
interference and support from Devops - this is worth a lot. Strategically - it
may be the most valuable thing for most growing companies.

'Complexity' in all it's various forms represent a kind of constant barrier, a
force that the company is going to have to fight against to make customers
happy. This is the thing we want to minimize.

Obviously, issues such as switching costs and the 'proprietary trap' are an
issue, and of course 'total cost of operations' i.e. the cost of the services
are important basis of comparison, but even the later is an issue later on for
the company, once it reachers maturity. (i.e. A 'Dropbox' type company should
definitely start in the cloud, and not until they have the kind of scale that
warrants unit-cost scrutiny would they consider building their own ifra)

In the big picture of 'total cost of ownership' \- it's the ability of the
system to meet the needs of product development, now 'how fast or cheap' it
is, that's really the point.

Something that is 2x the cost, and 10% 'less performant' \- but is very easy
to use, requires minimal devops focus, and can enable feature iteration and
easy scale - this is what most growing companies need.

Unless performance or cost are key attributes and differentiators of the
product or service - then 'take the easy path'.

~~~
Chyzwar
> For Fargate, this meant deploying 50 instances of my container with pretty
> beefy settings — 8 GB of memory and 4 full CPU units per container instance.

This is for hello world application. This is insane. Single EC2 HVM instance
would have better performance.

> Something that is 2x the cost, and 10% 'less performant' \- but is very easy
> to use, requires minimal devops focus, and can enable feature iteration and
> easy scale - this is what most growing companies need.

Cost is close to 50x and Fargate is not necessarily lower overhead. Terraform
+ Packer + EC2 vs ECS + Docker + Fargate is pretty the same for me. You still
need to build images and manage the deployment lifecycle.

~~~
sonnyblarney
Why would you use " Terraform + Packer + EC2 vs ECS + Docker + Fargate" if you
can just use AWS Lambda?

~~~
bdcravens
Because it's never just Lambda. It's the toolchain you use for deployment and
orchestration, routing, etc. (eg, "Serverless Framework + Lambda + API
Gateway")

~~~
sonnyblarney
Well 'serverless framework' really is just node.js or java. That's likely
going to be used anyhow.

Lambda itself doesn't really take maintenance.

Routing can be done in the app, you only need one - or a small number of
'endpoints' so messing with API gateway can be minimized.

You can run any sized app with a single API Gateway endpoint, and a single
lambda, on a simple node.js setup.

I can't see any reason to use containers on EC2 or containers on Amazon's
container service until an app gains quite a degree of sophistication.

~~~
Chyzwar
CloudFormation, API gateway, AIM, Secret Manager. Production application
requires much more than just a lambda. It is better to start from something
more structured than shell scripts.

------
denart2203
Great article. I’m interested in how the performance between the same docker
image running on fargate vs a c5.large EC2 instance.

------
scottfr
> First, I ran a small sample of 2000 requests to check the performance of new
> deploys. This was running at around 40 requests per second.

...

> When I ran my initial Fargate warmup, I got the following results. Around
> 10% of my requests were failing altogether!

....

> To remedy this, I decided to bump the specs on my deployments.

> The general goal for this bakeoff is to get a best-case outcome for each of
> these architectures, rather than an apples-to-apples comparison of cost vs
> performance.

> For Fargate, this meant deploying 50 instances of my container with pretty
> beefy settings — 8 GB of memory and 4 full CPU units per container instance.

~~~
alexeldeib
This is what caught my eye too. I'm not familiar with the guts of Fargate, but
this seems like very poor performance?

~~~
reilly3000
I sort of suspect the error wasn’t just about resources, since it continued
after given gobs of resources.

~~~
vanrysss
Having run a service on Fargate in prod before I can almost guarantee that the
errors he was seeing were 504s which are caused by the server code and the ALB
not having their request-timeouts set to to the same value.

~~~
lukeck
It’s my understanding that you want to set the timeout in the server code to
be higher than the idle connection timeout in the ALB. With them set to the
same value, timeout errors will look different depending on whether the
application or the ALB dropped the connection.

