
Serverless Performance: Cloudflare Workers, Lambda and LambdaEdge - a17anxx
https://blog.cloudflare.com/serverless-performance-comparison-workers-lambda/?hn
======
jhgg
For what it's worth at work (Discord) we serve our marketing page/app/etc...
entirely from the edge using Cloudflare workers. We also have a non-trivial
amount of logic that runs on the edge, such as locale detection to serve the
correct pre-rendered sites, and more powerful developer only features, like
"build overrides" that let us override the build artifacts served to the
browser on a per-client basis.

This is really useful to test new features before they land in master and are
deployed out - we actually have tooling that lets you specify a branch you
want to try out, and you can quickly have your client pull built assets from
that branch - and even share a link that someone can click on to test out
their client on a certain build. Just the other week, we shipped some non-
obvious bug, and I was able to bisect the troubled commit using our build
override stuff.

The worker stuff is completely integrated into our CI pipeline, and on a usual
day, we're pushing a bunch of worker updates to our different release
channels. The backing-store for all assets and pre-rendered content sits in a
multi-regional GCS bucket that the worker signs + proxies + caches requests
to.

We build the worker's JS using webpack, and our CI toolchain snapshots each
worker deploy, and allows us to roll back the worker to any point in time with
ease.

I wrote probably the stupidest worker script too, that streams a clock to your
browser using chunked requests:
[https://jake.lol/clock](https://jake.lol/clock) (no javascript required).

~~~
kentonv
At Cloudflare we're all in awe at how fast
[https://discordapp.com](https://discordapp.com) loads... nice work.

~~~
jhgg
Thanks! And thanks for building Cloudflare workers. You have no idea how much
nicer it is to write and express a lot of this logic in modern javascript,
rather than nginx configs and VCLs.

------
jedberg
The value of any function-as-a-service is the ecosystem within which it sits.
Pretty much all of them are the same: upload your code, we will run it.

The value comes from 1) What can trigger that code to run and 2) What services
that code can interact with.

And on those two points, AWS still wins hands down. They have by far the most
possible triggers for Lambda, and they have by far the most services that
Lambda can interact with.

It's cool that Cloudflare built something faster, but unless you're running in
a vacuum, speed is the least of your concerns.

~~~
zackbloom
Yes, that's how Amazon creates lock-in. But it depends what you're doing with
it right? If you are looking to run code based on a SQS event, yes you have to
use a Lambda. If you are looking to execute code when something visits a URL
you have more options.

~~~
greenail
The lockin argument alone is a red herring. Every technology implementation
creates lockin. The valid question is how hard is something to change. A good
architecture balances how easy it is to change something with how optimized it
is, also balancing how much it costs to build and maintain.

Realistically you can get as locked into Amazon as you want, lambda alone does
not create inescapable lockin by any measure so I would argue Jeremy still has
a point in the fact that tools become more useful when you can use them to do
more work (ecosystem)...

~~~
1996
Amazon performance is generally bad when using other services.

We are rolling out a CDN, with a goal of 20 ms latency in most countries. We
want more granularity that AWS - and some zones are just not well served (No
AWS in Africa, incomplete offer in Brasil, etc)

Still, we figured we would use Route 53 as you can do Latency Based Routing
even with non-AWS servers. Computing latency or using EDNS0 as a proxy is not
rocket science, so we thought the DNS would not be a limiting point.

Oh boy, how wrong we were! After wrongly blaming the bad performance on
Cloudflare caching, further tests revealed Route 53 takes as much as 0.7s to
reply to some DNS queries - and even worse when fronted by Cloudflare, as for
some reason the DNS TTL seems to be ignored by Cloudflare. The latency only
drops down after about 4 queries, which makes me thing they have some kind of
Round-Robin that does not share the DNS queries (I could be wrong)

In the article, the author says: "Most of that delay is DNS however
(Route53?). Just showing the time spent waiting for a response (ignoring DNS
and connection time)". No you should not ignore the DNS delays! Route53
performance is very bad - 2 full seconds for you!!

We are fortunate it did not take 2s for us. Still, having servers all over the
world that reply in 20 ms is useless when the first DNS query takes 700ms.

We ended up leaving for Azure: Traffic Manager outperforms Route 53 by a
factor of 2.

Eventually, we will roll our own GeoIP with DNS resolvers on a anycast subnet.

I do not understand how this level of "performance" can be tolerated. At 2
seconds for a DNS query, you are better off using the registrar free DNS
service!!

~~~
mayank
Saying Route53 takes “2 seconds” to resolve is pretty meaningless without a
distribution or at least percentiles. Route53 obviously doesn’t take 2 seconds
for all or most queries.

~~~
1996
I am quoting the author, and their analysis of the initial query. This
observation from whoever wrote the article matches my own experimental
results: initial queries are very very slow on Route 53 LBR. A distribution of
queries is useless and misleading, as later queries are cached if you have a
sufficient TTL - so only the first few really matters in the performance
results.

Later queries are fast of course, as the results are cached (TTL).

Even if the DNS is very pooly configured, all queries after the first one will
benefit from the cache!! So the first few queries matter much more, and this
is what we should be talking about instead of distributions and percentiless.

Said differently: If each of your visitor has to way a second or two until the
site comes up the first time, then the site works normally, it may still give
them a bad impression.

I measured the DNS delay on first Route53 reply to be over 700 ms personally.
For the author it is 2000 ms. These results are in the same order of
magnitude, and make Route53 unsuitable for many applications. Of coutse, you
could start hacking, like keeping Amazon cache warm by issuing queries through
chron, or by setting extremely long TTLs and hoping your visitors DSL modem
will keep your A records in cache as long as you asked for - but these are
just hacks trying to compensate that the first DNS query takes SECONDS to
process.

Route53 LBR DNS is not as a "slow and requiring hacks". It's supposed to be
fast, simple to run, and to ingrate with different ecosystems. To me, it seems
to be none of that.

After assessing Route53 as fubar, I switched from AWS to Azure: TrafficManager
offers the same features, and the first request takes less than 350ms. There
must still be some cruft in there, but at least it is manageable.

------
kentonv
As the architect of Workers I was obviously pretty happy with Zack's results
in general. But, I'm not happy with the tail latency (99th percentile), even
if it beats the competition. I suspect this has to do with GC pauses. The
solution may be to proactively run GC in a background thread between requests.
For high-traffic workers that are always processing requests, we could load
multiple instances of the worker and alternate between them.

BTW, if you're into modern C++ and this kind of work interests you, please
e-mail me at kenton at cloudflare.com. We're hiring!

~~~
1996
You are from Cloudflare? Could you tell me why the replies to geoip/latency
based routing CNAMEs do not seem to be cached by Cloudflare?

The setup is: domain.com -> geoiplbr.domain.com with cloudflare caching
enabled. Nothing else that is fancy and could cause delays.

If I measure the TTFB for domain.com, I see a large DNS delay until about the
4th consecutive query - and then the DNS is no longer the limiting factor.

The same measures on geoiplbr.domain.com normalize after the 2nd query.

It seems to me you have some kind of Roud Robin going on that does not share
the DNS results.

Or maybe the caching is not done at the POP level?

~~~
kentonv
Sorry, I work on Workers, not DNS, so I honestly don't know.

------
skunkworker
This seems to disregard some of the other factors that make Lambda >
Cloudflare Workers. We run binaries on our lambda instance with a go-based
function, since Lambda allows for up to 250mb of binaries, 3gb ram and 30s
max, this allows us to perform computationally and ram heavy applications
without worrying about our instance being killed off.

Also I looked into using cloudflare workers to write my own custom edge cdn
but they currently don't allow you to change where in the call requests are
processed or telling cloudflare what to cache vs not cache. If they could have
some functionality that would allow you to easily write your own multi layered
CDN this would be interesting.

~~~
zackbloom
It’s worth pointing out Amazon charges more than 10x a Worker on a per
execution basis to use it as you describe for just 100ms of compute. If you’re
actually using 30s it’s probably very expensive indeed.

The statements in the second paragraph are fortunately incorrect. With the
exception of some security features Workers totally takes over the incoming
request. It can use flags in its subrequests to configure the cache as you
need, and will soon have access to the raw Cache API.

~~~
skunkworker
Interesting, I'm glad to know that raw access to the Cache API is being added,
when I contacted Cloudflare about this a number of months ago at the time they
didn't support this. For my edge CDN needs I will reevaluate cloudflare
workers soon.

On the first paragraph we have shifted some computationally heavy and
horizontally restricted functions from our own servers to Lambda, this allows
us to instantly scale to meet our non-consistent demand. With the lambda
workers we are using we are averaging 5 to 11s of execution time with
approximately 800mb of memory and utilize the cpu heavily. If Cloudflare
workers ever expanded to allow for a similar scope I would definitely take a
second look at it.

------
poulpi
Have you think about including Golang based Lambda function in your benchmark?

As you're guessing that Cloudfare superior JS runtime plays a big role, it
could be interesting to see if it can compete against Golang Lambda as well.

~~~
kentonv
A lot of our performance benefit comes from lighter-weight sandboxing using V8
instead of containers, which makes it feasible for us to run Workers across
more machines and more locations. It wouldn't surprise me too much if a Worker
written in JS can out-perform a Lambda written in Go, as long as the logic is
fairly simple. But I agree we should actually run some tests and put up
numbers... :)

On another note, currently we only support JavaScript, but we're putting the
finishing touches on WebAssembly support, which would let you run Go on
Workers... stay tuned.

~~~
Matthias247
Just curious: The article mentions V8 isolates. Do you actually also run all
IO of the worker in the same isolate? Or in a different one, and the API calls
are bridged (via some webworker-like API)?

I guess one of the main challenges is that all resources are properly released
when a worker is shut down. Releasing memory sounds pretty easy, if V8 does it
for you. But releasing all IO resources might be a bit harder, especially if
they are shared between isolates.

~~~
kentonv
The Workers runtime itself is implemented in (modern) C++, not JavaScript. So,
there's no need for a separate isolate -- API objects are implemented directly
in C++.

In C++, memory and I/O resources are both managed through RAII. Of course,
when binding to JavaScript, we often end up at the mercy of the JavaScript GC
to let us know when an object is no longer reachable from JS, and the GC makes
no promises as to how promptly it will notice this (maybe never). That's fine
for memory (it amortizes out) but not for I/O resources. So we're back at the
original problem.

Luckily, in the CF Workers environment, it turns out that all I/O objects are
request-scoped. So, once a request/response completes, we can proactively
release all I/O object handles bound into JS during that request/response. If
JS is still holding on to those handles and calls them later, it gets an
exception.

~~~
Matthias247
Thanks for the explanation!

Yes, I guess a part of my question was whether the destructors/finalizers that
the JS object bindings in C++ might impose are called fast enough to guarantee
isolation and prevent resource leakage. Looks like in your case that happens
through the request scoping.

------
iamleppert
This really isn't a very good benchmark. It's basically only validating the
Cloudflare edge network, but the test itself is far from real-world. A service
that returns the current time is not doing anything practical and borders on
meaningless.

~~~
zackbloom
We have a post which compares CPU intensive workloads which should be ready
after the American holiday. The summary is a 128MB lambda provides you with
roughly 1/8 of a CPU core, which is therefore 8x slower than a worker.

------
mrkurt
Comparing Workers to Lambda proper seems silly. Lambda lets you connect to
DBs, use a lot more than 128MB of memory, etc, etc, etc.

Comparing them to Lambda@Edge makes sense, but Lambda@Edge is not a very good
product.

(Full disclosure: my company competes with Cloud Flare Workers).

~~~
BillinghamJ
Could you expand on your opinion of Lambda@Edge? For my company's needs, it
has worked superbly.

~~~
mrkurt
Sure! I think it's fine infrastructure-wise, but the dev experience is awful.
I like tools you can build, test and run locally, for example.

The fundamental problem I run into with Lambda@Edge is just that their request
stages aren't a great abstraction (OpenResty/nginx has a similar problem). It
really limits what kinds of problems you can solve.

~~~
kentonv
> The fundamental problem I run into with Lambda@Edge is just that their
> request stages aren't a great abstraction (OpenResty/nginx has a similar
> problem). It really limits what kinds of problems you can solve.

Yes! I completely agree. Interesting that we both ended up with the Service
Workers API instead. I'm really hoping that Service Workers becomes the
standard for JavaScript HTTP handling in the future.

------
bufferoverflow
Does Cloudflare have a free DB of sorts, like Amazon's DynamoDB? Or can I
query Amazon's DynamoDB from the worker?

~~~
kentonv
Building out storage is my current focus. The challenge is that we want to
build something that actually utilizes out network of 151 locations today,
1000's of locations in the future. If your application has users on Mars (or,
New Zealand), you should be able to store their data at the Cloudflare
location on Mars (or, New Zealand) so that they can get to it with minimal
latency.

PS. If you're a storage expert and building a hyper-distributed storage system
interests you, e-mail me at kenton at cloudflare. We're hiring.

~~~
ranman
Let me know if you need help with the Mars location in the future. I can't
wait for AWS to open their utopia-planitia-1 region with SpaceX or BlueOrigin.

------
jbergstroem
I'm using cloudflare workers to "polyfill" client hints if they're missing
with cookie logic. With their addition of being able to mutate cache keys via
edge workers I find it to be a extremely powerful way of everything from per-
device image optimization (or google data saver or hidpi support) to serving
different pages for the same uri based on your requirements (and storing this
in cache).

------
smoll
What Infrastructure as Code (IaC) options exist for Cloudflare Workers? AFAICT
neither Serverless nor Terraform support it. IaC is table stakes for any new
part of my tech stack, and I would prefer not to code it from scratch - unless
deployment/configuration is extremely easy to automate via CLI or something...

~~~
prdonahue
To expand on what Zack said, we're just about ready to merge in Cloudflare
Workers support to our golang SDK (see
[https://github.com/cloudflare/cloudflare-
go/pull/188](https://github.com/cloudflare/cloudflare-go/pull/188)).

Once this is merged, it clears the way for us adding Terraform support (as
terraform-provider-cloudflare wraps cloudflare-go).

There's been lots of interest from our customers in being able to manage
Workers using Terraform, so it's high on the list.

------
kevan
>To be fair, comparing my Lambda, which only runs in us-east-1 (Northern
Virginia, USA), to a global service like Workers is a a little unfair.

At least you acknowledge that it's a bit silly to use a global benchmark to
compare a global service with an intentionally-regionalized service.

~~~
zackbloom
Why would you run something in a single location if you can run it everywhere
for the same price though? It's not like Lambda is cheaper for being
centralized.

~~~
kevan
Isolating failure domains and complying with data residency requirements are a
couple reasons. Also, global reach usually means global blast radius if you
screw something up.

For the specific use case you tested workers on the edge absolutely make more
sense than lambda, but I think the headline is a bit click-baity.

------
speeq
I wish Cloudflare would offer some kind of key-value store with Workers,
something like Google Cloud Memorystore but globally distributed in all of
their PoPs - even if it's really limited like 32 MB RAM.

~~~
zackbloom
Would you like to be able to write to it from your Worker, or only read from
it? Can you tell us more about your use case?

Feel free to email zack [at] cloudflare.com directly if you like.

------
adreamingsoul
Can we see the code that was used for testing Lambda and Workers?

~~~
zackbloom
Yes! [https://github.com/cloudflare/worker-performance-
examples/tr...](https://github.com/cloudflare/worker-performance-
examples/tree/master/time)

~~~
richardowright
Any plans to publish the results from the pbkdf2 version?

~~~
zackbloom
Yep, later this week!

------
mchahn
> The functions being tested simply return the current time

Not a very interesting benchmark. This would only measure net latency and
spin-up time.

~~~
zackbloom
I have a post which should come out later in the week which dives into the
performance with CPU-intensive workloads. tl;dr is that a 128MB Lambda is
about 8x slower than a Worker.

~~~
mayank
Can you also throw in some adversarial workloads? Simple proof of work using
node’s builtin crypto module would be a nice benchmark for V8 isolates vs
Lambda’s processes, and would go a long way convincing people that using
isolates is reliable in a shared setting relative to processes/containers.

~~~
zackbloom
Can you elaborate on this? How would doing crypto demonstrate the security of
the isolate?

~~~
mayank
Nail the CPU and ensure that other isolates on the same process/machine don’t
get starved. I meant isolation more than security.

~~~
kentonv
Different isolates can run concurrently on different threads, so pegging the
CPU in one isolate doesn't block any others. Also, if a worker spends more
than 50ms of CPU time on a request, we cancel it, terminating execution of the
worker even if it's in an infinite loop. (Almost all non-buggy Workers we've
seen in practice use more like 0ms-2ms per request. Note this is CPU time;
time spent waiting for network replies doesn't count.)

------
squark007
I would love to see fly.io included in these benchmarks - their product is
also very similar.

------
cordite
“Half a decade of experience” doesn’t sound like much these days.

~~~
zackbloom
That’s fair, the truth is some of the people here have been around since the
beginning of the internet, so four decades might be more fair. Unless you add
up everyone’s experience, then we’re in the millennia...

------
xstartup
I was working on an app which serves roughly 400*10^6 API requests/day.

One cool property of our app is that it's rarely updated and mostly read.

And the goal is to achieve the lowest possible latency at the edge.

It scales beautiful, I am not sure which other architecture can help us keep
this afloat with only 4 developers working on it.

So, we have a DynamoDB table which is replicated to multiple regions using
DynamoDB streams and Lambda.

For us, Lambda means achieving a lot without many developers and system
administrators but I understand that not all problems yield gracefully to this
pattern.

It seems using Cloudflare Workers to trigger our Lambda function instead of
API gateway could prove to be cheaper.

~~~
zackbloom
Would it be possible to read the data through the Cloudflare cache? If so your
data and API would be replicated not just around the world but actually within
the majority of the worlds ISPs. Based on our experiences with 1.1.1.1,
Cloudflare is within 20ms roundtrip of most people on earth.

