
Serverless: Cold Start War - kiyanwang
https://mikhail.io/2018/08/serverless-cold-start-war/
======
stephenr
Tech, to Business: So, hear me out guys. With the power of "The Cloud", we can
break our compute workload down to the function level, and have _them_ run as
a service for us, rather than say an entire VM, or even an entire container.

And because it's a "Cloud" service, we pay for what we use, so if there's no
workload for the functions to service, there's no cost. We just pay for the
time the tiny little container is active.

Business: Ok, that sounds like it can save us money, and you seem confident in
the technology, let's go with that.

1 Month Later:

Business: Why do all these actions take so much longer to complete? They used
to load instantly with a "Done" message, now they're measurably delayed to
respond.

Tech: Well, you see the containers that run our functions stop running once
they've finished their work and it looks like theres nothing more for them to
do, so when new work comes in theres a delay while the container starts.

Business: You mean like how my Dell takes 5 minutes to boot up?

Tech: Well, kind of, but its much quicker than that obviously, and it's not a
whole operating system, it's just some processes in a process namespace...

Business: <visibly zoning out>

Tech: Ok, well tell you what we can solve this, we can setup something to
periodically ping the system so the containers running our functions don't
ever get de-activated, and that way they'll be ready to service requests
immediately, all the time.

Business: OK great that sounds like what we want.

1 Month Later:

Finance: Why did our bill for "functions" suddenly spike in the last month,
compared to when we started using it?

Tech: Well you see now we have to keep the function containers running 24x7 so
they're quick to respond, because $Business complained they're too slow to
start up from inactive state.

Tech Onlookers: ..... <blink>.... Wat... Why... Why would you do that?

(Edited to add:)

Tech Entrepreneur: I can outsource that process of keeping your function
containers active for you, for just $1/container/month!

~~~
TrueTeller
That's all plausible except the price issue. Say, you keep 10 containers
alive, so you make 10x 100ms calls every 5 minutes. That's gonna be $0.20 per
month. 20 cents.

~~~
village-idiot
Internal math at my org says that lambdas cost about 20% more than an
equivalent amount of EC2 power at full utilization. Now hitting full
utilization is very hard, but the point is that lambas are only cheaper
compared to underutilized EC2 hosts.

~~~
weego
That's ultimately the misuse 9f the technology though right. Unless it's a
freak, business changing, spike in usage due to exposure or whatever that
lambda are not critical path suitable and don't seem to have been implemented
with that intent.

For example if your sign up process is lambda you did you got your cost
benefit analysis wrong on anything other than a prototype or mvp, but if your
'change your profile photo', something that maybe happens a handful of times
across a month on average per user at best, and you implemented that as your
lambda to reduce load and delay scaling needs on your core infrastructure then
that feels like you did it right.

------
acjohnson55
I'm still a bit puzzled by the hype around FaaS. It seems like a useful tool
for things where you don't want major queuing under pressure, but you can
tolerate human perceptible delays. But it also seems easy to build a big ball
of mud deeply tied to the nuances of the chosen FaaS provider. It just seems
like most use cases are probably going to be just fine with more conventional
horizontal scaling techniques.

But I think I missing something, so I'd love to hear from someone who can
paint this picture for me.

~~~
biztos
I'm pretty new to it but so far I think the big selling points in a corporate
environment are:

1\. (Maybe, someday) no real Ops work.

2\. 100% utilization (at a cost).

3\. Measurability of cost in #2.

I agree about the ball-of-locked-in-mud danger. And so far I've seen the Ops
part be actually a bigger issue than it was before, because there's so much
opaque, badly documented madness involved. With AWS Lambda + API Gateway
anyhow, I can't speak to the others.

Even so, it seems like long-term the Ops end of the puzzle can be
planned/automated away for the most part. That leaves utilization and billing.

It can be very very useful to say "This costs exactly $X and that cost scales
linearly" when planning resource allocations. Even if everyone agrees it could
probably be done for an unknowable amount less. The predictability and the
isolation of the cost is sometimes worth spending more money. (Of course the
predictability requires that the Ops Ninjas not be required, which again I
think is possible long-term but definitely not short-term.)

Anyway that's just for one realm of application, which I think is getting more
popular ("do this internal thing we used to have some EC2's do").

~~~
acjohnson55
The no-ops part is compelling, but we've already had that for years in the
form of PaaS for small teams and K8s for large ones.

I think the billing part sounds a bit like a trap to me. At least while FaaS
retains so many quirks, it's like trading billing for the engineering time of
contorting a business process to a given FaaS model (e.g. trying to eek out
the lowest average and worst case latency; also working within language
runtime limitations).

I'm pretty sure ops will never be automated away; it can only be transmuted to
a different form :). But, at best, you can achieve elegant separation of ops
and process concerns. That's where I worry FaaS could be a bit of a hazard, if
not carefully utilized.

~~~
biztos
> ops will never be automated away; it can only be transmuted to a different
> form

Precisely, and the promise of FaaS is that it can become a black box you don't
care anything about, and bundled into the markup you're paying on CPU, memory
and network.

So far in real life it looks more like Ops is very much involved whenever you
update a function or -- gods help you -- actually have to debug something in
production. As long as this is the case it's a broken model, because as soon
as you pull your Ops Ninja away from some other task you're right back in the
bad old world of unpredictable costs.

My sense (as a relative newbie) is that the providers know this and are slowly
trying to make stuff easier and more flexible -- and more predictable -- to
deploy. Amazon Elastic Container Service being one example.

The catch is that the more standard it is, the less lock-in there is, and so
far my experience with AWS suggests that lock-in is a major part of their
strategy.

------
jondubois
Lambda is not useful. It solves a few problems but creates even more new
problems.

Some problems include:

\- It makes managing multiple environments (e.g. development, staging,
production) almost impossible.

\- It makes debugging difficult because you can't run the code on your own
machine and step through the code. Most projects cannot be tested end-to-end
due to environment incompatibilities between different services and front-ends
running locally... It's a fact that multiple developers can't share a single
development environment because each developer needs to work with their own
test data but Lambda doesn't allow this.

\- Lambda adds all sorts of unexpected limits on your code; e.g. cold starts,
maximum function execution duration and others.

\- The lock-in factor is significant; once you're hooked into Lambda and all
the surrounding services that it encourages, you cannot leave and you have no
bargaining power in terms of hosting costs and your future is entirely
dependent on Amazon.

\- Other AWS services that you can integrate with Lambda also exacerbate
problems related to handling multiple environments and debugging. Services
like Elastic Transcoder and S3 are blackboxes and are very hard to debug. If
something goes wrong, sometimes the only way to resolve the problem is to
contact Amazon support and spend weeks sending messages back and forth to
figure out the issue.

\- You're contributing to centralization of wealth and power instead of
helping small companies and small open source projects. You're helping to turn
Amazon into yet another too big to fail company with infinite leverage on the
rest of the economy.

\- It takes the fun out of coding. As a developer, you no longer feel any
ownership or responsibility over the code that you produce, you're just
handing over all that code to Amazon. In fact, it might as well belong to
Amazon because that's the only company that is able to execute that code. It
doesn't help with employee turnover.

The main reasons why Lambda is popular are because Amazon spent a fortune on
marketing it and there are a lot of vested interests in the industry who want
it to succeed (to drive up Amazon share price).

~~~
sebazzz
Doesn't that mostly apply to all serverless platforms? Serverless is cheap and
allows cloud providers to use excess capacity to allow small pieces of
software to run, scaling almost infinitely[0]. But it is always vendor lock-
in, whether you choose Azure or Amazon. It is a propietary platform but that
applies for most of the cloud if you use services that don't exist elsewhere
(Azure blobs, Azure Cosmos DB, etc)

[0]: [https://www.troyhunt.com/serverless-to-the-max-doing-big-
thi...](https://www.troyhunt.com/serverless-to-the-max-doing-big-things-for-
small-dollars-with-cloudflare-workers-and-azure-functions/)

~~~
stephenr
I think I would take the parent comment to apply to any hosted Functions as a
Service platform.

------
abalone
Two things.

1- It's amazing that Java has the fastest cold start time! Faster than
Nodejs.[1] That's exactly the opposite of what I've heard before.

2- I am so tired of hearing about cold start times for dormant apps as if that
is the only cold start scenario. It is arguably a worse problem to have cold
starts when scaling!

What do I mean by cold starts when scaling? You adopt serverless. Things go
great. Your app is never dormant. You're not serverless because you want to
shave costs for infrequently unused apps. You're serverless so you have
infinite scale and pay by the millisecond and minimal devops and so on. But
whenever you have a burst of scale and Lambda needs to spin up more
instances... some of your unlucky users are going to hit that cold start. And
this hack of keeping an instance warm would do nothing to solve that.

I mean, do they? Do we know? It's possible that AWS warms up instances before
throwing traffic at them but.. has anyone looked at this?

[1] [https://mikhail.io/2018/08/serverless-cold-start-
war//coldst...](https://mikhail.io/2018/08/serverless-cold-start-
war//coldstart-dependencies.png)

~~~
akvadrako
That chart is quite strange. Why would JS start take 500ms (on my system it's
80ms) and be slower to start with more dependencies?

Maybe it's because the bulk of the time is just copying the deployment
artifact to the local disk. In that case the overriding factor is the size of
the package.

~~~
icebraining
If you require() the libraries at the top, as it seems to be common in Node
applications, it makes sense that more dependencies add to the cold start
time.

------
pjc50
What if we had smaller programs, such that an executable could be started on
demand for each request? You could also avoid GC and long-term stability
issues this way by having a short-lived program. It even allows you to have
multiple different microservice functions written in different languages
served by the same system. So long as they use a similar API - we could call
it a Common Gateway Interface?

(Yes, this is a joke based on how we used to do it 20 years ago)

~~~
falcolas
It's somewhat amazing to me how some CGI services are so often faster than
full-blown-listening-on-port-8080 "micro" services.

It's unfortunate that we consider 1-3 seconds to be acceptable response times.

------
zackbloom
One thing this doesn't talk about is Cloudflare Workers. Rather than running a
separate container for each function we use V8 isolates directly, meaning the
cold start time is under 5ms.

~~~
bni
Interesting. I always thought AWS Lambda being a complete Linux environment
with curl and everything is just supremely wasteful. What could be the reason
AWS implemented it like that? To support many programming languages /runtimes?

~~~
15155
I would guess for these reasons and liability reasons: what happens when V8's
sandboxing is exploited?

------
isuckatcoding
I feel as if the whole "warm up the lamba as a pre-step"-thing takes away from
the whole benefit of serverless. I wonder if AWS Lambda could be smart enough
to anticipate requests based on some historical or daily pattern. Also, I'd be
curious to know if it is still cost effective to use lambda with this
technique (pre-warming) or to straight up go for a EC2 instance.

~~~
stephenr
Or.. you know.. don't have user-facing interfaces serviced by endpoints that
aren't running all the time?

I can see that _maybe_ there is a case to make for using functions as a
service, to handle batch processing of things, or possibly to service
background API requests.

------
xrd
I'm still really puzzled why AWS lambda has golang and Azure functions has
Java and python but GCF doesn't have anything but JavaScript. That's a big
hole that must be leaving a lot of developers feeling frustrated.

~~~
manigandham
GCP is just very slow at releasing and behind in both features and services as
compared to AWS and Azure. The trade-off is that the GA services are usually
more consistent, cheaper, faster and easier to use and integrate.

GCF is still in beta so it's the worst case, but they recently announced
Serverless Containers which will let you run anything insider a docker
container on-demand. That'll get around the language barriers and is the
inevitable destination of all serverless platforms eventually.

~~~
steren
GCF is now Generally Available.

And to sign up for serverless containers:
[https://g.co/serverlesscontainers](https://g.co/serverlesscontainers)

(I am a PM on GCP)

~~~
xrd
I'm assuming there is a blog post detailing this somewhere, can you share it
here?

~~~
manigandham
Here's the GCP Next 2018 video:
[https://www.youtube.com/watch?v=Y1sRy0Q2qig](https://www.youtube.com/watch?v=Y1sRy0Q2qig)

[https://cloudplatform.googleblog.com/2018/07/bringing-the-
be...](https://cloudplatform.googleblog.com/2018/07/bringing-the-best-of-
serverless-to-you.html)

------
davewritescode
I'm not sure if there's an equivalent in GCP but one of the issues we have
with AWS lambda is that cold start performance is also greatly affected by
whether or not the function is running in a VPC. In addition to warming up the
function, a virtual network interface has to be attached to the worker. This
can take 10+ seconds in the worst cases.

Hopefully this is an issue AWS fixes soon. The VPC cold start latency could
end up forcing you to make some less than optimal infrastructure choices like
running infrastructure on the public internet

------
xienze
The whole point of serverless functions is to do work that isn’t necessarily
real-time, but of which this is a lot and each unit of work can be completed
relatively quickly (say, periodic sync tasks and the like). I really think if
the nitty-gritty of cold start behavior is foremost in your mind you’re likely
Doing it Wrong(tm).

~~~
ssijak
I`m making a multiplayer game that includes dices using firebase + cloud
functions. I must use cloud functions for the game because I can not let the
client roll the dices on it`s own and just let it write to firebase, that
could be easily cheated. When starting the game, cold starts are very
noticeable, but after a few moves, everything feels fine.

~~~
phnofive
Very cool! Firmly believe no game should ever trust clients. Can you share the
project?

------
enitihas
Has anyone here migrated back from lambda to conventional servers? How was the
experience like. Is there any straightforward way to convert serverless
projects to traditional services en masse. In what cases would you recommend
moving away from serverless?

~~~
Nialna
We have from Google cloud functions because they are a joke (see cold start
graph on OP link). Even during development it was horribly painful to have 10+
second call times on many many requests while testing (and the same in
production). Even requests that are only a few minutes or seconds away from
each other would cold start randomly.

I already had a centralised entry point for cloud functions (just a basic
abstraction), but generally they are pretty much wrapped Express requests, in
GCP at least, and AWS too I think so the experience should be similar between
AWS and GCP.

Changing that to be actual pure Express functions and not use the cloud
functions API was pretty easy and quick, and while I was there I refactored
our entry points a bit to be easier to migrate in the future.

The only thing that took time was making a new deployment process (we moved to
App Engine so it's still "kinda serverless"). Since cloud functions have their
own deployment system, I had to write our own deployment scripts, management
of environments and so on rather than relying on the one provided by firebase
cloud functions. Not a lot of work really, and this is something you would
need if you have your own seevers anyway.

Once you have this done, it's pretty easy to move your "cloud functions" to
any scaling node server host, or even to another cloud FaaS provider.

------
iamjustlooking
I commented on a thread a couple weeks ago about Cloud Functions on GCP here:
[https://news.ycombinator.com/item?id=17796893](https://news.ycombinator.com/item?id=17796893)

Eager to test it out we ran thousands of tests attempts with different RAM
sizes and I can corroborate this persons findings in regards to the reduction
of cold start time from functions with larger RAM allocations and seeming
unpredictability of cold start on GCP. I hope with time they will improve cold
start times or increase the minimum time for making a function "cold".

~~~
jacques_chester
From the parts I can see in Knative-land, it's being given a lot of thought.
My view is that the biggest improvement to be made is in smarter handling of
raw bits. Kubernetes doesn't quite understand disk locality yet and most
docker images are less-than-ideally constructed in any case.

~~~
mwcampbell
What do you find suboptimal about most Docker images? Just the size, or
something else?

~~~
jacques_chester
I wrote several thousand words on the topic a few months (email me for the
link).

The gist is: you can make an image easy for developers, or you can make it
performant in production, but you cannot have both.

Ease of development typically leads to kitchen-sink images or squashed images,
but production performance requires careful attention to the ordering and
content of layers.

------
jchrisa
An alternative approach for user facing apps is to connect directly to the
database from the browser. Lambda still has a role to play in such an
architecture, but hopefully most of your basic crud operations can go direct,
with less commonly called functions like login depending on Lambda. I address
this option about 2/3 of the way through this webcast on serverless best
practices [https://blog.fauna.com/webcast-video-serverless-best-
practic...](https://blog.fauna.com/webcast-video-serverless-best-practices-
with-netlify)

~~~
alpb
Probably worth adding disclaimer you work on FaunaDB. :)

I am wondering in what sort of cases users can just hit the DB. Do you mind
giving a tl;dr? Does it only work for readonly/non-sensitive data?

~~~
jchrisa
Thanks. :) That was the last thing I wrote before falling asleep... TLDR is
that many (not just FaunaDB) cloud databases have a security and connection
model that's suitable for connecting from the browser. I know Firebase has
been doing it forever, and you can also do it with DynamoDB if you don't mind
complexity.

It takes a little bit of thinking to set up the security rules so that users
can only see what they are allowed, but it's worth it for the performance and
runtime simplicity. And of course you can always invoke a Lambda for code
paths that need to run with privileged access.

------
thomasfoster96
The comparison of JavaScript cold start times by the number/size of
dependencies is a little confusing.

I’m not too familiar with how all the various serverless platforms work, but a
decent bundler would surely improve the cold start times of most JavaScript
functions. Deploying dozens of megabytes of dependencies across dozens and
dozens of files is obviously going to result in a longer start up time than
uploading a single bundle generated by webpack or Rollup.

Enforcing a single <1MB file [0] seems to have at least partly allowed
drastically improved cold start times for Cloudflare Workers in comparison
with AWS Lambda, Azure Functions and Google Cloud Functions (although
Cloudflare Workers also a much smaller feature set).

[0] [https://developers.cloudflare.com/workers/writing-
workers/st...](https://developers.cloudflare.com/workers/writing-
workers/storing-data/)

------
catchmeifyoucan
This is very well done! I love the comparisons, and the fact that it compares
different languages.

------
tomcam
One of my favorite articles of recent times. What jumped out at me was that
there is a dramatic difference in start up time depending on how much memory
your cloud server has: the more memory, the faster the start up time.

~~~
ssijak
More memory = faster CPU allocated

~~~
tomcam
Which is not necessarily intuitive. For all I knew there was some kind of
penalty of a 1025MB memory image vs 512MB. Didn't think so... but I'm
increasingly baffled by hardware issues in an increasingly virtualized world.

------
a_imho
So, what is the killer app of faas? To my layman understanding the selling
point is easy scale-ability, but it seems to be inherently at odds with
serverless being the most expensive computing model?

~~~
TrueTeller
IMO, every greenfield app that can benefit from "just focus on your business
code" model.

P.S. I'm the author of OP, thanks for reading!

~~~
jacques_chester
This is an argument for buildpacks, isn't it?

Disclosure: I am currently working on buildpacks again.

------
hacknat
While the idea of only paying for what you use seems to be an interesting
proposition with serverless. I think the true cost savings come in no longer
having to manage VMs.

With that in mind the coldstart problem can be avoided entirely with Fargate
and Azure container groups. Sure you pay for an app to be on all the time, but
you were doing that anyways.

I’m using Azure container groups and their managed DB service right now, and
my CI pipeline is thin and I have had no need of an Ops person to manage VMs.

------
supahfly_remix
Don't remember the blogpost, but one had a cost breakdown of FaaS. I was
surprised to learn that the API Gateway cost is more than AWS lambda's cost.

------
trevyn
This is just AWS, GCP, and Azure — it would be nice to see Zeit and Cloudflare
Workers in the comparison as well. (Any other serverless providers of note?)

------
maxxxxx
I think if would be nice to also show the performance after cold start so you
can make a real-world comparison with all trade offs.

------
wimbledon
I think there is a business case for a small app where you could input the url
of your function and have it called every ‘x’ seconds or minutes to keep the
function warm.

Would anyone here use it? I can put together a small app in a few days.

~~~
stephenr
A few _days_ to put `curl $url > /dev/null` in a crontab? Are you kidding me?

~~~
bufferoverflow
He is describing an HTML+JS solution. Even then, it's just a few lines of JS -
create an image or a script object with a given URL, append it to the body.

------
abhisuri97
Does anyone know a potential reason why .NET shows a clear trend of decreasing
cold start time as memory size increases?

~~~
reilly3000
At least on AWS, increasing memory also comes along with more compute and more
networking bandwidth. If there are a lot of packages to be imported and
possibly decompressed that could have a large impact

"AWS Lambda allocates CPU power proportional to the memory by using the same
ratio as a general purpose Amazon EC2 instance type, such as an M3 type. For
example, if you allocate 256 MB memory, your Lambda function will receive
twice the CPU share than if you allocated only 128 MB. "

[https://docs.aws.amazon.com/lambda/latest/dg/resource-
model....](https://docs.aws.amazon.com/lambda/latest/dg/resource-model.html)

------
mattip
The second article on the front page right now with war in the title, why
can’t this be an analysis instead?

~~~
TrueTeller
"Analysis" is not that hackernews-worthy :P

------
songco
Great comparisons

