And because it's a "Cloud" service, we pay for what we use, so if there's no workload for the functions to service, there's no cost. We just pay for the time the tiny little container is active.
Ok, that sounds like it can save us money, and you seem confident in the technology, let's go with that.
1 Month Later:
Business: Why do all these actions take so much longer to complete? They used to load instantly with a "Done" message, now they're measurably delayed to respond.
Tech: Well, you see the containers that run our functions stop running once they've finished their work and it looks like theres nothing more for them to do, so when new work comes in theres a delay while the container starts.
Business: You mean like how my Dell takes 5 minutes to boot up?
Tech: Well, kind of, but its much quicker than that obviously, and it's not a whole operating system, it's just some processes in a process namespace...
Business: <visibly zoning out>
Tech: Ok, well tell you what we can solve this, we can setup something to periodically ping the system so the containers running our functions don't ever get de-activated, and that way they'll be ready to service requests immediately, all the time.
Business: OK great that sounds like what we want.
Finance: Why did our bill for "functions" suddenly spike in the last month, compared to when we started using it?
Tech: Well you see now we have to keep the function containers running 24x7 so they're quick to respond, because $Business complained they're too slow to start up from inactive state.
Tech Onlookers: ..... <blink>.... Wat... Why... Why would you do that?
(Edited to add:)
Tech Entrepreneur: I can outsource that process of keeping your function containers active for you, for just $1/container/month!
If you’re not latency sensitive, such as queue to queue lambdas, then go for it.
Pretty sketchy marketing in my opinion.
I think you would expect them to at least use swap for density and always keep one extra instance running and ready to serve requests. It's not like people generate functions on the fly, so it shouldn't even cost anything extra. Swap will help with unnecessary things kept in memory.
2. The last company I worked for isolated functions to vms on a customer level. That way, if someone hacked their way past the function container, they would then still have to hack their way past the VM container to access another customers data which is what you would have to do anyways on the public VM offering.
For example if your sign up process is lambda you did you got your cost benefit analysis wrong on anything other than a prototype or mvp, but if your 'change your profile photo', something that maybe happens a handful of times across a month on average per user at best, and you implemented that as your lambda to reduce load and delay scaling needs on your core infrastructure then that feels like you did it right.
Business: We were really successful with hundreds of requests a minute, but for some reason the delay came back!
Tech: Oh...warming up doesn't scale. Scratches head and walks away.
But I think I missing something, so I'd love to hear from someone who can paint this picture for me.
1. (Maybe, someday) no real Ops work.
2. 100% utilization (at a cost).
3. Measurability of cost in #2.
I agree about the ball-of-locked-in-mud danger. And so far I've seen the Ops part be actually a bigger issue than it was before, because there's so much opaque, badly documented madness involved. With AWS Lambda + API Gateway anyhow, I can't speak to the others.
Even so, it seems like long-term the Ops end of the puzzle can be planned/automated away for the most part. That leaves utilization and billing.
It can be very very useful to say "This costs exactly $X and that cost scales linearly" when planning resource allocations. Even if everyone agrees it could probably be done for an unknowable amount less. The predictability and the isolation of the cost is sometimes worth spending more money. (Of course the predictability requires that the Ops Ninjas not be required, which again I think is possible long-term but definitely not short-term.)
Anyway that's just for one realm of application, which I think is getting more popular ("do this internal thing we used to have some EC2's do").
I think the billing part sounds a bit like a trap to me. At least while FaaS retains so many quirks, it's like trading billing for the engineering time of contorting a business process to a given FaaS model (e.g. trying to eek out the lowest average and worst case latency; also working within language runtime limitations).
I'm pretty sure ops will never be automated away; it can only be transmuted to a different form :). But, at best, you can achieve elegant separation of ops and process concerns. That's where I worry FaaS could be a bit of a hazard, if not carefully utilized.
Precisely, and the promise of FaaS is that it can become a black box you don't care anything about, and bundled into the markup you're paying on CPU, memory and network.
So far in real life it looks more like Ops is very much involved whenever you update a function or -- gods help you -- actually have to debug something in production. As long as this is the case it's a broken model, because as soon as you pull your Ops Ninja away from some other task you're right back in the bad old world of unpredictable costs.
My sense (as a relative newbie) is that the providers know this and are slowly trying to make stuff easier and more flexible -- and more predictable -- to deploy. Amazon Elastic Container Service being one example.
The catch is that the more standard it is, the less lock-in there is, and so far my experience with AWS suggests that lock-in is a major part of their strategy.
You keep the ability to exercise a fair amount of control, easily, and clearly.
And you don't sit inside really opaque execution environment.
Sure, the scaling and provisioning of those Docker containers is opaque, but I'm much more willing to deal with it at that level.
The flip side - so long as its a service and not a job, who cares? :D
I guess you can. However, when I built Lambda functions I did pay attention to restricting lock-in to the entrypoint, and when I later had to migrate elsewhere for non-technical reasons, that was very much doable. In other words, it's easy but not necessary to build a big ball of locked-in mud.
The reason I've chosen this is because I do not want to manage an OS. Patch management is something I am not interested in tackling - it's a hard problem and AWS has it down anyways.
There are other reasons as well (I find ephemeral systems very appealing), but this is the most significant.
Some problems include:
- It makes managing multiple environments (e.g. development, staging, production) almost impossible.
- It makes debugging difficult because you can't run the code on your own machine and step through the code. Most projects cannot be tested end-to-end due to environment incompatibilities between different services and front-ends running locally... It's a fact that multiple developers can't share a single development environment because each developer needs to work with their own test data but Lambda doesn't allow this.
- Lambda adds all sorts of unexpected limits on your code; e.g. cold starts, maximum function execution duration and others.
- The lock-in factor is significant; once you're hooked into Lambda and all the surrounding services that it encourages, you cannot leave and you have no bargaining power in terms of hosting costs and your future is entirely dependent on Amazon.
- Other AWS services that you can integrate with Lambda also exacerbate problems related to handling multiple environments and debugging. Services like Elastic Transcoder and S3 are blackboxes and are very hard to debug. If something goes wrong, sometimes the only way to resolve the problem is to contact Amazon support and spend weeks sending messages back and forth to figure out the issue.
- You're contributing to centralization of wealth and power instead of helping small companies and small open source projects. You're helping to turn Amazon into yet another too big to fail company with infinite leverage on the rest of the economy.
- It takes the fun out of coding. As a developer, you no longer feel any ownership or responsibility over the code that you produce, you're just handing over all that code to Amazon. In fact, it might as well belong to Amazon because that's the only company that is able to execute that code. It doesn't help with employee turnover.
The main reasons why Lambda is popular are because Amazon spent a fortune on marketing it and there are a lot of vested interests in the industry who want it to succeed (to drive up Amazon share price).
> It makes managing multiple environments (e.g. development, staging, production) almost impossible.
I don't know why you say this? In my experience, this was one of the most amazing parts: my CI/CD setup simply deployed a new Lambda function for every branch in my repo, which was equivalent to the production one in terms of environment.
> It makes debugging difficult because you can't run the code on your own machine and step through the code.
It took some work to initially setup, but I did manage to do this pretty well. Perhaps related to the next point:
> The lock-in factor is significant; once you're hooked into Lambda and all the surrounding services that it encourages, you cannot leave and you have no bargaining power in terms of hosting costs and your future is entirely dependent on Amazon.
This is true, but can be mitigated. I restricted the lock-in to the entrypoint, and managed to transfer my functions elsewhere later with relatively little effort. In this case, the Lambda functions were behind an API Gateway proxy, and I converted them to the same interface used by Express.js before passing them to my business logic. That allowed me both to execute the business logic locally, and to migrate it where it is running now, on a standard Express.js server.
> It takes the fun out of coding.
Well, I can just say that it was still fun for me - most of the fun usually is in the business logic, although the experience of simply experimenting in a different branch and having a complete production-like environment pulled up for that still amazes me.
It certainly doesn't. You just have to use an infrastructure-as-code tool such as Terraform or Serverless. AWS's own method of dealing with environments is best ignored.
> Lambda adds all sorts of unexpected limits on your code; e.g. cold starts, maximum function execution duration and others.
Like any other platform, Lambdas have certain perforamance limitations you need to be aware of. They're not a fit for every problem but the future is going to be increasingly serverless. These limitations are likely going to be much less of a problem as the technology matures.
Again, use of a platform-agnostic setup like Terraform helps mitigate against this. In theory it ought to be relatively simple to change FaaS providers (I haven't actually tried this).
> You're contributing to centralization of wealth and power instead of helping small companies and small open source projects. You're helping to turn Amazon into yet another too big to fail company with infinite leverage on the rest of the economy.
The centralisation of power and wealth is true to an extent but there are two sides to this. It's also empowering for small organisations. It allows small companies to very quickly ramp up availablity and have extremely solid reliability without a specialised infrastructure team. You can also start up something very quickly and cheaply given AWS's free usage tier.
I do think you raise some great points however.
a. it is possible to have multiple environments[https://up.docs.apex.sh/#configuration.stages]
b. you can run code on your machine[https://up.docs.apex.sh/#commands.start]
c. regarding vendo lockin; alternative serveless providers(apart from aws) are planned [https://github.com/apex/up/issues/4]
d. You're contributing to centralization of wealth and power instead of helping small companies and small open source projects.
up is open-source but also has a paid plan[https://up.docs.apex.sh/#guides.subscribing_to_up_pro]
e. It takes the fun out of coding.
I'm not sure about that.
1- It's amazing that Java has the fastest cold start time! Faster than Nodejs. That's exactly the opposite of what I've heard before.
2- I am so tired of hearing about cold start times for dormant apps as if that is the only cold start scenario. It is arguably a worse problem to have cold starts when scaling!
What do I mean by cold starts when scaling? You adopt serverless. Things go great. Your app is never dormant. You're not serverless because you want to shave costs for infrequently unused apps. You're serverless so you have infinite scale and pay by the millisecond and minimal devops and so on. But whenever you have a burst of scale and Lambda needs to spin up more instances... some of your unlucky users are going to hit that cold start. And this hack of keeping an instance warm would do nothing to solve that.
I mean, do they? Do we know? It's possible that AWS warms up instances before throwing traffic at them but.. has anyone looked at this?
I've been doing a bunch of reading as a side-effect of being in and around the riff and Knative autoscaler efforts. What you're describing is known in other professions as a "stock-out".
The good news is: there are existing models for answering this kind of question. From what I've seen the "order up to" model is a fit, but I've yet to find time to work on testing that theory.
The bad news is: this problem never goes away. You are always going to be oversupplied or undersupplied. Autoscalers don't let you break Little's Law or overturn causality.
The good or bad news, depending on how you think: this tradeoff can be tuned. You can choose an acceptable probability of running out of running instances vs the acceptable average level of utilisation. That tradeoff is purely economic, it is a business decision, not an engineering decision.
An autoscaler cannot throw the bones, gaze into the crystal ball and mystically divine your intentions. A human will still be responsible for the decisions that matter.
Maybe it's because the bulk of the time is just copying the deployment artifact to the local disk. In that case the overriding factor is the size of the package.
For most apps, scale-out cold start won't be too much of a deal breaker: longer instance lifetime + shorter start will do the trick. And wait for the next wave of optimizations from vendors, I'm sure there's more to come.
P.S. I'm the author of OP, thanks for reading!
Uhh.. you’re just speculating here right? I seriously doubt Lambda spins up a whole new container for every single “parallel” request in a short burst. There’s probably a little bit of queuing and/or they aren’t exactly parallel.
I think it’s something the vendors need to solve, like by directly prewarming prior to throwing traffic at it.
Oh and thank you for the detailed post!
(Yes, this is a joke based on how we used to do it 20 years ago)
It's unfortunate that we consider 1-3 seconds to be acceptable response times.
And IncludeOS is a modern way to do it.
I can see that maybe there is a case to make for using functions as a service, to handle batch processing of things, or possibly to service background API requests.
2. Cloud vendors are working and succeeding to make cold starts faster, and AWS is at the forefront.
3. Pre-warming done right is extremely cheap, you won't pay anything meaningful, just several cents per month.
P.S. I'm the author of OP.
This is called "predictive autoscaling". In other fields it's called seasonality and you can build inventory and manufacturing plans around it.
While researching FaaSes last year I saw a slide deck from an Amazon PM about Lambda (I promptly lost the link). The thing that stuck with me was a claim that a substantial amount of their autoscaling "magic" was due to predictive autoscaling.
Netflix wrote up, but have not opensourced, their predictive autoscaler "Scryer" in a few blog posts. The tl;dr is that they use a combination of fourier transforms and simple correlations to make a basic forecast of how many VMs to have ready at different times of day. A reactive autoscaler adjusts on the day.
GCF is still in beta so it's the worst case, but they recently announced Serverless Containers which will let you run anything insider a docker container on-demand. That'll get around the language barriers and is the inevitable destination of all serverless platforms eventually.
And to sign up for serverless containers: https://g.co/serverlesscontainers
(I am a PM on GCP)
Hopefully this is an issue AWS fixes soon. The VPC cold start latency could end up forcing you to make some less than optimal infrastructure choices like running infrastructure on the public internet
For example, on the real-time side, busy APIs with functions than execute quickly. If the caller has either a decent timeout + retry configured (or doesn't care), cold start really isn't an issue. User facing web technologies? Serverless has never been that compelling for that given auto-scaling webservers is a pretty solid technique.
But truth be told, not having to deal with servers at all is magical. And if your team is used to it, it does cut down on a lot of worry such as patching stuff. Spectre or Meltdown patches? Ha, zero effort for us.
The problem with this (which applies to "cloud" infrastructure) is that, ultimately, it's somebody else's server.
You may not "worry" about something like low-level security patches, but that also means you have no control or even visibility into them. It doesn't mean you're not subject to the consequences.
Of course, there's always something magical about any form of outsourcing when it works realy well. I'm not sure this form has enough of a track record to be blindly trusted, however.
I already had a centralised entry point for cloud functions (just a basic abstraction), but generally they are pretty much wrapped Express requests, in GCP at least, and AWS too I think so the experience should be similar between AWS and GCP.
Changing that to be actual pure Express functions and not use the cloud functions API was pretty easy and quick, and while I was there I refactored our entry points a bit to be easier to migrate in the future.
The only thing that took time was making a new deployment process (we moved to App Engine so it's still "kinda serverless"). Since cloud functions have their own deployment system, I had to write our own deployment scripts, management of environments and so on rather than relying on the one provided by firebase cloud functions. Not a lot of work really, and this is something you would need if you have your own seevers anyway.
Once you have this done, it's pretty easy to move your "cloud functions" to any scaling node server host, or even to another cloud FaaS provider.
Eager to test it out we ran thousands of tests attempts with different RAM sizes and I can corroborate this persons findings in regards to the reduction of cold start time from functions with larger RAM allocations and seeming unpredictability of cold start on GCP. I hope with time they will improve cold start times or increase the minimum time for making a function "cold".
The gist is: you can make an image easy for developers, or you can make it performant in production, but you cannot have both.
Ease of development typically leads to kitchen-sink images or squashed images, but production performance requires careful attention to the ordering and content of layers.
I am wondering in what sort of cases users can just hit the DB. Do you mind giving a tl;dr? Does it only work for readonly/non-sensitive data?
It takes a little bit of thinking to set up the security rules so that users can only see what they are allowed, but it's worth it for the performance and runtime simplicity. And of course you can always invoke a Lambda for code paths that need to run with privileged access.
Enforcing a single <1MB file  seems to have at least partly allowed drastically improved cold start times for Cloudflare Workers in comparison with AWS Lambda, Azure Functions and Google Cloud Functions (although Cloudflare Workers also a much smaller feature set).
Disclosure: I am currently working on buildpacks again.
With that in mind the coldstart problem can be avoided entirely with Fargate and Azure container groups. Sure you pay for an app to be on all the time, but you were doing that anyways.
I’m using Azure container groups and their managed DB service right now, and my CI pipeline is thin and I have had no need of an Ops person to manage VMs.
Would anyone here use it? I can put together a small app in a few days.
"AWS Lambda allocates CPU power proportional to the memory by using the same ratio as a general purpose Amazon EC2 instance type, such as an M3 type. For example, if you allocate 256 MB memory, your Lambda function will receive twice the CPU share than if you allocated only 128 MB. "