Hacker News new | past | comments | ask | show | jobs | submit login
Serverless Computing: One Step Forward, Two Steps Back (arxiv.org)
465 points by snaky 4 months ago | hide | past | web | favorite | 222 comments

On the same note, CloudFlare’s approach to serverless computing by leveraging V8 sandboxing instead of traditional virtualization is fascinating: https://blog.cloudflare.com/cloud-computing-without-containe...

Considering challenges of interconnected cloud computing and common IR (of which the latter is mentioned in this paper) an interesting distinction arises: WASM, in comparison to containers, brings us a single language (JS), and through this, possibly a single “orchestrator” (V8 Isolates). Of course, this is more or less saying that the world would be a better place if everyone would be driving a Volkswagen. But this is like a second chance to iterate such idea and maybe do something about it. Isn't this a step towards the future of cloud computing?

Workers are network addressible which seems to directly respond to one of the complaints in the article. Beautifully, if you call one Worker from another they have a decent chance of running on not just the same server, but in the same process. When one Worker calls another its data doesn't even leave user-space, even if the two Workers are owned by different people. Imagine calling a third-party API and having your request never even need to leave the process to be handled.

If the request never has to leave the process, that means there isn't anything the third party API is doing that persists state or calls out to another backend. In that case, wouldn't it just be better to import this third party API directly as code and call it for significantly better reliability, performance, and reduced complexity?

There are advantages in being able to have a security barrier between your code and that of your customers, and that you can push releases etc without having to update every downstream consumer.

More importantly though, we're working on incorporating storage in a meaningful way.

If the third party code is running in the same process than the security is managed by V8 (or whatever runtime). Does such a thing exist in other managed languages? Does anyone know if C# or Java(for example) have the concept of untrusted libraries?

You could think of "untrusted library" as of a wrapper around an HTTP API. The code you don't run, but depend upon it's results.

Dunno about managed runtimes, but the whole thing sounds isomorphic to running a bunch of untrusted docker containers in a single qemu VM.

Then your engineers are responsible for maintaining and updating it. Calling to a third party allows you to offload that.

On the contrary, WASM brings us quite a few languages, and likely even more as compiler backends are developed and the platform’s capabilities are extended. It would actually be pretty terrible at running JS as it exists right now.

This is true, yet apparently we are standardizing on JavaScript for foreign function calls. How do you pass a string or an array between workers written in different languages, other than as JavaScript?

It seems like whatever language you use for generating WASM gets hidden as an implementation detail.

Am I the only one that thinks severless is about creating a bunch of vendor locked-in technical debt?

You're not the only one.

Some time ago I worked with a bunch of people who staked their careers on moving everything to AWS and converting it to serverless architecture. Here is just one example of how fishy it was. They used Dynamo. During presentations to the outgroup, they constatly praised Dynamo as being flexible, scalable and fast. However, internally they constantly struggled with various limitations of the database, from limitations of indexing to not being able to get certain types of metadata.

My point is, the architecture they were designing was heavily bent around Dynamo's strength and weaknesses. Considering how extreme and peculiar those strength and weaknesses are, switching to another database would require heavy re-engineering of the rest of their architecture, much more so than, say, switching between Oracle, MS SQL and Postgress.

Not sure I understand this. Why did they choose Dynamo? And what does serverless has to do with this?

This seems like a simple case of people choosing the wrong tool for the job, without really understanding its limitations or capabilities.

> say, switching between Oracle, MS SQL and Postgress.

That's because those are all relational DBs that use SQL. If they chose some other NoSQL DB they would likely have encountered similar "peculiar" strengths and weaknesses. And I'm still not sure what this has to do with serverless - they could've have easily implemented a serverless architecture with MySQL or Postgress on RDS instead of Dynamo. But they chose not to.

> Why did they choose Dynamo? And what does serverless has to do with this?

Possibly because lambda is integrated with dynamodb. You can hook a function to an event like a row being inserted or modified: https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

That lets you skip a lot of explicit queue management code and the change+notify is atomic by default.

You can use lambda within an Aurora (AWS version of MySQL) trigger also.

The advantage of DynamoDB when using lambda is that it is API based instead of based on database connections and pooling and it doesn’t run inside of a VPC - meaning a network interface doesn’t need to be created inside your VPC every time a new lambda process comes online.

But now Aurora Serverless has the Data API that doesn’t use traditional database connections so there is even less of a need for DynamoDB for lambda performance reasons.

Sure, but that's simply a single point in favour of Dynamo in what should have been a comprehensive analysis of strengths and weaknesses with respect to the required use case.

If they chose DynamoDB simply due to this strength (ignoring for the moment that the same thing can be achieved with native functions in Aurora), while ignoring the other things that ultimately turned out to be more important, then this again points to a flawed analysis. It's hard to see how serverless is to blame for this.

Aws might have fixed this by now with severless aurora, but as of earlier this year the issue was connection pools didn’t work on lambda functions due to their ephemeral nature not being able to keep the tcp sockets open for more than a few minutes before the functions froze or were terminated.

Connection pools always worked quite well with Lambdas, as long as the connection pool was initiated in the global scope of the running code, and not within the handler function. See [1] and [2]

[1] - https://docs.aws.amazon.com/lambda/latest/dg/running-lambda-...

[2] - https://blog.spotinst.com/2017/11/19/best-practices-serverle...

We use pools in lambdas to Amazon Aurora and it works out fine. Only limit we hit was when the db server was configured way too small.

Are you running your Lambdas inside the VPC (and incurring huge cold-start latency), or are you running your database exposed to the public internet? I've avoided using Lambda for anything that requires relational database connectivity because those appear to be the two options, and neither is viable for me.

The Data API was announced less than a month ago.


> That's because those are all relational DBs that use SQL

I have dealt with all of those, including MySQL. Switching is never easy. Every technology choice, even ones that should be similar, like an RDBMS, have strengths and weaknesses.

If you are actually taking advantage of all of the features of sql server or Oracle, you’re not just going to switch overnight.

I’m no fan of DynamoDB and a multi certified AWS fan, but, I know it’s strengths and weaknesses and so should anyone else.

SQL Server and Oracle have different features and idiosyncrasies, but use the same fundamental models of data storage that is a product of decades of research and development and isn't controlled or promoted by a single company. You're not going to suddenly run into some fundamental limitation that will suddenly fuck you over on architectural level.

This happens all the time with databases like Dynamo. All of them feel amazing when you use them in the "intended" way, but they also have extreme limitations and those limitations are different for every product. Why? Because they aren't based on some fundamental data storage model, but rather on an aggregation of specific use cases, which are different across different products.

Here is a mild practical example: https://blog.codebarrel.io/why-we-switched-from-dynamodb-bac...

Dynamo, like many NoSQL dbs, is (partition key, subkey, value). Partition key gets you to a storage node. Subkey gets you range queries on that storage node. That's the "fundamental data storage model" for most of NoSQL. And it's extremely useful for some use cases.

You could have mapped your use case to this storage model and gotten ridiculously fast queries on vast mountains of data, which is largely the point of NoSQL. But this would have required data duplication, home-made management tools for dealing with said data duplication, and other work that was clearly better spent implementing the much simpler SQL solution.

A benefit of the SQL approach is that if your queries start getting bogged down from growth in data size and/or request volume (not entirely unexpected behavior when you have a couple joins), you can move to a different model where you treat the SQL db as a slow-access source of truth and periodically generate key/value into NoSQL or Redis/ES/etc for actually running queries.

I’m very well aware of that. If they thought about using DynamoDB and they did even cursory research, they would know that it’s basically a glorified key value store with the chance of using multiple key value types - ie global and local indexes.

There are certain (limited) use cases where it is good. Every technology choice takes you down a certain road and limits your implementation choices in certain ways. It’s not the fault of the tool if they choose one that doesn’t meet their needs.

Also, the beauty of something like AWS, is that you can do “polyglot persistence”. You can have different services in your infrastructure use different types of storage depending on the use case.

A product that actually uses all the features of Oracle would be naturally hilarious. Oracle's documentation is over 500 megabytes compressed. It's an alternative computing reality.

Here is a 5559 page book of error codes. Imagine having a physical copy of this on your desk.


> alternative computing reality

This such the perfect description. I was working on a Django feature last year that had to work on all DBs and testing the oracle stuff was like entering a new reality lol.

And full of such amazing utility, such as:

" CLSRSC-00009: No value passed as OLR locations

Cause: No value was passed as OLR locations.

Action: None "

Holy shit

This is exactly right, wow. I am almost speechless.

> If you are actually taking advantage of all of the features of sql server or Oracle

I doubt any real world application comes anywhere close to doing that. Heck, I’d be mildly surprised if any real world enterprise, considering all their apps together, does that for either SQL Server or Oracle.

This sounds more like an argument to not use a NoSQL database without a specific use case. That's an argument I completely agree with. I'm not sure it's a specific argument against Dynamo or AWS in general though.

This is any system that uses a database. It's just worse because only aws has Dynamo.

I think you’re underestimating the degrees of difference. There are many apps which support multiple SQL databases but far fewer support multiple NoSQL systems because the differences are more significant than an ORM can cover up.

You’re always “locked into” your infrastructure. I shake my head when I see some bushy tailed developer using the repository pattern just in case the corporation decides to abandon their million dollar Oracle installation to go with Postgres.

As far as being “locked in” because of serverless. With AWS the only thing you are doing code wise to make your code “serverless” is creating one function as an entry point that takes in two parameters - a JSON object and a lambda context. The only thing your handler should be doing is mapping the event to your domain objects and calling your back end code.

I have a c# .Net Core project that can quite easily be deployed as both a lambda that is triggered by an SQS message or as a Windows service. In fact, my CodeBuild step that triggered by a git commit compiles it as both a zip file that is deployed to lambda and a Windows executable - yes you can build a Windows .Net Core executable on Linux and vice versa.

People have a dream of making their implementation vendor neutral but wholesale migrations involving major infrastructure changes rarely happens. The risk of regressions is too high and the benefits too low.

This 1000x.

A well designed and properly isolated codebase is going to have limited technical debt in switching providers for various services.

But the cost of maintaining vendor neutral code and avoiding making use of the unique product advantages of the provider you've already chosen has significant downside.

It becomes a matter of designing for the lowest common denominator between providers instead of making full use of the ways in which your existing product choice excels.

If you are frequently changing providers, you have other issues afoot. If you rarely change providers, it's doubtful it'd be a net loss to design vendor-specific interactions with those services/products. Just add an intermediate API to isolate vendor specific code, which is going to be a good thing to do anyways for testing.

The vendor lock in argument is one of the few programming boogie-men that really gets under my skin.

Amazon made a pretty big deal of how they finally migrated off of Oracle to get away from their repressive licensing. Being locked in sucks when the vendor can extract monopoly rents.

Maintaining a second source is a good business strategy.

Yes and Amazon built their own globally distributed infrastructure,DynamoDB and RedShift - a purposefully play on shifting from Oracle - a random company isn’t going to do that. And it took 10 plus years. Again a random company is not going to take on a ten year plan to move from Oracle.

Amazon has engineering teams that very few other companies have. Do you think a bank would be able to move out from Oracle like Amazon? I do not think so. Maybe they would be able to move to DB2 or SQLServer with the help of the other vendors, so trading a vendor lock in for an other.

It's why the military tend to insist on it.

>> I shake my head when I see some bushy tailed developer using the repository pattern just in case the corporation decides to abandon their million dollar Oracle installation to go with Postgres.

The repo abstraction is a fantastic way of declaring your data access implementation and separating its concerns (especially limitations) from the service tier. Once declared - via interfaces - you can substitute the concrete implementation for something that suits the environment and the ever maturing agile use case.

Given how this is a fundamental scale/refactoring primitive, perhaps you haven't considered how others use it. I've used it time and time again at least 3 ways:

Monitor and consolidate data access patterns. If ITopSecret repo requires you pass an IUser to your verbs, you only need to audit/validate this layer to prevent unauthorised access. No one will replicate the query in IExportService and forget to check permissions again.

To scale read-heavy data, with new technologies like predictive search: Slowly pivot away from SQL sprocs/views to Solr/Elasticsearch by swapping out the concrete implementation ISomethingSearch repo with a new one.

For offline 'work' and unit tests: Create new concrete implementation of IBlobRepository that read/writes to your file system or an in-memory chunk.

The larger point is though if a developer went to a CTO and told them that they can get rid of the million dollar Oracle installation and replace it with Postgres because they used the repository pattern so therefore there is no vendor lock in, they would be laughed out of the building.

You're not the only one, the paper mentions this too :)

> As a result, serverless computing today is at best a simple and powerful way to run embarrassingly parallel computations or harness proprietary services. At worst, it can be viewed as a cynical effort to lock users into those services and lock out innovation

No, it is also about writing fancy stuff in your CV.

Vendor lock-in is a serious issue. But there is an upside. Say you're building a serverless API. Say you build it using pyhton flask or bottle. You can use a packaging framework such as Zappa such that it can be deployed to serverless hosting or it can run in uwsgi / nginx on a server.

Id be delighted to see a universally adopted standard specification for serverless computing including management interfaces and platform compatibility. The sort of specification that competing Cloud providers could implement thus better empowering consumers like myself. I have faith it will come around eventually all of this is relatively new

This is what the CNCF is trying to achieve with Kubernetes as a foundation for managing compute/containerized workloads. If we lay down the right set of abstractions for app developers then the theory is the workloads become somewhat portable. Think of it as the core AWS services generalized as kubernetes services -- and k8s (or even stuff on top in theory) could still be managed by the provider.

The problem for the CNCF and its projects is you can almost always move quicker if subscribing to the walled gardens of the major vendors... even if the experience is less than optimal and definitely not portable.

Hopefully this changes over the next year as better abstractions (think Fn Project/Knative/Rook/someDB) and even some standards (cloudevents) emerge.

That said I suppose data could be traditional DB's (sql/nosql) managed by k8s... not sure about this yet.

Is it though? I think the possibility to move to another solution is the way to deal with vendor lock in. Simple example, one of my clients wanted to move to AWS but they were super afraid of lock in. We had to create exit strategy for them with cost analysis and now they are happily using AWS services knowing they can move to GCP or something else for a certain amount of money.

Glad to hear things worked out for your client. The key thing is, YOU had to create an exit strategy just for your client.

If the specifications for the serverless platform were "open" then the client could move their stack to different vendor with insignificant planning/risk.

An analogy would be vendors somehow changing machine virtualization such that your code was required to be very hypervisor-aware to function, and porting from vmware to xen imposed significant cost and risk.

I am not sure how much of this is not already mitigated with https://serverless.com. I have never used Lambda without it. For me serverless.com is the solution to avoid vendor lock in. However, Lambda is a tiny fraction of the entire stack we are talking about here. S3 + EMR + ALB + EC2 are all there and they are much more difficult to dodge the vendor lock in bullet with. One particularly sticky vendor lock in for S3 is pricing and reliability. Unfortunately it is almost impossible to beat that. My clients just discovered this, it would be significantly more expensive to move from S3 to HDFS or any other storage solution. This has nothing to do with how open the platform is, yet it is a much more serious lock in.

I don't find it that concerning for a couple reasons.

1) Much serverless code will be vendor specific anyways. Anecdotally I'd say at least half of all serverless code I've written was to deal/work with vendor services.

2) You can easily write code to be platform independent with a thin wrapper for whichever serverless system you are currently using. This is pretty straightforward unless you are using vendor services, in which case you are locked in anyways due to that dependency.

Lock in is a massive problem with FAAS, but it certainly isn't about lock-in [1]. It's all about reduction of developer ops - not having to provision (or even scale) servers is quite nice.

[1] In fact, for FAAS start-ups (such as my own - PiCloud), the lock-in fear was one of the biggest hurdles.

Uber-nitpick, but I’m pretty sure that you meant to suggest that serverless appears to be about creating vendor lock-in, not about creating “vendor locked-in” technical debt. While technical debt can take the form of vendor lock-in, the two concepts are otherwise unrelated, and they’re both fairly well defined. I can imagine the public cloud providers wanting to create vendor lock-in, but I don’t see why they’d care if doing so creates technical debt for their customers. I say this based on the accepted definition of technical debt, and again, this is a huge nitpick.

You can abstract your logic from the infrastructure with the Serverless framework and deploy on any provider.


I agree that is a big risk of serverless. But it isn’t inherent to it. GitLab will have serverless on December 22 based on Knative and open source code. But the current state of the industry is that 95%+ of workloads are bound to a specific hyper cloud (AWS, Azure, GCP).

Proprietary FaaS is like Stored Procedures in proprietary DBMSes but in the Cloud.

I would say that broadly categorizing FaaS as vendor lock-in would be disingenuous on its own. Practical FaaS is a simple concept as far as developer implementation is concerned. Write functions, get HTTP context, and return a result. This article really only touches on problems with vendor implementation of FaaS, AWS may have these issues with I/O and lifetime of a function, but there are many open source implementations that you could install on custom hardware that bypasses these issues. Even some of the issues that seem like fundamental issues with FaaS (non-addressable instances, unable to use specialized hardware) are already solved by other open source self-hosted FaaS solutions.

Care to elaborate?

Edit: to clarify, I’m referring to the assertion at the end of your comment, and I’m genuinely interested.

The increase in CPU utilization efficiency (and lower costs) is decreased by lack of portability.

Everything has a cost. And so being able to move your code around at will is the cost of saving money.

Actually, most modern FaaS based on Amazon's is portable, it's the other rest of the stack that isn't very portable.

I loved the article https://arxiv.org/pdf/1812.03651.pdf

It mentions two temporary bottlenecks:

1. Functions can run only for 15 minutes and experience cold-starts.

2. You can't use specialized hardware like GPUs.

It also mentions two fundamental bottlenecks:

1. Serverless functions are run on isolated VMs, separate from data. In addition, serverless functions are short-lived and non-addressable, so their capacity to cache state internally to service repeated requests is limited. Hence FaaS routinely “ships data to code” rather than “shipping code to data.”

2. Because there is no network addressability of serverless functions, two functions can work together serverlessly only by passing data through slow and expensive storage. This stymies basic distributed computing. , With all communication transiting through storage, there is no real way for thousands (much less millions) of cores in the cloud to work together efficiently using current FaaS platforms other than via largely uncoordinated (embarrassing) parallelism.

Disclosure: We just shipped serverless in GitLab https://about.gitlab.com/2018/12/11/introducing-gitlab-serve...

As far as “bottlenecks”.

Almost all applications have data separate from code. Most implementations don’t run their applications on the same server as their databases.

Most app servers shouldn’t be “caching state locally” anyway. Once you do that it makes scaling horizontally harder - you have to Implement sticky sessions.

And in all practicality, I would assume that most people are using EBS for their storage which is network attached storage.

Two lambda backed microservices work together just like any other microservices - via known urls and over https.

>you have to Implement sticky sessions.

There is more to the world than serving HTTP requests.

To expand on your point, there is also more to cache than user-specific data. Exchange rates, tax rates, product catalogs, and more can be usefully cached without requiring sticky sessions.

Things like exchange rates, catalogues, tax rates are not generally 'session oriented' anyhow. They could be cached by some other means hopefully at least somewhat transparent to the business logic anyhow, so to the OP's point, I don't think this is a limitation.

I'm in agreement that the fundamental limitations prescribed in the article are academic for the most part. 'custom GPU's ' are not really what we're after with Lambda's etc, at least not yet - and I'll bet when we get to that point it may be possible to access specialized hardware in Lambdas as well.

In addition, things like '15 minute timeouts' are kind of arbitrary product design issues: we generally don't need 'long-timed Lamdas' because that's not what they are for. Something that needs 15 mins of crunching is better suited to another piece of the infrastructure anyhow.

Serverless computing, done in a manner that can doesn't throw up a bunch of hurdles definitely has a lot of advantages.

The more obvious downside to me is the degree to which they are all fairly custom implementations, and that Lamdas aren't so easily ported between providers.

who keeps that data locally on the server?

Anyone implementing business-critical applications which have to continue to function if external services go down.

Ever wondered how cash registers work? Exactly like that. The world may burn down around you, but as long as your cash register PC is running, it must continue to at least offer basic functionality.

The vast majority of point of sale systems are networked, and this is nothing new. There's literally millions of PoS systems out there running off of AS/400 backends, even today.

You mean like all of the cash registers at companies like Target that were breached because they were distributed?


If you're querying the IRS for tax rates every time you run a calculation, you're doing it wrong.

If you aren’t querying exchange rates often - you’re also doing it wrong....

Depends on what you're using the exchange rates for. The rates used by Visa only change once per day, as an example.

And you still wouldn’t store that locally. You may cache it locally - something you could also practically do with lambda.

Well yes, the parent we're replying to was talking about caching.

Sure, you have communications over queues, event messages etc. both supported by lambda. You also have higher volume messages from something like IOT devices where you would use something like Kinesis.

Are FaaS’s fundamentally aimed at those use cases?

Yes. Those are all processes that are “embarrassingly parallel” workloads. You can just point lambda to a queue for instance and it will automatically scale based on workload. When nothing is in the queue, you’re not paying for resources just to poll an empty queue.

> Most app servers shouldn’t be “caching state locally” anyway. Once you do that it makes scaling horizontally harder - you have to Implement sticky sessions.

Why not? In-memory caches are fast and fairly uncomplicated to write.

You always have to apply pressure to keep the system decoupled or you get a Ball of Mud. If you switch to micro services and serverless and then have a high fanout and caches everywhere just to make the numbers work, you haven’t gotten rid of the Ball of Mud, you’ve just hidden it in plain sight.

Serverless and embarrassingly parallel problems look like pure functions punctuated by one or two state changes. To get that you should probably be passing all the data in, instead of communicating out of hand with a cache to keep your fanout from looking awful (again, hiding the problem).

But who is doing the "passing-in" then, and where does the data come from? This still sounds like hiding the problem...or more specifically, making it someone else's problem.

Who does the passing in in development mode? Where does the data come from?

You have this problem anyway, and it blocks you from launching new code. These out of band communication mechanisms always lead to problems with developer ergonomics and repeatability.

It's essentially a larger version of the problem of testing pure functions versus heavy usage of mocks. Passing the data in directly makes testing the callee very simple.

Usually the data comes from outside the system - an api integration, user interactions with a website, an ETL process involving a file, IOT devices, etc.

The sticky session thing is about a clients session. You store the identifier on one machines and if the same client hits another machine it knows nothing of that session I'd. This is one example but in memory only works well in this setup for client agnostic data. Type table lookups, etc

Because when you have horizontally scaled to 1000 servers, when the next request comes in, it's very likely to hit a different server from last time.

So your cache hit rate is going to be horrendous. Take the cache hit rate you'd have for a single server, and divide it by 1000, the odds of getting the same server again. So if your cache hit rate was 80%, now it's 0.08%.

Of course you could arrange for a user's requests to hit the same server as last time, but that's sticky sessions.

Wait so how is Lambda making this worse? In this case it’s pretty clear you want a caching server instead of an in memory cache. I don’t see why the 1000 serverless functions is worse than 1000 long running processes.

That’s kind of the point. People act like lambda is some weird thing that completely changes your architecture.

Even when I’m hosting things on VMs, I still consider them as disposable. I don’t store state on them, logs are all sent to a central store like Cloudwatch or ElasticSearch, if a health check fails, autoscaling just kills the instance and initiates another one based on a custom prebaked image and a startup script, etc.

Long lived data is stored in a database and cached with either Memcache or Redis.

We had an old school Devops guy that wanted me to list the names and ip addresses of some servers that I was using. I told him that I would have no idea and even if I did give him a list, it would change tomorrow. I have some windows servers that in non prod environments that host some Windows services based on queues. The servers are automatically terminated when nothing is in the queue and it may spin up a dozen when needed. My servers don’t have names - they have tags.

How is this any different than using a Lambda?

> Hiw is this any different from Lamba?

You had to write all that orchestration code yourself.

That said, I completely agree with you, and utilize the same arch at dayjob.

There are “app tier” cattle, and there are a few special pets (databases) that are backed up, loved, hand-held, etc. because database failover is still painful for users for seconds-to-minutes. When you’re serving thousands of requests per second, that’s a lot of angry users.

RDS, Azure SQL, Cassandra, Aurora, whatever still cause measurable pain for the end user during node failure.

You had to write all that orchestration code yourself.

With the Windows VMs I had to:

- integrate TopShelf to make it a Windows Service

- write the code to poll the queue (no big deal but still)

- create two Cloudwatch alarms to monitor the queue to tell the autoscaling group when to scale in and out

- define the launch configuration for the autoscsling group

- define the autoscaling group

- add health check functionality to the Windows service to ensure that the Windows service was running that could be integrated with the sutoscaling group (well I didn’t have too but to be complete)

Alternatively with lambda all I had to do was

- add a method that took in two parameters - the SQS event and the lambda context.

- add the SQS event to trigger the lambda.

Of course both stacks were created with Cloudformstion. One I had to do myself, the other I just basically set everything up in the web console, exported the CloudFormation template and made some minor changes.

>Almost all applications have data and separate from code.

Almost all applications aren't distributed in any meaningful sense of the word.

If you aren’t doing distributed applications, why are you even considering lambda - the main purpose of which is massive parallelism?

For us, lambda is great for the random junk drawer of compute load. Cron jobs; miscellaneous async jobs we'd have used something like sidekiq for in the past; infrequently used microsevices (where cold start doesn't matter) like schema registries (where client caching means it might not be hit for days). In the past I've had "job servers" running these various things.

Yeah, the article isn't arguing about those things, but the rest of the jobs, especially data dependent things. For non time critical miscellaneous stuff lambda is great.

Because 'you' don't get to make the decisions.

What type of organization will ignore experienced developers that can rationally argue the type of architecture? I've been developing professionally for over 20 years and I can honestly say that I’ve never been overridden about small scale architectural decisions like should we use lambda versus an EC2 instance unless the decisions makers had some larger insight than I had.

I don’t mean I could have decided whether to go to Azure or AWS or use Oracle instead of SQL Server or something on that level.

I'm not in this arena at all but Amazon's lambdas seem like a way bigger architectural decision than azure vs AWS or Oracle vs sqls

It’s acrually smaller. A lambda code wise is just a function that takes in two parameters. Like I said before, architecturally, your lambda entry point should be skinny - take in lambda event data, map it to your domain object and call your domain services (not http services, your business layer).

In fact, you can take a standard C# web api and using the AWS SDK deploy the same code either as a lambda or you can do a standard deployment to IIS or a self hosted executable with Kestral

I’m speaking in terms of .Net but the same concept applies to any language.

One project has your lambda interface with a dependency on your domain classes.

A second project has your standard controller code with a dependency on your same domain classes.

You can have yet a third project that in my case uses TopShelf - a nuget package that makes it easy to create Windows services - that also uses Kestral to self host a web service.

I can deploy the same code to either IIS, Windows as a service, a Linux VM (with Kestral without TopShelf) or lambda. The only difference is the code deployment part of my CI/CD Pipeline. I use AWS CodeBuild with a preconfigured .Net Core Linux Docker container that builds to either target and stores the artifact as a zip file to S3.

None of this onerous. I test locally on my Windows machine and just push to git everything else just works based on some yaml files.

I think another very huge benefit is that developers don't have to worry about infrastructure.

I don't know if its there yet but at least I imagine that's something serverless technologies strive towards

Not worrying about infrastructure implies trivial portability between IaaS providers... Is that at all true?

Again in the real world, an AWS shop isn’t just going to pick up and decide willy nilly to pick up and go Azure just to temporarily save a few dollars.

Is the main purpose massive parallelism? I was under the impression it was mostly for rarely used resources, because you only pay when you use them.

Not every application is a web application. There is this quaint old framework for distributed computing called Hadoop, maybe you've heard of it? It has a little brother Spark. Anyway the whole point is to "ship code to data". Its an interesting idea, but I don't think it will ever go anywhere.

And surprisingly that use case works with Kinesis and queues etc.

No, map reduce is not stream processing at all. It’s a transformation/computation over what may be many terabytes of data.

Yes its possible, and the SP describes the shortcomings involved in giving up data locality and cross-node interaction.

> there is no network addressability of serverless functions

Didn't inetd (4.3BSD, June 1986) solve the problem?

No. It solved a problem that was vaguely analogous if you consider a binary sitting on some Unix server the equivalent of a “serverless function”. They are somewhat similar, in that both are relatively self-contained and stateless compared to the alternative, not requiring anything to be explicitly spun up or down (a daemon or a VM/container respectively). But an inetd binary runs on an existing server, and the client is expected to know the address of the server. A “serverless function” can be spun up on the fly on any number of provider-owned servers which nobody has seen before, so there needs to be a separate mechanism to (1) request a server to spin up if necessary and (2) locate that server. Right now, Amazon Lambda only provides this through high-level abstractions such as Amazon API Gateway; you can’t even listen on a port from a Lambda container, so you’re limited by the speed of those gateways.

"Locate that server" is DNS. Or possibly reverse proxying, if DNS is hard and Apache is easy.

Transparent elasticity is the fundamental advance here, and it seems to come with a surprising number of tradeoffs.

There is nothing stopping inetd services from sharing disk data or communicating with other local services.

The particular inetd was just an example, but even for use with exact inetd from 1986, there are many solutions in modern operating systems to restrict any process, like SELinux, AppArmor, Seccomp and capabilities, LXC unprivileged containers and whatnot.

You're not wrong.

Regarding the bottlenecks, note that CSPs have been doing work to run functions next to the data without having to “retrieve” it to a warmer storage level.

Consider Amazon Web Services’ S3 Select, Glacier Select, and Athena introduced a year ago:


If you look at 2017 announcements then 2018 announcements, you can project a likely strategy here.

Very interesting. I wasn’t aware that with S3 select [0] you can select columns by using SQL on a zipped file stored in s3.

[0] https://aws.amazon.com/blogs/aws/s3-glacier-select/

That's not a disclosure, that's an ad.

So knowing this what has Gitlab done to overcome it?

If they haven't, why not? Why is it hard or inconvenient?

What we now shipped is a Minimal Viable Change (MVC). I suspected that serverless wasn’t ideal for all use-cases but I couldn’t articulate this until I read this paper.

We have taken no steps to overcome these problems, I think we’re likely to follow the leaders in the industry like Knative from Google and TriggerMesh.

In general I think serverless is more of a pattern then a technology. We’ll use Knative to offer PaaS functionality (Heroku) on Kubernetes with GitLab.

I find it kinda funny that serverless can't do GPUs while docker containers mounted with the nividia-docker runtime can do GPU capable tasks.

Are these serverless solutions really so different?

Serverless/FaaS is just a buzzword which describes proprietary hidden abstraction on top of vendor's isolation system (VM), proprietary image and orchestration APIs. One day they may happily announce that they allow you to:

   func yay(...) {

Yes, they usually have greater levels of isolation.

I'm interested to pick your brain about the second fundamental bottleneck you/the article mentioned. I'm not sure that I agree that serverless functions are fundamentally non-addressable. Today they're not addressable, but why can't they be in some future evolution of serverless? Sure serverless functions are ephemeral, but during their brief lifetime, is there something fundamental keeping them from being addressable?

I’m not a serverless expert. Consider opening an issue with your proposal, or even better a merge request against the serverless functionality in GitLab that is already in master.

Implementation details, not conceptual ones

As others here have already noted, serverless is still not fundamentally any better than good old CGI, and despite recent advances in virtualisation it still follows the same stateless request-response paradigm.

For me the biggest shortcoming in serverless is that it is actually a poor fit for modern interactive web apps (and apps in general) with a constantly changing state.

Technologies such as HTTP/2, websocket and SSE are clearly pointing in the direction of long-running client-server interaction, yet current serverless solutions have (to the best of my knowledge) no answer for that.

I think the big challenge is in how to do serverless computing with long-running processes, in a way that solves isolation and scalability at the same time.

> Technologies such as HTTP/2, websocket and SSE are clearly pointing in the direction of long-running client-server interaction, yet current serverless solutions have (to the best of my knowledge) no answer for that.

This absolutely kills serverless for me: we are moving into a world where real-time interaction and updates are critical for serving our customers.

HTTP/2, websockets, and SSE bring various benefits. Here are a couple of examples:

- The potential for much improved user experience through greater application responsiveness,

- Delivery of relevant data and insight, as and when it becomes available, without the need for clients to ask for it - again, for us that can deliver a much better user experience.

These are nullified by (current implementations of) serverless.

Exactly. I found serverless paradigm is particularly hard to implement stateful patterns. Even if there are some component could do that, most of them are tricky.

One thing serverless architecture look like is a little bit of Erlang/BEAM. You can 1. Call a function to spawn a process. The function itself would be executed in that new process. 2. The process can hold a state and iterate in a recursive manner. 3. The process can expect a message, or send another process a message. And that's the minimalist way to build up something called Actor.

With Actor and bi-directional protocols like WebSocket, you can achieve amazing dynamic stuff easily.

Serverless, on the other hand, is mentally very similar to that paradigm. However, without the long-running process, single-direction messaging etc.

There should be a huge potential lies ahead. All we need is a cloud operating system like Erlang/BEAM but a much bigger one at planet scale.

What's happening here that an Layer 7 Load Balancer, Multiple Serverless Functions, and a Websocket can't handle?

AWS released exactly that during the last re:Invent with the Websocket-integrated ALB being capable of triggering Lambda

You're basically locking yourself in with a particular vendor if you choose to use that tech (rather than a standard stack deployable anywhere with compute power).

Going beyond the vendor lock-in, one of the biggest initial draws of serverless - from speaking with friends - has been simplicity in situations where state isn't a concern. On the other hand, here we're discussing a bunch of hoop jumping to get stateful functionality that might more easily achieved by adding a library such as SignalR, or socket.io into your app.

Whilst this is probably a few screens to set up via the AWS, and can obviously be automated, you add complexity to development and delivery. Critically you also make it harder to diagnose and fix problems.

Like anything it's a trade-off: for some apps these issues won't matter so much, for others they will. For us, it would definitely be a problem.

Serverless is a tool with good use cases where your tradeoff-win will be great scalability. It's not that great for general use (yet) IMHO. You can create a "session" in redis pretty easily if you're willing to add another database as a requirement. That way you can share state. It's also very new so the ecosystem will keep maturing and enabling more use-cases.

Does the websocket really need to do much more than ferry pub/sub data to the client?

The next question is about read versus write traffic. If your reads outstrip writes, you move the work to save time instead of display time.

If both of those are true then I can see some ways to wedge serverless in there but they all look like job execution systems. Having a polyglot job processing system is nothing to sneeze at, but is seems like people want serverless to be more than that.

> Technologies such as HTTP/2, websocket and SSE are clearly pointing in the direction of long-running client-server interaction

Yes, we've skipped from raw TCP directly to websockets. There is the same need for interactivity. Protocol change is just for going from an OS to this browser-OS we have now.

API Gateway not supports websockets. Will that change anything?

> Technologies such as HTTP/2, websocket and SSE are clearly pointing in the direction of long-running client-server interaction

No, not really.

In modern web "application" load-balancer servers like Nginx, when the web server receives HTTP2 traffic, it will unwrap the HTTP2 stream into its individual consituent HTTP "sessions", and then make a separate upstream request for each session. From your application-server's perspective, you don't need to care about the "state" of HTTP2; it's just a transport-layer detail. If you wrote a (connectionless) REST API, its semantics survive entirely unchanged over an HTTP2 carrier. You just need an HTTP2 "tunnel terminator"—sort of like how stunnel(1) works for TLS, or how IPSec ESP tunnels are terminated by the kernel.

The same goes for websockets and SSE. You just deploy a reverse-proxy like https://github.com/fanout/pushpin, and then the client sees WebSockets or SSE streams, while your backends can treat the client as 1. a source of regular HTTP requests (translated from WebSocket messages); and 2. a webhook registrant (where the webhook is really just an endpoint on the reverse-proxy, and reverse-proxy sends the registration message when the client opens the stream), where you can then call that webhook with data—from any of your backends—to push a message into the relevant connection-oriented stream.

With a "smart" load-balancer/API-gateway layer like this keeping connection state for you, you can totally write your app entirely as a set of stateless functions running on stateless backends, while still allowing clients to "approach" the app statefully. (And I say this as someone who writes extremely stateful, connection-oriented Erlang service backends. Even my most fraught use-cases, that would never work or scale under e.g. PHP, do have a way to both work and scale under the FaaS+"smart API gateway" paradigm.)

And note as well that the "smart API gateway" isn't [or at least shouldn't be] something proprietary+nonstandard that a given FaaS provider is running for you (it's just a piece of software you can stand up an autoscaling cluster of); and nor does your API-gateway cluster need to live in the same cloud that your FaaS functions do. Your API-gateway is just a fancy web proxy making requests to upstreams; if you think of it like your API-gateway being your web-app, and the upstreams being third-party SaaS vendors you make API calls to during the request lifecycle, you'll have the right mental model for what ops with a FaaS looks like.


That being said, I don't necessarily disagree with this:

> For me the biggest shortcoming in serverless is that it is actually a poor fit for modern interactive web apps (and apps in general) with a constantly changing state.

I wouldn't call Serverless a poor fit, personally, but interactive apps are certainly not the simplest, most idiomatic use-case for Serverless architecture.

Instead, Serverless is most idiomatic when used to build

1. the "Command" and "Query" layers of an event-streaming CQRS architecture; or

2. the map and reduce steps of ETL pipelines (i.e. the "DoFn"s, in Apache Beam parlance; or the "design documents" in CouchDB parlance.)

In both paradigms, you're already pretty much restricted from manipulating any sort of global state anyway, so you don't lose anything by using FaaS functions rather than long-running code. It's a much more natural fit.

(And the interesting thing is, you can frequently rearchitect an "interactive web-application" into a CQRS system + a bunch of async ETL pipelines. Twitter is essentially this kind of thing, for example. So is your bank.)


Still, admittedly, one thing you lose with Serverless is the ability to keep local computations cached.

I think Lambda, at least, is moving toward an architecture that will address this somewhat, with Lambda "functions" now essentially being Docker containers that do HTTP request/response over their stdio. If that Docker container were required to be able to stick around for multiple HTTP requests, you'd get something that's a lot more like "distributed FCGI" than "distributed CGI", and you'd be able to go back to the same sort of local in-memory caching that you see in e.g. Ruby/Python web-apps.

the analogy to "distributed CGI" is on point

I'm not sure if i understand their view on fundamental limitations - they don't seem fundamental to me:

1. It does not seem impossible to imagine a function that spawns code close to data, be it on a VM with a connected fast SSD drive already populated with data. Also, Lambda-at-edge and Cloudflare workers are already more like “shipping code to data.”, or " the customer" in this case.

2. Functions are load-balanced and potentially parallelisable to millions of invocations. The only missing piece is some kind of parallel-invoke call, to give each instance of a function a distinct piece of data to process, and an identifier to save the result under. The identifier could easily refer to a local disk location, in some future implementation.

Also, another point from the article: "FaaS discourages Open Source service innovation". seems wrong to me, as it may be said only about current super-early implementations, reason being, they are new.

In the long-run, I'd expect serverless to help open source, because of a simplicity of deployment. We will likely have projects working on an abstraction layer of compute and storage, hiding the underlying cloud or multi-cloud implementation. (Kubernetes is one candidate, it just needs ideas in the like of Virtual Kubelet to become more serverless).

The paper didn’t mention any fundamental limitations. Its main point was that current serverless architectures suffer from two major performance deficiencies: inter[process/agent/what-have-you] communication is funneled through the bottleneck of slow storage (e.g., S3, DynamoDB); and various forms of optimizations based on caching are hamstrung by the fact that agents are short-lived and not directly addressable over the network.

The authors’ concern regarding the potential lack of open-source projects that integrate with the serverless platforms currently on offer is based on the severity—roughly between one and three orders of magnitude, overall—of the aforementioned performance deficiencies. It appears to be based on the assumption that open-source contributors won’t invest in what they (accurately) perceive to be a technically inferior platform. This part of the paper isn’t particularly clear, but I believe the authors are talking more about extensions to current serverless platforms than about application-level code that simply runs on top of said platforms.

Anyway, I’m no expert, but I’m fairly familiar with the subject matter and I read the paper in its entirety. I’m open to corrections from those who possess a deeper understanding of the issues involved.

One major caveat: I’m much more familiar with AWS than I am with its competitors, and I’m taking the authors’ word for it when they assert that their AWS-based examples are broadly representative.

> 1. It does not seem impossible to imagine a function that spawns code close to data, be it on a VM with a connected fast SSD drive already populated with data. Also, Lambda-at-edge and Cloudflare workers are already more like “shipping code to data.”, or " the customer" in this case.

This would work, of course. But doesn't it defeat at least some of the convenience of a "serverless" architecture if I still need to manage/configure servers with attached (and pre-populated) storage?

> 2. Functions are load-balanced and potentially parallelisable to millions of invocations...

Continuing from point (1), if the code needs to run proximate to data it may be difficult to achieve a huge number of parallel invocations. My parallel capacity is limited by the number of servers available for function execution, which is only those servers with direct/fast access to storage.

> This would work, of course. But doesn't it defeat at least some of the convenience of a "serverless" architecture if I still need to manage/configure servers with attached (and pre-populated) storage?

It might not be you who maintains the server. Internally, Amazon’s DynamoDB equivalent allows code owned by teams to run on data nodes triggered by events (writes, deletes, fetches). That code is run in a sandbox with certain constraints that ensure computation stays local. It’s serverless for the function owners.

In my experience that’s really only true at small scale. Once your dataset/traffic volume gets bigger you have to start getting much more hands on with sharding, keying/affinity, and availability.

When I left Amazon, this was a single data store with thousands of partitions, hundreds of billions of records, dozens of teams writing functions that ran on it, thousands of data sets, and hundreds of thousands of requests per second being made. Our team had several functions that handless thousands of requests per second. It was a critical piece of infrastructure, for among other things, Amazon retail, Prime, etc.

Sure, there was a team that owned the platform, but that wasn’t us. We were customers akin to AWS customers.

Joyent’s Manta system is closest to a ‘bring code to the data system’ as I’ve seen: https://www.joyent.com/blog/hello-manta-bringing-unix-to-big.... Though geared more to data processing than serving traffic.

Yes, I'd love to be able to use Manta every day. It's pretty crazy to write a simple shell pipeline and have it actually run not on my local machine but on all the data nodes.

> It does not seem impossible to imagine a function that spawns code close to data

The authors discuss this:

To achieve good performance, the infrastructure should be able and willing to physically colocate certain code and data. This is often best achieved by shipping code to data, rather than the current FaaS approach of pulling data to code. At the same time, elasticity requires that code and data be logically separated, to allow infrastructure to adapt placement: sometimes data needs to be replicated or repartitioned to match code needs. In essence, this is the traditional challenge of data independence, but at extreme and varying scale, with multi-tenanted usage and fine-grained adaptivity in time.

Not impossible to imagine, but the challenges are non-trivial.

This is very well put. In particular, there is a very common use case where Lambda-at-edge excels: thin query transformation layers, which both expand the number of use cases where back-and-forth can happen close to the user, and allow centralized services to focus on complex tasks worthy of the long haul.

Use lambda-at-edge to verify the format and size of an image. Ship it to your data center only when it's ready for permanent storage, object recognition AI, etc.

> "to imagine"

the paper critic is exactly that. the current state is missing too many easy to imagine points for othet paradigms

>Cloudflare workers are already more like “shipping code to data.”, or " the customer" in this case.

That's shipping code further from the data, unless you store all of your data on your customers' hardware.

The closest data would be in the datacenter closest to the customer, not on customer's hardware. But I mentioned it to illustrate that the code already is location-independent a little.

Is this your data or your customers' data? I'd much prefer the code that runs on my data to come to me, instead of me having to give my data away.

Serverless applications suffer from a few under reported drawbacks:

- Local development is complex. Does one provision a full cloud environment per developer? Emulate a cloud environment locally? These are solvable but it's a grind of special cases.

- Limited customization. Eventually one needs to customize the language / libraries of the lambda. Again these are solvable with a grind of special cases.

- Limited execution model. Not all compute tiers are the same. Some quickly assemble results from a database. Others run for a long time. Others build up a model in RAM before delivering results. Serverless optimizes for some but not all models.

Unlike serverless / lambda, all of the above can be addressed by containers.

- Containers have orchestration solutions for local development and deployment

- Containers are designed to support customization

- Containers can be deployed to run in many different ways.

Not to be a booster, but in my experience containers are the scalable compute architecture.

> drawbacks:

> ... Serverless optimizes for some but not all models.

How is this a drawback? General purpose things are equally bad at all things. If serverless is optimal for some models, those are the places you use serverless!

Please, share the use cases where serverless is the optimal choice.

Because I've tried it. For problems like thumbnail generation which is literally the canonical example [1]. And even in that case, found myself switching the workload back to containers to control the exact version of imagemagick to prevent a color gamut bug.

1. https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example...

> ...containers*scalable, etc... Depends on your domain. Say you are containerizing nodes sharing KVM hypervisor resources with full VMs. The container application in use is resident on a file served backend, heavily dependent on SQL RDB for input, result outputs to file and db and application programmed using MPI via openmp. Processing is in real time and time sensitive.

One size doesn't fit all. Containers are web tier solutions. They require too many compromises and work for a truly global solution.

that sounds like an interesting architecture and problem set but don't see how 'serverless' would help

Replied to the previous poster who postulated that containers were the new everywhere. They aren't and they won't be.

Serverless is an attractive prospect: Provide a function and lvalues and get a result.

Adaptation: You just need to break out everything to the point where you can farm it to function(lvalue) = ret.

A much needed paper which goes into the programming model in serverless rather than the operational model. Although, the first cases study is obviously bad for serverless (training a ML model) whereas the second case study does not indicate bad results (using trained ML model) as half a second latency is acceptable. As the paper itself quotes:

`One might argue that FaaS encourages a new, event-driven distributed programming model based on global state.`

Yes, exactly. It is not made for cluster computing or data-centric applications. It is made for making applications more modular, atomic and compositional with the upside of less ops and more scale and the downside of lower performance.

I think that the authors of the paper would say that, ideally, serverless would be ubiquitous, the obvious choice for the vast majority of use cases. A pipe dream? Perhaps. But that’s the dream, and we’ll never be able to make the dream a reality unless people write papers like this one that highlight the work that separates where we are from where we want to be.

Going through the article I think they made some mistakes in analysis and also miscategorized some features.

1) Limited Lifetimes, Stickiness: These is good and leads to good design when you scale beyond a single server. It prepares you for dealing with the reality that if you want to run a service 24/7 with no downtime you need to prepare for things like different versions being in production, failures, and stale caches.

2) I/O Bottlenecks: If you are parallelizing your workloads you will actually see more aggregate bandwidth than you could scale up quickly with normal server hardware. Sure a single Lambda might not have the full amount but you can run 1000 of them at once.

3) Communication through slow storage: This is not entirely true, you can call another Lambda directly but yes you can't access a particular instance. That is a good thing. Designing systems where you need to return to the same instance is an anti-pattern.

4) No Specialized Hardware: I don't expect this to be a limitation for long. There is no reason why you couldn't ask for specialized hardware in the definition of a Lambda and the scheduler take care of it. It will, in fact, be better than the status quo because parts of your application that don't need that specialized hardware won't need to pay for it. In a typical larger application you would probably end up running part of your heavy CPU load inefficiently on your GPU hardware because it would be convenient. Also, most of this might be unnecessary in many cases since you would likely take advantage of a service that was specifically designed to run specialized workloads. Only if you DIY would need this support. Lambda is especially good at coordinating such services.

5) Faas is a datashipping arch: Not even true today. Lambda lets you move the computation to the data like they are with CloudFront Lambda, S3 Batch and Snowball Edge Compute. It is in fact easier to execute the code near the data when encapsulated in this way.

6) FaaS Stymies Dist Computing: Maybe they have applications that can afford to fail at any point and not recover but keeping your global state in a distributed data storage system is the right thing to do generally, not just with Lambda. Might not work for HPC but it generally doesn't need to be reliable in the same way applications do.

At least in section 3.2 they finally seem to understand that most of the limitations are a good thing and if you are building software that don't have them you are likely not building scalable, reliable software systems. Also, if they think that their section 4 is somehow non-obvious to AWS and others they really aren't paying attention.

Having your functions be stateless is actually an asset and forces you to write your code in a way that doesn’t make assumptions about the state of some cache, which makes it inherently more reliable.

Cold starts aren’t much of a problem if you have non-interactive workloads, or can design UI’s in such a way to let the user know the first interaction will be slow but the rest fast, and if your modality allows for such. These problems tend to go away with any moderate scale in any event.

The time limits I’ve found to be an asset as well, and forces me to think about how to structure my task into discrete and well known units of work. This makes development harder but leads to an overall more reliable system where any one invocation failure won’t seriously jeopardize the larger task and can easily be recovered. You can’t say the same for a process that’s been running for hours and suddenly dies without committing its work.

That said, I’d love to have serverless GPU functions. I’d love to be able to somehow run shaders over superresolution images. That would be amazing for me working in the mapping world.

If a marketing term means exactly the opposite of what you think it means (that's some hefty server they've got there), then you probably should look elsewhere anyway. Unless you enjoy dealing with a community and support team that speaks an inverted lingo, of course.

Only programs that talk to servers can be serverless.

In two years we'll discover the benefits of running serverless workloads on the client and there will be a new name for that.

Some questions:

1. The authors are quoting exact performance metrics and limits, which is making the article more about "Amazon Lambda" rather than "Serverless Computing".

2. The authors mentioned a "15-minute lifetime" problem. Although it could be concerning that cache may be invalided sooner than running the same code on other infrastructures, a use case like this could easily migrated by using container. Not to say this number could be adjustable - this is a limit that is set by AWS rather than some theological limit. Thus using this problem to attack the serverless computing idea would be unfair.

3. The authors mentioned low IO problem. As public cloud is a shard infrastructure, it is only reasonable for end users to assume a baseline IO performance when using the platform - and this applies to any service that any public cloud provides, currently and in the future. It would be more beneficial if AWS could reveal the baseline performance in number, as this would assist developers with planning.

4. The author attacked serverless computing about its "Communication Through Slow Storage". This is probably not avoidable and is common practise in modern developing and should not be used to attack the idea of serverless computing. AWS do provide u-12tb1.metal EC2 server that comes with 12TB of RAM for this use case, though.

5. The author mentioned that currently serverless has "No Specialized Hardware". Again, this is a attack on AWS Lambda rather than serverless computing, and is a particular function that could be easily added. (I have a feeling that Cloud TPU could be used with Google Cloud Function, but it's a assumption.)

6. The authors also attacked that "FaaS discourages Open Source service innovation." Supposedly one can only imagine that more software projects would be PORTED TO serverless, and there should be no real issue running them as standalone application. I lost track of what the authors are trying to argue.

I actually agree with many of the suggestions in the paper, but of course am biased as these are my colleagues and we disagree on some of the interesting capabilities of modern serverless architectures (see https://arxiv.org/abs/1810.09679 ) . But it comes across like this DeWitt/Stonebraker piece from back in the day, "MapReduce: a major step backwards" https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_...

If I were a Machiavellian dominant player trying to keep my advantage in distributed systems, I'd invest into serverless propaganda to mislead everyone performance/cost-wise, keep my dominant position and get people doing real performant distributed systems for cheap due to market hunting slow/costly fad.

How does someone grow up to be Machiavellian ? Bad childhood ? Extreme circumstances forcing them to always default in to a ruthless uncaring attitude?

Does anyone know why lambda is taking off, compared to heroku style PaaS?

Heroku style still removes the burden of managing servers ("serverless"), but doesn't lock you in as much. You could easily move your own server process to a different provider.

Is it just because Amazon doesn't offer anything similar to heroku?

A number of reasons. For one, you pay per execution with Lambda (and other similar services like Azure Functions and Cloudflare Workers)...and the costs can be dramatically lower. Here's a great article by haveibeenpwned author explaining how they support 141 million requests a month at 2.6 cents per day: https://www.troyhunt.com/serverless-to-the-max-doing-big-thi...

Besides cost, automatic scaling is something traditional PaaS does not offer.

Amazon is the new IBM - nobody gets fired for using it. I have personally moved services which were cheaper and more performant with other 3rd party providers to Amazon just becausr higher-ups wanted us to keep as much stuff with them as possible, for obscure reasons like "we already have a relationship with Amazon" i.e. their sales rep keeps sending me free stuff and sending me on BS "training courses" with free lunch.

I honestly don’t understand why this hasn’t been PHP’s niche for years. It’s exactly what the language has been doing for years. That’s why PHP hosting is so cheap.

I’m not sure I would call it two steps back.

I think it’s quite obvious that every web application everywhere doesn’t need a full OS install chugging along all day everyday waiting for things like mouse input or having its own patches for services that will never be used.

It’s first gen tech. Really curious to see where it goes.

I mean, where do we think the functions are executing now? It's not like Ubuntu Server is running Xorg, or has libinput installed, nor have any of those things ever been operational or developmental considerations on any "non-serverless" app.

Or maybe you're in a super cutting edge functions deployment where you're actually deploying microkernels? Seems orthogonal to me, and doesn't seem to address the data locality, etc issues brought up in this article.

There are so many people in this thread saying stuff like "PaaS doesn't allow auto-scaling" and I just don't really know what's going on. It's baffling to me. I guess if all you know is an archaic build process and manual deployment, then it looks great. But I'm willing to bet most of these function platforms bottom out on containers-on-k8s-on-vms anyway.

I am increasingly, increasingly convinced that people don't really need or want "serverless". They want the final step in the build/development process that they thought they were going to get with Docker -- they want to write code and have it running somewhere. I really don't buy that Serverless/Lockin are the right or best way to that future.

>They want the final step in the build/development process that they thought they were going to get with Docker

This isn’t my area, so I don’t know - but why didn’t people get that with Docker?

They didn’t get the magic functionality they wanted because magic isn’t real. Coming from a devops background with more of an emphasis on “ops” than “dev”, my (admittedly biased) opinion is that a lot of developers believe three things simultaneously:

1) Ops people are bitter, power-tripping killjoys who couldn’t make it as developers and do nothing but click buttons and think up ways to make the lives of developers difficult.

2) Ops/infra really isn’t as difficult as ops people would have you believe, and some day very soon all the ops people will be replaced by code.

3) Someone else should write the code that’s going to automate the ops people out of existence.

These developers are like highly specialized scientists who think that their PhDs make them experts on everything. They have no idea what happens to their code once they commit it and they don’t care to learn, but they’re pretty sure that it can’t be all that complicated.

Ops will—like everything else—eventually be eaten by software, but the people who drive that transformation sure as fuck won’t be JavaScript specialists who can’t troubleshoot their inability to connect to the VPN. They’ll either be systems/infra people who can code, or well-rounded coders who understand that infra is actually nontrivial (or a group comprised of both).

If it seems like I have a problem with devs, it’s honestly only because a vocal minority of devs have a big problem with me.

The paper authors agree:

> Taken together, these challenges seem both interesting and surmountable.

I love how everyone always shat on PHP (the original serverless technology...convince me otherwise) and now everyone is scrambling to reinvent PHP.

> I love how everyone always shat on PHP (the original serverless technology...

I like and use PHP, and you absolutely can create serverless solutions with PHP, but you may be confusing "serverless" with "stateless". PHP itself is in no way intrinsically more "serverless" than other options.

Does PHP automatically scale infrastructure to meet demand? Can you bring up 1000 instances of your service for spikes in volume? And then scale back down when the spike is over? Without writing code?

Nobody is reinventing PHP.

Totally stateless PHP apps (retrieving and storing state at the top and bottom of each request) plus autoscaling based on CPU load starts to look pretty close to Lambdas + scheduled tasks to keep them "warm". This is how most LAMP apps are/were built, notably Wordpress.

The real pain point in any auto-scaling/cloud/serverless application is and will probably always be the stateful bits. I think the first cloud provider to really figure out this killer bit will be the ultimate winner of the cloud wars. It will probably look something like Amazon's "Aurora Serverless" but be even more frictionless.

> Does PHP automatically scale infrastructure to meet demand? Can you bring up 1000 instances of your service for spikes in volume? And then scale back down when the spike is over? Without writing code?

People talk about this a lot. I wonder, for what percentage of systems, this kind of scalability is really a requirement. I also wonder what percentage of devs work on these kind of systems.

Don't forget perl and CGI as the even more original serverless tech.

That's why I made http://bigcgi.com. It's just CGI behind a reverse proxy that handles file sync between hosts. My blog runs on it.

BigCGI looks interesting. I signed up for your newsletter.

Thats great, thanks! Im working on getting the blog up and running, and plan to cross post there, along with any news about the platform. Feel free to email me any suggestions... my address is on the bigcgi site.

Here's a web version if you're on a phone: https://www.arxiv-vanity.com/papers/1812.03651/



(1) Limited Lifetimes. After 15 minutes, function invocations are shut down by the Lambda infrastructure. Lambda may keep the function’s state cached in the hosting VM to support “warm start”, but there is no way to ensure that subsequent invocations are run on the same VM. Hence functions must be written assuming that state will not be recoverable across invocations.

(2) I/O Bottlenecks. Lambdas connect to cloud services—notably, shared storage—across a network interface. In practice, this typically means moving data across nodes or racks. With FaaS, things appear even worse than the network topology would suggest. Recent studies show that a single Lambda function can achieve on average 538Mbps network bandwidth; numbers from Google and Azure were in the same ballpark [26]. This is an order of magnitude slower than a single modern SSD. Worse, AWS appears to attempt to pack Lambda functions from the same user together on a single VM, so the limited bandwidth is shared by multiple functions. The result is that as compute power scales up, per-function bandwidth shrinks proportionately. With 20 Lambda functions, average network bandwidth was 28.7Mbps—2.5 orders of magnitude slower than a single SSD [26].

(3) Communication Through Slow Storage. While Lambda functions can initiate outbound network connections, they themselves are not directly network-addressable [...] can only communicate through an autoscaling intermediary service; today, this means a storage system like S3 [...]

(4) No Specialized Hardware. FaaS offerings today only allow users to provision a timeslice of a CPU hyperthread and some amount of RAM [...] no API or mechanism to access specialized hardware. However, [...], hardware specialization will only accelerate in the coming years.

While each of the above points are mostly quite factual, I don't personally find any of them to be restricting. We run a mixture of stateless and stateful services and serverless is suitable for the stateless ones.

The original problems for network security being solved by running functions in a VPC and secure connections to databases and even connection pools makes pretty much everything work.

Running functions in a VPC cause them to have cold starts of ~5 seconds. Connection pools need to be centralized to be of any use, since lambdas scale automatically.

I found myself in the middle of migrating a startup's entire "something" ingestion pipeline to be entirely done on Lambdas. For our purposes of one data block per every 10-20 minutes, and then processing and ingesting that data, Lambdas work perfectly, and about 10x cheaper than using ECS or EC2.

Well, non of the application we moved to AWS Lambda affected by these. I think the problem is that many of the users misunderstood the scope of Lambda and use it as the holy grail of computing. Simple price calculations usually makes it clear what Lambda is not for. You could pick on big data Hadoop for example and find projects when people wanted to use it for something that it is not good, it is not specific to "serverless".

re 1) Lots of production Lambda applications are not truly stateless. Checking for environment reuse is trivial. For instance, in Java, you can handle this by leaving your instantiation logic in the function constructor just as you would in a typical application.

re 2) I think that that they meant invocations from the same function use a single ENI. This is public knowledge and having any function size between 1.5GB (exclusive) and 3GB will leave you with the entire network interface bandwidth.

re 3) Absolutely a valid criticism, but DynamoDB is typically the source of information sharing among time critical applications and has a significantly different performance profile than S3.

What does "too less" mean? Is it a joke? (I've heard the saying "less is more").

I think it’s a play on the word serverless and their argument that serverless is a lot less performant than “serverful” in the general use case.

“Less is more” ==> “serverless is more” ==> “But serverless is too much less performant” ==> “Too much ‘less’” ==> “Too less”

...or something like that? I wouldn’t worry about it. I’m a native English speaker and I also thought it was awkward, though I (think I) know what they were going for.

I really wish that one of the big cloud providers was building out actors as a service. That would address a lot of the shortcomings of FaaS, whilst still having many of the important benefits.

One way to get around vendor lock-in if you're worried about it, is to use a shim that translates the proprietary code to a basic http interface. So you just write your code as it it were a basic interface. There are tools like apex up, https://github.com/apex/up which do this already, though at the moment only support aws, but could eventually be setup to push to any provider.

The Serverless Framework https://serverless.com/framework/ consolidates the functionality of all the major cloud providers behind one API, greatly reducing lock-in.

Agree though, when you’re talking about Lambda it should just be another entry point into your business logic.

Another alternative is ZEIT Now 2.0 which is supposed to be cloud-agnostic serverless. It puts a few more layers of abstractions between the developer and the cloud providers.


People will argue that there is vendor locking with Now 2.0 - it has its own config file and way of doing this.

But i am running express and the config is tiny, just few lines. If you use some cloud db provider for db that you can host yourself (like postgres or mongo in my case). You can move to your infrastructure without any pain.

Mind you using Express is not the "ideal" way Zeit seems to push. It is exactly like with PHP - you are supposed to use simple files and leave routing to webserver. I think they forget that there is no PHP framework that uses this. All of them use the "Front controller" pattern and webserver basically routes everything to it. There must be reason for this.

Came to conclusions really close to these when I investigated FaaS/Lambda a couple moths ago. Here's the blog post (faster read): https://archbee.io/blog/why-serverless-is-not-there-yet/

We have found the same when working on our own projects. Within the last few months a friend and I have been working on our own FaaS: https://3clouds.io/ We believe it solves about 90% of those issues - it's currently under heavy development and not ready to be used in production yet, but we think we are on to something here. Check it out, let us know what you think.

Interesting that you have the Docker option. I wish big cloud providers would be able to allow us to provide a Docker image to run as serverless. And not be as expensive as AWS Fargate..

If you want to avoid vendor lock in, may I recommend open source FN Project - you can run anywhere as it's container native serverless platform. https://fnproject.io (Full Disclosure : My company supports development of this project).

Interesting article. It has often seemed to me that efforts such as Urbit (urbit.org) deserve our attention, precisely because they might actually deliver on the 'serverless' premise to a far greater extent than current solutions.

Really? Isn’t urbit based on a computation model multiple orders of magnitude less efficient than KVM?

If you can write a paper like this, however right or wrong, that’s the standard for Berkeley Computer Science often cited as one of the best programs out there?

It’s nothing bad per se it just seems closer to nice blog post than research.

It’s basically an opinion piece - CIDR is an unusual conference and encourages this:


Maybe the perception has less to do with the paper or Berkeley that than a possible bias against Software Engineering research.

I’ve always perceived it, maybe unfairly, as one of the least rigorous topics in Computer Science, and more removed from more foundational insights that can have eventually have more important, broader impact or increase fundamental understanding that enables advances in other areas.

If any of the conjecture is true, it wouldn’t be a criticism of any researcher in this area or their capabilities.

Maybe it’s just a matter of maturity. How rigorous can I expect theory to be when th top software companies in the word still have trouble estiming delivery and complexity of software implementations?

We may just not know enough yet.

How is it an opinion piece? If you called it a bait-and-switch I might agree, since it starts off criticizing the performance of serverless architectures in general, and proceeds to critique the performance of one specific serverless architecture. But that critique uses the scientific method, and their conclusions appear to be reasonably well argued, so long as you substitute “Lamda + DynamoDB/S3” for “serverless”.

What am I missing?

One thing that stops me seriously thinking about serverless is : what would my application's test suite look like if the application made heavy use of AWS lambda?

It should look just like it does now:


Lambda handler -> business logic


Test function -> business logic.

Testing a lambda function is no different than testing an controller in an MVC framework.

"Cloud" = back to dumb terminals & mainframes. Giant step back.

Edit: and pulling content from two dozen separate servers ... madness^2

Yeah, simpler to just serve all content from the file server sitting in your garage. After all, you tested your site from the laptop on your kitchen table, and everything loaded just fine.

Why does serverless remind me of PHP?

Maybe because a making a "lambda" is quite like putting some code in a .php file and uploading it to a LAMP server, except it's pay-as-you-go and it autoscales.

Because it's CGI.

Except with vendor lock-in.

One size never fits all. The end.

Not exactly. The authors are part of the (large) camp that envisions serverless eventually becoming ubiquitous. In that sense, they hope that one size will fit all, some day.

> In that sense, they hope that one size will fit all, some day.

Nothing new under the sun there. And, that being the case, they're likely to be disappointed.

I built an Adtech platform on Lambda recently:

It's processing 9 billion events per week.

1. We used Firehose which pushed the data to s3

2. Go binary on Lambda transformed the data into Parquet format

3. Used Athena to query this data

4. Using Lambda to adjust the machine learing data based on the ariving data in batches.

5. Using Lambda to query athene/bigquery for data dashboard queries. Again using Go binary for max performance.

All this made our platform 10x cheaper.

Yep, there are workloads that benefit from the FaaS model, but not all.

In my case, the cost / hassle of high-availability, scaling, no need for state, time-based triggers, and more, made Lambda an easy decision to make: and it's working quite well. A cron-triggered job every minute is still in the "free" tier, so even better!

Oh look a bunch of academics telling developers how to do their job, again.

Translation for those not proficient in Academese: Serverless computing a novel paradigm enabled by a large cloud provider, does not matches our past research, thus we argue why it's not the right thing.

This is classic "if theory does not matches practice lets change the practice" approach, which is far too common in academic systems research.

I trust AWS with knowing what its customers truly want (in terms of performance and cost) and what it can provide,Since AWS has real financial stakes in its success.

A decade ago the same researchers would have mourned emergence of cloud computing as a wrong thing and instead asked for P2P computing since that's what they had spent the decade before doing research on.

I'm afraid your "translation" does a disservice to all the potential readers of the paper, to those who want to decide whether to use serverless computing at present and to the authors.

Quoting from the conclusion of the paper:

"Taken together, these challenges seem both interesting and surmountable. The FaaS platforms from cloud providers are not fully open source, but the systems issues delineated above can be explored in new systems by third parties using cloud features like container orchestration. The program analysis and scheduling issues are likely to open up significant opportunities for more formal research, especially for data-centric programs. Finally, language design issues remain a fascinating challenge, bridging program analysis power to programmer productivity and design tastes. In sum, we are optimistic that research can open the cloud’s full potential to programmers. Whether we call the new results “serverless computing” or something else, the future is fluid."

Interestingly, a paper not 10, but 9 years ago, not by the same authors, but by the same group (systems folks at Berkeley), was proclaiming the cloud as an idea whose time had finally come. No mourning there. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact