This is a great overview! I’ve been working on a project where I’ve tried a few of these and it’s definitely a space where this data is super valuable.
I really wanted to like Banana.dev but runpod consistently outperforms them for my use case. I love the innovation in this space.
Here is my wishlist:
1. Faster cold starts. If you’re building a consumer product that has a request and response using one of these services I’m seeing at least a 20 second delay before the server begins working.
2. Much cheaper GPUs. This is an unrealistic expectation right now because supply and demand have these services completely crowded with people happy to pay. I just wish I could afford to have a few of the faster GPUs prewarmed ready to go but that would be several thousand dollars a month. Doable eventually but for my bootstrapped side project that hasn’t found product market fit it’s a little rough.
3. Ability to create custom models from the service. I’m on an Intel Mac and so making a custom model requires me to ssh into a machine. If only there was a service that let me rent a high end GPU service by the second. /sarcasm. I wasn’t able to get access to docker or install it on runpod and support confirmed it isn’t something they support.
For custom model building I found the prices and flexibility to do whatever you need on lambdalabs.com to be best. Also their prices are blowing all of these other services away. However no serverless option. The space is so crowded with consumers I’m almost afraid to even mention them because I worry I won’t have GPUs available for me. I’m seeing this mentality a lot.
Firstly, thank you so much for trying our service, we'll do our best to meet performance expectations and win you back!
Re: #1 and #2, cold boots are the most vital thing for us to solve, because it fixes #1 directly and helps #2 indirectly. As we drive cold starts exponentially toward 0s (obv impossible but there's a near-zero asymptote to what's possible, limited only by disk read throughput), it makes it more viable to actually run serverless and scale from 0->1 for each user call. Our goal RN is to hit 1s, as that's generally the sweetspot for LLM app builders to stop feeling pain from the cold boot. We're getting close (2-5s) for most models with Turboboot (shill warning https://www.banana.dev/blog/turboboot). Glorious future would be more like 100ms, but depending on where the size of models end up, 1s+ cold boots may just be the cost of doing business.
Re: #3, if I understand, you're looking for interactive compute (IE you ssh in, mess around, do training runs, attach a jupyter notebook). For my personal ML training I use and suggest Lambda Labs or Brev.dev. Have heard great things of Coreweave, and users of Mosaic have seemed quite satisfied but I don't believe it's as interactive as you may want. Banana has no plans to support interactive GPU sessions, to conserve focus toward being best at cold boots.
You're definitely not alone in this wishlist, so I validate you. If I were building applications on top of a provider, I'd expect the same things. Big gnarly challenge with all the tools being a few years old at best! Fastest route to dependable tools is intense focus, us on cold boots.
> cold boots are the most vital thing for us to solve ... Our goal RN is to hit 1s
Did a bit of package management and experimenting with optimizing bits over the wire for serverless scaling on NFLX's internal serverless platform.
I'm selfishly interested in learning more about what you're doing to optimize meeting an incoming request with a live instance, but also might be able to help.
Know you're busy but if you, or your engineering team, have time to connect I'd love to chat: hn@blankenship.io
Neat! Bit too busy now, but we'll hopefully put some technical blogs over time to explain these things.
We don't do any predictive scaling yet; only when a call hits the queue do we scale replicas. Replicas cold boot (pod scheduling + application loading models into GPU memory) then subscribe to the queue.
Moving away from this replica + queue design very soon; using K8s primitives has gotten us this far but impossible to hit 1s cold boots with orchestration overhead. Building our own orchestrator and python runtime now.
Are you colocating your storage plane and GPUs? What’s ingress/egress to a node and are those links near saturated (with comfy room for returning model output, but I’m assuming moving the models dwarfs I/O from customer workloads)? Do you see high reusability across workloads? Have you explored chunking/hashing your workloads IPFS style (do these models radically change, or is there a high chance that two models that share an ancestor also share 50% of their bits. If you’re chunking your models and colocating the storage plane with GPUs, can you distribute chunks to increase the hit-rate of a chunk being on-node? Is your scheduler aware of the existing distribution of chunks across nodes? Given the workload patterns you see, and the shared bits between models, is it even practical to try and chase a local cache hit rate to reduce bits-over-wire? If you have a cache miss, what’s the path to getting those bits to the node with the GPU? How does the cost of that path compare to the cost of the scheduler making a decision?
Can only publicly answer one of these:
- reusability of workloads: yes, introducing the community templates feature (https://banana.dev/templates) for common models has dramatically cut back on storage requirements and transfers. We're still majority custom code, but it's helped prevent us from exploding storage over people running the same "model of the week"
As for caching / chunking, sounds like you're thinking on our wavelength, perhaps even ahead of us, so maybe I should take you up on the offer to chat! Will reach out.
Some of these are a bit more host a server for you or others to run. I wish this comparison also compared billing models a bit and any other value-adds.
The cool thing about replicate.com is that you can use someone else's public model and it's billed to the callee. For someone like me who is gluing models together for a hobbiest setting, it's been great.
For some image identification tasks, it's been pretty neat for me to be able to call https://replicate.com/andreasjansson/blip-2 which is someone else's already deployed model which was already pre-warmed by somebody or level of activity, and get results back. For me, I've been captioning images and putting them into OpenAI prompts.
I've also myself put out https://replicate.com/nelsonjchen/minigpt-4_vicuna-13b to see if maybe it's an improvement in captioning. Unfortunately, it takes like 15 minutes to spin up. That said, it's currently free for me to put up. If someone else wants to run it, they can wait/pay. And they only pay for the runtime and not setup. And if it were to get popular, it'll be naturally warm for everyone. For me, it was 6x the cost of https://replicate.com/andreasjansson/blip-2, and although my experiment did not produce something suitable or usable for me, maybe those caveats are appropriate for someone else's use case, and they can super-easily reuse my deployment on their dime without costing me any money.
Not to mention that replicate also put out some pretty alright APIs or libraries to call their service. It's been consistent.
All this ease of use does come with some caveats, replicate.com is pretty expensive for the raw calls.
Wondering whether most major inference libraries support storage-direct or if the listed providers are cheaping out on storage latency. Several seconds to load a 100MB model when pcie 4 is 256Gb/s I'd have expected an order of magnitude less - at least from my experience with gpudirect and real-time processing of up to 200Gb/s network streams with datacenter gpus (not performing inference but I'm talking about actual streaming to gpu memory here). PCIe-5 and H100 (the only one in the line-up that has PCIe-5...) should also improve on that.
Additional note, the L4 inferencing board is also still on PCIe-4 (like the L40, very sadly), so NVIDIA doesn't seem to foresee the need for more bandwidth on inferencing workloads in the next 3 years (until they maybe get a H30 out, with the binned not-fully-functional H100s? Maybe? Hopefully?) and/or will make you pay for a full H100.
I don’t understand how these small companies can compete with the big clouds ofer time. They aren’t offering anything fundamentally different than each other or the big clouds themselves.
At some point, Lambda will offer GPU’s on their instances and there are plenty of serverless offering right now that have GPU’s.
Fly.io at least is differentiated with their “run your compute closer to the user” with zero hassle. Though even that doesn’t seem super sustainable in the long term.
Wow I always thought AWS was crazy complicated just like .. because stuff is hard. But you're right, if it's complex enough people define their career identity as an "AWS expert" and at that point you have lock-in for life.
I've found that a lot of the complexity of the cloud is self imposed, the other half is because they're running a huge range of mixed-quality software that needs a lot of scaffolding to even run properly.
For example, if the clouds used IPv6, then 90% of the networking complexity would simply evaporate. But they can't, because despite two decades of warnings, nobody has working IPv6 LAN networks. Similarly, IPv4-only server software is still being written, today.
In comparison IPv4 needs a ton of infrastructure to function. Stateful subnet and IP assignments. Methods for dealing with overlapping private networks. Split DNS, private endpoints, NAT, and on and on...
I could list other examples, but you get the idea. The big clouds are pandering to their customers, and the customers can't modernise their end fast enough, so you end up with complex features to account for all the legacy software and networks.
> I've found that a lot of the complexity of the cloud is self imposed
Indeed. I prefer the terms circumstantial- and inherent complexity. Inherent is the easiest to define: it’s the bare minimum to do what needs to be done, analog to Kolmogorov complexity and in the spirit of “if I had time, I would have written you a shorter letter” or “make it as simple as possible, but not simpler”. In the real world, your software interacts with imperfect and legacy systems with their own issues, but doing that is also part of the inherent complexity (because if you removed it, it wouldn’t satisfy the requirements).
Circumstantial complexity otoh comes in many forms: initial poor design, tech debt from scope creep and requirement changes. In adversarial environments it can even be deliberate, such as DRM, inter-team politicking, and “job security through obscurity”, or “if nobody else understands this doc, it’s less likely to be changed and my coworkers will think I’m smarter”.
In the case of AWS, I suspect there’s all kinds of circumstantial complexity, but in particular there’s “green field accidents”. As an early mover (all cloud tech is extremely new), you don't get the simplest possible design on the first try. Instead, you get something that works, but is full of redundancy. For one, the best way to layer your systems aren’t clear, so you end up with individual products having to reinvent things like consensus, caching, durability, replication, data integrity, yadda yadda. But at the same time, it’s an immense pressure to get stuff out the door, so we make do with what we have and keep cargo-culting until it’s cost efficient to replace a lot of the garbage with something better. That can take 10 or 20 years.
I've used both AWS and Azure extensively. Everyone seems to think Azure is "weird and difficult", most likely because they had internalised the circumstantial complexities of AWS and can't wrap their heads around a cleaner but unfamiliar model.
In AWS, everything has non-human-readable identifiers shown in a flat lists in some random order. This adds a lot of unnecessary complexity. Almost the entire circus around having to create dozens of AWS accounts just evaporates in Azure's model that has folders called Resource Groups with resources in them with names.
Yeah, that's right: folders and object names. The magic anti-complexity technology that harks back to the 1960s UNIX era that AWS still hasn't been able to replicate in the 2020s despite a decade of trying.
A lot of other incidental complexity stemmed from old issues that have been resolved, but still linger around due to backwards compatibility. For example, not being able to change the IP address of an EC2 VM resulted in all sorts of craziness. Similarly, both Azure and AWS have unexpected naming restrictions on things like KMS / Key Vault secret names. I.e.: Key Vault secrets can't have names that match typical "web.config" parameter names or environment variable names in Linux... "for reasons". Stupid reasons. Hence, you need to have a back-and-forth encoding or escape/unescape mechanism between two things that should be identical.
> Yeah, that's right: folders and object names. The magic anti-complexity technology that harks back to the 1960s UNIX era that AWS still hasn't been able to replicate
Right on. These choices are often frontloaded to the greenfield stage, where you have to make some decision, way before you can say which data model makes the most sense in the future. Even the wisest of architects cannot predict everything, so it’s not for a lack of competence. The people I knew at @faang-gig were incredible bright, but technical design as an early mover is still an incredibly delicate art form.
Eventually the “happy path” will simplify. AWS is spending all its time going up market wooing enterprises, but if there is money to be had at the low end, they’ll go after it it.
Beyond AI, I don’t see any upcoming paradigm shifts, so what’s out there will just continue to get better versus chasing something fundamentally better.
Is this a sarcastic quip or are you able to expand on this?
I use a lot of serverless daily, handling events (even ML inference), and it seems to work great, but would love to understand the alternatives and your perspective.
The overhead of abstracting away the servers is a luxury in many ways. This extra cost I believe was heavily funded by low-interest rates which flushed the VC world with dough. There’s been a lot less serverless talk since the fed started cranking the rates
Sorry, but this feels like a total non sequitir. Serverless or FaaS is pretty mature now. People get the concept, businesses understand the savings, and the services and tooling are stable. We don't talk about it because it's boring.
Maybe the backend is, but the frontend aspect is very bad
Both GCP and AWS have terrible web UI's for their cloud functions offerings and every deploy is so slow I'm lucky that next monday is a holiday so I can rest from the stress of having to use GCP last friday (on a deadline)
Codesandbox should offer their own serverless functions so I can actually have serverless for the whole development cycle
I've used serverless for the past 3 years in production. Unfortunately my experience with it is that it's several orders of magnitude more expensive than a k3s cluster on a cheap provider like Hetzner, and it's slower.
When I last calculated the cost of serverless, it was ~500-5,000x more expensive for the compute compared to k3s and ~10x more expensive for bandwidth at a minimum. To me, removing the burden of maintaining infra didn't justify that level of cost.
Some examples:
- Upstash latency was ~70ms for Redis. Cost was prohibitive.
- AWS Lambda / Cloudflare Worker / Firebase Function cost becomes prohibitive. At least cold starts aren't as bad as they used to be.
- Firebase Realtime Database performance didn't scale, and wound up getting maxed out because of the way it works with nested key updates. Replaced with a Redis instance in k3s which is now running at <2% max capacity and is ~1,000x cheaper.
- Tried Planetscale. Cost was much higher than PostgreSQL.
- Tried Vercel. Bandwidth costs are very scary ($400 / TB egress, or ~350x the cost of Hetzner if you don't count Hetzner's free 20 TB per node)
That being said, I don't know of any good, reasonably-priced GPU offerings.
Also hard disagree (as one of the Banana founders). Many users on our platform spend less than $10 a month on A100 GPUs, while building whole startups. Compared to the alternative of minimum $1k monthly for an always-on A100.
As an avid Runpod user, I have come across some information that is not entirely accurate.
- Although the number of models is limited, the platform has a community feature where users can fork models.
I'm not entirely sure of the intended meaning, but contradicts with being able to bring any container.
However, if it pertains to the selection of pre-made template containers, it is true that the options may be limited. Nonetheless, there aren't a significant number of commonly-used open-source models available either...
Or it is referring to the API endpoints. That is indeed correct but it is confusing what this review is exactly about.
- Post-deployment, it can be confusing to understand how the platform works, which may result in users receiving a bill if they are not careful.
Although documentation is sometimes lacking, it is very clear what it costs and there is not such thing as a bill...
- There is no bot or instant support mechanism available.
They have a live chat available, and their response time was good. They are also very active on their Discord channel, providing speedy support to users.
Because the server is entirely abstracted away from computing applications that do not require direct access to the underlying hardware and only have to produce a computation result.
Conceptually, «serverless» is a reincarnation or an evolution of the VAX VMS style cluster computing: applications running on VMS clusters see the cluster as a single computer. Adding a new cluster makes the cluster more powerful, yet the app continues to see the cluster as a single computer. Cluster nodes can be geographically distributed, appear and disappear without the app noticing it. So, serverless apps, whilst technically running on a server, actually run somewhere (on a cluster node) but the app does not know it and does not care about it.
So «serverless» in the cloud takes the concept further, slaps on the automation and makes it more accessible to mere mortals, it scales (almost) indefinitely and completely transparently from the application. It requires no setup and is easy to use (setting up VMS clusters requires an experienced and knowledgeable systems engineer, for instance).
We can agree and disagree on the semantical correctness and richness (or the lack thereof) of the term «serverless», but it has already caught on and has evolved to mean more than just cloud serverless functions.
I literally don’t know what it means. Does serverless just mean doing it locally? Like an on premises cluster of GPUs in the IT closet?
If you are doing it on someone else computer, and that computer is a big server farm, I find it odd to use the term “serverless.” And by odd I mean literally the exact opposite of serverless.
"Serverless" is a cloud-native software development paradigm. Of course there is a server - the machine, and there is a server running on the machine (that you don't control). The point is in the code you write - it allows you (the developer) to forget about servers as in software as well as servers as in machines.
A serverless program (often called a function) is invoked by a server on-demand instead of running all the time, and it doesn't listen for anything on any port. The client request is passed to it by the server (e.g. on standard input) and it's supposed to pass the response back to the server (e.g. through standard output) - and the server sends it to the client. Then the program halts.
Compare that to Django, Node.js or ASP.NET: A classic backend app exposes a HTTP server on a port and handles client connections itself - and thus it has to run all the time, it literally is a server (in the software sense).
If you know PHP, that's the original serverless. As opposed to the Python/ASP.NET/Node.js backends, your page.php is invoked by the Apache daemon only when someone opens that page, and there's no "server.listen(3000)" in page.php.
Serverless is cool because it allows the cloud provider to fully utilize a machine while the developer pays only for their portion of actual usage. You don't need to reserve a specific amount of compute resources and worry about up/down-scaling or about paying for unused hardware.
Serverless GPUs are about bringing that concept to the GPU as a service space - ideally you'd have a function that uses the GPU. That function could be invoked by sending a request to the platform's server, at which point the server would load and execute it, pass the client request to it and pass the response back to the client once the function is done.
Having a server is like having a car. Its yours, you do what you want with it but:
1) If you don't drive much then you might be overpaying
2) You have to worry about the car infrastructure, you have to maintain it and if it breaks down you might be offline (unable to travel) until its repaired.
3) If you need bring home a washer and dryer for example you have to worry about your car having the needed capacity (scaling)
In todays world you can decide not to own a car and just use Uber or whatever vehicle sharing service is available to you. You pay only for the rides you need, you don't have to worry about infrastructure (repairs) and you can always pay for larger vehicles on demand.
That's serverless computing. Instead of paying for a physical computer or a virtual computer (compute service) you deploy the app you want to run and pay for each run of your app. Typically the app is some type of request/response or event processing app. The cloud company maintains the compute fleet and when a request comes in the will deploy your app to one of their compute nodes long enough to service the request.
Its called serverless not because there are no servers but because you don't know or care about the server it runs on. The hardware could change out between executions. You don't care about failed hardware, OS patching, etc. The lowest level you typically deal with is selecting the app language (node, java, etc).
There is zero actionable information there, it’s just describing the article. I was hoping to get a recommendation for which provider to use for which kinds of workloads. I guess I have to read it all then :/
Surprised GKE autopilot isn’t mentioned as it now has some level of gpu virtualization, I’ve found it to be the most flexible fully featured way of approaching this problem
> Pardon my ignorance on this matter but if the processing unit isn’t displaying graphics shouldn’t it just be referred to as a cpu?
No, i mean, its an auxiliary processor (there is a CPU, and this isn’t it) doing floating point math, so I guess you could call it either a math coprocessor or an FPU, but... we that’s somewhat confusing for historical reasons.
The term “serverless GPU” somehow wrecks my brain. Logically the absence of a server suggests its opposite, and the opposite of a server is a client, and client GPUs are the default. But this means “server GPU that’s available on-demand for very short-lived jobs” I guess.
"Serverless" has been a standard term in industry for at least 7 years. Sometimes words don't map perfectly onto the subcomponents that form them. For example a "mailbox" isn't always a box, and doesn't always contain mail. You just have to learn to use the words and not worry so much about the etymology.
The tortured etymology becomes apparent again when these words are combined in new ways. “Serverless GPU” might be something like “mailbox SSD” in your example. What would that mean? It’s not obvious at first sight. The metaphor loses its power when it’s attached to a physical descriptor which is not a metaphor.
No, it's obvious to everyone working with GPUs who knows what "serverless" means.
It's a very standard construction in English. "Serverless GPU" means "GPU" that is "serverless". If you know what both words mean in the jargon, you know what they mean together. It's ok to not know what they mean, but arguing that it's "tortured" rings to me as misguided obtuseness.
It's a new buzzword coined by the marketing team at Amazon in 2014. Somewhat confusing here as you are renting time on a GPU server described as "serverless".
I don't understand how people haven't gotten over this yet. When someone says serverless, I immediately understand that to mean "we've obfuscated the underlying server hardware from the consumer of this product." It means "you don't think about servers," not "there are no servers."
Because if I'm thinking about the gpu I'm fundamentally thinking about the hardware, the server.
Serverless responding to http requests. Sure. I write some code. It gets fed data and returns data. I don't have to know how many cores the server has, or what microcode version the CPU is, or how many other things are running on the server, or if I'm writing an interpreted language (probably the case) even what architecture the CPU is.
But... I need to know all of that if I'm writing GPU code today.
Are you really thinking about the hardware when thinking about the GPU? For instance, if you use pytorch to write a NN, don't you kind of expect it to execute on the GPU without needing to get into the gritty details of it?
The specific version of Nvidia CUDA or AMD ROCm that the GPU supports is often very important; some software needs to be compiled from a specific branch or with specific settings to support a given version. Case in point: the official PyTorch website offers four distinct builds for various platforms, two of those being different versions of CUDA.
I think it'll still be useful for plenty of people to choose which GPU runs their code, even if it's compatible with any GPU offered by the service. You might want to choose an older, cheaper GPU for basic parallel computation, but as supply catches up with demand for a newer and more energy-efficient model you'll then want to switch to that. There are only so many GeForce 4090s to go round :)
Recently, folks have begun to use the word "serverless" to mean something subtly different from its intended meaning in this document. Here are two possible definitions of "serverless":
Classic Serverless: The database engine runs within the same process, thread, and address space as the application. There is no message passing or network activity.
Neo-Serverless: The database engine runs in a separate namespace from the application, probably on a separate machine, but the database is provided as a turn-key service by the hosting provider, requires no management or administration by the application owners, and is so easy to use that the developers can think of the database as being serverless even if it really does use a server under the covers.
Yeah, should be 'elastic' and 'very reduced runtime, possibly just inferencing', so exposing Triton to an API gateway and putting a custom load balancer and task queue facade ? Curious too.
Also, GPUs do other things than inferencing, right?
I really wanted to like Banana.dev but runpod consistently outperforms them for my use case. I love the innovation in this space.
Here is my wishlist:
1. Faster cold starts. If you’re building a consumer product that has a request and response using one of these services I’m seeing at least a 20 second delay before the server begins working.
2. Much cheaper GPUs. This is an unrealistic expectation right now because supply and demand have these services completely crowded with people happy to pay. I just wish I could afford to have a few of the faster GPUs prewarmed ready to go but that would be several thousand dollars a month. Doable eventually but for my bootstrapped side project that hasn’t found product market fit it’s a little rough.
3. Ability to create custom models from the service. I’m on an Intel Mac and so making a custom model requires me to ssh into a machine. If only there was a service that let me rent a high end GPU service by the second. /sarcasm. I wasn’t able to get access to docker or install it on runpod and support confirmed it isn’t something they support.
For custom model building I found the prices and flexibility to do whatever you need on lambdalabs.com to be best. Also their prices are blowing all of these other services away. However no serverless option. The space is so crowded with consumers I’m almost afraid to even mention them because I worry I won’t have GPUs available for me. I’m seeing this mentality a lot.