Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What do you use for ML Hosting?
133 points by blululu on May 2, 2023 | hide | past | favorite | 65 comments
I'm trying to setup server to run ML inferences. I need to provision a somewhat beefy gpu with a decent amount of RAM (8-16 GB). Does anyone here have personal experience and recommendations about the various companies operating in this space?



On Modal.com these 34 lines of code is all you need to serverlessly run BERT text generation inference on an A10G (which has 24GB of GPU memory). No Dockerfile, no YAML, no Terraform or AWS Cloudformation. Just these 34 lines.

  import modal

  def download_model():
      from transformers import pipeline
      pipeline("fill-mask", model="bert-base-uncased")

  CACHE_PATH = "/root/model_cache"  # model location in image
  ENV = modal.Secret({"TRANSFORMERS_CACHE": CACHE_PATH})

  image = (
      modal.Image.debian_slim()
      .pip_install("torch", "transformers")
      .run_function(download_model, secret=ENV)
  )
  stub = modal.Stub(name="hn-demo", image=image)


  class Model:
      def __enter__(self):
          from transformers import pipeline
          self.model = pipeline("fill-mask", model="bert-base-uncased", device=0)

      @stub.function(
          gpu="a10g",
          secret=ENV,
      )
      def handler(self, prompt: str):
          return self.model(prompt)


  if __name__ == "__main__":
      with stub.run():
          prompt = "Hello World! I am a [MASK] machine learning model."
          print(Model().handler.call(prompt)[0]["sequence"])

Running `python hn_demo.py` prints "Hello World! I am a simple machine learning model."

You can check out available GPUs at https://modal.com/docs/reference/modal.gpu.

There's also a bunch of easy-to-run examples in our docs :) https://modal.com/docs/guide/ex/stable_diffusion_cli


btw. HN supports very simple code formatting, just indent by two or more spaces https://news.ycombinator.com/formatdoc


Ah nice. Thank you. I was using backticks


Modal's usability is amazing, I'm just a bit wary because they use AWS but somehow their prices are lower than that of AWS's GPU machines.


Perfect sell. I'm signing-up just from this comment alone ;)


How does Modal perform, latency-wise? Since it has to spin up an instance/GPU, does it take long to return results?


Love Modal. We use it for data processing, queues, apis, and all sorts of random things. Such a great product!


Hey! Would love to have you try https://banana.dev (bias: I'm one of the founders). We run A100s for you and scale 0->1->n->0 on demand, so you only pay for what you use.

I'm at erik@banana.dev if you want any help with it :)


+1 on banana.dev, I used it for a side project and deployed some custom code and it was a good experience! I liked the pricing model (lack of minimums and pay for usage instead of a "plan") and how you can package up whatever code you want.


Thanks for the +1!

Small note here: our billing is changing within the next month, to up-front payments that apply as a credit balance to your account. It still won't have minimums and you'll have the option to set up auto-refill on your balance, so it will functionally remain pay-as-you-go, but just wanted to add flavor to your comment on the pricing model.

Thanks for using us btw, you rock


Kind of funny how cellphones went the opposite way - we all hated "paying for usage (minutes/txts))" and now we want just the Plan.


I still use that because I rarely use the phone without wifi and don't make a lot of calls. And because it's 2023 I can change to a paid plan for a month at any time in the app, so it's the best of both worlds.


I have the cheapest plan, 5GB, and an unlimited data Android tablet for a total of $30+$15=$45 for both...

I pay my cell bill a year in advance - then I never have to worry about a bill ever.


Looking at your pricing, is that per seconds of GPU usage or per total seconds the app is running?

Eg. might have only a few minutes of usage in an hour and the rest of the time is spent waiting for requests. How's that billed?


https://docs.banana.dev/banana-docs/core-concepts/billing You're only billed for active replica time. Call comes in, we start a replica, it handles the request, it waits around for a 10s (configurable) idle timeout to handle any additional calls, and shuts down if no other calls to serve. The idle timeout is to prevent cold boots when not necessary, but is billed, so you can get closer to pure pay-per-call pricing by reducing idle timeout.


Looks great. It appears aimed at the inference use case rather than training, yes?


yeah, we're optimizing the infra for realtime inference, though people definitely still do run training on us, with a weights upload implemented at the end of your handler.


how were you able to tell this? still trying to understand what infra is better used for inference (say realtime image category matching) vs training (feeding a chatbot huge sums of data)


Here are some candidates: - HuggingFace Inference Endpoints: https://huggingface.co/inference-endpoints - Amazon SageMaker: https://aws.amazon.com/sagemaker/ - Replicate: https://replicate.com/

The first two are more customizable than the last. SageMaker is the cheapest.


My preference is not to have to change my code to use some special framework, and just get access to a gpu machine I can run my stuff on.

I'm assuming you know what you need for a GPU. If you're unsure, consider trying to run inferences on a CPU and see how long it takes and if it could work.

And then just look at price and reliability for a gpu machine with the different cloud providers. Ovh is cheap but the only thing worse than their reliability is their customer service. Various niche players offering V100s used to pop up that were pretty cheap. AWS is more expensive, more reliable, they may still have availability problems. Paperspace looks pretty good. Etc.


> are worth avoiding so you don't get stuck with somebody else's framework.

Modal eng here. Modal is not setup as a framework. Think of more as Python-defined serverless infrastructure that has native support for the Python runtime. This is in some places called "Infrastructure from code", as opposed to "Infrastructure as code" which means just source-controlling K8s YAML and Cloudformation.

A major benefit of this approach is that the cloud becomes part of your dev loop, as opposed to doing `docker build`, `docker push`, `kubectl`, etc just to ship a change to a GPU.

In the script I posted Modal APIs are mixed in with standard Python code for brevity, but many customers just keep their code in their own modules and have a `modal_infra.py` module that defines the serverless infrastructure.


Glad to see this concept being used. I was thinking the other day how odd it is to have IaC then keep the code siloed from the app code. Why doesn’t the app make it’s own infra? I think you have articulated how that would work.


Understood, thanks for clarifying. I'll edit my post.


That makes sense, Brev.dev is a really simple way to run your code on a configured GPU without having to change your code. It'll also optimize your GPU to save money when possible.


Disclaimer: I work at Truefoundry

You can give us a shot at https://truefoundry.com We are a general purpose ML Deployments platform which works on top of your existing Kubernetes clusters (AWS EKS, GCP GKE or Azure AKS) abstracting away the complexity of dealing with cloud providers and Kubernetes. We support Services for ML web apps, APIs, Jobs for ML training jobs, Model Registry for storing models, Model Servers for no code model deployments. (Our platform can be partially or completely self hosted for privacy and compliance)

Adding one or more GPUs (V100, T4, A10, A100, etc) is simply one extra line https://docs.truefoundry.com/docs/gpus#adding-gpu-to-service...

Examples:

- Stable Diffusion with Gradio: https://github.com/truefoundry/truefoundry-examples/tree/mai...

- GPT-J 6B fp16 with FastAPI: https://github.com/truefoundry/truefoundry-examples/tree/mai...


Love TrueFoundry! We use it for Infra provisioning on our own cloud and deploying the ML Models behind a choice of a specific model server. Pricing model is also good for early start-ups :)


PS: (I am one of the founders) - you can write to us at founders@truefoundry.com We can help understand your use case and try to suggest whatever is best from whatever is available in the ML Serving ecosystem.


For serverless: check the list I posted here https://news.ycombinator.com/item?id=34742087 (I ended up using Banana, it was fine)

For non-serverless, some to check out are these (though likely all overkill if you just need a single GPU)

https://www.coreweave.com/

vast.ai

Lambda labs


How come you didn't end up using Modal, seeing at it was recommended in the only reply in the thread? [I'm a Modal person looking for insight :)]


I’m using a docker container on Ubuntu, which is on my home lab that’s an esxi 6.5 hypervisor. Going to be building a new machine with a few hundred GB of ram and then, at some point in the next 6 months, looking at getting a good GPU with a bunch of vRAM.

Wrapped the thing in a flask app so I can expose APIs I build out.


I'm currently running a Discord bot with a 7B model off a free Oracle Ampere instance with their Pytorch Accelerated[0] image. It's not terribly fast, but totally usable for group chats that want to interrogate an AI. If you're doing some sort of offline processing or non-time-imperative operation, something like this might be worth looking into.

[0] https://cloudmarketplace.oracle.com/marketplace/en_US/adf.ta...


Oracles free forever tier is so underrated. They just throw half a startup at you at no cost.


They also shutdown your instances after a while and ask you to upgrade to paid tier.


Does that use all 4 OCPUs / 24GB memory?


It can! I'm using 2 cores per request though, and I've got memory to spare.


what discord bot? :)


Genesis Cloud (https://www.genesiscloud.com/pricing).

Disclaimer: I am the CTO ;)

Why use us?

Competitive prices (billing by the minute, only pay when you actually run an instance). High reliability (professional DCs, customized hardware to suit requirements). Good connectivity (traffic is also free, no in-/egress fees). High security level (full VMs with dedicated GPUs with proper separation of customers instead of shared hosts with docker). Free storage. A great support team. Green energy (no greenwashing by carbon offsetting, we use energy sources that are renewable and carbon free at the source (geothermal/hydro)).

I could go on... Would love it if you just try our services, after sign up there are free credits available for risk free testing.


I have had good experience with Replicate and Runpod. Replicate seems to be nicer but has very bad cold boot issue. Runpod is great once you have an app set up!

I use mix of both for my side project: https://trainengine.ai


Founder of Replicate here. Looks like you're using DreamBooth for TrainEngine. We have a beta version of really fast cold boots for DreamBooth trainings. I'll drop you an email to get you set up with it.

We've got some big cold improvements rolling out across everything soon. We can also just keep models switched on to avoid cold boots entirely.


Seconding the cold boot issue, hoping it's fixed I've had to look elsewhere because I can't have users waiting 3 min for a 5-15 second inference unexpectedly, I must say I do like the site and product!


Try www.salad.com. We've got 10k+ GPUs - from 8GB to 24GB. You get 10x more inferences per dollar compared to others. Our product team is pretty happy to help out on Discord. Some prices of interest. RTX 3060 - 12 GB - $0.08/hr RTX 3090 - 24 GB - $0.25/hr


Wow, looks like there's a ton of choices here I haven't looked at. For iterate.world we use replicate but just added kandinsky from runpod. Thinking about switching everything to runpod because it's 5-10x cheaper and we only use models that they have anyway.

There's one I won't share that's is now defunct but you could use any diffuser's compatible project on Hugging Face, which was such a cool feature. I wish someone (cheap) would implement this!

edit: just looked at banana.dev in this thread, their templates look closest to the HuggingFace integration though I don't think they have webhooks.


Founder of Replicate here. Also YC founder (W20). :)

It's also worth noting that we bill by the second for how long your prediction is running, and we don't bill for any idle time, so in practice Replicate works out cheaper for many workloads. We can give discounts if you're putting through a decent amount of traffic. We should be able to match Runpod's pricing.

Drop me an email: ben@replicate.com


Hey! Banana founder here. Explicit webhook support coming out soon, though one could always add an http POST request to their webhook endpoint at the end of their handler to send the data that way rather than awaiting the results from the client. It'd take some customization, but you're into our templates, you can click the github icon in the UI to see the source repo, fork it, add the HTTP POST call at the end of the handler, and then deploy that to Banana as a custom repo.


Let us know if you need help with webhooks, we make it super easy...

https://www.svix.com/



I've been very happy with Genesis Cloud (www.genesiscloud.com) - they have worked with me on getting additional GPU capacity and have very reasonable prices. 0.7 USD/HR for a Nvidia GeForce RTX 3090. They give you $15 in credits for starting an account but you can get $50 with this referral code: https://gnsiscld.co/x5tpz .


Baseten was by far the easiest setup I've tried https://www.baseten.co


Do any of the "serverless"/saas model hosting services perform optimizations such as quantization or input micro-batching?


If you’re using python, Modal (modal.com) was awesome to setup.

They’ll take a FastAPI setup too and just put it online to be used on demand.


Have you tried self hosting? All you need is business internet with a static IP which is quite inexpensive and doing inference can be done on CPU depending what you want to perform inferfence. Also, wherever you are hosting a good rule of thumb is to have at least 1.5 times the amount of regular ram with your VRAM.


I'm the director of engineering for Databricks' model serving product. It is serverless, meaning it autoscales to & from zero. If you are a Databricks customer or willing to be, you can reach out about enrolling in the GPU preview.


Founder of https://replicate.com/ here, which has been mentioned a few times. Happy to help you get set up. :) ben@replicate.com


Banana.dev is what I use. The cold boots are fast


Check out JetML.com (I'm the founder). Happy to help get you started with a demo if you want to reach out nick@jetml.com.


Check out BentoML https://github.com/bentoml


+1 for BentoML. Open source, good docs, and the community around it is responsive.


What about just using a cloud VM with an ansible script? I find ML deployment solutions to be very over engineered


Nvidia jetson nano boards (orin and the previous one) at home. The cloud is so expensive for gpu usage


We use Sagemaker at work because AWS. I don't really like their style of APIs but it works.


Vast.ai Nobody has better prices.


Another vote for vast.ai, has been around quite a while and I've been using them for shell access to bare metal machines stuffed with GPUs, always had a decent experience.


Brev.dev

This is exactly what you’re looking for


if you want to host voice ML models, check out Uberduck.


KFServing




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: