Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How does deploying a fine-tuned model work
124 points by FezzikTheGiant 7 months ago | hide | past | favorite | 64 comments
If I've managed to build my own model, say a fine-tuned version of Llama and trained it on some GPUs, how do I then deploy it and use it in an app. Does it need to be running on the GPUs all the time or can I host the model on a web server or something. Sorry if this is an obvious/misinformed question, I'm a beginner in this space



There are a lot of places e.g. Replicate where you can finetune and deploy language models. You’ll need GPUs, but you can simply select a hardware class and pay per second of usage, similar to AWS Lambda or similar serverless infrastructure.

Serverless AI is quickly becoming popular precisely because of the scenario you’re describing — it’s currently still pretty hard to deploy your own GPU stack, not to mention crazy expensive to run eg an A100 24/7, plus orchestration for scale up/down. It’s why so many model authors don’t host their own demos anymore and simply toss it on HuggingFace or Replicate directly.

Serverless providers will basically do the infra for you, as well as make necessary GPU reservations, then charge you a slight premium on the reserved price — so you’d pay less than on-demand on GCP/AWS, while they benefit from economies of scale.

I do imagine at some point soon GPUs will become cheap and more commercially available so you could away with hosting your own VPS in the distant (or near?) future.


Is it really a slight premium or more like “at least 2x of the cost”?


A lot of the answers to your question focus solely on the infra piece of the deployment process, which is just one, albeit, important piece of the puzzle.

Each model is built using some predefined model architecture and the majority of the LLMs of today are the implementation of Transformer architecture, based on the "Attention is All You Need" paper from 2017. That said, when you fine-tune a model, you usually start from a checkpoint and then using techniques like LORA or QLORA you compute new weights. You do this in your training/fine-tuning script using PyTorch, or some other framework.

Once the training is done you get the final weights -- a binary blob of floats. Now you need to use those weights back into the inference architecture of the model. You do that by using the framework which is used for training (PyTorch) to construct the inferencing pipeline. You can build your own framework/inferencing engine too if you want and try to beat PyTorch :) The pipeline will consist of things like:

- loading the model weights

- doing pre-processing on your input

- building the inference graph

- running your input (embeddings/vectors) through the graph

- generating predictions/results

Now, the execution of this pipeline can be done on GPU(s) so all the computations (matrix multiplications) are super fast and the results are generated quickly, or it can still run on good old CPUs, but much slower. Tricks like quantization of model weights can be used here to reduce the model size and speed up the execution by trading-off precision/recall.

Services like ollama, or vllm abstract away all the above steps and that's why they are very popular -- they might even allow you to bring your own (fine-tuned) model.

On top of the pure model execution, you can create a web service that will serve your model via a HTTP or gRPC endpoint. It could accept user query/input and return a JSON with the results. Then it can be incorporated in any application, or become part of another service, etc.

So, the answer is much more than "get the GPU and run with it" and I think it's important to be aware of all the steps required if you want to really understand what goes into deploying custom ML models and putting them to a good use.


Thanks for the insightful response. This is exactly the type of answer I was looking for. What's the best way to educate myself on the end-to-end process of deploying a production grade model smartly in a cost efficient manner?


This might be asking for too much but is there a guide that explains each part of this process? Your comment made the higher level way clearer for me and I'd like to go into the weeds a bit on each of these


download llama.cpp

convert the fine tuned model into gguf format. choose a number of quantization bits such that the final gguf will fit in your free ram + vram

run the llama.cpp server binary. choose the -ngl number of graphics layers which is the max number that will not overflow your vram (i just determine it experimentally, i start with the full number of layers, divide by two if it runs out of vram, multiply by 1.5 if there is enough vram, etc)

make sure to set the temperature to 0 if you are doing facts based language conversion and not creative tasks

if it's too slow, get more vram

ollama, kobold.cpp, and just running the model yourself with a python script as described by the original commenter are also options, but the above is what i have been enjoying lately.

everyone else in this thread is saying you need gpus but this really isn't true. what you need is ram. if you are trying to get a model that can reason you really want the biggest model possible. the more ram you have the less quantized you have to make your production model. if you can batch your requests and get the result a day later, you just need as much ram as you can get and it doesn't matter how many tokens per second you get. if you are doing creative generation then this doesn't matter nearly as much. if you need realtime then it gets extremely expensive fast to get enough vram to host your whole model (assuming you want as large a model as possible for better reasoning capability)


Interesting. Thanks for the response. Do you have any resources where I can educate myself about this? How did you learn what you know about LLMs?


Well, when Llama 1 came out I signed up and downloaded it, and that led me to llama.cpp. I followed the instructions to quantize the model to fit in my graphics card. Then later when more models like llama2 and mixtral came out I would download and evaluate them.

I kept up on hacker news posts and any comments about things I didn't understand. I've also found the localllama subreddit to be a great way to learn.

Any time I saw a comment on anything I would try it, like ollama, kobold.cpp, sillytavern, textgen-webui, and more.

I also have a friend who has been into ai for many years and we always exchange links to new things. I developed a retrieval augmented generation (rag) app with him and a "transformation engine" pipeline.

So following ai stories on hn and reddit, learning through doing, and applying what I learned to real projects.


Thanks. Very cool. Have you ever tried to implement a transformer from scratch? Like in the Attention is all you need paper? Can a first/second year college student do it


Andrej Karpathy's course is a good resource: https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs...


I haven't tried it yet, but I do intend to. I think the code for llm inference is quite straightforward. The complexity lies in collecting the training corpus and doing good rlhf. That's just my intuition.


Hi, I work at a startup where we train / fine tune / inference models on a gcp kubernetes cluster on some a100s.

There isn't really that much information about how to do this properly because everyone is working it out and it changes month by month. It requires a bunch of DevOps an infrastructure knowledge above and beyond the raw ml knowledge.

Your best bet is probably just to tool around and see what you can do.



Thanks!! This is really cool


For my hobby project, I:

1/ bought a gaming rig off craigslist 2/ set it up to run as a server (auto-start services, auto-start if power is cut, etc.) 3/ setup a cloud redis server 4/ gaming rig fetches tasks from redis queue. processes them. and updates them.

Queueing in front of ML models is important. GPUs are easily overwhelmed and will basically fail 90% of traffic and have max latency if you send it too much. Let the GPU server run at its own pace.


> Queueing in front of ML models is important

This sounds so clean! A super fast NoSQL to hold incoming requests.

Could even throttle your free users and prioritize the paying customers.

Btw, is the project online now?


Here is a queueing api server for self hosted inference backends: https://github.com/aime-team/aime-api-server from a friend of mine. Very light weight and easy to use. You can even serve models from Jupyter Notebooks with it without needing to worry about overwhelming the server. It just gets slower the more load you send to it.


Really cool! I like that they have live demos to prove it out. Thanks for sharing


I use ML for content moderation at Grab and my side project. Both are online. Both use queues, but Grab's traffic is more organic and real time than the side project's.


Could you please elaborate on the OS/tools you’re using to run the model locally?


I recommend checking out https://modal.com

It’s “serverless” and you only pay for the compute you use.


I have modal, runpod and beam.cloud in my notes.

Surprised not seeing modal recommended more. I plan on trying it it myself soon.


This looks awesome, thanks for sharing


You still need a GPU.

I'm nothing even close to a knowledgeable expert on this, but I've dabbled enough to have an idea how to answer this.

The special thing about why GPU's are used to train models is that their architecture is directly applicable to parallelize the kind of vector math that goes on under the hood when adjusting weights between nodes in all the layers of a neural network.

The weights files (GGUF, etc) are just data describing how the networks need to be built up to be functional. Think compressed ZIP file versus uncompressed text document.

You can run a lot of models on just a cpu, buts its gonna be slooooooow. For example, I've been tweaking on running a Mixtral8x7b model on my 2019 Intel Macbook Pro with llama.cpp. It works, but it may be running at 1-2 tokens per second at max, and that's even with the limited GPU offloading I figured out how to do.


Depends on the application.

If you can process "offline" for an hour and then cache the results, CPU inference is fine.

GPUs are expensive.


Very true. This is the approach I've been taking. My project isn't human-interactive and I'm fine with it taking a couple hours for each run to finish.

The downside is that it makes development and testing super slow without the speedup you get from having local GPU power.


> If you can process "offline" for an hour and then cache the results, CPU inference is fine.

> GPUs are expensive.

Depends on the GPU. I've found T4 GPUs to be cheaper than CPU compute on AWS when testing throughput per $ of spend.


CPU inference is “OK” on quantized models smaller than 10B

Gives at least a couple tokens per sec on fast modern CPUs.

So it really depends on how fast the speed needs to be and the expected loads, compute cost and budget etc etc


If you care about inference time, then you'll do two things.

1. Train a student model from your fine-tuned model. (Known as "knowledge distillation").

2. Quantize the student model so that it uses integers.

You might also prune the model to get rid of some close-to-zero weights.

This will get you a smaller model that can probably run OK on a CPU, but will also be much more efficient on GPU.

Next: architect your code so that the inference step sits behind a queue. You do not generally want to have the user interface waiting on a inference event because you can't guarantee latency or resource availability, and your model's inference processing will be the biggest slowest thing in your stack, so you can't afford to overprovision.

So have a queue of "things to infer", and having your inference process run in the background chomping through the backlog, storing the results in your database. When it infers something, somehow notify your front-end clients that it's ready in the database for them to retrieve. In this model, you can potentially run your model somewhere cheaper than AWS (e.g. a cheaper provider, a machine under your desk).

Or, for the genius move: compile the model to ONNX and run it in a background thread in the users' browser, and then you don't have to worry about provisioning; users will wonder why their computer runs so slowly though.


This is really helpful. I'm really new to this, what's the best way to educate myself on fine-tuning and deploying a LLM in the most cost and time efficient way?


I've got video recordings of my lectures. I haven't edited them properly yet, but I can share them with you privately. Email me (it's in my profile) and I'll get them to you.


There are lots of third-party providers that will host your fine-tuned model for you, and just charge per token like OpenAI. Here are some of the providers I've personally used and would vouch for in production, along with their costs per 1M input tokens for Llama 3 8B, as a point of comparison:

- Replicate: $0.05 input, $0.25 output

- Octo: $0.15

- Anyscale: $0.15

- Together: $0.20

- Fireworks: $0.20

If you're looking for an end-to-end flow that will help you gather the training data, validate it, run the fine tune and then define evaluations, you could also check out my company, OpenPipe (https://openpipe.ai/). In addition to hosting your model, we help you organize your training data, relabel if necessary, define evaluations on the finished fine-tune, and monitor its performance in production. Our inference prices are higher than the above providers, but once you're happy with your model you can always export your weights and host them on one of the above!


Appreciate the response. Definitely gonna check it out. How is pricing structured


For the inference part, you can dockerise your model and use https://banana.dev for serverless GPU.

They have examples on github on how to deploy and I’ve done it last year and was pretty straightforward.


Followed this link to banana.dev. It appears that the company has sunset the product: https://www.banana.dev/blog/sunset


The hardware you need and amount of time you run the model really depends on which model you've fine-tuned and what your model will be doing. If you don't need quick responses, you could use a cpu via llama.cpp. If you are only doing something like summarization you could do the inference in batches and start and stop your GPU resources between them. If you're doing chat around something predictable, like a product, you could cache common questions and responses and use smaller model to pick between them.


I've built a product in this regard - specifically for fine-tuning and deploying said fine-tuned models.

You'll need GPUs for inferencing + have to quantize the model + have it hosted on the cloud. The platform I've built is around the same workflow (but all of it is automated, along with autoscaling, and you get an API endpoint; you only pay for the compute you host on).

Generally, the GPU(s) you choose will depend on how big the model is + how many tokens/sec you're looking to get out of it.


Hijacking this thread to ask a question. Does anyone have a 1000 ft level description of all the steps related to the ML pipeline. I thought I had a basic understanding, but I am now seeing things like pre-training which I am not sure about (why pre-train, and what are you pre-training with). And fine-tuning which I understand better (at least the concept). I guess I am looking for a concept level definition of the verbiage/taxonomy used in the ML pipeline - ideally in chronological (pipeline) order.


Does anyone know which (if any) serverless providers support LoRA? It should be possible to serve community models like Llama3-70B and dynamically load an adapter on request, to greatly reduce the “cold start” problem.



Nice find! Better pricing than Replicate, too.


fireworks does: https://readme.fireworks.ai/reference/requirements-and-limit... but its limited to rank 64, CloudFlare workers too, but there rank is limited to as low as 8


The easiest stepping stone will involve deploying on HuggingFace, think of it, in this case, of basically a mashup of GitHub x Heroku. You'll upload your model, they'll already have a repo for the model you fine-tuned somewhere, you'll tweak 20 lines of Python to run your version instead of theirs.

The next easiest step would be OpenAI. They offer extremely easy finetuning, more or less you'll upload a file with training data to their web app, then a few hours later it's done, and you just use your API key + a model ID for your particular fine-tuned model.


He mentioned he already has a fine-tuned model. OpenAI does not host user-created models.


They do not have a finetuned model and don't claim to, beyond that...this is just some filler text I'm writing because it felt aggro when I ended with a period. but I do want to correct the record because post flipped from +3 to -1 within 20 minutes of your post. have a good Tuesday!


"If I've managed to build my own model, say a fine-tuned version of Llama"


"If", "say" are diametrically opposed to certainty in US English vernacular.

Only thought left after that is if you've gotten to the point you've finetuned a model it'd be very difficult to have this level of question.

Frankly, I don't like asserting that because it's not guaranteed, and I really don't want to get in a pissing match about this. Bad vibes for no reason. I wrote a informative comment, from direct experience, that will help them. It wasn't just OpenAI...but I did enable them or anyone else who HuggingFace would also be too much for, to find an intermediate step.


I think you underestimate the skill/knowledge gap present in people who can download and tinker with an existing program vs. deploy it in a sensible way so that people can interact with it. I have friends in different areas of tech who might be able to complete basic projects but would struggle to get them out there without a little guidance.


- There's absolutely no reason to think they have a fine-tuned model ready.

- There are multiple strong indicators from the poster themself they specifically do not.

- It's dispiriting to see a long comment chain dedicated to picking apart a good, helpful, response.

- It's uncomfortable to be told I'm underestimating skill/knowledge gap, after I purposely bracketed it.

- It's uncomfortable to be told I'm underestimating skill/knowledge gap, after I specifically said "I don't like asserting that because it's not guaranteed, and I really don't want to get in a pissing match about this."


I'm pissed too, when such happens. But that's the way it goes. One can choose between two questions:

- Am I important, ain't I?

- what can I buy of the HN KarmaPoints?

I mostly choose the second question and answer it briefly:

Nothing.

And then I realize, for what it's worth: it's worth nothing and I can't buy anything with it.

I learn to be not pissed on the Internet. I think it's an important skill.


No, actually, you're allowed to post and tease people who are saying nothing. The karma is just a useful framing as to why, people don't like it when you tease for no reason. (I only mentioned it once briefly lol, 4 comments up, relax buddy)


My fault I always make is to ask after the initial comment and negative points for it: "why no text but minus points. it's just some random points on the Internet. I can't care less, if you down vote, so feel free .." ....

Sudden death reply:

"So then why you feel important to tell that you don't care if you don't care?"

Nailed on the spot :)


When this stuff happens to me, I try to imagine that people accidentally tapped the down arrow, because it is so close to the up arrow.

Enjoy my upvote!


First, you should do some experimentation to determine if fine tuning is even needed for you. gpt4 and claude opus have enough context window to ground the model with your knowledge base and one of the best of class hosted models is going to be far better at reasoning than any model you can afford to run yourself.

If you need data privacy and don't trust openai and anthropic when they say they don't train on api call data, then you will need a local model.


TLDR you’ll probably serve it on gpus

If it’s a small model you might be able to host it on a regular server with CPU inference (see llama.cpp)

Or a big model on cpu but really slowly

But realistically you’ll probably want to use gpu inference

Either running on gpus all the time (no cold start times) or on serverless gpus (but then the downside is the instances need to start up when needed, which might take 10 seconds)


Is vllm on a gpu server. Unfortunately won’t be cheap if you need it 24/7


Question: Why do so many people ask if it's needed "24/7"? I've been on 2 calls this week where someone asked about my project "Do you need it available 24/7?"

I'm wondering where this question keeps coming from and why people ask it - of course if I'm launching a public product it should be available 24/7. Is that what the question is?


You can rent GPUs by the hour and there are (limited) serverless options too. So this question makes a gigantic difference in cost.

>of course if I'm launching a public product it should be available 24/7.

Generally yes, but not a given. In some context like data processing you can accumulate tasks and run them in a batch.

e.g. OpenAI just launched batches...50% discount but you don't get instant results, you get them somewhere in next 24h


Ahh got it, so not for a real-time consumer use cases then.

Typically when I think of a user using an AI product they're hitting the LLM a lot and expecting results on their screen right then, but I guess there are a lot of back office and data processing use cases.


You can use https://github.com/dstackai/dstack to deploy your model to the most affordable GPU clouds. It supports auto-scaling and other features.

Disclaimer: I’m the creator of dstack.


This is why everything I do is fine tuned openai. It is so much easier, faster and at least at my scale, cheaper. I suspect it’s also better for small datasets like mine. It’s unbelievable to me that none of the big competitors offer a similar service.


GPU vs CPU:

It's faster to use a GPU. If you tried to play a game on a laptop with onboard gfx vs buying a good external graphics card, it might technically work but a good GPU gives you more processing power and VRAM to make it a faster experience. Same thing here.

When is GPU needed:

You need it for both initial training (which it sounds like you've done) and also when someone prompts the LLM and it parses their query (called inference). So to answer your question - your web server that handles LLM queries coming in also needs a great GPU because with any amount of user activity it will be running effectively 24/7 as users are continually prompting it, as they would use any other site you have online.

When is GPU not needed:

Computationally, inference is just "next token prediction", but depending on how the user enters their prompt sometimes it's able to provide those predictions (called completions) with pre-computed embeddings, or in other words by performing a simple lookup, and the GPU is not invoked. For example in this autocompletion/token-prediction library I wrote that uses an ngram language model (https://github.com/bennyschmidt/next-token-prediction), GPU is only needed for initial training on text data, but there's no inference component to it - so completions are fast and don't invoke the GPU, they are effectively lookups. An LM like this could be trained offline and deployed cheaply, no cloud GPU needed. And you will notice that LLMs sometimes will work this way, especially with follow-up prompting once it already has the needed embeddings from the initial prompt - for some responses, an LLM is fast like this.

On-prem:

Beyond the GPU requirement, it's not fundamentally different than any other web server. You can buy/build a gaming PC with a decent GPU, forward ports, get a domain, install a cert, run your model locally, and now you have an LLM server online. If you like Raspberry Pi, you might look into the NVIDIA Jetson Nano (https://www.nvidia.com/en-us/autonomous-machines/embedded-sy...) as it's basically a tiny computer like the Pi but with a GPU and designed for AI. So you can cheaply and easily get an AI/LLM server running out of your apartment.

Cloud & serverless:

Hosting is not very different from conventional web servers except that their hardware has more VRAM and their software is designed for LLM access rather than a web backend (different db technologies, different frameworks/libraries). Of course AWS already has options for deploying your own models and there are a number of tutorials showing how to deploy Ollama on EC2. There's also serverless providers - Replicate, Lightning.AI - these are your Vercels and Herokus that you might pay a little more for but get convenience so you can get up and running quickly.

TLDR: It's like any other web server except you need more GPU/VRAM to do training and inference. Whether you want to run it yourself on-prem, host in the cloud, use a PaaS, etc. those are mostly the same decisions as any other project.


Replying to follow


You can use "favorite" for this, which is less spammy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: