Hacker News new | past | comments | ask | show | jobs | submit login
Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Custom Models (anyscale.com)
308 points by robertnishihara on Aug 11, 2023 | hide | past | favorite | 59 comments



Just to add to this, I run through a lot of these topics around fine-tuning Llama 2 on your own dataset (for me it's my own code :P) in a coding live stream a couple weeks ago. All on Colab single GPU

Fine-tuning Llama stream: https://www.youtube.com/watch?v=TYgtG2Th6fI&t=2282s

I have a couple more one where I do a QLoRa fine tuning session and explain the concepts as a personally self taught engineer (software engineer of 8 years moving into ML recently)

QloRa fine-tuning stream: https://www.youtube.com/watch?v=LitybCiLhSc&t=4584s

Overall I'm trying to breakdown how I'm approaching a lot of my personal projects and my current AI driven startup. Want to make this information as accessible as possible. Also have a series where I'm fine-tuning a model to be the smallest webdev llm as possible which seems like people are liking. Only been streaming for about a month and plenty more to come.

Ask me any question about the stream and fine-tuning llama!


What is the general thought process on when it makes sense to use RAG vs fine tuning?

How does segmenting fine tuning models make sense? Do I need a terraform LLM, a SQL LLM, and a python LLM, or can I just use a “code” LLM?


Fine tuning for training the model to perform a new task, RAG for adding knowledge.

In your example, you would fine tune the model to train it to code in a language it hasn't seen before, RAG will not really help with that.


I've read that SFT is good for "leveraging existing knowledge" gained during initial pretraining, and helpful in changing the way that the model responds, but not useful for teaching it new knowledge. In your experience is that true?

For example, changing the way in which it responds could be:

  - debate me
  - brainstorm
  - be sarcastic
Which also seems like something that could be accomplished with a system prompt or few shot examples, so I'm not sure when SFT is the more appropriate approach or what the tradeoffs are.

Alternatively, gaining new knowledge would be training it on a dataset of e.g. sports trivia to make it highly effective at answering those types of questions.

P.S. nice username... Irving Fisher would approve.


I have an RAG video (my "make a ChatGPT with podcasts" video) you might be interested in. Semantic search is increddible and you might be surprised how good a Q/A solution is by just extracting passages that answer the question.

Overall it depends on whether or not you can turn your data into a fine-tuning data and if you can find a low parameter (enough) model that can use your found contexts as input to host yourself of use inference endpoints. Hosting an LLM is actually not easy and I'm finding in the field working an information retrieval business OpenAI isn't terrible compared to costs of having a GPUs for your users across the world.


There is an article at the original site about that: https://www.anyscale.com/blog/fine-tuning-is-for-form-not-fa...

Everybody new to this field thinks that he needs finetuning to teach the LLM of new facts. I made the same mistake initially, later I published a slightly ranty post on that: https://zzbbyy.substack.com/p/why-you-need-rag-not-finetunin...


Quick question - Gorilla paper talks about finetuning for RAG. Do you see this in practice ? can you do finetuning that specifically affects RAG ?


Sorry - myself I don't have much experience yet, I am at the research phase, but from what I read it makes sense to finetune the model to better understand the format used for calling the external tools including a search engine.


really need a simple "put your source stuff in this directory, then press this button, then chat with your contents" type app/module/library.

too much implementation detail required make it inaccessible for any non-significant use case. i imagine privateGpt will get there slowly


I wrote a simple implementation to do this in ChatGPT via local plugin [0]. Obviously it doesn’t hit the “fully private” requirement but I imagine it would be relatively straightforward to integrate into a local LLM. The question is whether a local LLM would be as good at grabbing enough context and nuance from the project to answer meaningfully as GPT-4 is able to do with plugins.

[0] https://github.com/samrawal/chatgpt-localfiles


One of my streams I essentially build this from scratch https://www.youtube.com/watch?v=kBB1A2ot-Bw&t=236s. A retriever reader model, let me know if you want the code I think I like the colab in the comments but let me know if you need more.


At this stage of the AI, The implementation details matters a lot for the chat to be actually meaningful… RAG is over-hyped


this is brilliant. could you do a series about how to prepare custom data sets for finetuning. thats the part that a lot of other tutorials skip on. Especially for different goals - like safety, accuracy, etc.


Of course I have a few where I web scrape and build a dataset for myself with prefix tokens. I can break that down more on a specific stream about it.


well not so much as the raw data acquisition (scraping and stuff), but really data prep for finetuning. I'm hearing that each model needs it in a different format - chat finetuning data is different from instruct, etc etc


one gpu? feasible with one 3060?


Absolutely. For QLORA / 4bit / GPTQ finetuning, you can train a 7B easily on an RTX 3060 (12GB VRAM).

If you have a 24GB VRAM GPU like a RTX 3090/4090, you can Qlora finetune a 13B or even a 30B model (in a few hours).


Would be good to see a rigorous analysis of these PEFT methods on quality. There still seems to be a debate on whether these methods sacrifice quality or not.


+1 this


> Additionally, while this wasn’t an issue for GPT, the Llama chat models would often output hundreds of miscellaneous tokens that were unnecessary for the task, further slowing down their inference time (e.g. “Sure! Happy to help…”).

That's the problem I've been facing with Llama 2 as well. It's almost impossible to have it just output the desired text. It will always add something before and after its response. Does anyone know if there's any prompt technique to fix this problem?


Use a better model.

airoboros supports the PLAINFORMAT token "to avoid backticks, explanations, etc. and just print the code".

https://huggingface.co/TheBloke/airoboros-l2-70B-GPT4-2.0-GG...


It's not useful for code, but you can see the difference of approach with NovelAI's homegrown Kayra model, which is set up to handle a mix of text completion and instruct functionality. It never includes extraneous prefix/suffix text and will smoothly follow instructions embedded in text without interrupting the text.


Thanks, I'll give this a try.

I wonder if LLMs will have less reasoning power if they simply return the output. AFAIK, they think by writing their thoughts. So forcing an LLM to just return the goddamn code might limit its reasoning skills, leading to poor code. Is that true?


Potentially it could have an impact if it omits a high level description before writing the code, although obviously things like "Sure! Happy to help" do not help.

In practice I haven't seen it make too much of a difference with GPT. The model can still use comments to express itself.

For non coding tasks, adding "Think step by step" makes a huge difference (versus YOLOing a single word reply).


> although obviously things like "Sure! Happy to help" do not help.

Yes you're right. I'm mostly concerned with the text that actually "computes" something before the actual code begins. Niceties like "sure! happy to help" don't compute anything.

CoT indeed works. Now I've seem people take it to the extreme by having tree of thoughts, forest of thoughts, etc. but I'm not sure how much "reasoning" we can extract from a model that is obviously limited in terms of knowledge and intelligence. CoT already gets us to 80% of the way. With some tweaks it can get even better.

I've also seen simulation methods where GPT "agents" talk to each other to form better ideas about a subject. But then again, it's like trying to achieve perpetual motion in physics. One can't get more intelligence from a system than one puts in the system.


> But then again, it's like trying to achieve perpetual motion in physics. One can't get more intelligence from a system than one puts in the system.

Not necessarily the same thing, as you're still putting in more processing power/checking more possible paths. Its kinda like simulated annealing, sure the system is dumb, but as long as checking if you have a correct answer is cheap, it still narrows down the search space a lot.


> Its kinda like simulated annealing.

Yeah I get that. We assume there's X amount of intelligence in the LLM and try different paths to tap on that potential. The more paths are simulated, the closer we get to the LLM's intelligence asymptote. But then that's it—we can't go any further.


You can also just parse the text for all valid code blocks and combine them. I have a script which automatically check the clipboard for this

There's no reason to handle the LLM side of things, unless you want to try and optimize the amount of tokens which are code vs comments vs explanations and such. (Though you could also just start a new context window with only your code or such)


The model card also has prompt formats for context aware document Q/A and multi-CoT, using those correctly improves performance at such tasks significantly.


Llama-2-chat models have been overly fine-tuned to be like this. You can give a few-shot prompting a try, but they still don't gurantee a desired output. The best way to guarantee is to fine-tune on small (~1k) data points and go from there.


fine tune the chat or base model?


It depends on what your goal is, but I've had success reproducing specific output formatting by fine-tuning the base LLaMA2 models instead of the RLHF'd models. My use cases were simpler - information extraction/synthesis from text rather than creative writing. The base models might not be good fits for your task.


Prompt the model to always output answers / code within ```content``` strings or json. If it's json, then you can identify where it starts and ends. Strip everything outside the json.


I'm really glad to see a post like this come out. I've seen so many discussions online about customizing models -- this post really does cut through the noise.

Really like the evaluation methodology, and seems well-written as well.


It's weird that Lora and training with quantization is not being taken more seriously. It's way cheaper, takes less time, and a lot of evidence shows it's pretty good.

I don't think it should be something brushed on the side to be tried out later..



I'm not sure to whom he is responding, since no one is claiming LoRA performs as well as traditional fine tuning. If you click through to the original Tweet he shared, it says "when you have a lot of data and limited compute go for LoRA, while with limited data and ample compute go for full finetuning" which I think is absolutely correct and few would disagree. As these models get bigger and bigger though, fewer and fewer people are going to have the "ample compute" required for full fine tuning.


I'm not sure less data should require full fine-tuning. If I had 5 pages of text, I don't see why I need to train billions of parameters that are already trained pretty well on general internet knowledge, and already know how to chat..

From a practical perspective, unless cost is really immaterial, I think most will end up starting with Lora, especially for 13b or 70b models.. you could do 10 fine-tuning runs for the cost of a few full fine-tunings.

But it's still all witchcraft to me to some degree, and I'd probably try full and Lora.


The tweet is referring to a paper that fine tunes Chinese dataset on english base model. I'm not surprised with LoRA's poor result in this setup.


Glad to see the NER-like task performed the best, as I was just about to test something like this for comparison with a fine-tuned BERT model. Any idea about the training costs for this task?


Hey, I am one of the co-authors of the post. So the training data for ViGGO has about 5.1k rows which we trained with a block size of 512 (you can lower the block size if you want but we didn't do so because it was just easier to not change code :)). On 16xA10Gs for 7B it took ~15 min per epoch and on 13B it took ~25 min per epoch. So the on-demand cost per epoch is ~$7.2 for 7B and ~$12 for 13B. This is based on the time only spent on the training part and does not take into account the cluster startup time and shutdown time.


Great! Thank you!


Great question. I wish they said how long the 10 epochs took, so we could figure out the cost (or better, just posted the time and cost together):

"For the 7B and 13B models, we used 16xA10Gs, and for the 70B model, we used 32xA10Gs (across 4x g5.48xlarge instances). When using Ray, there's no need to secure A100s to perform full-parameter fine-tuning on these models! The process is simply repeated for each task. Figures below show an example run based on a context length of 512, with a total of 3.7M effective tokens per epoch on GSM8k dataset.

We ran the training for a maximum of 10 epochs and selected the best checkpoint according to the minimum perplexity score on the validation set."


Training times for GSM8k are mentioned here: https://github.com/ray-project/ray/tree/master/doc/source/te...


One challenge is that to get large enough custom datasets you either need a small army or a very strong existing model. Which means that you probably have to use OpenAI. And using OpenAI to generate training material for another model violates their terms.

Has anyone taken them to court about this? Do we all just decide it's not fair and ignore it?


This is not true for all tasks. For many NLP tasks, you just need to reformat existing data to match the LLM format.


Why not ignore ToS? The worst that can happen is that you lose access.


The worst that can happen is you get brought into an expensive lawsuit.


Seeing NER examples pop up more frequently now, and wondering why folks don’t use spacy for those sorts of tasks.


Spacy doesn’t work well for multilingual training data and I’ve found it barfs in more and somehow even odder ways than stuff in transformers.


My line of thinking is using the more expensive model to label data, then use a teacher/student methodology to train the smaller model (SpaCy or BERT) for cost & speed.


I use a fine-tuned BERT-like model for NER, but I'd be interested to compare how it performs.


Disclaimer: I work for Anyscale

This blog seems to got good attention :) So we definitely plan to add it to Ray Summit https://raysummit.anyscale.com/agenda

Please comment on this thread if you have ideas of what kind of content you want to see more at Ray Summit


> ~14 min. for 7B for 1 epoch on 3.5M tokens. ~26 min for 13B for 1 epoch.

> At least 1xg5.16xlarge for head-node and 15xg5.4xlarge for worker nodes for both 7B and 13B

For the uninitiated, anyone have an idea how much this would cost on AWS?


g5.16xlarge - $4.0960/hour

g5.4xlarge - $1.6240/hour

You're looking at about $30/hour to run this in us-east-1.

https://instances.vantage.sh/?selected=g5.16xlarge,g5.4xlarg...


thanks


Is this possible to fine tune llama-2 locally on M1 Ultra 64GB, I would like to know or any pointer would be good. Most of them are on Cloud or using Nvidia Cuda on linux.


I don't think so. I have M1 Max 64GB and it works okay for some inference. I'm buying a few credits from RunPod. It will be a few 10's of dollars to get it trained.


Has anyone had luck with fine-tuning Llama-v2-7b using the paid (€11.00) Colab Pro?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: