Just to add to this, I run through a lot of these topics around fine-tuning Llama 2 on your own dataset (for me it's my own code :P) in a coding live stream a couple weeks ago. All on Colab single GPU
I have a couple more one where I do a QLoRa fine tuning session and explain the concepts as a personally self taught engineer (software engineer of 8 years moving into ML recently)
Overall I'm trying to breakdown how I'm approaching a lot of my personal projects and my current AI driven startup. Want to make this information as accessible as possible. Also have a series where I'm fine-tuning a model to be the smallest webdev llm as possible which seems like people are liking. Only been streaming for about a month and plenty more to come.
Ask me any question about the stream and fine-tuning llama!
I've read that SFT is good for "leveraging existing knowledge" gained during initial pretraining, and helpful in changing the way that the model responds, but not useful for teaching it new knowledge. In your experience is that true?
For example, changing the way in which it responds could be:
- debate me
- brainstorm
- be sarcastic
Which also seems like something that could be accomplished with a system prompt or few shot examples, so I'm not sure when SFT is the more appropriate approach or what the tradeoffs are.
Alternatively, gaining new knowledge would be training it on a dataset of e.g. sports trivia to make it highly effective at answering those types of questions.
P.S. nice username... Irving Fisher would approve.
I have an RAG video (my "make a ChatGPT with podcasts" video) you might be interested in. Semantic search is increddible and you might be surprised how good a Q/A solution is by just extracting passages that answer the question.
Overall it depends on whether or not you can turn your data into a fine-tuning data and if you can find a low parameter (enough) model that can use your found contexts as input to host yourself of use inference endpoints. Hosting an LLM is actually not easy and I'm finding in the field working an information retrieval business OpenAI isn't terrible compared to costs of having a GPUs for your users across the world.
Sorry - myself I don't have much experience yet, I am at the research phase, but from what I read it makes sense to finetune the model to better understand the format used for calling the external tools including a search engine.
I wrote a simple implementation to do this in ChatGPT via local plugin [0]. Obviously it doesn’t hit the “fully private” requirement but I imagine it would be relatively straightforward to integrate into a local LLM. The question is whether a local LLM would be as good at grabbing enough context and nuance from the project to answer meaningfully as GPT-4 is able to do with plugins.
One of my streams I essentially build this from scratch https://www.youtube.com/watch?v=kBB1A2ot-Bw&t=236s. A retriever reader model, let me know if you want the code I think I like the colab in the comments but let me know if you need more.
this is brilliant.
could you do a series about how to prepare custom data sets for finetuning. thats the part that a lot of other tutorials skip on.
Especially for different goals - like safety, accuracy, etc.
well not so much as the raw data acquisition (scraping and stuff), but really data prep for finetuning.
I'm hearing that each model needs it in a different format - chat finetuning data is different from instruct, etc etc
Would be good to see a rigorous analysis of these PEFT methods on quality. There still seems to be a debate on whether these methods sacrifice quality or not.
> Additionally, while this wasn’t an issue for GPT, the Llama chat models would often output hundreds of miscellaneous tokens that were unnecessary for the task, further slowing down their inference time (e.g. “Sure! Happy to help…”).
That's the problem I've been facing with Llama 2 as well. It's almost impossible to have it just output the desired text. It will always add something before and after its response. Does anyone know if there's any prompt technique to fix this problem?
It's not useful for code, but you can see the difference of approach with NovelAI's homegrown Kayra model, which is set up to handle a mix of text completion and instruct functionality. It never includes extraneous prefix/suffix text and will smoothly follow instructions embedded in text without interrupting the text.
I wonder if LLMs will have less reasoning power if they simply return the output. AFAIK, they think by writing their thoughts. So forcing an LLM to just return the goddamn code might limit its reasoning skills, leading to poor code. Is that true?
Potentially it could have an impact if it omits a high level description before writing the code, although obviously things like "Sure! Happy to help" do not help.
In practice I haven't seen it make too much of a difference with GPT. The model can still use comments to express itself.
For non coding tasks, adding "Think step by step" makes a huge difference (versus YOLOing a single word reply).
> although obviously things like "Sure! Happy to help" do not help.
Yes you're right. I'm mostly concerned with the text that actually "computes" something before the actual code begins. Niceties like "sure! happy to help" don't compute anything.
CoT indeed works. Now I've seem people take it to the extreme by having tree of thoughts, forest of thoughts, etc. but I'm not sure how much "reasoning" we can extract from a model that is obviously limited in terms of knowledge and intelligence. CoT already gets us to 80% of the way. With some tweaks it can get even better.
I've also seen simulation methods where GPT "agents" talk to each other to form better ideas about a subject. But then again, it's like trying to achieve perpetual motion in physics. One can't get more intelligence from a system than one puts in the system.
> But then again, it's like trying to achieve perpetual motion in physics. One can't get more intelligence from a system than one puts in the system.
Not necessarily the same thing, as you're still putting in more processing power/checking more possible paths. Its kinda like simulated annealing, sure the system is dumb, but as long as checking if you have a correct answer is cheap, it still narrows down the search space a lot.
Yeah I get that. We assume there's X amount of intelligence in the LLM and try different paths to tap on that potential. The more paths are simulated, the closer we get to the LLM's intelligence asymptote. But then that's it—we can't go any further.
You can also just parse the text for all valid code blocks and combine them. I have a script which automatically check the clipboard for this
There's no reason to handle the LLM side of things, unless you want to try and optimize the amount of tokens which are code vs comments vs explanations and such. (Though you could also just start a new context window with only your code or such)
The model card also has prompt formats for context aware document Q/A and multi-CoT, using those correctly improves performance at such tasks significantly.
Llama-2-chat models have been overly fine-tuned to be like this. You can give a few-shot prompting a try, but they still don't gurantee a desired output. The best way to guarantee is to fine-tune on small (~1k) data points and go from there.
It depends on what your goal is, but I've had success reproducing specific output formatting by fine-tuning the base LLaMA2 models instead of the RLHF'd models. My use cases were simpler - information extraction/synthesis from text rather than creative writing. The base models might not be good fits for your task.
Prompt the model to always output answers / code within ```content``` strings or json. If it's json, then you can identify where it starts and ends. Strip everything outside the json.
I'm really glad to see a post like this come out. I've seen so many discussions online about customizing models -- this post really does cut through the noise.
Really like the evaluation methodology, and seems well-written as well.
It's weird that Lora and training with quantization is not being taken more seriously. It's way cheaper, takes less time, and a lot of evidence shows it's pretty good.
I don't think it should be something brushed on the side to be tried out later..
I'm not sure to whom he is responding, since no one is claiming LoRA performs as well as traditional fine tuning. If you click through to the original Tweet he shared, it says "when you have a lot of data and limited compute go for LoRA, while with limited data and ample compute go for full finetuning" which I think is absolutely correct and few would disagree. As these models get bigger and bigger though, fewer and fewer people are going to have the "ample compute" required for full fine tuning.
I'm not sure less data should require full fine-tuning. If I had 5 pages of text, I don't see why I need to train billions of parameters that are already trained pretty well on general internet knowledge, and already know how to chat..
From a practical perspective, unless cost is really immaterial, I think most will end up starting with Lora, especially for 13b or 70b models.. you could do 10 fine-tuning runs for the cost of a few full fine-tunings.
But it's still all witchcraft to me to some degree, and I'd probably try full and Lora.
Glad to see the NER-like task performed the best, as I was just about to test something like this for comparison with a fine-tuned BERT model. Any idea about the training costs for this task?
Hey, I am one of the co-authors of the post. So the training data for ViGGO has about 5.1k rows which we trained with a block size of 512 (you can lower the block size if you want but we didn't do so because it was just easier to not change code :)). On 16xA10Gs for 7B it took ~15 min per epoch and on 13B it took ~25 min per epoch. So the on-demand cost per epoch is ~$7.2 for 7B and ~$12 for 13B. This is based on the time only spent on the training part and does not take into account the cluster startup time and shutdown time.
Great question. I wish they said how long the 10 epochs took, so we could figure out the cost (or better, just posted the time and cost together):
"For the 7B and 13B models, we used 16xA10Gs, and for the 70B model, we used 32xA10Gs (across 4x g5.48xlarge instances). When using Ray, there's no need to secure A100s to perform full-parameter fine-tuning on these models! The process is simply repeated for each task. Figures below show an example run based on a context length of 512, with a total of 3.7M effective tokens per epoch on GSM8k dataset.
We ran the training for a maximum of 10 epochs and selected the best checkpoint according to the minimum perplexity score on the validation set."
One challenge is that to get large enough custom datasets you either need a small army or a very strong existing model. Which means that you probably have to use OpenAI. And using OpenAI to generate training material for another model violates their terms.
Has anyone taken them to court about this? Do we all just decide it's not fair and ignore it?
My line of thinking is using the more expensive model to label data, then use a teacher/student methodology to train the smaller model (SpaCy or BERT) for cost & speed.
Is this possible to fine tune llama-2 locally on M1 Ultra 64GB, I would like to know or any pointer would be good. Most of them are on Cloud or using Nvidia Cuda on linux.
I don't think so. I have M1 Max 64GB and it works okay for some inference. I'm buying a few credits from RunPod. It will be a few 10's of dollars to get it trained.
Fine-tuning Llama stream: https://www.youtube.com/watch?v=TYgtG2Th6fI&t=2282s
I have a couple more one where I do a QLoRa fine tuning session and explain the concepts as a personally self taught engineer (software engineer of 8 years moving into ML recently)
QloRa fine-tuning stream: https://www.youtube.com/watch?v=LitybCiLhSc&t=4584s
Overall I'm trying to breakdown how I'm approaching a lot of my personal projects and my current AI driven startup. Want to make this information as accessible as possible. Also have a series where I'm fine-tuning a model to be the smallest webdev llm as possible which seems like people are liking. Only been streaming for about a month and plenty more to come.
Ask me any question about the stream and fine-tuning llama!