Takeaways from hundreds of LLM finetuning experiments with LoRA

lappa · 2023-10-13T17:32:32

Great work, lots of useful information here.

The only thing I wish you did different was explored alpha > 2 * r.

In this blog post, the author found that alpha of 4 * r (where r=64) outperformed all smaller alphas in terms of loss when finetuning Llama-7b on databricks-dolly-15k.

https://medium.com/@drishtisharma96505/comparative-analysis-...

Additionally you identify (alpha = 2*r) r=16 as inferior to r=256, however aside from arithmetic, r=16 actually outperforms all others. And the base model outperforms any finetuning for both arithmetic metrics.

QuantumG · 2023-10-13T23:46:23

Seemed to take terrible performing models and marginally improve them. None of them are fit for purpose at the end of the training, so what was the point?

nl · 2023-10-14T03:31:42

Science.

nl · 2023-10-14T05:46:58

I'm outside my edit window but I'm surprised at the negative reaction to this.

This experimental process on small NNs us exactly how science is done. You develop and test theories of what will work using small networks because you can do them quickly.

Then you test on large, production size networks based on the lessons from the small ones.

grandmczeb · 2023-10-14T06:36:47

Probably because one word replies come across as low effort. IMO your second reply is great.

cypress66 · 2023-10-14T06:43:27

It had the right amount of effort that the parent question deserved.

IOT_Apprentice · 2023-10-15T00:51:09

How would anyone know if they did not posit an idea or theory and then implement it as a test? Publish results and it may spur others to tweak it or try other approaches.

carbocation · 2023-10-13T16:04:14

I'd like to see a writeup of LORAs for something that is readily tractable without LORAs. E.g., a pretrained ResNet34 ImageNet model that gets a LORA instead of being fine-tuned or fully re-trained. The pedagogical value is that it can be compared to the alternatives which are tractable in this setting (and which are not in an LLM setting).

mufasachan · 2023-10-13T16:46:57

Disclaimer: This is just my intuition, I do not have knowledge about LoRA on small models.

It's possible that does not work. LoRA (for Low Rank) benefits from the "small changes" introduced during finetuning of a model. The update of the weights has a low rank. If you take a smaller model, it might induce that the rank is not so low, resulting in degradation in metrics by LoRA compression. I would be interested to see if LoRA still has a benefit in this configuration.

dragonwriter · 2023-10-13T17:19:26

> LoRA (for Low Rank)

Pedantic, but it is actually for Low Rank Adaptation.

nl · 2023-10-14T03:35:52

The original LoRA paper does exactly this using GPT2 and BERT type language models. They are explicit about it being designed for LLMs, but since it seems to work really well for stable diffusion it would be interesting to see metrics in the image domain.

knorker · 2023-10-13T17:16:40

If you write it "LORA" then people won't know if you mean "LoRA" or "LoRa".

Microsoft already created this huge confusion, so I'd recommend you help not make it worse by using the relevant capitalization.

carbocation · 2023-10-13T17:22:24

Both this article and Microsoft are referring to the same thing (low-rank adaptation, LoRA, which I have lazily miscapitalized). What is the confusion that you think will be addressed by my fixing that?

wlesieutre · 2023-10-13T17:25:31

LoRa is a LOng RAnge radio system https://lora-alliance.org/

LoRA is LOw Rank Adaptation, the LLM tuning technique

carbocation · 2023-10-13T17:27:12

So you are saying that in the context of an article about low-rank adaptation, where I was talking about applying this to ResNets, there was actually confusion that I was referring to long range radio?

xd1936 · 2023-10-13T17:38:13

Sample size of one, but I clicked the article thinking that someone was using an LLM to "tune" the range performance of LoRa radio somehow. :(

enjenye · 2023-10-13T19:46:46

Make that a sample size of two.

drakenot · 2023-10-14T03:27:08

Had you heard of LoRA in the context of machine learning before?

abraae · 2023-10-13T21:30:07

Make that 3. I imagined somehow using an LLM to even further increase LoRa range, maybe sifting through noise.

snitty · 2023-10-13T21:10:31

As someone who is familiar with LoRa, but not LoRA, I was confused as to what they could have possibly been using LoRa for to finetune an LLM. The internet has a lot of acronyms, and a lot of jargon, and a lot of people doing seeming absurd things, okay?

wlesieutre · 2023-10-13T17:41:31

I'm not the one who originally commented to complain about your capitalization, but sure.

Even if not in this specific comment thread, maybe having correct capitalization here means someone else knows how to write the one they mean to talk about in another conversation later, and then it will have been helpful.

If everyone went around calling both of them LORA all the time absolutely that's going to sometimes have people talking past each other about different technologies. Even with them capitalized correctly it's still going to, but not as much.

Too bad Microsoft didn't call theirs LoRaA and it would've been more obvious.

ajsnigrutin · 2023-10-14T01:43:45

Yep, i clicked on it, because i thought it was about LoRa (the radio protocol). LoRa has many capabilities and many uses, from simple signalization and telemetry data to full on mesh networks for chat based communication, and, well... there are many things that could be fine tuned there :)

HerculePoirot · 2023-10-14T02:12:03

There is "LLM" in the title ...

ajsnigrutin · 2023-10-14T20:50:10

Well yeah, and meshtastic is a chat app over lora (radio)... why not have a LLM somewhere on lora?

knorker · 2023-10-14T13:16:28

My comment was more in general, that people who are aware of both, and read carefully, will spot the distinction by capitalizing it unambigiously.

They (and I) may need to do a double-take if they process it wrong, and you reduce that risk by being as clear as possible, despite Microsoft's incredibly poor naming choice.

By analogy, if I refer to MICRO soft, and their operating system, I'm sure there's no confusion to you. But it is harder to read, and may (for longer sentences) require re-reading the whole sentence.

(and, of course, the other replies you got, and other people being confused every single time there's any announcement or article about LoRA)

Der_Einzige · 2023-10-13T17:39:41

LoRAs are, like most fine-tuning, a spectrum.

LoRAs can be nearly the same size as the original model, with nearly the same representation capacity/trainability, or they can be a tiny tiny fraction of the original model size, with correspondingly fewer learnable parameters.

As such, they are suitable for nearly all tasks. We should be asking if they are better than regular fine-tuning, or soft-prompts (aka textual inversion), or slapping new trainable layers on the end (aka hypernetworks). The stable diffusion community seems to think that they are.

munro · 2023-10-13T16:48:56

This is amazing, thank you!!

> My hypothesis is that the Alpaca dataset does not contain any related arithmetic tasks, and the model actively unlearns basic arithmetic when it focuses more on other tasks.

I'm surprised this wasn't verified, it's a major benchmark stat. My eyes keep getting drawn to it, because it seems to have the most variance. Does anyone know?

Also throwing it out, I would love to see a Neptune/Wnb of the hyperparameter tuning :)

moffkalast · 2023-10-13T17:04:04

There's also been a recent discovery that the MMLU benchmark contains a significant percentage of answers with completely wrong ground truth, where models answering correctly lots points.

The most popular benchmarks and datasets are mostly haphazardly cobbled together with hardly any oversight or verification, sometimes even synthetically generated from gpt 3.5 et al. without checking the output at all. Frankly it's amazing that any of it even works when people blindly train and test with what's essentially self contradicting garbage.

munro · 2023-10-13T18:27:50

Oh wow, yea I see. A web search brings up a lot of examples.

>> As a result of an accident, Abdul lost sight in his right eye. To judge the distance of vehicles when he is driving, Abdul is able to rely on cues of >> - A. I only >> - B. II only >> - C. III only >> - D. I and II only

> You didn’t read that wrong. The question never explains what I, II or III are. This appears to have been improperly copied from crackap.com. Somehow Platypus 2 still gets the right answer with high confidence. Is this a sign it has merely memorized the answers? I checked the second best ranked model upstage/LLama-2–70b-instruct-v2 and it also somehow got the answer right (the third best Open LLM also gets this question right so I don’t know what is happening).

https://derenrich.medium.com/errors-in-the-mmlu-the-deep-lea...

kristianp · 2023-10-13T22:52:54

We might get a hint that a model has beem trained on these datasets if the model gets these questionable questions "correct".

pbhjpbhj · 2023-10-13T17:03:26

>the model actively unlearns

Aka "catastrophic forgetting" (CF).

moconnor · 2023-10-13T19:50:05

That Alpaca has no arithmetic tasks? Just look at the data, it’s text...

stan_kirdey · 2023-10-13T19:24:20

claude2 tldr: Here are a few key points I gathered from the article:

The article explores optimizing LoRA hyperparameter settings for finetuning large language models. The goal is to maximize performance while minimizing memory usage and training time.

The base model used is Llama-2 7B. Experiments compare default LoRA, QLoRA (4-bit quantized), AdamW, SGD optimizers, and different choices for rank r and alpha hyperparameters.

Key findings:

QLoRA provides substantial memory savings (6GB less than default LoRA) at the cost of slower training. Performance impact is minor.

AdamW vs SGD makes little difference in memory or performance.

Increasing training iterations from 50k to 100k hurts performance, likely because the Alpaca dataset lacks diversity.

Tuning rank r and alpha is most impactful. Good rule of thumb is to set alpha=2*r. Best model uses r=256, alpha=512. Improves over base model on most tasks, except arithmetic.

The optimized LoRA model was submitted to the NeurIPS efficiency challenge and showed improvements on several benchmarks compared to the base Llama-2 model.

Takeaways are practical tips for tuning LoRA hyperparameters and trading off memory, compute, and model performance.

snitty · 2023-10-13T21:08:20

LoRA, not LoRa.

I was VERY confused for a minute.

wezdog1 · 2023-10-13T21:17:53

I was excited to see long range radio to meet machine learning

naillo · 2023-10-13T15:56:41

Good article but my god why is there like a 2 second delay before every user interaction on this page. Scrolling, selecting text, everything has some kind of 2 second 'bootup' time after which things work normally but if you stop interacting with it for a bit it goes back to some idle mode. Really weird.

swatcoder · 2023-10-13T16:54:20

As others noted, the latency is from a whole slew of trackers that privacy extension would block. Maybe this is a good nudge for you to go install one. It's a problem that's already widespread and will continually get worse.

But on top of that, this is also one of the many fun examples of where a static, linear document that's so simple it could have been written in markdown requires megabytes of download across dozens of files.

Nobody cares about user experience right now -- it's all about developer and content management experience, so runtime bulk and responsiveness are ignored in favor of... whatever these kinds of developers think they're gaining on the production end. I often have a hard time guessing what that is without assuming the responsible devs are not-so-great.

mike-cardwell · 2023-10-13T16:09:35

Suggest upgrading your browsing experience by switching to Firefox with uBlock Origin.

Tiberium · 2023-10-13T16:46:53

Maybe this is why :) https://i.imgur.com/odYgMpd.png