LoRA finds a subset of the original weights (about 1%) which can be trained to a...

stephanheijl · on March 24, 2023

To be more exact, LoRA adds two matrices `A` and `B` to any layers that contain trainable weights. The original weights (`W_0`) have the shape `d × k`. These are frozen. Matrix `A` has dimensions `d × <rank>` (`rank` is configurable) and matrix `B` has the shape `<rank> × k`. A and B are then multiplied and added to `W_0` to get altered weights. The benefit here is that the extra matrices are small compared to `W_0`, which means less parameters need to be optimized, so less activations need to be stored in memory.

twic · on March 24, 2023

Ah, so the resulting model contains both the large matrix of original weights, and also the two small matrices of alterations? But this is smaller than the alternative of a model which contains the large matrix of original weights, and an equally large matrix of alterations.

Why is fine-tuning done with separate alterations, rather than by mutating the original weights?

arugulum · on March 24, 2023

> Why is fine-tuning done with separate alterations, rather than by mutating the original weights?

The goal of most parameter-efficient methods is to store one gold copy of the original model, and learn minor modifications/additions to the model. The easiest way to think about this is in some kind of deployment setting, where you have 1 capable model and you learn different sets of LoRA weights for different tasks and applications.

The original intent of parameter-efficient methods is to reduce the amount of storage space needed for models (do you really want to keep a whole additional copy of LLaMA for each different task?). A secondary benefit is that because you are fine-tuning a smaller number of parameters, the optimizer states (can take up to 2x the size of your model) are also heavily shrunk, which makes it more economical (memory-wise) to (parameter-efficient) fine-tune your model.

leobg · on March 25, 2023

That’s probably what OpenAI does with their custom fine tuned models, no?

stu2b50 · on March 24, 2023

> But this is smaller than the alternative of a model which contains the large matrix of original weights, and an equally large matrix of alterations.

It's actually larger. If you just have two equally large matrices of the same dimension, one original, and one of "altercations"... then you can just add them together.

> Why is fine-tuning done with separate alterations, rather than by mutating the original weights?

Then you'd have to compute the gradients for the whole network, which is very expensive when the model has 7b, 65b, 165b parameters. The intent is to make that cheaper by only computing gradients for a low rank representation of the change in the weight matrix from training.

arugulum · on March 24, 2023

>Then you'd have to compute the gradients for the whole network

You have to do that with LoRA regardless, to compute the gradients for the lowest-level LoRA weights.

gliptic · on March 24, 2023

Correct me if I'm wrong, but I think you still need to compute gradients of non-trained weights in order to compute the gradients of the LoRA weights. What you don't have to do is store and update the optimizer state for all those non-trained weights.

stu2b50 · on March 24, 2023

I mean the derivative of a constant is 0. So if all of the original weights are considered constants, then computing their gradients is trivial, since they’re just zero.

jprafael · on March 24, 2023

Computing gradients is easy/cheap. What this technique solves is that you no longer need to store the computed values of the gradient until the backpropagation phase, which saves on expensive GPU RAM, allowing you to use commodity hardware.

TuringTest · on March 24, 2023

It's larger, but there are less parameters to train for your specific use case since you are training the small matrix only, while the original ones remain unaltered.

seydor · on March 24, 2023

Can rank decomposition be used to reduce the original weight matrices as well? Or are they assumed to be compressed already?

metanonsense · on March 29, 2023

Those fully trained networks are usually considered full-rank. At least that is what they say in the LoRA paper.

grph123dot · on March 24, 2023

Your explanation is crystal clear. I suppose it works well in practice, but is there any reason it works that well?

stu2b50 · on March 24, 2023

Per the original paper, empirically it’s been found that neural network weights often have low intrinsic rank. It follows, then, that the change in the weights as you train also have low intrinsic rank, which means that you should be able represent them with a lower rank matrix.

grph123dot · on March 24, 2023

Since we are in ELI5, it seems that the concept of low rank approximation is required to understand this method.

(1) https://en.wikipedia.org/wiki/Low-rank_approximation

Edited: By the way, it seems to me that there is an error in the wikipedia page because if the Low-rank approximation takes a larger rank then the bound of the error should decrease, and in this page the error increases.

grph123dot · on March 24, 2023

>> that the change in the weights as you train also have low intrinsic rank

It seems that the initial matrix of weights has a low rank approximation A and this implies that the difference E = W - A is small, also it seems that PCA fails when E is sparse because PCA is designed to be optimum when the error is gaussian.

stu2b50 · on March 24, 2023

In terms of PCA, PCA is also quite expensive computationally. Additionally, you'd probably have to do SVD instead.

Since the weights are derived from gradient descent, yeah we don't really know what the distributions would be.

A random projection empirically works quite well for very high dimensions, and is of course very cheap computationally.

seydor · on March 24, 2023

Does this mean the matrices are highly compressible?

loxias · on March 24, 2023

kinda/yes. To translate to more intuitive concepts: the matrices don't contain much variance in as many degrees of freedom as they could.

Think of a point cloud of a piece of paper floating in the wind. It would be a 3xn list of points, but "really" it's a 2d piece of paper.

Just like I can rewrite the number 27 as 333 or 8+19 or (2^3)+(2^4)+3.. Given a single matrix one can find myriad ways to rewrite it as a sequence of matrices that have the same (or similar) numeric value, but with interesting or desirable properties. :D

My favorite example (which is used in signal processing) is to take your ugly matrix and rewrite it as a set of smaller matrices where most of the elements are zero, or a power of 2.

It turns out, computers can multiply by zeros and powers of two very fast

arugulum · on March 24, 2023

>In practice this means you can fine tune a 30B parameter model on a consumer GPU in a couple of hours.

Consumer GPU, yes, but in practice LoRA doesn't actually reduce training time. What it mainly reduces is memory requirements. In fact LoRA training can often require more training steps than full fine-tuning and therefore be slower (you can imagine why this is the case: the optimization is trying to modify the mode's behavior a smaller number of parameters, and so has a harder job)

MacsHeadroom · on March 26, 2023

Modern peft methods with LoRA actually do reduce training time by orders of magnitude.

Here's an example of 20 seconds per epoch on a single consumer GPU: https://github.com/johnsmith0031/alpaca_lora_4bit/issues/7#i...

tylerekahn · on March 24, 2023

It’s actually as low as 0.01% of the original weights.

From the LoRa paper:

>When the pre-trained model is GPT-3 175B, the number of train- able parameters |Θ| can be as small as 0.01% of |Φ0|.

pffft8888 · on March 24, 2023

Is this the same as or similar to the Lottery Ticket concept from a few years ago?

quest88 · on March 24, 2023

Is this the same as Knowledge Distillation (teacher-student training)?

esperent · on March 25, 2023

> about the same result

Can you qualify this? Is it still useful or not?

MacsHeadroom · on March 26, 2023

>Is it still useful or not?

Is what still useful? A LoRA is about as good and useful as a full fine tune. If you have unlimited storage space to store them or unlimited compute to make them then I would still prefer full fine tunes. But the difference is marginal and generally not worth the storage space or increased compute costs for individuals.