> With some kind of streaming-from-flash architecture you might be in the realm ...

liuliu · 2024-02-28T18:40:24

There is a more detailed paper from Apple on this. Basically, you can do a little bit better than only keeping current weights in RAM with mmap.

For LLM, you are mostly dealing with b = W @ a where a and b are vectors, only W is the matrix. If a is sparse (i.e. have a few 0s), you don't need all the columns from W to do the matrix-vector multiplication. A cleverly arranged W can make sure during inference, only related columns loaded from flash. Further more, if you can apply "One Weird Trick" paper to this matrix-vector multiplication, you can shard W by rows, i.e. `b[i:i+n] = W[i:i+n,:] @ a[i:i+n] for i in range(N, N / b)` such that while the previous b[i:i+n] is still computing, you have visibility on which columns of the next matrix to be loaded already.

cjbprime · 2024-02-28T21:15:47

You need all of the model in RAM to perform the matmult that gets you the next token from it. There's no shortcut.

imtringued · 2024-02-28T15:45:07

I'm not sure what use that is, other than to maintain the KV cache across requests.