Hacker News new | past | comments | ask | show | jobs | submit login

> With some kind of streaming-from-flash architecture you might be in the realm already.

I thought mmap'ing models to only keep the currently needed pieces in RAM was something that was figured out ~6 months ago? Performance wasn't terribly great iirc, but with how much faster 1.58B is, it should still be okay-ish.




There is a more detailed paper from Apple on this. Basically, you can do a little bit better than only keeping current weights in RAM with mmap.

For LLM, you are mostly dealing with b = W @ a where a and b are vectors, only W is the matrix. If a is sparse (i.e. have a few 0s), you don't need all the columns from W to do the matrix-vector multiplication. A cleverly arranged W can make sure during inference, only related columns loaded from flash. Further more, if you can apply "One Weird Trick" paper to this matrix-vector multiplication, you can shard W by rows, i.e. `b[i:i+n] = W[i:i+n,:] @ a[i:i+n] for i in range(N, N / b)` such that while the previous b[i:i+n] is still computing, you have visibility on which columns of the next matrix to be loaded already.


You need all of the model in RAM to perform the matmult that gets you the next token from it. There's no shortcut.


I'm not sure what use that is, other than to maintain the KV cache across requests.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: