If your filesystem is fast enough, then you can dynamically load and unload chun...

simonw · 2023-07-18T21:06:14

I don't think that works for LLMs. My understanding is that every single token produced by the model requires running floating point operations against all x-billion parameters, so the entire thing needs to be loaded into memory at all times for it to work.

malux85 · 2023-07-18T21:13:34

Nope, you can use mmap to virtually map it to memory, and then you don’t have to hold the whole thing in RAM at once. I have spent the last 4-5 weeks working on this and optimising it

You can see some information here: https://justine.lol/mmap/

simonw · 2023-07-18T21:50:25

How does that work? Does it mean that for every token generated it has to page areas of disk into RAM and then back out again?

not2b · 2023-07-18T23:27:21

You'd have to have enough RAM to hold the whole model in memory or performance will be awful, mmap is just a way to get faster startup (if the mmap'd file looks exactly like the in-memory representation) and easier sharing (the mmap region can be shared read-only memory that multiple processes use).

Legend2440 · 2023-07-18T21:17:07

You can shuffle it back and forth between disk and memory. It's slow, but it works.

There are people working on compute-in-memory hardware, like flash chips that can do matrix multiplication in place. None of it is close to reaching the market but there's a lot more interest now that neural networks are obviously useful.