Hacker News new | past | comments | ask | show | jobs | submit login

If your filesystem is fast enough, then you can dynamically load and unload chunks of it as you use it



I don't think that works for LLMs. My understanding is that every single token produced by the model requires running floating point operations against all x-billion parameters, so the entire thing needs to be loaded into memory at all times for it to work.


Nope, you can use mmap to virtually map it to memory, and then you don’t have to hold the whole thing in RAM at once. I have spent the last 4-5 weeks working on this and optimising it

You can see some information here: https://justine.lol/mmap/


How does that work? Does it mean that for every token generated it has to page areas of disk into RAM and then back out again?


You'd have to have enough RAM to hold the whole model in memory or performance will be awful, mmap is just a way to get faster startup (if the mmap'd file looks exactly like the in-memory representation) and easier sharing (the mmap region can be shared read-only memory that multiple processes use).


You can shuffle it back and forth between disk and memory. It's slow, but it works.

There are people working on compute-in-memory hardware, like flash chips that can do matrix multiplication in place. None of it is close to reaching the market but there's a lot more interest now that neural networks are obviously useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: