Hacker News new | past | comments | ask | show | jobs | submit login

Ah, gotcha! I thought you probably meant something else. I've been wondering this too, and it's something I've been meaning to look at.

On a related note it doesn't seem like many local runners are leveraging techniques like PagedAttention yet (see https://vllm.ai/) which is inspired by operating system memory paging to reduce memory requirements for LLMs.

It's not quite what you mentioned, but it might have a similar effect! Would love to know if you've seen other methods that might help reduce memory requirements.. it's one of the largest resource bottlenecks to running LLMs right now!




That's a clever one, I had not seen that yet, thank you.

The hint for me is that the models compress so well, that suggests the information content is much lower than the size of the uncompressed model indicates which is a good reason to investigate which parts of the model are so compressible and why. I haven't looked at the raw data of these models but maybe I'll give it a shot. Sometimes you can learn a lot about the structure (built in or emergent) of data just by staring at the dumps.


That's quite interesting. I hadn't thought of sparsity in the weights as a way to compress models, although this is an obvious opportunity in retrospect! I started doing some digging and found https://github.com/SqueezeAILab/SqueezeLLM, although I'm sure there's newer work on this idea.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: