Yes CXL will soon benefit from PCIe Gen 7 x16 with expected 64GB/s in 2025 and t...

ryao · 2024-12-25T04:26:45 1735100805

I have been working on my own local inference software:

https://github.com/ryao/llama3.c/blob/master/run.c

First, CXL is useless as far as I am concerned.

The smaller LLM stuff in 1 and 2 is overrated. LLMs get plenty of things wrong and while the capabilities of small LLMs is increasing, they just are never as good as the larger LLMs in my testing. To give an example, between a small LLM that gets things right 20% of the time and a large one that gets things right 40% of the time, you are never going to want to deal the small one. Even when they improvement you will just find new things that they are not able to do well. At least, that is my experience.

Finally, the 75% savings figure in 3 is misleading. It applies to the context, not the LLMs themselves. It is very likely that nobody will use it, since it is a form of lossy compression that will ruin the ability of the LLM to repeat what is in its memory.

schmidtleonard · 2024-12-24T13:32:33 1735047153

No. Memory bandwidth is the important factor for LLM inference. 64GB/s is 4x less than the hypothetical I granted you (Gen7x16 = 256GB/s), which is 4x less than the memory bandwidth on my 2 year old pleb GPU (1TB/s), which is 10x less than a state of the art professional GPU (10TB/s), which is what the cloud services will be using.

That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.

Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.

(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)

ryao · 2024-12-24T22:23:57 1735079037

It is reportedly 242GB/sec due to overhead:

https://en.wikipedia.org/wiki/PCI_Express#PCI_Express_7.0