Hacker News new | past | comments | ask | show | jobs | submit login

Yes CXL will soon benefit from PCIe Gen 7 x16 with expected 64GB/s in 2025 and the non-HBM bandwidth I/O alternative is increasing rapidly by the day. For most inferences of near real-time LLM it will be feasible. For majority of SME companies and other DIY users (humans or ants) with their localized LLM should not be any issues [1],[2]. In addition new techniques for more efficient LLM are being discover to reduce the memory consumption [3].

[1] Forget ChatGPT: why researchers now run small AIs on their laptops:

https://news.ycombinator.com/item?id=41609393

[2] Welcome to LLMflation – LLM inference cost is going down fast:

https://a16z.com/llmflation-llm-inference-cost/

[3] New LLM optimization technique slashes memory costs up to 75%:

https://news.ycombinator.com/item?id=42411409




I have been working on my own local inference software:

https://github.com/ryao/llama3.c/blob/master/run.c

First, CXL is useless as far as I am concerned.

The smaller LLM stuff in 1 and 2 is overrated. LLMs get plenty of things wrong and while the capabilities of small LLMs is increasing, they just are never as good as the larger LLMs in my testing. To give an example, between a small LLM that gets things right 20% of the time and a large one that gets things right 40% of the time, you are never going to want to deal the small one. Even when they improvement you will just find new things that they are not able to do well. At least, that is my experience.

Finally, the 75% savings figure in 3 is misleading. It applies to the context, not the LLMs themselves. It is very likely that nobody will use it, since it is a form of lossy compression that will ruin the ability of the LLM to repeat what is in its memory.


No. Memory bandwidth is the important factor for LLM inference. 64GB/s is 4x less than the hypothetical I granted you (Gen7x16 = 256GB/s), which is 4x less than the memory bandwidth on my 2 year old pleb GPU (1TB/s), which is 10x less than a state of the art professional GPU (10TB/s), which is what the cloud services will be using.

That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.

Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.

(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)


It is reportedly 242GB/sec due to overhead:

https://en.wikipedia.org/wiki/PCI_Express#PCI_Express_7.0




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: