Correct. The main bottleneck with LLM inference is, and have always been, memory...

		dannyw 3 days ago \| parent \| context \| favorite \| on: GLM-5.2 – How to Run Locally Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth. TPS = active weights in GB / your memory bandwidth. That’s it for decode. That’s all.
		help