I think quantization (e.g. 4-bit, https://arxiv.org/abs/2212.09720) and sparsity...

I think quantization (e.g. 4-bit, https://arxiv.org/abs/2212.09720) and sparsity (e.g. SparseGPT, https://arxiv.org/abs/2301.00774) will bring down inference cost.

Edit: This isn’t handwaving btw, this is to say some fairly decent solutions are available now.