CPU/GPUs are general purpose, if enough workload demand exists specialized Transformer cores will be designed. Likewise, its not at all clear that current O(N^2) self-attention is the ideal setup for larger context lengths. All to say, I'd believe we have another 8-10x algorithmic improvement in inference costs over the next 10 years. In addition to whatever Moore's law brings.
they quantized model from 16 bits to 4 bits which was low hanging fruit, and looks like they can't quantize it anymore to 2 bits..