With bigger 7B and 8B models, the battery life goes from a over a day to a few h...

With bigger 7B and 8B models, the battery life goes from a over a day to a few hours on my iPhone 15 Pro.

The 8B model nominally works on 6GB phones but it's quite slow on them. OTOH, it's very usable on iPhone 15 Pro/ Pro Max devices and even better on M1/M2 iPads.

Every framework: llama.cpp, MLX, mlc-llm (which I use) all only use the GPU. Using the ANE and perhaps the undocumented AMX coprocessor for efficient decoder only transformer inference is still an open problem. I've made early some progress on quantised inference using ANE, but there 're still a lot of issues to be solved before it is even demo ready, let alone a shipping product.