Great work, lots of useful information here. The only thing I wish you did diffe...

Great work, lots of useful information here.

The only thing I wish you did different was explored alpha > 2 * r.

In this blog post, the author found that alpha of 4 * r (where r=64) outperformed all smaller alphas in terms of loss when finetuning Llama-7b on databricks-dolly-15k.

https://medium.com/@drishtisharma96505/comparative-analysis-...

Additionally you identify (alpha = 2*r) r=16 as inferior to r=256, however aside from arithmetic, r=16 actually outperforms all others. And the base model outperforms any finetuning for both arithmetic metrics.