The only thing I wish you did different was explored alpha > 2 * r.
In this blog post, the author found that alpha of 4 * r (where r=64) outperformed all smaller alphas in terms of loss when finetuning Llama-7b on databricks-dolly-15k.
Additionally you identify (alpha = 2*r) r=16 as inferior to r=256, however aside from arithmetic, r=16 actually outperforms all others. And the base model outperforms any finetuning for both arithmetic metrics.
The only thing I wish you did different was explored alpha > 2 * r.
In this blog post, the author found that alpha of 4 * r (where r=64) outperformed all smaller alphas in terms of loss when finetuning Llama-7b on databricks-dolly-15k.
https://medium.com/@drishtisharma96505/comparative-analysis-...
Additionally you identify (alpha = 2*r) r=16 as inferior to r=256, however aside from arithmetic, r=16 actually outperforms all others. And the base model outperforms any finetuning for both arithmetic metrics.