Did you run the fine-tuning on LLaMA yourselves based on the 52k examples from Alpaca? Or is there a 7B pre-trained alpaca model out there that you grabbed?
Can you comment on the '8bit version' from above? Does that mean these parameters are uint8's (converted from the original float16 params)? Looking in your pytorch code I see some float16 declarations.
I've been running alpaca.cpp 13b locally and your 7b model performs much better than it does. I had assumed this was because alpaca.cpp was converting weights to 4bits from float16, but is there some other fine tuning you're doing that might also account for the better performance of chatLLaMA over alpaca.cpp?