Yeah, I mention this in the post but this variant of LLaMA isn't storing any of the conversation in memory so it doesn't have context on the prior questions. You're starting fresh with each prompt. We have some ideas for how to improve this though... more soon :)
Heads up: The UI is not usable on iOS safari if the URL bar is set to be on the bottom. The bar covers up the text area and send button and if you scroll down it jumps back up to be hidden again.
Did you run the fine-tuning on LLaMA yourselves based on the 52k examples from Alpaca? Or is there a 7B pre-trained alpaca model out there that you grabbed?
Can you comment on the '8bit version' from above? Does that mean these parameters are uint8's (converted from the original float16 params)? Looking in your pytorch code I see some float16 declarations.
I've been running alpaca.cpp 13b locally and your 7b model performs much better than it does. I had assumed this was because alpaca.cpp was converting weights to 4bits from float16, but is there some other fine tuning you're doing that might also account for the better performance of chatLLaMA over alpaca.cpp?