Could someone give a breakdown on why Dolly 2 is so much more difficult to run t...

Irishsteve · on April 13, 2023

Could just be timing. Dolly was announced yesterday - Llama was announced by FB and then it took perhaps a week or two weeks for Llama.cpp to appear.

iruoy · on April 13, 2023

Will Llama.cpp give the same results as Llama? And how is it so much easier to run?

naasking · on April 13, 2023

Llama.cpp is just a redistributable C++ program. The original Llama depended on a Python machine learning toolchain IIRC, so a lot more dependencies to install.

ankitmathur · on April 13, 2023

Hey there! I'm one of the folks working on Dolly - Dolly-V2 is based on the GPT-NeoX architecture. llama.cpp is a really cool library that was built to optimize the execution of the Llama architecture from Facebook on CPUs, and as such, it doesn't really support this other architecture at this time from what I understand. Llama also features most of the tricks used in GPT-NeoX (and probably more), so I can't imagine it's a super heavy lift to add support for NeoX and GPT-J in the library.

We couldn't use Llama because we wanted to use a model that was able to be used for commercial use, and the Llama weights aren't available for that kind of use.

summarity · on April 13, 2023

llama.cpp is just a frontend for the GGML tensor library, which required models be converted (optionally quantized) to GGML format.

Of course people are working on that for Dolly as well: https://huggingface.co/snphs/dolly-v2-12b-q4

rodrigobellusci · on April 13, 2023

Besides converting Dolly's weights to ggml it's also necessary to implement the model using ggml operators, right? Or does the ggml format also carry with it the architecture of the model?