Have a repo anywhere?

MacsHeadroom · on April 3, 2023

It's just the same llama.cpp repo everyone else is using. You just git clone it to your android phone in termux and then run make and you're done. https://github.com/ggerganov/llama.cpp

Assuming you have the model file downloaded (you can use wget to download it) these are the instructions to install and run:

pkg install git

pkg install cmake

pkg install build-essential

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make -j

./main

bugglebeetle · on April 3, 2023

Yeah, I’ve already been running llama.cpp locally, but not found it to perform at the level attested in the comment (30B model as a chat bot on commodity hardware). 13B runs okay, but inference appears generally too slow on to do anything useful on my MacBook. I wondered what you might be doing to get usable performance in that context.

MacsHeadroom · on April 4, 2023

You can change the number of threads llama.cpp uses with the -t argument. By default it only uses 4. For example, if your CPU has 16 physical cores then you can run ./main -m model.bin -t 16

16 cores would be about 4x faster than the default 4 cores. Eventually you hit memory bottlenecks. So 32 cores is not twice as fast as 13 cores unfortunately.

bugglebeetle · on April 4, 2023

Thanks! Will test that out!