Better output than the smaller llamas in my limited testing, but it's surprising...

Better output than the smaller llamas in my limited testing, but it's surprisingly slow:

Output generated in 101.74 seconds (0.98 tokens/s, 100 tokens, context 82, seed 532878022)

Output generated in 515.46 seconds (0.99 tokens/s, 511 tokens, context 27, seed 660997525)

Checking nvidia-smi it stalls at ~130W (out of ~470 W max) power usage, ~25% GPU usage and ~10% memory bandwidth usage. There's fairly much traffic on the pci-bus though, and the python process is stable at 100% usage of one core. GPU possibly limited by some thing handled in python? Pausing the GPU-accelerated video-decoding of a twitch stream it get a surprisingly large boost:

Output generated in 380.42 seconds (1.34 tokens/s, 511 tokens, context 26, seed 648992918)