Hacker News new | past | comments | ask | show | jobs | submit login

ExLlama still uses a bit less VRAM than anything else out there: https://github.com/turboderp/exllama#new-implementation - this is sometimes significant since from my personal experience it can support full context on a quantized llama-33b model on a 24GB GPU that can OOM w/ other inference engines.

oobabooga recently did a direct perplexity comparison against various engines/quants: https://oobabooga.github.io/blog/posts/perplexities/

On wikitext, for llama-13b, the perplexity of a q4_K_M GGML on llama.cpp was within 0.3% of the perplexity of a 4-bit 128g desc_act GPTQ on ExLlama, so basically interchangeable.

There are some new quantization formats being proposed like AWQ, SpQR, SqueezeLLM that perform slightly better, but none have been implemented in any real systems yet (the paper for SqueezeLLM is the latest, and has comparison vs AWQ and SpQR if you want to read about it: https://arxiv.org/pdf/2306.07629.pdf)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: