Hacker News new | past | comments | ask | show | jobs | submit login

You can make up by learning more parameters, albeit each parameter is of a lower resolution. The tradeoff evidently works favorably till upto 4-bits; I'm basing this off the results reported on the k-bit inference scaling laws paper by Tim Dettmers and Luke Zettlemoyer, see Figure 1 here [1].

[1] https://arxiv.org/abs/2212.09720




Has there been any quantified benchmark on this?


I tried to find the article I saw that did exactly that and couldn't, but I guess empirically if you take a look at the LLM leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) and add the "precision" column to the data, you can see the GPTQ/4-bit/8-bit quants still handily beat out the smaller models at full precision. Downside is there's no 3-bit on the submission page, so we can't easily gauge how those are doing, but all my anecdotal personal experience with 3-bit has been extremely disappointing. Exllamav2 might have bridged that gap a bit. Again, wish I could find you that article I had. It laid all this out and showed a huge perplexity dropoff below 4-bit.

Here's a reddit post showing the 2.5 (exllamav2) quant as incredibly bad, at least: https://www.reddit.com/r/LocalLLaMA/comments/16mif47/compari...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: