Hacker News new | past | comments | ask | show | jobs | submit login

I think you need more evidence than this paper (which is very short and light on actual numbers) to be this shocked.

For example, most of the plots in the paper are actually of throughput, memory, etc. all performance characteristics that are better on the ternary version. Which, of course.

The only thing that contains perplexities are Table 1 and 2. There, they compare "BitNet b1.58 to our reproduced FP16 LLaMA LLM in various sizes" on the RedPajama data set. The first thing to note is the perplexities are very high: they're all at least ~9.9, which compared for example with quantized Llama on wikitext-2 which is 6.15 (https://www.xzh.me/2023/09/a-perplexity-benchmark-of-llamacp...). Maybe RedPajama is a lot harder than wikitext-2, but that's a big gap.

I think probably their benchmark (their "reproduced FP16 LLaMA LLM") is just not very good. They didn't invest much in training their baseline and so they handily beat it.




Thank you. I think the paper as it is provides enough evidence to support the claims. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, is still necessary.


And that's true, but why do they limit it to 100B tokens? And why not provide the loss curves in the end to show that both models have converged? What's not proven to me, in this paper, is the ability of the model to scale and generalize to bigger datasets. It's easy to see how a model of sufficient size can overcome the quantization bottleneck, when trained on such a small dataset. Which is perhaps why smaller variations failed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: