They seem to be using LLAMA. Might be worth trying out. Their conversion formula...

wongarsu · 2024-02-28T12:06:36

However they trained their models from scratch, which is also why they only have meaningful numbers for 700M, 1.3B, 3B and 3.9B models. Apparently they are following BitNet's approach of replacing linear layers with quantized layers during training? If it was trivial to convert existing models without performance loss I would have expected them to include a benchmark of that somewhere in the paper to generate even more impact.

imjonse · 2024-02-28T12:09:54

They present numbers for 7B to 70B models as well.

anon373839 · 2024-02-28T12:19:33

Those numbers are for cost only, not performance. It’s not clear they actually trained a 70B vs. just using randomly initialized parameters.

sp332 · 2024-02-28T12:18:29

They do not have perplexity numbers for the larger models (see Table 2), only speed and memory benchmarks.

imjonse · 2024-02-28T13:11:45

You're both right, I skimmed the paper, saw large model numbers but didn't notice it was for speed. On the HF page they say those models are being trained.

https://huggingface.co/papers/2402.17764

"We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready."

FrustratedMonky · 2024-02-28T13:06:12

Yes. I wonder then how long before someone that does have a lot of compute power like OpenAI/MS, or others, can rapidly pivot and try this out on some even larger models.

Doesn't this mean that current big players can rapidly expand by huge multiples in size.?