Hacker News new | past | comments | ask | show | jobs | submit login

I'd be VERRY cautious about being excited here.

My priors are like this:

1. Initial training of a neural network moves all weights around a large amount at first.

2. Later training of the network adjusts them a small amount.

3. An undertrained network will therefore look a lot like figuring out "positive, negative, or 0?" for each node during early training.

If all these things are true, then

1. Early training of an fp16 network and a bitnet with 0 added will be roughly similar in results

2. Later training will yield different / worse results, as the network gets into the 'fine tuning' part of the training.

I think the paper's stats back these priors up -- they say "this works on (3B+) large networks, but not small ones." They then imply there's something about the structure of a large network that allows a bitnet to do well. It seems more likely to me it works on large networks because they have not put the compute into 3B+ networks to get past the 'gross tuning' phase.

The networks they have compute to put in to get them 'fully' trained -- those networks don't show the results.

Also, a quick reminder that Perplexity 12 is really terrible. You would not want to use such a network. Hopefully I'm wrong and we can get something for free here! But, I'm cautious - to - skeptical.




Update - I'm still cautious about this paper, but I had the table numbers inverted in my head while thinking about it. The paper shows better perplexity results than competing models at larger parameter sizes, so I was wrong.


I was pretty unhappy and suspicious for the same reason. Not reporting perplexity for a 70B network while reporting its efficiency means that someone did something and the result wasn't good enough to put in the paper.


According to the author, the 70B model is not fully trained.


"Is not fully trained" can also mean "we did not figure out how to reach an acceptable loss" or "training was unstable," both of which are common for ML systems.


It probably means that the model is not fully trained, because it is very expensive to train a 70B model, not even Mamba or RWKV have a model that comes close to that size, the leeriness is just kinda silly honestly.


Extraordinary claims require extraordinary evidence.

That's not to say that a 70B model is necessary, but surely something larger than 3B is doable, especially given that the results of the paper directly imply a significant reduction in memory requirements for training such a model.


> results of the paper directly imply a significant reduction in memory requirements for training such a model

Isn't memory use in training higher, since they maintain high precision latent weights in addition to the binarized weights used in the forward pass?


Yes. The optimizer is keeping a higher precision copy. It's likely slower and requires more memory than an equivalent full precision model when it comes to training. I'd also imagine it requires a multiple of epochs to get one epoch equivalent because the forward pass will need several goes to get the right choice between three states, rather than just moving a little bit in the right direction.


Most research universities have the resources to train a ~10B parameter model, at least.


For sure bigger models are needed to compete with transformer LLM, same thing for Mamba, I was just bothered by the distrust about something very reasonable like not being able to fully train a 70B model.


One can forgive the lack of quality results for the 70B model, but apparently they trained 7B and 13B versions of their model, and don't report those either.


Wait, are we reading the same paper? What I'm seeing is comparable accuracy to unquantized models for <4B params, and nothing reported for larger models except resource consumption.


Nope, you're right, I got the table inverted in my head. I'm updating my top comment.


Then perhaps a method emerges out of this to make training faster (but not inference) - do early training on highly quantized (even ternary) weights, and then swap out the weights for fp16 or something and fine-tune? Might save $$$ in training large models.


Thank you. Your key point -- that so far all models with the proposed methods may have been only "grossly trained" -- is compelling. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That seems sensible to me, and makes replication easier, but I agree we need more to see extensive testing, after more extensive pretraining, on models of larger sizes.


They also trained 3B with 2 trillion tokens.

> The number of training tokens is a crucial factor for LLMs. To test the scalability of BitNet b1.58 in terms of tokens, we trained a BitNet b1.58 model with 2T tokens following the data recipe of StableLM-3B [ TBMR], which is the state-of-the-art open-source 3B model.

> [..]

> Our findings shows that BitNet b1.58 achieves a superior performance on all end tasks, indicating that 1.58-bit LLMs also have strong generalization capabilities.


And I was hoping to agree on this, but there is no 'SOTA StableLM-3b' with 2T tokens. Which is a big gap in the paper, because StableLM 3B is trained on 1T tokens for 4 epochs. And the benchmarks they report far exceed the benchmarks shown in the paper. You can find them in the official StableLM git and compare to the results in the paper https://github.com/Stability-AI/StableLM?tab=readme-ov-file#...


You're right. Thank you for pointing that out!


Intuitively I've always been a bit skeptical of quantization. Wouldn't there be a tiny loss in precision by doing this type of quantization? I could imagine the error function increasing by utilizing these types of techniques.


John Carmack pointed out (and I learned it here at HN) that what training really needs is the *sign" of each individual gradient parameter. I.e., you can quantize gradient to -1, 0 and 1 and still have neural network learn much of the dataset.


Why isn't John Carmack working for OpenAI? Hell, why did he waste years at Meta to work on a VR headset and NOT AI? He even announced he wants to focus on AGI but he missed out on literally all the action.


he has his own AGI startup now https://dallasinnovates.com/john-carmacks-keen-technologies-...

TBH I think they won't get anywhere. Doing good game engine work... why that would translate to AGI?


That game engine was over 3 decades ago! John is one of the sharpest minds I've ever seen, if he's passionate on AGI, he surely has much deeper understanding what he's doing than the AI trendies on social media.


Let me introduce you to the wonderful game that is The Talos Principle: https://en.wikipedia.org/wiki/The_Talos_Principle

It discusses whether it is possible to evolve AGi using... computer game engine! And that is John's bread and butter.


Wow! Is there a link to read up more on this?


  > It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made.
https://forums.fast.ai/t/how-to-do-reproducible-models-and-u...


They seem to be doing training with higher precision. The optimizer is keeping a copy.


It does increase the “error” (meaning it is less likely to predict the next word when compared against a dataset) but the losses are lower than your intuition would guide you to believe.


Quantization does reduce quality of the outputs. But the point is that you save enough memory doing so that you can cram a larger model into the same hardware, and this more than compensates for lost precision.


Yes each weight will not be able to "learn" as much if it has less bits of precision. But the idea is that you can use more weights, and the big question is whether these low-precision weights can make the model more accurate, as a whole.


> Also, a quick reminder that Perplexity 12 is really terrible.

The 3B model had a perplexity of 9.91, less than LLaMa 1 in fp16.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: