Hacker News new | past | comments | ask | show | jobs | submit login

"Is not fully trained" can also mean "we did not figure out how to reach an acceptable loss" or "training was unstable," both of which are common for ML systems.



It probably means that the model is not fully trained, because it is very expensive to train a 70B model, not even Mamba or RWKV have a model that comes close to that size, the leeriness is just kinda silly honestly.


Extraordinary claims require extraordinary evidence.

That's not to say that a 70B model is necessary, but surely something larger than 3B is doable, especially given that the results of the paper directly imply a significant reduction in memory requirements for training such a model.


> results of the paper directly imply a significant reduction in memory requirements for training such a model

Isn't memory use in training higher, since they maintain high precision latent weights in addition to the binarized weights used in the forward pass?


Yes. The optimizer is keeping a higher precision copy. It's likely slower and requires more memory than an equivalent full precision model when it comes to training. I'd also imagine it requires a multiple of epochs to get one epoch equivalent because the forward pass will need several goes to get the right choice between three states, rather than just moving a little bit in the right direction.


Most research universities have the resources to train a ~10B parameter model, at least.


For sure bigger models are needed to compete with transformer LLM, same thing for Mamba, I was just bothered by the distrust about something very reasonable like not being able to fully train a 70B model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: