"Is not fully trained" can also mean "we did not figure out how to reach an acce...

GaggiX · 2024-02-29T02:09:59

It probably means that the model is not fully trained, because it is very expensive to train a 70B model, not even Mamba or RWKV have a model that comes close to that size, the leeriness is just kinda silly honestly.

bick_nyers · 2024-02-29T13:40:34

Extraordinary claims require extraordinary evidence.

That's not to say that a 70B model is necessary, but surely something larger than 3B is doable, especially given that the results of the paper directly imply a significant reduction in memory requirements for training such a model.

edflsafoiewq · 2024-02-29T14:20:15

> results of the paper directly imply a significant reduction in memory requirements for training such a model

Isn't memory use in training higher, since they maintain high precision latent weights in addition to the binarized weights used in the forward pass?

danielmarkbruce · 2024-02-29T21:55:03

Yes. The optimizer is keeping a higher precision copy. It's likely slower and requires more memory than an equivalent full precision model when it comes to training. I'd also imagine it requires a multiple of epochs to get one epoch equivalent because the forward pass will need several goes to get the right choice between three states, rather than just moving a little bit in the right direction.

pclmulqdq · 2024-02-29T14:38:38

Most research universities have the resources to train a ~10B parameter model, at least.

GaggiX · 2024-02-29T13:53:00

For sure bigger models are needed to compete with transformer LLM, same thing for Mamba, I was just bothered by the distrust about something very reasonable like not being able to fully train a 70B model.