Hacker News new | past | comments | ask | show | jobs | submit login

Down to one bit but that’s taking the 2.62 bits and then applying the redundancy factor.

What’s cool is that the differentiable activation function is important—-to avoid the linear, perceptron limitation—but the weight scaling can be so simple, at least for LLMs.

It makes me wonder whether the extra layers are effectively compensating; in other words, can the number of layers or hidden neurons be trimmed down if we then add more bits to each weight and still see equivalent effectiveness?




You can just delete whole layers, because the point of residual layers is to make the model learn the hyperparameter "layer count" automatically. Compare this to the absence of residual layers where the model must use all layers. Then you will have to get the layer count perfect, but this isn't possible, since each data point might benefit from a different layer count. The extra layers therefore exist primarily so the model becomes robust to poorly chosen hyper parameters. You still need a minimum amount of layers, but that isn't the problem.

https://arxiv.org/abs/2403.17887




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: