Down to one bit but that’s taking the 2.62 bits and then applying the redundancy...

imtringued · 2024-03-30T10:36:26.000000Z

You can just delete whole layers, because the point of residual layers is to make the model learn the hyperparameter "layer count" automatically. Compare this to the absence of residual layers where the model must use all layers. Then you will have to get the layer count perfect, but this isn't possible, since each data point might benefit from a different layer count. The extra layers therefore exist primarily so the model becomes robust to poorly chosen hyper parameters. You still need a minimum amount of layers, but that isn't the problem.

https://arxiv.org/abs/2403.17887