they can also use more layers without increasing model size.
Additionally, in section 4.9 they compare more layers and find "The difference between 12-layer and 24-layerALBERT-xxlarge configurations in terms of downstream accuracy is negligible, with the Avg score being the same. We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer configuration"
Additionally, in section 4.9 they compare more layers and find "The difference between 12-layer and 24-layerALBERT-xxlarge configurations in terms of downstream accuracy is negligible, with the Avg score being the same. We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer configuration"