You don't want your embedding model to be trained only on ImageNet even if it's ...

You don't want your embedding model to be trained only on ImageNet even if it's a small embedding model because the dataset is not very diverse and the bias of training the model on cross-entropy loss on 1000 classes, for a small embedding model is much better to use DINOv2 distilled models like ViT-S/14 (21M parameters), model trained and distilled on 142M images without supervision, so no bias in the training objective.

Btw the splitted batch normalization is very interesting, I wonder if this splitted normalization can also be applied to other types of normalization like layer normalization on transformer models.