You don't want your embedding model to be trained only on ImageNet even if it's a small embedding model because the dataset is not very diverse and the bias of training the model on cross-entropy loss on 1000 classes, for a small embedding model is much better to use DINOv2 distilled models like ViT-S/14 (21M parameters), model trained and distilled on 142M images without supervision, so no bias in the training objective.
Btw the splitted batch normalization is very interesting, I wonder if this splitted normalization can also be applied to other types of normalization like layer normalization on transformer models.
Btw the splitted batch normalization is very interesting, I wonder if this splitted normalization can also be applied to other types of normalization like layer normalization on transformer models.