They introduce a complicated training regime to make up for tearing out a lot of known-good ideas like residual connections. (ie, not so vanilla after all.) They end up needing a lot more parameters and flops to match resnet quality...
(Edit to add) Like why not use some clever distillation to get things to work well? Take the large well performing model and try to match intermediate activations to avoid gradient collapse. Training should then be very straightforward.
Really any time I see multi stage training in a paper I'm afraid... Tends to indicate that the whole system is very fragile, and that we'll be searching for just the right hyperparameters for months...
I would have liked to see it compared to MobileOne https://github.com/apple/ml-mobileone, which appears to run faster on an IPhone 12 Pro for comparable accuracy than VanillaNet does on an A100 server-class GPU.
It would also be great if there were pretrained mobiles available that better match typical mobile scenarios, e.g, comparable in complexity to Mobilenet2 0.5.
Transformers don't seem to optimal as an architecture for vision tasks, but I don't think pure convnets are either. Either way, having the right data is 10x as important.