VanillaNet: The power of minimalism in deep learning

blt · on May 24, 2023

I too am repelled by the complexity of deep learning research, but using the phrase "visionary journey" in the abstract is a bit of a crank alert...

sberens · on May 24, 2023

The whole abstract sounds like marketing copy from those websites that don't tell you what their product is:

> By avoiding high depth, shortcuts, and intricate operations like self-attention, VanillaNet is refreshingly concise yet remarkably powerful.

sdenton4 · on May 24, 2023

This seems bad, though?

They introduce a complicated training regime to make up for tearing out a lot of known-good ideas like residual connections. (ie, not so vanilla after all.) They end up needing a lot more parameters and flops to match resnet quality...

(Edit to add) Like why not use some clever distillation to get things to work well? Take the large well performing model and try to match intermediate activations to avoid gradient collapse. Training should then be very straightforward.

Really any time I see multi stage training in a paper I'm afraid... Tends to indicate that the whole system is very fragile, and that we'll be searching for just the right hyperparameters for months...

jacobgorm · on May 24, 2023

I would have liked to see it compared to MobileOne https://github.com/apple/ml-mobileone, which appears to run faster on an IPhone 12 Pro for comparable accuracy than VanillaNet does on an A100 server-class GPU.

It would also be great if there were pretrained mobiles available that better match typical mobile scenarios, e.g, comparable in complexity to Mobilenet2 0.5.

fzliu · on May 24, 2023

Transformers don't seem to optimal as an architecture for vision tasks, but I don't think pure convnets are either. Either way, having the right data is 10x as important.