Convolutional Neural Networks for Visual Recognition

ks2048 · 2024-05-19T20:35:54

One page says "CS231n: Deep Learning for Computer Vision" and another says "CS231n: Convolutional Neural Networks for Visual Recognition". Did they change it recently to recognize other methods (ViT), or?

danjl · 2024-05-19T20:37:54

Certainly still worth learning CNNs. Still unclear if ViT is better. And there's certainly enough for a full course on CNNs and a separate course on vision transformers.

fzliu · 2024-05-19T21:40:35

Agreed. ViTs are better if you're looking to go multimodal or use attention-specific mechanisms such as cross-attention. If not, there's evidence out there that ViTs are not better than convnets for small networks and at scale (https://frankzliu.com/blog/vision-transformers-are-overrated).

abrichr · 2024-05-20T00:49:31

ViTs also have proven to be more effective for zero-shot generalization tasks due to their ability to capture global context and relationships in the input data, which CNNs struggle with.

https://arxiv.org/abs/2304.02643