Hacker News new | past | comments | ask | show | jobs | submit login
Vision Transformers Are Overrated (frankzliu.com)
67 points by fzliu 9 months ago | hide | past | favorite | 17 comments



> With a bit of experimentation on Imagenet-1k, we can reach 82.0% accuracy with a 176x176 training image size with no extra data, matching ConvNeXt-T (v1, without pre-training a-la MAE) and surpassing ViT-S (specifically, the ViT flavor from DeiT-III).

This is of no surprise, and well known. ViTs excel when data is plentiful, but perform less well than convolutions in lower data regimes. ConvNets on the other hand (resnets for example) don’t do as well in high data regimes. Imagenet is considered “low data regime” compared to what ViTs get normally trained on, most SOTA ViTs on imagenet are typically pre-trained on MUCH larger datasets (e.g. JWT300)


My intuition for this is that CNNs impose a stronger prior on the function space. We rely on the previous experience of computer vision and even our own retinas to give it a useful prior.

Transformers are more general, they impose a weaker prior and require more data. But by imposing a weaker prior they also are more flexible and adaptable, thus able to take advantage of more data.


Not all vision transformers have weak priors. Shifted-window transformers and neighborhood attention have priors well suited to images; the latter is showing extremely strong performance in image generation (such as the recent hourglass diffusion which allows pixel space diffusion to be trained orders of magnitudes faster than vanilla attention) and general image classification, and certainly does not need the same dataset size of classical ViTs.


Agree. However I feel that there may still be many priors that are useful in practice, like decomposing the vision task into a hierarchy of sub problems, but as a human - I could be biased.


Oh absolutely, I think having a prior makes these problems tractable. Otherwise we would just force feed data to fully connected networks!


It would be much better if we had a way to learn good priors -> extract them -> initialize new models with them. We shouldn’t be trying to guess or handcraft them.


I recommend checking out this paper: https://arxiv.org/pdf/2310.16764.pdf. From the concluding section:

"Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly."

Neural networks can be unpredictable, and there's evidence that questions how important transformers' lack of inductive bias (at scale) really is.


Yes, exactly. ViTs need O(100M)-O(1B) images to overcome the lack of spatial priors. In that regime and beyond, they begin to generalize better than ConvNets.

Unfortunately, ImageNet is not a useful benchmark for a while now since pre-training is so important for production visual foundation models.


Pulling out a key part of this post from a DeepMind 2023 paper[1]: “Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly.”

Another common constraint in vision vs language is the long tails are very long in the visual world. There's a number of domains where you have very little examples to learn (defects are designed to happen infrequently; rare species for identification show up, well, rarely). And pulling from the blog: "But small models ... benefit greatly from the exact type experiment of outlined in this post: strong augmentation with limited data trained across many epochs."

[1] https://arxiv.org/pdf/2310.16764.pdf


You don't want your embedding model to be trained only on ImageNet even if it's a small embedding model because the dataset is not very diverse and the bias of training the model on cross-entropy loss on 1000 classes, for a small embedding model is much better to use DINOv2 distilled models like ViT-S/14 (21M parameters), model trained and distilled on 142M images without supervision, so no bias in the training objective.

Btw the splitted batch normalization is very interesting, I wonder if this splitted normalization can also be applied to other types of normalization like layer normalization on transformer models.


idk, sounds more like they just architecturally overfit to imagenet - especially if they evaluated before doing each of their arch changes

also this is low data setting


Couldn't you use split normalization to improve other networks too?

Also didn't realize Milvus had a pure Python version. Is there even a reason to use Chroma now?


This is a total misunderstanding of current machine learning, of how you should run experiments to avoid training on the test data, and of what performance means.

First, tuning the hyperparameters of a CNN to gain <2% improvement on a dataset when you're constantly looking at the answers is meaningless! You're double-dipping into the data. You look at the performance, then you tune some part of the network by hand, see if it helps, and then keep doing that. It's testing on the training data. This is totally statistically bogus! Never mind that the variance between runs wipes out a lot of these differences. And that this performance difference doesn't matter to any problems in the real world. The results at the end show none of this matters. This is really just the worst kind of performance hacking ML that contributes nothing at all.

Second, the point of modern ML is not to perform well on ImageNet-1k! No one cares about ImageNet-1k. No one needs to classify ImageNet-1k in real life. The goal is to generalize to new data that the customer has. For that, a ViT trained at scale is far better than a CNN trained on a small dataset.

> But small models (e.g. most embedding models) are arguably more important because of their portability and adaptability, and these models benefit greatly from the exact type experiment of outlined in this post: strong augmentation with limited data trained across many epochs. This is exactly the type of data that Imagenet-1k represents.

This is totally mathematically bogus and ignorant of like the past 10+ years of machine learning. I cannot believe that someone that has a job working in ML would write something like this. Much less someone who works at a vector database company!

ImageNet-1k represents basically garbage for embedding models. What you want is a ViT that's seen massive amounts of data so that your embeddings don't become degenerate because they're far away from ImageNet-1k! Almost everything is far away from ImageNet. It's an extremely narrow slice of the possible images.

It also confuses the size of the training data with the size of the model. Again, a confusion that would make much more sense 10 years ago. Just because a ViT has seen a lot of data doesn't mean it needs to be big; it could have started out big and been distilled down. From the NLP world we've learned that there are absolutely tiny embedding models, just a few MB in size, that work amazingly well; they can even beat multi-GB models.

As an ML researcher, I can't believe anyone would write this article today. I'd fail a student if they wrote this.


> You're double-dipping into the data. You look at the performance, then you tune some part of the network by hand, see if it helps, and then keep doing that. It's testing on the training data.

I purposely tried to avoid adding any niche network modifications that would help it overfit to in-1k. All three of the modifications are applicable to other networks and datasets.

> No one cares about ImageNet-1k. No one needs to classify ImageNet-1k in real life.

I completely agree with you; I just don't have the compute to train this on a massive dataset. With that being said, I'm not advocating for taking an in-1k model and putting it into production. I'm merely saying we can get ResNet to reach the same level of performance as ViTs. And that there's evidence that convnets reach that same level of performance at scale.

> What you want is a ViT that's seen massive amounts of data so that your embeddings don't become degenerate because they're far away from ImageNet-1k!

https://arxiv.org/pdf/2310.16764.pdf


> purposely tried to avoid adding any niche network modifications that would help it overfit to in-1k. All three of the modifications are applicable to other networks and datasets.

You introduce a ridiculous number of degrees of freedom from your architectural modifications and the vast majority of them don’t seem to be some standard way of doing things. If these are generally applicable to most datasets, then show an eval on a new dataset against ViT that you haven’t looked at while making your arch modifications.

The way this is described per the blog is just not sufficient for the sweeping titular claim without more robust evaluation without overfitting/p-hacking


What's SOTA for ViTs that can generate good bounding boxes and easily be fine-tuned or retrained (e.g. I might want to identify household objects or UI elements)?


What an article!

GResNet seems to generated the most similar result from my human evaluation.

Maybe we can combine both model with a hybrid search?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: