Transformers gained popularity due to the scalable nature of the architecture an...

saeranv · on March 10, 2022

Are transformers competitive with (for example) CNNs on vision-related tasks when there's less data available? I'm not that familiar with "injecting inductive bias" via positional encodings, but it sounds really interesting. My crude understanding is that the positional encodings were used in the original Transformer architecture to encode the ordering of words for NLP. Are they more flexible then that? For example, can they be used to replicate the image-related inductive bias of CNNs and match CNN performance on small datasets (1000 - 10,000)?

If not, then to me, it seems like only industries where it's possible to get access to a large amount of representative data (i.e. greater than a million?) benefit from transformers. In industries where there are bottlenecks to data generation, there's a clear benefit in leveraging the inductive bias in other architectures, such as the various ways CNNs have biases towards image recognition.

I'm in an industry (building energy consumption prediction) where we can only generate around 10,000 to 100,000 datapoints (from simulation engines) for DL. Are transformers ever used with that scale of data?

stevenwalton · on March 10, 2022

> Are transformers competitive with (for example) CNNs on vision-related tasks when there's less data available?

They can be, there's current research into the tradeoffs between local inductive bias (information from local receptive fields: CNNs have strong local inductive bias) and global inductive bias (large receptive fields: i.e. attention). There's plenty of works that combine CNNs and Attention/Transformers. A handful of them focus on smaller datasets, but the majority are more interested in ImageNet. There's also work being done to change the receptive fields within attention mechanisms as a means to balance this.

> Are transformers ever used with that scale of data?

So there's a yes and no to your question. But definitely yes since people have done work on Flowers102 (6.5k training) and CIFAR10 (50k training). Keep in mind that not all these models are pure transformers. Some have early convolutions or intermediate ones. Some of these works even have a smaller number of parameters and better computational efficiency than CNNs.

But more importantly, I think the big question is about what type of data you have. If large receptive fields are helpful to your problem then transformers will work great. If you need local receptive fields then CNNs will tend to do better (or combinations of transformers and CNNs or reduced receptive fields on transformers). I doubt there will be a one size fits all architecture.

One thing to also keep in mind is that transformers typically like heavy amounts of augmentation. Not all data can be augmented significantly. There's also pre-training and knowledge transfer/distillation.

tourist_on_road · on March 10, 2022

Good point. The fact there is no inductive bias inherent to transformers makes it difficult to train a decent model on small datasets from scratch. However, there are recent research directions that try to address this problem [1].

Also baking in some sort of domain specific inductive bias into model architecture itself can address this problem as well [2].

[1]: Escaping the Big Data Paradigm with Compact Transformers: https://arxiv.org/abs/2104.05704

[2]: CvT: Introducing Convolutions to Vision Transformers: https://arxiv.org/abs/2103.15808

version_five · on March 10, 2022

Maybe a naive question: is there no transfer learning with transformers? I've done a lot of work with CNN architectures on small datasets, and almost always start with something trained on imagenet, and fine tune, or do some kine of semi-supervised training to start. Can we do that with VIT et al as well? Or are they really usually trained from scratch?

stevenwalton · on March 10, 2022

Lots of people transfer learn with transformers. ViT[0] originally did CIFAR with it. Then DeiT[1] introduced some knowledge transfer (note: their student is larger than the teacher). ViT pretrained on both ImageNet21k and JFT-300m.

CCT ([1] from above) was focused on training from scratch.

There's two paradigms to be aware of. ImageNet and pre-training can often be beneficial but it doesn't always help. It really depends on the problem you're trying to tackle and if there are similar features within the target dataset and the pre-trained dataset. If there is low similarity you might as well train from scratch. Also, you might not want as large of models (like ViT and DeiT have, which ViT's has more parameters than CIFAR-10 has features).

Disclosure: Author on CCT

[0] https://arxiv.org/abs/2010.11929

[1] https://arxiv.org/abs/2012.12877

version_five · on March 10, 2022

Awesome, thanks for the reply. It's been on my list to try transformers instead of (mainly) Resnet for a while now.

stevenwalton · on March 10, 2022

Sure thing. Also if you're getting into transformers I'd recommend lucidrains's GitHub[0] since it has a large collection of them with links to papers. It's nice that things are consolidated.

[0] https://github.com/lucidrains/vit-pytorch

thomasahle · on March 10, 2022

> by injecting inductive bias like positional encoding

I don't think it's fair to just call positional encoding "inductive bias". The positional encoding is the only way the word order is communicated to the model. That would be like saying it is inductive bias to include color channels when working with images.

Der_Einzige · on March 10, 2022

All of this is just another externalization of the bitter lesson.

http://www.incompleteideas.net/IncIdeas/BitterLesson.html