Hacker News new | past | comments | ask | show | jobs | submit login
Do vision transformers see like convolutional neural networks? (arxiv.org)
110 points by jonbaer 24 days ago | hide | past | favorite | 42 comments

I am much more interested if they fall for the same tricks.

For example, if it is easy to fool them with optical illusions, such as innocent images that look racy at the first glance: https://medium.com/@marekkcichy/does-ai-have-a-dirty-mind-to... CW: Even though it does not contain a single explicit picture, it might be considered NSFW (literally - as at the first glance it looks like nudity); full disclosure: I mentored the project.

I suggest you take a look at Geometric Deep Learning, but the gist here is that convolutions can be thought of as translation equivariant functions, and pooling operations as permutation invariant combinations, all operating on a graph of as many components as the number of times the operation will be output, each component is composed of the pixels that the operation will act on, so there is local information that is slowly combined through the layers, then relative positioning can be decoded when the representation is transformed into a dense 1d-vector aka flattening.

In contrast, attention mechanisms in transformers can be seen as taking into consideration dense graphs of the whole input (at least in text, I haven't really worked with vision transformers but if an attention mechanism exists then it should be similar), along with some positional encoding and a neighborhood summary.

If they indeed can be thought as stacking neighborhood summaries along with attention mechanisms, then they shouldn't fall for the same tricks since they have access to "global" information instead of disconnected components.

But take this reply with a grain of salt as I am still learning about Geometric DL. If I misunderstood something, please correct me.

Thank you for bringing up GDL. I've been following its developments, and a really great resource is this site: https://geometricdeeplearning.com/

It contains links to the paper and lectures, and the keynote by M. Bronstein is illuminating and discusses the operations on graphs that lead to equivalence to other network topologies and designs: transformer equivalence, and more.

> Keynote: https://www.youtube.com/watch?v=9cxhvQK9ALQ

How are you learning/studying GDL? Would you like someone else to discuss/learn it with?

I believe Bronstein is onto something huge here, with massive implications. Illuminating is the best adjective to describe it, as I watched this keynote:

> https://www.youtube.com/watch?v=w6Pw4MOzMuo

Everything clicked into place and I was given a new language to see the world that combined everything together well beyond the way standard DL is taught:

> we do feature extraction using this function that resembles the receptive fields of the visual cortex and then we project the dense feature representation onto multiple other vectors and pass that through stacked non-linearities, and oh by the way we have myriad of different, seemingly disconnected, architectures that we are not sure why they work, but we call it inductive bias.

> https://geometricdeeplearning.com/

That's my main source, along with the papers that lead up to the proto-book, so pretty much Bronstein's work along with related papers found using `connectedpapers.com`. I don't have an appropriate background so I am grinding through abstract algebra, geometric algebra, will then go into geometry and whatever my supervisor suggests I should read. Sure, I would like to have other people to discuss it, but don't expect much just yet.

I agree, this perspective is very interesting and tames the zoo of architectures through mathematical unification. It is indeed exciting!

Good luck with your studies/learning!

> I suggest you take a look at Geometric Deep Learning, but the gist here is that convolutions can be thought of as translation equivariant functions, and pooling operations as permutation invariant combinations, all operating on a graph of as many components as the number of times the operation will be output, each component is composed of the pixels that the operation will act on, so there is local information that is slowly combined through the layers, then relative positioning can be decoded when the representation is transformed into a dense 1d-vector aka flattening.

Hey just trying to check my understanding, this is what Taco-tron does right? It outputs an image that can be Fourier transformed into a soundwave, which is the flatting to a dense 1d vector? And the construction of that image works because the network was able to learn from examples of existing sounds transformed into images, because the transformation into an image encodes some invariance that biases the learning network to generalize better or something?

I didn't do really great in math in college, but always found deep learning interesting. Not sure if anything I said above makes any sense.

Since it easy to fool people with optical illusions I doubt that you will be able to train a computer to not be fooled by optical illusions.

The hard thing to emulate is that people quickly become aware that they are looking at an illusion. Even though you can't turn your perception off, Escher's infinite staircase doesn't actually trick you into thinking a set of stairs can go in a closed loop.

It fools humans only at the first glance. A few seconds later we make a correct assessement.

Typical CNNs miss this second stage.

And the humor of the images comes from our initial expectations and how different they are from our actual understanding of what we're seeing.

I think because the human world is made for humans there’s a lot of value in an AI with similar failure modes to humans. Right now AI can do a good job of learning to classify images, but fails in ways that are entirely foreign to us

We can use our understanding of how the world usually is and plausibly might be to get us out of local minima.

Those pictures are definitely NSFW when viewed at low res/from far away, which is how coworkers typically see your monitor contents. An argument that starts with “Well, technically” is unlikely to carry much weight in a discussion with HR (and probably rightfully so).

I think the scientific point here is that visual processing is not a one-shot process. Tasked with object detection, some scenes demand more careful processing and more computation.

Almost all neural network architectures process a given input size in the same amount of time, and some applications and datasets would benefit from an "anytime" approach, where the output is gradually refined given more time.

I understand the point you are making, but it's kind of irrelevant. The task is to produce an answer for the image at the given resolution. It is an accident and coincidence that the neural network produces an answer that is arguably correct for a blurrier version of the image.

Counter argument in addition to downvote, please?

Probably because "Well, technically" is a really unfair way to characterize that argument.

It's not porn. It's not simulated porn. It's a hallway, and if you're not setting it as your desktop background to trick people on purpose then you're not doing anything wrong.

Not sure if this is along the lines of what you're thinking, but we tried looking at this a while ago: https://arxiv.org/abs/2103.14586

It's always nice to see big labs working more towards building an understanding of things instead of just chasing SOTA. On the other hand, I'm not sure there is a lot of actionable findings in here. I guess that's the trade off with these things though....

Demonstrating really different representations is a good indicator that ensembles are worth trying.

Offtopic sort of, but does anyone know if folks are working on combining vision and natural language in one model? I think that could wield some interesting results.

And here is a short guide and a link to a Google Collab notebook that anyone can use to create their own AI-powered art using VQGAN+CLIP: https://sourceful.us/doc/935/introduction-to-vqganclip

yeah there has definitely been work done in that space: it’s called multi-modal models

not sure if this is the latest work but here’s some results from Google’s AI Blog


What would be really cool is neural networks with routing. Like circuit switching or packet switching. No idea how you would train such a beast though.

Like imagine the vision part making a phonecall to the natural language part to ask it for help with something.

Sounds like The Society of Mind - https://en.m.wikipedia.org/wiki/Society_of_Mind

Capsule networks have a routing algorithm as far as I know

It makes sense to me that attention would be hugely beneficial for vision tasks. We use contextual clues every day to decide what we’re looking at.

It may make sense, but it also makes no sense. CNNs already have full view of the entire input image. That's how discriminators are able to discriminate in GANs.

We added attention and observed no benefits at all in our GAN experiments.

Does tesla use transformers for the auto pilot?

They do. Karpathy spoke about it in on Tesla AI day. They use it for transforming image space to a vector space.

See: https://youtu.be/j0z4FweCy4M (timestamp 54.40 onwards)

Can you provide a more specific timestamp? 54.40 doesn't seem to mention anything about transformers, and "onwards" is two hours.

I'd be really surprised if they use transformers due to how computationally expensive they are for anything involving vision.

EDIT: Found it. 1h: https://www.youtube.com/watch?v=j0z4FweCy4M?t=1h

Fascinating. I guess transformers are efficient.

I gave the timestamp where they start talking about the problem they are trying to solve using transformers. As you said it is around 1hr mark

Doubtful. The biggest downside of transformers for vision is how ungodly long they take to produce results. Tesla has to operate in realtime.

In Karpathy’s recent AI day presentation he specifically stated they use transformers.

But not on the raw camera input — they use regnets for that. The transformers come higher up the stack:


Transformers mentioned on the slide at timestamp 1:00:18.

They use the key-value lookup/routing mechanism from Transformers to predict pixel-wise labels in bird view (lane, car, obstacle, intersection etc.). The motivation here is that some of the predictions may temporarily be occluded, so for predicting these occluded areas it may be particularly helpful to attend to remote regions in the input images which requires long-range dependencies that highly depend on the input itself (e.g. on whether there is an occlusion), which is exactly where the key-value mechanism excels. Not sure they even process past camera frames at this point. They only mention that later in the pipline they have an LSTM-like NN incorporating past camera frames (Schmidhuber will be proud!!).

Edit: A random observation which just occurred to me is that their predictions seem surprisingly temporally unstable. Observe, for example, the lane layout wildly changing while it drives makes a left-turn at the intersection (https://youtu.be/j0z4FweCy4M?t=2608). You can use the comma and period keys to step through the video frame-by-frame.

Thank you!

A useful HN feature would be small space to put in a summary, like the abstract:

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks?

Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers.

We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.

> A useful HN feature would be small space to put in a summary, like the abstract

Interestingly, HN does actually save accompanying text when you submit a link also. It just doesn’t show the text on the website.


Personally I like it the way that it is. I think showing an accompanying text for links would allow too much for anyone posting a link to “force” everyone to read their comment on it. Leaving it so that comments must be posted separately in order to be visible in the thread makes it so that useful accompanying comments can float to the top, whereas a useless comment from the submitter sinks to the bottom while still allowing the submitted link to be voted on individually.

I guess this is really in response to all the other responses as well, but I thought the idea would be:

to help people decide if they want to click the link.

So the title may not be sufficiently informative to let people know if they can understand the article, are interested in it, if it is at the right technical level and so on.

I think you're right that it will be abused in many instances and might not be worth it.

This would be handy, but at the same time, I think I like that not doing so encourages people to click the link and read more of the article than they might otherwise.

There is little point in duplicating the first thing you see on the page on HN too.

Is the idea to not make people click the links before discussing?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact