For example, if it is easy to fool them with optical illusions, such as innocent images that look racy at the first glance:
CW: Even though it does not contain a single explicit picture, it might be considered NSFW (literally - as at the first glance it looks like nudity); full disclosure: I mentored the project.
In contrast, attention mechanisms in transformers can be seen as taking into consideration dense graphs of the whole input (at least in text, I haven't really worked with vision transformers but if an attention mechanism exists then it should be similar), along with some positional encoding and a neighborhood summary.
If they indeed can be thought as stacking neighborhood summaries along with attention mechanisms, then they shouldn't fall for the same tricks since they have access to "global" information instead of disconnected components.
But take this reply with a grain of salt as I am still learning about Geometric DL. If I misunderstood something, please correct me.
It contains links to the paper and lectures, and the keynote by M. Bronstein is illuminating and discusses the operations on graphs that lead to equivalence to other network topologies and designs: transformer equivalence, and more.
> Keynote: https://www.youtube.com/watch?v=9cxhvQK9ALQ
How are you learning/studying GDL? Would you like someone else to discuss/learn it with?
Everything clicked into place and I was given a new language to see the world that combined everything together well beyond the way standard DL is taught:
> we do feature extraction using this function that resembles the receptive fields of the visual cortex and then we project the dense feature representation onto multiple other vectors and pass that through stacked non-linearities, and oh by the way we have myriad of different, seemingly disconnected, architectures that we are not sure why they work, but we call it inductive bias.
That's my main source, along with the papers that lead up to the proto-book, so pretty much Bronstein's work along with related papers found using `connectedpapers.com`. I don't have an appropriate background so I am grinding through abstract algebra, geometric algebra, will then go into geometry and whatever my supervisor suggests I should read. Sure, I would like to have other people to discuss it, but don't expect much just yet.
Good luck with your studies/learning!
Hey just trying to check my understanding, this is what Taco-tron does right? It outputs an image that can be Fourier transformed into a soundwave, which is the flatting to a dense 1d vector? And the construction of that image works because the network was able to learn from examples of existing sounds transformed into images, because the transformation into an image encodes some invariance that biases the learning network to generalize better or something?
I didn't do really great in math in college, but always found deep learning interesting. Not sure if anything I said above makes any sense.
Typical CNNs miss this second stage.
Almost all neural network architectures process a given input size in the same amount of time, and some applications and datasets would benefit from an "anytime" approach, where the output is gradually refined given more time.
I understand the point you are making, but it's kind of irrelevant. The task is to produce an answer for the image at the given resolution. It is an accident and coincidence that the neural network produces an answer that is arguably correct for a blurrier version of the image.
It's not porn. It's not simulated porn. It's a hallway, and if you're not setting it as your desktop background to trick people on purpose then you're not doing anything wrong.
The results are quite interesting:
not sure if this is the latest work but here’s some results from Google’s AI Blog
Like imagine the vision part making a phonecall to the natural language part to ask it for help with something.
We added attention and observed no benefits at all in our GAN experiments.
See: https://youtu.be/j0z4FweCy4M (timestamp 54.40 onwards)
I'd be really surprised if they use transformers due to how computationally expensive they are for anything involving vision.
EDIT: Found it. 1h: https://www.youtube.com/watch?v=j0z4FweCy4M?t=1h
Fascinating. I guess transformers are efficient.
But not on the raw camera input — they use regnets for that. The transformers come higher up the stack:
Transformers mentioned on the slide at timestamp 1:00:18.
Edit: A random observation which just occurred to me is that their predictions seem surprisingly temporally unstable. Observe, for example, the lane layout wildly changing while it drives makes a left-turn at the intersection (https://youtu.be/j0z4FweCy4M?t=2608). You can use the comma and period keys to step through the video frame-by-frame.
Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks?
Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.
Interestingly, HN does actually save accompanying text when you submit a link also. It just doesn’t show the text on the website.
Personally I like it the way that it is. I think showing an accompanying text for links would allow too much for anyone posting a link to “force” everyone to read their comment on it. Leaving it so that comments must be posted separately in order to be visible in the thread makes it so that useful accompanying comments can float to the top, whereas a useless comment from the submitter sinks to the bottom while still allowing the submitted link to be voted on individually.
to help people decide if they want to click the link.
So the title may not be sufficiently informative to let people know if they can understand the article, are interested in it, if it is at the right technical level and so on.
I think you're right that it will be abused in many instances and might not be worth it.