The model was trained to match an image with its caption. So its unsurprising to find a neuron that fires across spider man the character and a spider: those are both instances of valid captions. Same with the "iPod" example. Seems a stretch to suggest equivalence to biological neurons (unless you think those are also trained by text supervision. Which is an open hypothesis I suppose.)
I think the surprising thing is that it's the same exact neuron that does the different modes, not that the network has the ability to detect a photo of spiderman and a picture with the text "spiderman". You might instead imagine that the network would have some neurons specialized to reading text, and others specialized to recognizing faces, and those would be like two separate paths through the network.
I think the OP meant that the network writes spider-man in both cases. So the multi-modal "Spider-man" neuron is more a "write spider-man" neuron. It is impossible to tell (at the moment) if its forming and purpose is really comparable to how biological multi-modal "think about spider-man" neurons evolve and work. IMO it is not very probable.
Well, that's not really what's going on here. CLIP has two components - a text encoder that encoders candidate captions, and an image encoder. There's no part that does "writing" - it just makes an encoding for the image, and then sees which candidate text encoding is the most similar. Further, what's being looked at here, as I understand it, is JUST the image encoder part. The neuron in question isn't seeing or generating caption text, it's just a step along the way in trying to come up with a representation of the image. So it's surprising that that same neuron is strongly activated both by the word "spiderman" appearing in an image, and an actual picture of spiderman.
If the text encoder's representation of "spiderman" and "spider" are close together, then the image encoder is penalized if it makes the image representations far apart from each other. So what we are witnessing here could simply be the neuron that says "I think 'spider' appears in the caption'". Of course it could be that the network understands the metaphorical relationship of spiderman to spiders, but the simpler explanation seems more plausible to me.
Right, so I don't think the surprise is that the neuron responds to a picture of spiderman and a picture of a spider. What's surprising is that it responds to a picture of spiderman, and a picture of the text "spider". The thing that's interesting is not that it understands that the concept of a spider and the concept of spiderman are close together, it's that the same neuron is responsible for both the visual depiction of spiderman, and for an image of the text. In previous networks, you'd maybe have a neuron that detected a picture of text in general, but not one that would look for a particular word as well as the image it corresponds to.
If its deep in the network, though, couldn't you interpret it as a single neuron responsible for the substring "spider" in the caption? That seems less surprising. But, I see your point.
Yeah, I mean it's definitely the case that the signal here is that words occur in captions for pictures of text as for pictures of objects. But if you do this same experiment of visualizing dataset examples for other networks (like ones trained on Imagenet), you don't find this sort of multimodal neuron.
I am not sure if I get it, but isn't the CLIP "multi-modal" neuron just considered as "multi-modal" because it occurs some layers before the actual output? I am not sure, but maybe the indirection is just obfuscation and not a sign of abstraction.
I meant "write" not in a literal sense. "CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset" Isn't this the implicit coupling between text and image that is observed as multi-modal neurons?
Well, the text encoder sees the ascii characters s-p-i-d-e-r (after byte-pair encoding). That's different from seeing a photograph of a piece of paper that says "spider" on it. It's not surprising that the network can associate a picture of spiderman with a caption that contains the text "spider", but rather that the same neuron lights up when you show it a piece of paper that says "spider" as when you show it a picture of spiderman.
Maybe I don't get something about CLIP. But won't there the same labels and as a result the same pairings for a written piece of paper with spider on it and a picture of Spiderman?
The labels are just whatever people on the internet wrote next to the image. Certainly there are some instances of things like "this is a picture that says 'spider'" or whatever (probably a little more natural than that), or else the network would have no way of learning to read. But what's interesting here is that it's the same neuron doing the reading and doing the recognizing of Spiderman's head. That's not the only way that it could have solved the problem. There could have been some dimensions of the representation vector used for reading text, and other for recognizing visual objects, and those would be handled by separate subsets of neurons in the network.
Maybe it just recognizes pixel soups? Why should it know the difference between a piece of text and a real spider? It's just our interpretation of the image that makes it "multi-modal". CLIP probably just categorizes certain kinds of white and black patterns as a special kind of spider that happens to also look like a piece of paper and instances of text.
I mean, yeah, it does just recognize pixel soups - all the neurons are just semi-scrutable combinations of other features. It's probably the case that there are some early neurons that recognize various letters, and so you'd have some subset of neurons that are shared between the "spiderman" neuron circuit and the circuits that are used by other neurons that recognize other words. I don't know how much credit you'd give that for "reading", but I'd say it at least would qualify as multi-modal.
So if the network can recognize species by their rear view you would consider it also multi-modal? Because that is what I am trying to tell... there are probably no higher level concepts here, just various different pixel soups that happen to need the same label.
If the trained network produces common labels for sets of pixel soups that we consider to be semantically related, but are not visually related, then that is interesting.
I agree. But isn't it more probable that there forms some kind of arbitrary "OR" logic than "real abstraction" which is indicated by choosing the word "multi-modal".
I guess we see something like this:
e.g. "Photo of spider" -> Hierarchy of pixel soups -> "Photo of spider" OR "Photo of word spider" OR "Spider rear view" OR "Spiderman" OR ... -> [Spider]
What I think the authors want to tell me when calling it multi-modal:
"Photo of spider" -> "Characteristics of a spider" -> [Spider]
"Photo of word spider" -> "Letters S-P-I-D-E-R" -> "Written word spider" -> [Spider]
I guess there are two ways of looking at that question.
One is just basic generalisation - do these neurons effectively capture things within their semantic group (whatever that means) but completely outside of the training data. If yes, then I guess the answer might be 'yes, in some sense it is like a "real abstraction"'.
Second, and (afaik) currently a more philosophical framing - it isn't obvious whether "(sufficiently advanced) OR logic" and "real abstraction", are actually different. Additionally, for the purposes of a model like this one, I find it hard to see how they could be different. The best the model can do is (roughly speaking) assign neurons to particular concepts, be they ones that fit with our mental models of the world, or ones that are more "functional". The better a job it can do of the former, the more we might be inclined to believe that it is modelling things as we understand them.
A long time ago I implemented a Neocognitron. As far as I understand most deep learning models are very similar in architecture. My Neocognitron could recognize 1 and 0 (grey scale). I handcrafted the weights and hierarchy (as you do it with a normal beginner Neocognitron). When you wrote a 1 into the hole of the 0 or multiple 0s next to a 1, you could see how it weighted sometimes the 1 or the 0 higher. It's not very far fetched to add an "OR" layer that can combine patterns of 1 and 0. Yet I wouldn't call this layer multi-modal. I guess in case of CLIP due to the training method and the output alignment that this kind of OR layer simply occurs in some cases.
When I talked about OR logic I meant a simple and direct aggregation of all possible pixel soups that should be labeled in a specific way. Calling this multimodal and comparing it to the Jennifer Aniston neuron is framing the situation in a way that is not good for science (see AI winter). Especially when multimodal neurons in neuroscience refer to processing more than one sense.
I find it hard to imagine a different solution than sufficiently advanced OR logic for real abstraction within the current ANN models. That does not mean there is none within ANNs and especially not in the brain. We are far from being able to do anything like the brain does and don't know how it learns new concepts so fast - without thousands of sample images. Just because we observe an effect that reminds us of some vague findings in the brain and we call the effect the same in ANNs does not make ANNs more like brains.
I’ve always thought it’s wild how we can apply one concept to so many different types of things. For example, if I say something is “soft,” you probably think of the opposite of firmness. But at the same time, I can describe a person as “soft,” and the same descriptor can say something meaningful about their character.
Seeing the Spider-Man neuron work on multiple types (pictures, drawings, text), makes it seem like we can teach AI to learn these same type connections.
And if we scale up the network size enough, what if we could see these types through the equivalent of a being with 1000IQ? What connection types are the most effective for a being like that? Can we even understand them? Maybe they would be deep, and archetypical in the way that Odysseus and Harry Potter are the same, despite the fact that one is an ancient Greek king, and the other is a modern British wizard. Even more interestingly, maybe the connections would be completely inexplicable to us, with no apparent rhyme or reason perceptible to humans.
One of the amazing things about this project exploring CLIP was seeing some hints of this. For example, one day I was studying one of the Africa neurons and it generated the text "IMBEWU" -- it turns out this is a popular TV show in South Africa (https://en.wikipedia.org/wiki/Imbewu:_The_Seed). That's a trivial example, but it begins to hint at something interesting.
I'd really love to see what a domain expert analyzing CLIP would make of things. For example, I'd love to hear what ethnographers think of the region neurons, or what historians think of the time period neurons. Especially for future, larger models.
> What connection types are the most effective for a being like that?
That would be analogous to: What kind of tokens when put together make state of the art algorithms? - they are just tokens, even if it's the best implementation.
Intelligence is in the game, including the agent, the environment and goals. It's not in some kind of special neuron connections, it's in the way the brain is connected to opportunities and dangers outside.
I don't know, I have pretty mixed feelings about the whole thing.
In theory, it appears that the articles are supposed to be self contained Git repositories. In practice, this article loads a bunch of assets from Google's Cloud Storage. Blocking them appears to partially break various bits in subtle ways - figure 6 is particularly bad. It also fails to render correctly without a couple third party scripts hosted by Cloudflare. It's hardly an ideal state of affairs.
Articles that aren't self contained make it frustratingly difficult to view offline copies. They also raise the possibility of the publication changing over time as anything hosted by a third party is outside the control of the journal. This lack of an immutable version is highly concerning because it means that no single canonical form exists for incorporation into the scholarly record. (Note that even redacted, entirely fraudulent papers have traditionally remained available from the original publisher with appropriate warning labels attached.)
Reflowable text and interactive figures are very welcome additions, but the concept of a mutable scientific literature is absolutely horrifying.
Figure 1 has an intriguing caption: "You can click on any neuron to open it up in OpenAI Microscope to see feature visualizations, dataset examples that maximally activate the neuron, and more."
I have been unable to get this to work. Does it work for anyone else?
(Each neuron has a dropdown menu, but it sounds like there's more to it.)
Oh woah, thanks for catching that! The ability to change each neurons facet on hover was added towards the end, but seems to be blocking one from clicking the <a> tag.
The misclassification of a poodle as a piggy bank by putting "$$$" in text on it is amazing to see work:
The finance neuron [1330], for example, responds to images of piggy banks, but also responds to the string “$$$”. By forcing the finance neuron to fire, we can fool our model into classifying a dog as a piggy bank.
I thought neuroscience had largely moved away from the idea of "grandmother neurons" in favor of sparsity [1]. This seems to suggest that the model is functioning differently than the human brain...
I'm impressed with OpenAI confronting this head on.
"Our model, despite being trained on a curated subset of the internet, still inherits its many unchecked biases and associations."
If these models find themselves into production environment - if they are good enough and profitable enough - they will eventually become legacy systems quietly perpetuating the biases of past times.
What's interesting about this to me is that ultimately we _want_ our AIs to learn bias. The whole point of a predictive AI is to model the behavior of the thing its predicting. So for AIs trained against humans, which CLIP is, it by necessity must learn our prejudices. If it didn't, it wouldn't be good at predicting how we describe images.
The model learning bias isn't the issue. You could ask me what I think the racist members of my family might write about a given image. I'd then be able to emulate them inside my head and accurately predict their responses. We all do that. It's how we have moments like "I knew you were going to say that."; "That's typical of you to say."; "Why am I not surprised?"; etc. The fact that I, and everyone else, can do that does not imply that we are biased. It's how we behave that determines if we are biased.
We want our AIs to do the same.
The real ethics question here is not how to we prevent AIs from learning bias. It's how do we get AIs to not _express_ those biases. We need a way to put them into "impartial" mode, much like we take biased and fallible humans and make them judges in courtrooms.
Personally I don't think that's going to be as hard as some imagine. Again, remember that these AIs are learning to emulate humans, _including_ judges. Give GPT-* a bunch of court documents and transcripts and it will learn the capacity to emulate a judge. Then you just need to carefully craft its prompt text so that for any given query, you can be reasonably sure it's acting impartially.
I think that's the challenging part about bias. If you can make that discrimination (no pun intended) then it's a feature which you can control.
The irony is that the black bias mentioned in this paper is probably due to inherent bias in image processing algorithms of the sensors themselves. Look up the "Shirley Card"
I am interested to know if it's possible to correct biases after training without resorting to retraining and training data curation.
As for prompt engineering for gpt, it feels a bit like reading tea leaves. I'm not sure if it is possible to know for certain that a specific prompt will elicit the desire all the time.
I wouldn't say it's only past times. You would be surprised how many people still hold those opinions (not only white supremacist types). Internet just brings out the things the people sometimes think about, but do not express in their everyday lives (unless they're under some kind of pressure or stressed).
So these models for now can be treated as a very reductive snapshot of humanity (at least English-speaking segment), warts and all
Can someone give more technical detail on what they are showing with the "neurons"?
They say "Each neuron is represented by a feature visualization with a human-chosen concept labels to help quickly provide a sense of each neuron", and these neurons are selected from the final layer. I don't think I understand this.
The ones you see in this work are mostly a variant of the standard feature visualization, which tries to show different "facets" in neurons that respond to multiple things. The details are explained in the appendix of the paper (https://distill.pub/2021/multimodal-neurons/ ).
Worth noting that Chris Olah (who wrote this comment) has led much of the interesting work in making feature visualisations useful. If you look for "Visualizing Neural Networks" on his page https://colah.github.io/ you'll find lots of other interesting links in this area.
Has there been any study of variability in these activation images - like are there many disconnected local maxima depending on the initialization, or how they vary with retraining the network (or e.g. with dropout, etc), or varying the model parameters in some direction that keeps the loss in a local minimum.
I could picture that maybe they always look the same, but sometimes there would be cases where they have different modes that accomplish the same thing.
Quick unrelated question as you seem to be a subject expert. Are there currently any neural network models that deal with multiple separate networks that occasionally trade or share nodes? It might be useful for modeling a system where members of an organization may leave and join another organization such as a business or a church
Start with random input, then incrementally optimize the input to maximize the activation of one of the nodes in the graph, the neuron. The visualization is one of those inputs that hit a maximum.
During the initial research into multi-layer neural networks, it appeared that only the input and output layers had any human-comprehendible meaning; anything else would be an indecipherable vector of how much weight each item on the previous layer should have, e.g. 0.1*pixel[0] + 0.004*pixel[2] + … 0.4*pixel[1024]
In more recent research, people noticed that sensible labels can be given by seeing what sort of inputs activated it. For example, if that neuron lights up with any picture with a lot of right-angles, and nothing else, a researcher may call it “the right-angle neuron”.
In this case, the researchers did this to the CLIP network and found it was much easier than usual for humans to come up with sensible names for the type of thing any given neuron would recognise.
You can also demonstrate what a neuron does by running the maths in reverse to get an input which would activate that neuron more strongly than any other input; in this case an image.
> In more recent research, people noticed that sensible labels can be given by seeing what sort of inputs activated it. For example, if that neuron lights up with any picture with a lot of right-angles, and nothing else, a researcher may call it “the right-angle neuron”.
Isn’t this also how neuroscientists identified brain regions?
I guess it makes sense this would happen but it’s curious that we’ve been able to build a blackbox algorithm/technique that’s as difficult to reverse engineer as real brains.
I'm not sure what you would call it, but is there any research into the dilemma that exists when an artificial intelligence surpasses its creator? How exactly would we evaluate its decisions?
I'm not suggesting this example is the case, but with AlphaZero, for example many of the moves are beyond an average players comprehension. What happens when it's beyond the greatest of us?
Yes there is research in this area. It's controversial because most practising researches in the field of building AI don't believe it is a concern on any close timescales.
Note that a very significant fraction do believe human level machine intelligence is likely within about 50 years, and a good few on closer timescales, though the precise numbers are endlessly debatable. Two major AI corporations, DeepMind and OpenAI, are explicitly aiming to create AGI.
Although ML has definitely developed something, it has nothing to do with AGI, so they don't seem anywhere close. Instead it seems like we've gotten good at inventing visual cortexes.
AGI would involve having some motivation to stay alive, affect the world and imagine things, instead of just using a fixed amount of CPU to point out whether something looks like a cat.
From the article:
> Unlike humans, CLIP can’t slow down to compensate for the harder task. Instead of taking a longer amount of time for the incongruent stimuli, it has a very high error rate.
I like to keep personal opinions out of statistical data about opinions, since they are very different qualities of argument. Any individual take doesn't meaningfully affect the distribution of opinions. The key is that there is no argument from consensus to be made; smart people disagree, AGI within our lifetimes is not a niche expectation.
Anyhow, I suspect you're not familiar with ML research, since eg. there are plenty of non-feedforward networks available. I also don't think much of this really matters; it is absurdly anthropocentric to expect ‘motivation to stay alive’ to be fundamental to intelligence.
"Martin Ford interviewed 23 of the most prominent men and women who are working in AI today, including DeepMind CEO Demis Hassabis, Google AI Chief Jeff Dean, and Stanford AI director Fei-Fei Li. In an informal survey, Ford asked each of them to guess by which year there will be at least a 50 percent chance of AGI being built.... Ray Kurzweil, a futurist and director of engineering at Google, suggested that by 2029, there would be a 50 percent chance of AGI being built, and Rodney Brooks, roboticist and co-founder of iRobot, went for 2200. The rest of the guesses were scattered between these two extremes, with the average estimate being 2099 — 81 years from now."
I don't think there's a single timestamp for this since the exact question was never asked. But the gist is very explicit in his numerous talks (which note includes both the
Future of Life Institute and Singularity Summit), and I think https://youtu.be/d-bvsJWmqlc?t=203 sums it up beyond reasonable doubt.
> I can point at "Geoffrey Hinton and Demis Hassabis: AGI is nowhere close to being a reality"
This is a disaster of a headline, since while Hinton does, Hassabis never says that.
> The rest of the guesses were scattered between these two extremes, with the average estimate being 2099 — 81 years from now.
The bars are quartiles, so 25% of the probability mass is for ~53 years away or closer.
Note that timelines for the HLMI variant of the question[1], which was asked to more people, had more like a 50% aggregate probability for 50 years away, per Figure 1.
[1] “High-level machine intelligence” (HLMI) is achieved when unaided machines can accomplish every task better and more cheaply than human workers.
> A CLIP model consists of two sides, a ResNet vision model and a Transformer language model, trained to align pairs of images and text from the internet using a contrastive loss.
It isn't surprising that a model trained for alligning photos and corresponding text labels has neurons that have learned higher semantic concepts as that is exactly the goal of generalisation.
Some of the answers aren't that wrong, it's trying to answer a silly question. The authors are asking what the image "is" but an image isn't one thing. They're also covering up the most identifying parts of the original object - like with the coffee cup, they're covering the handle with the label.