I immediately found the results suspect, and think I have found what is actually going on. The dataset it was trained on was 2770 images, minus 982 of those used for validation. I posit that the system did not actually read any pictures from the brains, but simply overfitted all the training images into the network itself. For example, if one looks at a picture of a teddy bear, you'd get an overfitted picture of another teddy bear from the training dataset instead.
The best evidence for this is a picture(1) from page 6 of the paper. Look at the second row. The building generated by 'mind reading' subject 2 and 4 look strikingly similar, but not very similar to the ground truth! From manually combing through the training dataset, I found a picture of a building that does look like that, and by scaling it down and cropping it exactly in the middle, it overlays rather closely(2) on the output that was ostensibly generated for an unrelated image.
If so, at most they found that looking at similar subjects light up similar regions of the brain, putting Stable Diffusion on top of it serves no purpose. At worst it's entirely cherry-picked coincidences.
I don’t get the criticism here. Normally I’d be the first to err on the side of skepticism, but this work seems above board.
I think the confusion is that this model is generating “teddy bear” internally, not a photo of a teddy bear. I.e. the diffusion part was added for flair, not to generate the details of the images that exist inside your mind. They could just as easily have run print(“teddy bear”), but they’re sending it to diffusion instead of printing it to console.
The fact that it can correctly discern between a dozen different outputs is pretty remarkable. And that’s all that this is showing. But that’s enough.
It’s not really a “gotcha” to say that it’s showing an image from the training set. They could have replaced diffusion with showing a static image of a teddy bear.
It sounds like this is many readers’ first time confronting the fact that scientists need to do these kinds of projects to get funding. As long as they’re not being intentionally deceptive, it seems fine. There’s a line between this and that ridiculous “rat brain flies plane” myth, and this seems above it.
Disclaimer: I should probably read the paper in detail before posting this, but the criticism of “the building looks like a training image” is mostly what I’m responding to. There are only so many topics one can think about, and having a machine draw a dog when I’m thinking about my dog Pip is some next-level sci-fi “we live in the future” stuff. Even if it doesn’t look like Pip, does it really matter?
Besides, it’s a matter of time till they correlate which parts of the brain are more prone to activating for specific details of the image you’re thinking about. Getting pose and color right would go a long way. So this is a resolution problem; we need more accurate brain sampling techniques, i.e. Neuralink. Then I’m sure diffusion will get a lot more of those details correct.
Because pretty much everybody that reads the article will have taken away a grossly exaggerated idea of what the system is actually capable of. If Stable Diffusion was intentionally added "for flair" and really is unnecessary, then I would absolutely say that the researchers were being intentionally deceptive.
Even if we do a massive goalpost-move and grant that the system is only identifying the label "dog" with a brain scan of a person looking at a dog, we would need to see actual statistics of its labelling accuracy before judging it in that way. If the images in the paper are cherry-picked(1), it could easily be only able to extract a handful of bits to no bits at all, and the entire thing could very well turn out the be replicable from random noise.
(1) Note that the paper even states "We generated five images for each test image and selected the generated images with highest PSMs [perceptual similarity metrics].", so it even directly admits that the presented images are cherry-picked at least once.
We can take fMRI scans when people are looking at images and generate blurry blobs that do indeed resemble the images spatially.
We can predict a text label of the image the person is looking at using another technique.
If you use SD just on the text labels and you generate an image, you get the semantic content, but not the special content.
If you combine the image and the text label and run it through an LDM then you get pictures that more closely match both the semantic and spatial characteristics of the images shown to the person.
That’s my understanding as well. It all depends whether their technique really can do this. If it can, it’s solid work imo. If it can’t (better than random chance), then it’s bunk.
There’s not much way to know other than to try it and see. But that’s true of almost every paper in ML. Some of them suck, some of them are great, but they all contribute something in their own way. Even “rat brain flies plane” paper (as much as I despise it) showed that you can change the values of mice neurons in a lab setting.
I'm definitely not an expert in this subject, but even if the model is overfitted, doesn't the fact that it can pull out the similar images at all give credit to the idea that a larger, non-overfitted model could actually work as the paper describes? It means that there does exist some correlation between the shown subject, the captured fMRI data, and the resulting location in latent space.
The output part is basically nonsense. It would be more honest if the output was a text. E.g. "Teddybear" instead of a bad image of a random teddybear.
In this specific case I agree, since the model may be overfitted, it seems like it's currently just a glorified object classifier based on what was in the training data, but the fact that it works at all may indicate that the underlying idea has merit. They would probably have to train a much larger network to see if it's able to separate features distinctly enough using the input fMRI data to be useful.
The problem is that it's impossible to know what is in the fMRI data and what is hallucinated by the reconstruction.
In this case, the real bear has a blue ribbon and the "reconstructed" bear ha a red ribbon. Is the ribbon in the fMRI data and the computer choose the wrong color, or most of the images in the training set had ribbons and the computer just added one.
Imagine this something like this is used in the future to get something like https://en.wikipedia.org/wiki/Facial_composite . People may give too much importance to the details and arrest someone only because the computer imagined some detail, like the logo in the baseball cap.
> Imagine this something like this is used in the future to get something like https://en.wikipedia.org/wiki/Facial_composite . People may give too much importance to the details and arrest someone only because the computer imagined some detail, like the logo in the baseball cap.
Wow, tech not working to tech might kill someone went super fast here.
In the real world when tech doesn't work people die.
OP is right to be concerned. This kind of tech (magickal mind-reading AI?!) is going to be bought up by security agencies, who wiil not understand its limitations and misuse it to accuse people of crimes they aren't related to.
There is ample precedent. Just for one recent example see plans to use an "AI lie detector" based on discredited pseudo-science at EU borders:
The picture is a high resolution image than make the system look accurate. They don't use the AI buzzword, but my guess it's only a mater of time. Anyway, the important paragraph is
> Seeing the composite image with no context or knowledge of DNA phenotyping, can mislead people into believing that the suspect looks exactly like the DNA profile. “Many members of the public that see this generated image will be unaware that it's a digital approximation, that age, weight, hairstyle, and face shape may be very different, and that accuracy of skin/hair/eye color is approximate,” Schroeder said.
It's not an object classifier at all. They had to text-prompt the system, first. I think the general idea is using the fMRI data as the pseudorandom initialization for the latent diffusion model to explore.
From what I understand, regular Stable Diffusion starts by generating a noise and then hallucinating modifications of that noise to make less noise. The more you let it run, the better the results.
So instead of just starting with a meaningless random noise, they're using the fMRI data to start. But if you didn't have the text prompt, you wouldn't get the right image. If you were looking at a cat but told it you were looking at a house, you'd probably end up with a small house, similar to one in its training set, positioned roughly where the cat was located in the original image.
Briefly reading the paper, it seems they trained 2 models (using data from different stages in the visual cortex) to generate latent vectors for both the visual and textual representations of the fMRA data, then feed those into Stable Diffusion. Those are the models that would be overfit in this case, so instead of those models being able to encode features like "toy, animal, fluffy, brown, ears, nose, arms, legs" individually, it's likely just encoding all of those features combined into a generic "teddy bear" because the input dataset is too small. Obviously this is an oversimplification, but hopefully you get what I mean. I didn't mean it was literally an object classifier, but that the nature of a model like this, with a dataset so small, it does not have to ability to extrapolate fine details. With a larger dataset and more training, it may be able to actually do that.
Largely agree with this, although I think it would be interesting to formulate in terms of: "what is the mutual information between the fMRI scan and the stimulus".
i.e) is there actually more information than a few bits encoding a crude object category, which stable diffusion then hallucinates the rest (/ uses to regurgitate an over-fit image)?
Or are there many bits, corresponding spatially to different regions of the stimulus - allowing for some meaningful degree of generalization.
If you train a model where the input is an integer between 1 and 10, and the output is a specific image from a set of ten, the model will be able to get zero loss on the task. That is what's happening here.
Are you saying the demonstrated results are all in sample? Because this is definitely not true for out of sample data. And the GP comment implies that there is in fact a validation/holdout set.
It's still a legitimate direction to pursue. Once you get to large enough training sets, it's basically the same way our own brains work. We don't perceive or remember all the details of a building - just "building, style 19B", plus a few extra generic parameters like distance, angle, color and so on. Totally manageable for deep learning to recognize, and perhaps even combine.
We performed visual reconstruction from fMRI signals using LDM in three simple steps as follows (Figure 2, middle). The only training required in our method is to construct linear models that map fMRI signals to each LDM component, and no training or fine-tuning of deep-learning models is needed. We used the default parameters of image- to-image and text-to-image codes provided by the authors of LDM 2, including the parameters used for the DDIM sam- pler. See Appendix A for details.
I am pretty pretty sure that this is just per person. so all it does is categorize complex brain patterns of one person into 10 category numbers and then do some hula hoop to display the numbers.
Good find, when I read it I called bullshit but I got lost trying to understand the diagrams.
Another gotcha is the semantic decoder, they are just looping the model on itself "A cozy teddy bear" + fMRI random input => A teddy bear!!!
Subject 4 in the first line also looks very different from the ground truth, but clearly an airliner. I'm curious if there is also a closer match to that one in the set.
> The dataset it was trained on was 2770 images, minus 982 of those used for validation.
I don't think you got that 2770 correct. Might be 9250 images, minus 982 (that one you got right). Then again, the paper is so badly written, I find it difficult to decipher what they did. From section 3.1:
Briefly, NSD provides data acquired from a 7-Tesla fMRI scanner over 30–40 sessions during which each subject viewed three repetitions of 10,000 images. We analyzed data for four of the eight subjects who completed all imaging sessions (subj01, subj02, subj05, and subj07).
We used 27,750 trials from NSD for each subject (2,250 trials out of the total 30,000 trials were not publicly released by NSD). For a subset of those trials (N=2,770 trials), 982 images
were viewed by all four subjects. Those trials were used as the test dataset, while the remaining trials (N=24,980) were used as the training dataset.
I feel like you might be moving the goal posts here a bit. Getting a reconstruction that is a bear, even if not the same bear, is impressive enough to be noteworthy.
I think the point is that it's not a reconstruction. It's more like recognizing which letter of a thousand-letter alphabet is shown to the human after decoding their brain waves. Still impressive, but not really as impressive as visual reconstruction.
TBH, I was not impressed up until now, but given the videos I have in mind from people trying to use brain computing interfaces to type a text, now I'm impressed.
The only training required in our method is to con-
struct linear models that map fMRI signals to each LDM
component, and no training or fine-tuning of deep-learning
models is needed.
...
To construct models from fMRI to the components of
LDM, we used L2-regularized linear regression, and all
models were built on a per subject basis. Weights were
estimated from training data, and regularization parame-
ters were explored during the training using 5-fold cross-
validation.
The best evidence for this is a picture(1) from page 6 of the paper. Look at the second row. The building generated by 'mind reading' subject 2 and 4 look strikingly similar, but not very similar to the ground truth! From manually combing through the training dataset, I found a picture of a building that does look like that, and by scaling it down and cropping it exactly in the middle, it overlays rather closely(2) on the output that was ostensibly generated for an unrelated image.
If so, at most they found that looking at similar subjects light up similar regions of the brain, putting Stable Diffusion on top of it serves no purpose. At worst it's entirely cherry-picked coincidences.
1. https://i.imgur.com/ILCD2Mu.png
2. https://i.imgur.com/ftMlGq8.png