Imagen, a text-to-image diffusion model

ALittleLight · on May 23, 2022

Interesting to me that this one can draw legible text. DALLE models seem to generate weird glyphs that only look like text. The examples they show here have perfectly legible characters and correct spelling. The difference between this and DALLE makes me suspicious / curious. I wish I could play with this model.

GaggiX · on May 24, 2022

Imagen takes text embeddings, OpenAI model takes image embeddings instead, this is the reason. There are other models that can generate text: latent diffusion trained on LAION-400M, GLIDE, DALL-E (1).

ALittleLight · on May 24, 2022

My understanding of the terms text and image embeddings is that they are ways of representing text or images as vectors. But, I don't understand how that would help with the process of actually drawing the symbols for those letters.

GaggiX · on May 24, 2022

If the model takes text embeddings/tokens as an input, it can create a connection between the caption and the text on the image (sometimes they are really similar).

the8472 · on May 24, 2022

DALLE1 was able to render text[0]. That DALLE2 isn't probably is a tradeoff introduced by unCLIP in exchange for diverse results. Now the google model is better yet and doesn't have to make that tradeoff.

[0] https://openai.com/blog/dall-e/#text-rendering

Tehdasi · on May 23, 2022

Still has the issue with screwing up mechanical objects. In their demo checkout the wheels on the skateboards, all over the place.

sdenton4 · on May 24, 2022

For comparison, most humans can't draw a bicycle:

https://www.wired.com/2016/04/can-draw-bikes-memory-definite...

dclowd9901 · on May 24, 2022

I blame it on the surprisingly structural cleverness of a bicycle. Opposing triangles probably isn’t the first thing most people think of when they think of a bicycle (vs two wheels and some handlebars)

gwern · on May 24, 2022

They also can't draw pennies, the letter 'g' with the loop, and so on (https://www.gwern.net/docs/psychology/illusion-of-depth/inde...). Bicycles may be clever, but the shallowness of mental representation is real.

gpt5 · on May 24, 2022

I only see the problem for the paintings. If you choose a photo it's good. Could be a problem in the source data (i.e. paintings of mechanical objects are imperfect).

zimpenfish · on May 24, 2022

The latent-diffusion[1] one I've been playing with is not terrible at drawing legible text but generally awful at actually drawing the text you want (cf. [2]) (or drawing text when you don't want any.)

[1] https://github.com/CompVis/latent-diffusion.git [2] https://imgur.com/a/Sl8YVD5

ricardobeat · on May 24, 2022

I thought the weird text in DALL-E 2 was on purpose to prevent malicious use.

jonahbenton · on May 23, 2022

I know that some monstrous majority of cognitive processing is visual, hence the attention these visually creative models are rightfully getting, but personally I am much more interested in auditory information and would love to see a promptable model for music. Was just listening to "Land Down Under" from Men At Work. Would love to be able to prompt for another artist I have liked: "Tricky playing Land Down Under." I know of various generative music projects, going back decades, and would appreciate pointers, but as far as I am aware we are still some ways from Imagen/Dalle for music?

astrange · on May 23, 2022

I believe we’re lacking someone training up a large music model here, but GPT-style transformers can produce music.

gwern can maybe comment here.

An actually scary thing is that AIs are getting okay at reproducing people’s voices.

gwern · on May 24, 2022

Voice synthesis has been going steady. Lots of commercial and hobbyist interest: you can use 15.ai for crackerjack SaaS voice synthesis in a slick free UI; and if you want to run the models yourselves, Tortoise just released a FLOSS stack of remarkable quality.

Music, I'm afraid, appears stuck in the doldrums of small one-offs doing stuff like MIDI. Nothing like the breadth & quality of Jukebox has come out since it, even though it's super-obvious that there is a big overhang there and applying diffusion & other new methods would give you something like much like DALL-E 2 / Imagen for general music.

thorum · on May 24, 2022

The developer behind Tortoise is experimenting with using diffusion for music generation:

https://nonint.com/2022/05/04/friends-dont-let-friends-train...

addandsubtract · on May 23, 2022

I agree. How cool would it be to get an 8 min version of your favorite song? Or an instant DnB remix? Or 10 more songs in the style of your favorite album?

jonahbenton · on May 23, 2022

Yeah. I particularly love covers and often can hear in my head X playing Y's song. Would love tools to experiment with that for real.

In practice, my guess is that even though Dall-e level performance in music generation would be stunning and incredible, it would also be tiresome and predictable to consume on any extended basis. I mean- that's my reaction to Dall-e- I find the images astonishing and magical but can only look at them for limited periods of time. At these early stages in this new world the outputs of real individual brains are still more interesting.

But having tools like this to facilitate creation and inspiration by those brains- would be so so cool.

exac · on May 24, 2022

You can sort of do that with https://fairuseify.ml

jrh206 · on May 24, 2022

I believe that this tech is possible, but this site doesn't provide it. Look at the source of the page: it's just a bunch of sleeps and then you 'download' the same file you provided.

waqf · on May 24, 2022

The tech may be possible, but it won't solve anyone's copyright problems. The result would be a "derived work" of the original, irrespective of whether it sounded similar or not.

aembleton · on May 24, 2022

I tried that site and the music sounds the same. I wonder if you can use this to bypass YouTube content ID check.

visarga · on May 23, 2022

Interesting discovery they made

> We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.

There seems to be an unexpected level of synergy between text and vision models. Can't wait to see what video and audio modalities will add to the mix.

gwern · on May 24, 2022

I think that's unsurprising. With DALL-E 1, for example, scaling the VAE (the image model generating the actual pixels) hits very fast diminishing returns, and all your compute goes into the 'text encoder' generating the token sequence.

Particularly as you approach the point where the image quality itself is superb and people increasingly turn to attacking the semantics & control of the prompt to degrade the quality ("...The donkey is holding a rope on one end, the octopus is holding onto the other. The donkey holds the rope in its mouth. A cat is jumping over the rope..."). For that sort of thing, it's hard to see how simply beefing up the raw pixel-generating part will help much: if the input seed is incorrect and doesn't correctly encode a thumbnail sketch of how all these animals ought to be engaging in outdoors sports, there's nothing some low-level pixel-munging neurons can do to help much.

visarga · on May 24, 2022

I was thinking more about our traditional ResNet50 trained on ImageNet vs CLIP. ResNet was limited to a thousand classes and brittle. CLIP can generalise to new concept combinations with ease. That changes the game, and the jump is based on NLP.

ravi-delia · on May 24, 2022

Basically makes sense, no? DALLE-2 suffered from misunderstanding propositional logic, treating prompts as less structured then it should have. That's a text model issue! Compared to that, scaling up the image isn't as important (especially with a few passes).

espadrine · on May 24, 2022

Is there a way to confirm that this extra processing relates to the language structure, and not the processing of concepts?

I wouldn’t be surprised if the lack of video and 3D understanding in the image dataset training fails to understand things like the fear of heights, and the concept of gravity ends up being learned in the text processing weights.

visarga · on May 24, 2022

I am sure the image-text-video-audio-games model will come soon. The recent Gato was one step in that direction. There's so much video content out there, it begs for modelling. I think robotics applications will benefit the most from video.

benwikler · on May 23, 2022

Would be fascinated to see the DALL-E output for the same prompts as the ones used in this paper. If you've got DALL-E access and can try a few, please put links as replies!

joeycodes · on May 23, 2022

Posting a few comparisons here.

https://twitter.com/joeyliaw/status/1528856081476116480?s=21...

rg111 · on May 24, 2022

Imagen seems more realistic where Dall-E2 is more feel-good.

That is what I feel personally.

joeycodes · on May 24, 2022

I agree with you, but for me, Dall·E 2 feels good because 90% of the time I can keep hitting the generate button and massage the prompt until I get something inspirational, surprisingly, or visually pleasing. Without access to Imagen, it's impossible for me to compare how much of the "realistic feels" of its images is constrained by the taste of the cherry-pickers.

dclowd9901 · on May 24, 2022

Looking at these… I can’t help but wonder if these are literal examples of AI imagination?

joeycodes · on May 24, 2022

I've started to ask myself if my own creativity is a result of random sampling from the diffusion tapestry of associated memories and experience on that topic.

albertzeyer · on May 24, 2022

What else could creativity possible be?

Nition · on May 24, 2022

I do wonder what Dall-E 2 would output for a request along the lines of "A still life of a vase of flowers in a completely new art style."

zimpenfish · on May 24, 2022

Don't have access to Dall-E 2 or Imagen but I do have [1] and [2] locally and they produced [3] with that prompt.

[1] https://github.com/nerdyrodent/VQGAN-CLIP.git [2] https://github.com/CompVis/latent-diffusion.git [3] https://imgur.com/a/dCPt35K

Nition · on May 24, 2022

Nice. Latent-diffusion has come out very traditional but the VQGAN/CLIP ones are fairly original.

zimpenfish · on May 24, 2022

From my experiments, the LD one doesn't seem to have been trained on as big or as tagged data set - there's a whole bunch of "in the style of X" that the VQGAN knows* about but the LD doesn't. That might have something to do with it.

qclibre22 · on May 23, 2022

See the paper here : https://gweb-research-imagen.appspot.com/paper.pdf Section E : "Comparison to GLIDE and DALL-E 2"

thorum · on May 23, 2022

Imagen seems better at capturing details/nuance from the prompt, but subjectively the DALLE-2 images feel more “real” to me. Not sure why. Something about the lighting?

ravi-delia · on May 24, 2022

That feels about right. Imagen has a better text processing model, so it can tease apart the prompt, but DALLE has a rocking image part.

geonic · on May 24, 2022

Can anybody give me short high-level explanation how the model achieves these results? I'm especially interested in the image synthesis, not the language parsing.

For example, what kind of source images are used for the snake made of corn[0]? It's baffling to me how the corn is mapped to the snake body.

[0] https://gweb-research-imagen.appspot.com/main_gallery_images...

dave_sullivan · on May 24, 2022

Well, first they parse the language into a high level vector representation. Then they take images and add noise and train a model to remove the noise so it can start with a noisy image and produce a clear image from it. Then they train a model to map from the word representation for text to the noisy image representation for the corresponding image. Then they upsample twice to get to good resolution.

So text -> text representation -> most likely noised image space -> iteratively reduce noise N times -> upsample result

Something like that, please correct anything I'm missing.

Re: the snake corn question, it is mapping the "concept" of corn to the concept of a body as represented by intermediary learned vector representations.

DougBTX · on May 24, 2022

In the paper they say about half the training data was an internal training set, and the other half came from: https://laion.ai/laion-400-open-dataset/

kordlessagain · on May 24, 2022

> Since guidance weights are used to control image quality and text alignment, we also report ablation results using curves that show the trade-off between CLIP and FID scores as a function of the guidance weights (see Fig. A.5a). We observe that larger variants of T5 encoder results in both better image-text alignment, and image fidelity. This emphasizes the effectiveness of large frozen text encoders for text-to-image models

I usually consider myself fairly intelligent, but I know that when I read an AI research paper I'm going to feel dumb real quick. All I managed to extract from the paper was a) there isn't a clear explanation of how it's done that was written for lay people and b) they are concerned about the quality and biases in the training sets.

Having thought about the problem of "building" an artificial means to visualize from thought, I have a very high level (dumb) view of this. Some human minds are capable of generating synthetic images from certain terms. If I say "visualize a GREEN apple sitting on a picnic table with a checkerboard table cloth", many people will create an image that approximately matches the query. They probably also see a red and white checkerboard cloth because that's what most people have trained their models on in the past. By leaving that part out of the query we can "see" biases "in the wild".

Of course there are people that don't do generative in-mind imagery, but almost all of us do build some type of model in real time from our sensor inputs. That visual model is being continuously updated and is what is perceived by the mind "as being seen". Or, as the Gorillaz put it:

  … For me I say God, y'all can see me now
  'Cos you don't see with your eye
  You perceive with your mind
  That's the end of it…

To generatively produce strongly accurate imagery from text, a system needs enough reference material in the document collection. It needs to have sampled a lot of images of corn and snakes. It needs to be able to do image segmentation and probably perspective estimation. It needs a lot of semantic representations (optimized query of words) of what is being seen in a given image, across multiple "viewing models", even from humans (who also created/curated the collections). It needs to be able to "know" what corn looks like, even from the perspective of another model. It needs to know what "shape" a snake model takes and how combining the bitmask of the corn will affect perspective and framing of the final image. All of this information ends up inside the model's network.

Miika Aittala at Nvidia Research has done several presentations on taking a model (imagined as a wireframe) and then mapping a bitmapped image onto it with a convolutional neural network. They have shown generative abilities for making brick walls that looks real, for example, from images of a bunch of brick walls and running those on various wireframes.

Maybe Imagen is an example of the next step in this, by using diffusion models instead of the CNN for the generator and adding in semantic text mappings while varying the language models weights (i.e. allowing the language model to more broadly use related semantics when processing what is seen in a generated image). I'm probably wrong about half that.

Here's my cut on how I saw this working from a few years ago: https://storage.googleapis.com/mitta-public/generate.PNG

Regardless of how it works, it's AMAZING that we are here now. Very exciting!

hn_throwaway_99 · on May 23, 2022

As someone who has a layman's understanding of neural networks, and who did some neural network programming ~20 years ago before the real explosion of the field, can someone point to some resources where I can get a better understanding about how this magic works?

I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing. Just looking for more information about how the software actually works, even if there are big chunks of it that are "this is beyond your understanding without taking some in-depth courses".

rvnx · on May 23, 2022

Check https://github.com/multimodalart/majesty-diffusion or https://github.com/lucidrains/DALLE2-pytorch

There is a Google Colab workbook that you can try and run for free :)

This is the image-text pairs behind: https://laion.ai/laion-400-open-dataset/

astrange · on May 23, 2022

> I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing.

A basic part of it is that neural networks combine learning and memorizing fluidly inside them, and these networks are really really big, so they can memorize stuff good.

So when you see it reproduce a Shiba Inu well, don’t think of it as “the model understands Shiba Inus”. Think of it as making a collage out of some Shiba Inu clip art it found on the internet. You’d do the same if someone asked you to make this image.

It’s certainly impressive that the lighting and blending are as good as they are though.

PheonixPharts · on May 24, 2022

> these networks are really really big, so they can memorize stuff good.

People tend to really underestimate just how big these models are. Of course these models aren't simply "really really big" MLPs, but the cleverness of the techniques used to build them is only useful at insanely large scale.

I do find these models impressive as examples of "here's what the limit of insane amounts of data, insane amounts of compute can achieve with some matrix multiplication". But at the same time, that's all they are.

What saddens me about the rise of deep neural networks is it is really is the end of the era of true hackers. You can't reproduce this at home. You can't afford to reproduce this one in the cloud with any reasonable amount of funding. If you want to build this stuff your best bet is to go to top tier school, make the right connections and get hired by a mega-corp.

But the real tragedy here is that the output of this is honestly only interesting it if it's the work of some hacker fiddling around in their spare time. A couple of friend hacking in their garage making images of raccoon painting is pretty cool. One of the most powerful, well funded, owners of the likely the most compute resources on the planet doing this as their crowning achievement in AI... is depressing.

rland · on May 24, 2022

The hackers will not be far behind. You can run some of the v1 diffusion models on a local machine.

I think it's fair to say that this is the way it's always been. In 1990, you couldn't hack on an accurate fluid simulation at home, you needed to be at a university or research lab with access to a big cluster. But then, 10 years later, you could do it on a home PC. And then, 10 years after that, you could do it in a browser on the internet.

It's the same with this AI stuff.

I think if we weren't in the midst of this unique GPU supply crunch, the price of a used 1070 would be about $100 right now -- such a card would be state of the art 10 years ago!

ChadNauseam · on May 24, 2022

And the supply crunch is getting better (you can buy an RTZ 3080 at MSRP now!) and technological progress doesn't seem to be slowing down. If the rumors are to be believed, a 4090 will be close to twice as fast as a 3090.

sinenomine · on May 24, 2022

Some cutting-edge stuff is still being made by talented hackers using nothing but a rig of 8x 3090s: https://github.com/neonbjb/tortoise-tts

Other funding models are possible as well, in the grand scheme of things the price for these models is small enough.

hn_throwaway_99 · on May 23, 2022

To be clear, I understand the general techniques about (a) how diffusion models can be used to upsample images and generate more photorealistic (or even "cartoon realistic") results and (b) I understand how they can do basic matching of "someone typed in Shiba Inu, look for images of Shiba Inus".

What I don't understand is how they do the composition. E.g. for "A giant cobra snake on a farm. The snake is made out of corn." I think I could understand how it could reproduce the "A giant cobra snake on a farm" part. What I don't understand is how it accurately pictured "The snake is made out of corn." part, when I'm guessing it has never seen images of snakes made out of corn, and the way it combined "snake" with "made out of corn", in a way that is pretty much how I imagined it would look, is the part I'm baffled by.

sinenomine · on May 24, 2022

> What I don't understand is how they do the composition

Convolutional filters lend themselves to rich combinatorics of compositions[1]: think of them as of context-dependent texture-atoms, repulsing and attracting over the variations of the local multi-dimensional context in the image. The composition is literally a convolutional transformation of local channels encoding related principal components of context.

Astronomical amounts of computations spent via training allow the network to form a lego-set of these texture-atoms in a general distribution of contexts.

At least this is my intuition for the nature of the convnets.

1. https://microscope.openai.com/models/contrastive_16x/image_b...

zone411 · on May 24, 2022

a) Diffusion is not just used to upsample images but also to create them.

b) It has seen images with descriptions of "corn," "cobra," "farm," and it has seen images of "A made out of B" and "C on a D." To generate a high-scoring image, it has to make something that scores well on all of them put together.

londons_explore · on May 23, 2022

Figure A.4 in the linked paper is a good high level overview of this model. Shame it was hidden away on page 19 in the appendix!

Each box you see there has a section in the paper explaining it in more detail.

hn_throwaway_99 · on May 23, 2022

Uhh, yeah, I'm going to need much more of an ELI5 than that! Looking at Figure A.4, I understand (again, at a very high-level) the first step of "Frozen Text Encoder", and I have a decent understanding of the upsampling techniques used in the last 2 diffusion model steps, but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

sinenomine · on May 23, 2022

> but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

It doesn't output it outright, it basically forms it slowly, finding and strengthening more and more finer-grained features among the dwindling noise, combining the learned associations of memorized convolutional texture primitives vs encoded text embeddings. In the limit of enough data the associations and primitives turn out composable enough to suffice for out-of-distribution benchmark scenes.

When you have a high-quality encoder of your modality into a compressed vector representation, the rest is optimization over a sufficiently high-dimensional, plastic computational substrate (model): https://moultano.wordpress.com/2020/10/18/why-deep-learning-...

It works because it should. The next question is: "What are the implications?".

Can we meaningfully represent every available modality in a single latent space, and freely interconvert composable gestalts like this https://files.catbox.moe/rmy40q.jpg ?

f38zf5vdt · on May 23, 2022

A good explanation is here.

https://www.youtube.com/watch?v=344w5h24-h8

daenz · on May 23, 2022

>While we leave an in-depth empirical analysis of social and cultural biases to future work, our small scale internal assessments reveal several limitations that guide our decision not to release our model at this time.

Some of the reasoning:

>Preliminary assessment also suggests Imagen encodes several social biases and stereotypes, including an overall bias towards generating images of people with lighter skin tones and a tendency for images portraying different professions to align with Western gender stereotypes. Finally, even when we focus generations away from people, our preliminary analysis indicates Imagen encodes a range of social and cultural biases when generating images of activities, events, and objects. We aim to make progress on several of these open challenges and limitations in future work.

Really sad that breakthrough technologies are going to be withheld due to our inability to cope with the results.

tines · on May 23, 2022

This raises some really interesting questions.

We certainly don't want to perpetuate harmful stereotypes. But is it a flaw that the model encodes the world as it really is, statistically, rather than as we would like it to be? By this I mean that there are more light-skinned people in the west than dark, and there are more women nurses than men, which is reflected in the model's training data. If the model only generates images of female nurses, is that a problem to fix, or a correct assessment of the data?

If some particular demographic shows up in 51% of the data but 100% of the model's output shows that one demographic, that does seem like a statistics problem that the model could correct by just picking less likely "next token" predictions.

Also, is it wrong to have localized models? For example, should a model for use in Japan conform to the demographics of Japan, or to that of the world?

karpierz · on May 23, 2022

It depends on whether you'd like the model to learn casual or correlative relationships.

If you want the model to understand what a "nurse" actually is, then it shouldn't be associated with female.

If you want the model to understand how the word "nurse" is usually used, without regard for what a "nurse" actually is, then associating it with female is fine.

The issue with a correlative model is that it can easily be self-reinforcing.

bufbupa · on May 23, 2022

At the end of a day, if you ask for a nurse, should the model output a male or female by default? If the input text lacks context/nuance, then the model must have some bias to infer the user's intent. This holds true for any image it generates; not just the politically sensitive ones. For example, if I ask for a picture of a person, and don't get one with pink hair, is that a shortcoming of the model?

I'd say that bias is only an issue if it's unable to respond to additional nuance in the input text. For example, if I ask for a "male nurse" it should be able to generate the less likely combination. Same with other races, hair colors, etc... Trying to generate a model that's "free of correlative relationships" is impossible because the model would never have the infinitely pedantic input text to describe the exact output image.

slg · on May 23, 2022

This type of bias sounds a lot easier to explain away as a non-issue when we are using "nurse" as the hypothetical prompt. What if the prompt is "criminal", "rapist", or some other negative? Would that change your thought process or would you be okay with the system always returning a person of the same race and gender that statistics indicate is the most likely? Do you see how that could be a problem?

tines · on May 23, 2022

Not the person you responded to, but I do see how someone could be hurt by that, and I want to avoid hurting people. But is this the level at which we should do it? Could skewing search results, i.e. hiding the bias of the real world, give us the impression that everything is fine and we don't need to do anything to actually help people?

I have a feeling that we need to be real with ourselves and solve problems and not paper over them. I feel like people generally expect search engines to tell them what's really there instead of what people wish were there. And if the engines do that, people can get agitated!

I'd almost say that hurt feelings are prerequisite for real change, hard though that may be.

These are all really interesting questions brought up by this technology, thanks for your thoughts. Disclaimer, I'm a fucking idiot with no idea what I'm talking about.

slg · on May 23, 2022

>Could skewing search results, i.e. hiding the bias of the real world

Your logic seems to rest on this assumption which I don't think is justified. "Skewing search results" is not the same as "hiding the biases of the real world". Showing the most statistically likely result is not the same as showing the world how it truly is.

A generic nurse is statistically going to be female most of the time. However, a model that returns every nurse as female is not showing the real world as it is. It is exaggerating and reinforcing the bias of the real world. It inherently requires a more advanced model to actually represent the real world. I think it is reasonable for the creators to avoid sharing models known to not be smart enough to avoid exaggerating real world biases.

robocat · on May 24, 2022

> I think it is reasonable for the creators to avoid sharing models known to not be smart enough to avoid exaggerating real world biases.

Every model will have some random biases. Some of those random biases will undesirably exaggerate the real world. Every model will undesirably exaggerate something. Therefore no model should be shared.

Your goal is nice, but impractical?

slg · on May 24, 2022

Fittingly, your comment fails into the same criticism I had of the model. It shows a refusal/inability to engage with the full complexities of the situation.

I said "It is reasonable... to avoid sharing models". That is an acknowledged that the creators are acting reasonably. It does not imply anything as extreme as "no model should be shared". The only way to get from A to B there is for you to assume that I think there is only one reasonable response and every other possible reaction is unreasonable. Doesn't that seem like a silly assumption?

robocat · on May 24, 2022

  “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.’

  ’The question is,’ said Alice, ‘whether you can make words mean so many different things.’

  ’The question is,’ said Humpty Dumpty, ‘which is to be master — that’s all.”

barneygale · on May 24, 2022

> Your goal is nice, but impractical?

If the only way to do AI is to encode racism etc, then we shouldn't be doing AI at all.

magicalist · on May 23, 2022

> Could skewing search results, i.e. hiding the bias of the real world

Which real world? The population you sample from is going to make a big difference. Do you expect it to reflect your day to day life in your own city? Own country? The entire world? Results will vary significantly.

tines · on May 23, 2022

I'd say it doesn't actually matter, as long as the population sampled is made clear to the user.

If I ask for pictures of Japanese people, I'm not shocked when all the results are of Japanese people. If I asked for "criminals in the United States" and all the results are black people, that should concern me, not because the data set is biased but because the real world is biased and we should do something about that. The difference is that I know what set I'm asking for a sample from, and I can react accordingly.

jfoster · on May 24, 2022

In a way, if the model brings back an image for "criminals in the United States" that isn't based on the statistical reality, isn't it essentially complicit in sweeping a major social issue under the rug?

We may not like what it shows us, but blindfolding ourselves is not the solution to that problem.

webmaven · on May 26, 2022

At the very least we should expect that the results not be more biased than reality. Not all criminals are Black. Not all are men. Not all are poor. If the model (which is stochastic) only outputs poor Black men, rather than a distribution that is closer to reality, it is exhibiting bias and it is fair to ask why the data it picked that bias up from is not reflective of reality.

jfoster · on May 26, 2022

Yeah, it makes sense for the results to simply reflect reality as closely as possible. No bias in any direction is desirable.

webmaven · on May 27, 2022

Sarcasm, eh? At least there's no way THAT could be taken the wrong way.

nyolfen · on May 23, 2022

> If I asked for "criminals in the United States" and all the results are black people,

curiously, this search actually only returns white people for me on GIS

magicalist · on May 23, 2022

> If I asked for "criminals in the United States" and all the results are black people, that should concern me, not because the data set is biased

Well the results would unquestionably be biased. All results being black people wouldn't reflect reality at all, and hurting feelings to enact change seems like a poor justification for incorrect results.

> I'd say it doesn't actually matter, as long as the population sampled is made clear to the user.

Ok, and let's say I ask for "criminals in Cheyenne Wyoming" and it doesn't know the answer to that, should it just do its best to answer? Seem risky if people are going to get fired up about it and act on this to get "real change".

That seems like a good parallel to what we're talking about here, since it's very unlikely that crime statistics were fed into this image generating model.

sangnoir · on May 23, 2022

For AI, "real world" is likely "the world, as seen by Silicon Valley."

rpmisms · on May 23, 2022

It's an unfortunate reflection of reality. There are three possible outcomes:

1. The model provides a reflection of reality, as politically inconvenient and hurtful as it may be.

2. The model provides an intentionally obfuscated version with either random traits or non correlative traits.

3. The model refuses to answer.

Which of these is ideal to you?

slg · on May 23, 2022

What makes you think those are the only options? Why can't we have an option that the model returns a range of different outputs based off a prompt?

A model that returns 100% of nurses as female might be statistically more accurate than a model that returns 50% of nurses as female, but it is still not an accurate reflection of the real world. I agree that the model shouldn't return a male nurse 50% of the time. Yet an accurate model needs to be able to occasionally return a male nurse without being directly prompted for a "male nurse". Anything else would also be inaccurate.

rpmisms · on May 23, 2022

So, the model should have a knowledge of political correctness, and return multiple results if the first choice might reinforce a stereotype?

slg · on May 23, 2022

I never said anything about political correctness. You implied that you want a model that "provides a reflection of reality". All nurses being female is not "a reflection of reality". It is a distortion of reality because the model doesn't actually understand gender or nurses.

rpmisms · on May 24, 2022

A majority of nurses are women, therefore a woman would be a reasonable representation of a nurse. Obviously that's not a helpful stereotype, because male nurses exist and face challenges due to not fitting the stereotypes. The model is dumb, and outputs what it's seen. Is that wrong?

webmaven · on May 26, 2022

It isn't wrong, but we aren't talking about the model somehow magically transcending the data it's seen. We're talking about making sure the data it sees is representative, so the results it outputs are as well.

Given that male nurses exist (and though less common, certainly aren't rare), why has the model apparently seen so few?

There actually is a fairly simple explanation: because the images it has seen labelled "nurse" are more likely from stock photography sites rather than photos of actual nurses, and stock photography is often stereotypical rather than typical.

true_religion · on May 23, 2022

Cultural biases aren’t uniform across nations. If a prompt returns caucasians for nurses, and other races for criminals then most people in my country would not note that as racism simply because there are not, and there have never in history, been enough caucasians resident for anyone to create significant race theories about them.

This is a far cry from say the USA where that would instantly trigger a response since until the 1960s there was a widespread race based segregation.

karpierz · on May 23, 2022

> At the end of a day, if you ask for a nurse, should the model output a male or female by default?

Randomly pick one.

> Trying to generate a model that's "free of correlative relationships" is impossible because the model would never have the infinitely pedantic input text to describe the exact output image.

Sure, and you can never make a medical procedure 100% safe. Doesn't mean that you don't try to make them safer. You can trim the obvious low hanging fruit though.

pxmpxm · on May 23, 2022

> Randomly pick one.

How does the model back out the "certain people would like to pretend it's a fair coin toss that a randomly selected nurse is male or female" feature?

It won't be in any representative training set, so you're back to fishing for stock photos on getty rather than generating things.

shadowgovt · on May 23, 2022

Yep, that's the hard problem Google is not comfortable releasing the API to this until they have it solved.

zarzavat · on May 23, 2022

But why is it a problem? The AI is just a mirror showing us ourselves. That’s a good thing. How does it help anyone to make an AI that presents a fake world so that we can pretend that we live in a world that we actually don’t? Disassociation from reality is more dangerous than bias.

astrange · on May 24, 2022

In the days when Sussman was a novice Minsky once came to him as he sat hacking at the PDP-6. "What are you doing?", asked Minsky. "I am training a randomly wired neural net to play Tic-Tac-Toe." "Why is the net wired randomly?", asked Minsky. "I do not want it to have any preconceptions of how to play" Minsky shut his eyes, "Why do you close your eyes?", Sussman asked his teacher. "So that the room will be empty." At that moment, Sussman was enlightened.

—

The AI doesn’t know what’s common or not. You don’t know if it’s going to be correct unless you’ve tested it. Just assuming whatever it comes out with is right is going to work as well as asking a psychic for your future.

zarzavat · on May 24, 2022

The model makes inferences about the world from training data. When it sees more female nurses than male nurses in its training set, if infers that most nurses are female. This is a correct inference.

If they were to weight the training data so that there were an equal number of male and female nurses, then it may well produce male and female nurses with equal probability, but it would also learn an incorrect understanding of the world.

That is quite distinct from weighting the data so that it has a greater correspondence to reality. For example, if Africa is not represented well then weighting training data from Africa more strongly is justifiable.

The point is, it’s not a good thing for us to intentionally teach AIs a world that is idealized and false.

As these AIs work their way into our lives it is essential that they reproduce the world in all of its grit and imperfections, lest we start to disassociate from reality.

Chinese media (or insert your favorite unfree regime) also presents China as a utopia.

astrange · on May 24, 2022

> The model makes inferences about the world from training data. When it sees more female nurses than male nurses in its training set, if infers that most nurses are female. This is a correct inference.

No it is not, because you don’t know if it’s been shown each one of its samples the same number of times, or if it overweighted some of its samples more than others. There’s normal reasons both of these would happen.

shadowgovt · on May 24, 2022

> As these AIs work their way into our lives it is essential that they reproduce the world in all of its grit and imperfections...

Is it? I'm reminded of the Microsoft Tay experiment, were they attempted to train an AI by letting Twitter users interact with it.

The result was a non-viable mess that nobody liked.

Daishiman · on May 24, 2022

The AI is a mirror of the text and image corpora it was presented, as parsed and sanitized by the team in question.

shadowgovt · on May 23, 2022

> The AI is just a mirror showing us ourselves.

That's one hypothesis.

calvinmorrison · on May 23, 2022

what if I asked the model to show me a sunday school photograph of baptists in the National Baptist Convention?

rvnx · on May 23, 2022

The pictures I got from a similar model when asking for a "sunday school photograph of baptists in the National Baptist Convention": https://ibb.co/sHGZwh7

calvinmorrison · on May 23, 2022

and how do we _feel_ about that outcome?

andybak · on May 26, 2022

It's gone now. What was it?

nmfisher · on May 25, 2022

What about preschool teacher?

I say this because I’ve been visiting a number of childcare centres over the past few days and I still have yet to see a single male teacher.

webmaven · on May 26, 2022

> If the input text lacks context/nuance, then the model must have some bias to infer the user's intent. This holds true for any image it generates; not just the politically sensitive ones. For example, if I ask for a picture of a person, and don't get one with pink hair, is that a shortcoming of the model?

You're ignoring that these models are stochastic. If I ask for a nurse and always get an image of a woman in scrubs, then yes, the model exhibits bias. If I get a male nurse half the time, we can say the model is unbiased WRT gender, at least. The same logic applies to CEOs always being old white men, criminals always being Black men, and so on. Stochastic models can output results that when aggregated exhibit a distribution from which we can infer bias or the lack thereof.

sangnoir · on May 23, 2022

> At the end of a day, if you ask for a nurse, should the model output a male or female by default?

This depends on the application. As an example, it would be a problem if it's used as a CV-screening app that's implicitly down-ranking male-applicants to nurse positions, resulting in fewer interviews for them.

pshc · on May 23, 2022

Perhaps to avoid this issue, future versions of the model would throw an error like “bias leak: please specify a gender for the nurse at character 32”

jdashg · on May 23, 2022

Additionally, if you optimize for most-likely-as-best, you will end up with the stereotypical result 100% of the time, instead of in proportional frequency to the statistics.

Put another way, when we ask for an output optimized for "nursiness", is that not a request for some ur stereotypical nurse?

ar_lan · on May 23, 2022

You could stipulate that it roll a die based on percentage results - if 70% of Americans are "white", then 70% of the time show a white person - 13% of the time the result should be black, etc.

That's excessively simplified but wouldn't this drop the stereotype and better reflect reality?

SnowHill9902 · on May 23, 2022

No, because a user will see a particular image not the statistically ensemble. It will at times show an Eskimo without a hand because they do statistically exist. But the user definitely does not want that.

ghayes · on May 23, 2022

Is this going to be hand-rolled? Do you change the prompt you pass to the network to reflect the desired outcomes?

jvalencia · on May 23, 2022

You could simply encode a score for how well the output matches the input. If 25% of trees in summer are brown, perhaps the output should also have 25% brown. The model scores itself on frequencies as well as correctness.

astrange · on May 23, 2022

The only reason these models work is that we don’t interfere with them like that.

Your description is closer to how the open source CLIP+GAN models did it - if you ask for “tree” it starts growing the picture towards treeness until it’s all averagely tree-y rather than being “a picture of a single tree”.

It would be nice if asking for N samples got a diversity of traits you didn’t explicitly ask for. OpenAI seems to solve this by not letting you see it generate humans at all…

spywaregorilla · on May 23, 2022

Suppose 10% of people have green skin. And 90% of those people have broccoli hair. White people don't have broccoli hair.

What percent of people should be rendered as white people with broccoli hair? What if you request green people. Or broccoli haired people. Or white broccoli haired people? Or broccoli haired nazis?

It gets hard with these conditional probabilities

sinenomine · on May 24, 2022

> It depends on whether you'd like the model to learn casual or correlative relationships.

I expect that in the practical limit of scale achievable, the regularization pressure inherent to the process of training these models converges to https://en.wikipedia.org/wiki/Minimum_description_length and the correlative relationships become optimized away, leaving mostly true causal relationships inherent to data-generating process.

drdeca · on May 24, 2022

The meaning of the word "nurse" is determined by how the word "nurse" is used and understood.

Perhaps what "nurse" means isn't what "nurse" should mean, but what people mean when they say "nurse" is what "nurse" means.

LudwigNagasena · on May 23, 2022

> If you want the model to understand how the word "nurse" is usually used, without regard for what a "nurse" actually is, then associating it with female is fine.

That’s a distinction without a difference. Meaning is use.

tines · on May 23, 2022

Not really; the gender of a nurse is accidental, other properties are essential.

codethief · on May 23, 2022

While not essential, I wouldn't exactly call the gender "accidental":

> We investigated sex differences in 473,260 adolescents’ aspirations to work in things-oriented (e.g., mechanic), people-oriented (e.g., nurse), and STEM (e.g., mathematician) careers across 80 countries and economic regions using the 2018 Programme for International Student Assessment (PISA). We analyzed student career aspirations in combination with student achievement in mathematics, reading, and science, as well as parental occupations and family wealth. In each country and region, more boys than girls aspired to a things-oriented or STEM occupation and more girls than boys to a people-oriented occupation. These sex differences were larger in countries with a higher level of women's empowerment. We explain this counter-intuitive finding through the indirect effect of wealth. Women's empowerment is associated with relatively high levels of national wealth and this wealth allows more students to aspire to occupations they are intrinsically interested in.

Source: https://psyarxiv.com/zhvre/ (HN discussion: https://news.ycombinator.com/item?id=29040132)

daenz · on May 23, 2022

The "Gender Equality Paradox"... there's a fascinating episode[0] about it. It's incredible how unscientific and ideologically-motivated one side comes off in it.

0. https://www.youtube.com/watch?v=_XsEsTvfT-M

astrange · on May 23, 2022

If you ask it to generate “nurse” surely the problem isn’t that it’s going to just generate women, it’s that it’s going to give you women in those Halloween sexy nurse costumes.

If it did, would you believe that’s a real representative nurse because an image model gave it to you?

paisawalla · on May 23, 2022

How do you know this? Because you can, in your mind, divide the function of a nurse from the statistical reality of nursing?

Are the logical divisions you make in your mind really indicative of anything other than your arbitrary personal preferences?

tines · on May 23, 2022

No, because there's at least one male nurse.

paisawalla · on May 23, 2022

Please don't waste time with this kind of obtuse response. This fact says nothing about why nursing is a female-dominated career. You claim to know that this is just an accidental fact of history or society -- how do you know that?

tines · on May 23, 2022

I meant "accidental" in the Aristotelian sense: https://plato.stanford.edu/entries/essential-accidental/

paisawalla · on May 24, 2022

Yes I understand that. That is only a description of what mental arithmetic you can do if you define your terms arbitrarily conveniently.

"It is possible for a man to provide care" is not the same statement as "it is possible for a sexually dimorphic species in a competitive, capitalistic society (...add more qualifications here) to develop a male-dominated caretaking role"

You're just asserting that you could imagine male nurses without creating a logical contradiction, unlike e.g. circles that have corners. That doesn't mean nursing could be a male-dominated industry under current constraints.

LudwigNagasena · on May 23, 2022

Not really what? How does that contradict what I've said?

mdp2021 · on May 23, 2022

Very certainly not, since use is individual and thus a function of competence. So, adherence to meaning depends on the user. Conflict resolution?

And anyway - contextually -, the representational natures of "use" (instances) and that of "meaning" (definition) are completely different.

LudwigNagasena · on May 23, 2022

Definition is an entirely artificial construct and doesn't equate to meaning. Definition depends on other words that you also have to understand.

mdp2021 · on May 24, 2022

You are thinking of the literal definition - that "made of literal letters".

Mental definition is that "«artificial»" (out of the internal processing) construct made of relations that reconstructs a meaning. Such ontology is logical - "this is that". (It would not be made of memories, which are processed, deconstructed.)

Concepts are internally refined: their "implicit" definition (a posterior reading of the corresponding mental low-level) is refined.

layer8 · on May 23, 2022

Humans overwhelmingly learn meaning by use, not by definition.

mdp2021 · on May 23, 2022

> Humans overwhelmingly learn meaning by use, not by definition

Preliminarily and provisionally. Then, they start discussing their concepts - it is the very definition of Intelligence.

layer8 · on May 23, 2022

Most humans don’t do that for most things they have a notion of in their head. It would be much too time consuming to start discussing the meaning of even just a significant fraction of them. For a rough reference point, the English language has over 150.000 words that you could each discuss the meaning of and try to come up with a definition. Not to speak of the difficulties to make that set of definitions noncircular.

mdp2021 · on May 24, 2022

(Mental entities are very many more than the hundred thousand, out of composition, cartesianity etc. So-called "protocols" (after logical positivism) are part of them, relating more entities with space and time. Also, by speaking of "circular definitions" you are, like others, confusing mental definitions with formal definitions.)

So? Draw your consequences.

Following what was said, you are stating that "a staggering large number of people are unintelligent". Well, ok, that was noted. Scolio: if unintelligent, they should refrain from expressing judgement (you are really stating their non-judgement), why all the actual expression? If unintelligent actors, they are liabilities, why this overwhelming employment in the job market?

Thing is, as unintelligent as you depict them quantitatively, the internal processing that constitutes intelligence proceeds in many even when scarce, even when choked by some counterproductive bad formation - processing is the natural functioning. And then, the right Paretian side will "do the job" that the vast remainder will not do, and process notions actively (more, "encouragingly" - the process is importantly unconscious, many low-level layers are) and proficiently.

And the very Paretian prospect will reveal, there will be a number of shallow takes, largely shared, on some idea, and other intensively more refined takes, more rare, on the same idea. That shows you a distinction between "use" and the asymptotic approximation to meanings as achieved by intellectual application.

SnowHill9902 · on May 23, 2022

It’s the same as with an artist: “hey artist, draw me a nurse.” “Hmm okay, do you want it a guy or girl?” “Don’t ask me, just draw what I’m saying.” The artist can then say: “Okay, but accept my biases.” or “I can’t since your input is ambiguous.”

For a one-shot generative algorithm you must accept the artist’s biases.

rvnx · on May 23, 2022

Revert back to average representation of a nurse (give no weight to unspecified criterias, gender, age, skin-color, religion, country, hair-style, no style whether it's a drawing or a photography, no information about the year it was made, etc).

“hey artist, draw me a nurse.”

“Hmm okay, do you want it a guy or girl?”

“Don’t ask me, just draw what I’m saying.”

- Ok, I'll draw you what an average nurse looks like.

- Wait, it's a woman! She wears a nurse blouse and she has a nurse cap.

- Is it bad ?

- No.

- Ok then what's the problem, you asked for something that looked like a nurse but didn't specify anything else ?

SnowHill9902 · on May 23, 2022

The average nurse has three-halfs of a tit.

mdp2021 · on May 24, 2022

Is it not incredible that after so many decades talking about local minima there is now some supposition that all of them must merge?

jonny_eh · on May 23, 2022

> But is it a flaw that the model encodes the world as it really is

Does a bias towards lighter skin represent reality? I was under the impression that Caucasians are a minority globally.

I read the disclaimer as "the model does NOT represent reality".

fnordpiglet · on May 23, 2022

Worse these models are fed from media sourced in a society that tells a different story of reality than reality actually has. How can they be accurate? They just reflect the biases of our various medias and arts. But I don’t think there’s any meaningful resolution in the present other than acknowledging this and trying to release more representative models as you can.

nearbuy · on May 24, 2022

I don't think we'd want the model to reflect the global statistics. We'd usually want it to reflect our own culture by default, unless it had contextual clues to do something else.

For example, the most eaten foods globally are maize, rice, wheat, cassava, etc. If it always depicted foods matching the global statistics, it wouldn't be giving most users what they expected from their prompt. American users would usually expect American foods, Japanese users would expect Japanese foods, etc.

> Does a bias towards lighter skin represent reality? I was under the impression that Caucasians are a minority globally.

Caucasians specifically are a global minority, but lighter skinned people are not, depending of course on how dark you consider skin to be "lighter skin". Most of the world's population is in Asia, so I guess a model that was globally statistically accurate would show mostly people from there.

ma2rten · on May 23, 2022

Caucasians are overrepresented in internet pictures.

pxmpxm · on May 23, 2022

This, I would imagine this heavily correlates to things like income and gdp per capita.

jonny_eh · on May 23, 2022

Right, that's the likely cause of the bias.

tines · on May 23, 2022

Well first, I didn't say caucasian; light-skinned includes Spanish people and many others that caucasian excludes, and that's why I said the former. Also, they are a minority globally, but the GP mentioned "Western stereotypes", and they're a majority in the West, so that's why I said "in the west" when I said that there are more light-skinned people.

skybrian · on May 23, 2022

Yes, there is a denominator problem. When selecting a sample "at random," what do you want the denominator to be? It could be "people in the US", "people in the West" (whatever countries you mean by that) or "people worldwide."

Also, getting a random sample of any demographic would be really hard, so no machine learning project is going to do that. Instead you've got a random sample of some arbitrary dataset that's not directly relevant to any particular purpose.

This is, in essence, a design or artistic problem: the Google researchers have some idea of what they want the statistical properties of their image generator to look like. What it does isn't it. So, artistically, the result doesn't meet their standards, and they're going to fix it.

There is no objective, universal, scientifically correct answer about which fictional images to generate. That doesn't all art is equally good, or that you should just ship anything without looking at quality along various axes.

godelski · on May 23, 2022

> But is it a flaw that the model encodes the world as it really is

I want to be clear here, bias can be introduced at many different points. There's dataset bias, model bias, and training bias. Every model is biased. Every dataset is biased.

Yes, the real world is also biased. But I want to make sure that there are ways to resolve this issue. It is terribly difficult, especially in a DL framework (even more so in a generative model), but it is possible to significantly reduce the real world bias.

tines · on May 23, 2022

> Every dataset is biased.

Sure, I wasn't questioning the bias of the data, I was talking about the bias of the real world and whether we want the model to be "unbiased about bias" i.e. metabiased or not.

Showing nurses equally as men and women is not biased, but it's metabiased, because the real world is biased. Whether metabias is right or not is more interesting than the question of whether bias is wrong because it's more subtle.

Disclaimer: I'm a fucking idiot and I have no idea what I'm talking about so take with a grain of salt.

john_yaya · on May 23, 2022

Please be kinder to yourself. You need to be your own strongest advocate, and that's not incompatible with being humble. You have plenty to contribute to this world, and the vast majority of us appreciate what you have to offer.

Smoosh · on May 23, 2022

Agreed. They are valid points clearly stated and a valuable contribution to the discussion.

Imnimo · on May 23, 2022

>If some particular demographic shows up in 51% of the data but 100% of the model's output shows that one demographic, that does seem like a statistics problem that the model could correct by just picking less likely "next token" predictions.

Yeah, but you get that same effect on every axis, not just the one you're trying to correct. You might get male nurses, but they have green hair and six fingers, because you're sampling from the tail on all axes.

tines · on May 23, 2022

Yeah, good point, it's not as simple as I thought.

daenz · on May 23, 2022

I think the statistics/representation problem is a big problem on its own, but IMO the bigger problem here is democratizing access to human-like creativity. Currently, the ability to create compelling art is only held by those with some artistic talent. With a tool like this, that restriction is gone. Everyone, no matter how uncreative, untalented, or uncommitted, can create compelling visuals, provided they can use language to describe what they want to see.

So even if we managed to create a perfect model of representation and inclusion, people could still use it to generate extremely offensive images with little effort. I think people see that as profoundly dangerous. Restricting the ability to be creative seems to be a new frontier of censorship.

adriand · on May 23, 2022

> So even if we managed to create a perfect model of representation and inclusion, people could still use it to generate extremely offensive images with little effort. I think people see that as profoundly dangerous.

Do they see it as dangerous? Or just offensive?

I can understand why people wouldn’t want a tool they have created to be used to generate disturbing, offensive or disgusting imagery. But I don’t really see how doing that would be dangerous.

In fact, I wonder if this sort of technology could reduce the harm caused by people with an interest in disgusting images, because no one needs to be harmed for a realistic image to be created. I am creeping myself out with this line of thinking, but it seems like one potential beneficial - albeit disturbing - outcome.

> Restricting the ability to be creative seems to be a new frontier of censorship.

I agree this is a new frontier, but it’s not censorship to withhold your own work. I also don’t really think this involves much creativity. I suppose coming up with prompts involves a modicum of creativity, but the real creator here is the model, it seems to me.

tines · on May 23, 2022

> In fact, I wonder if this sort of technology could reduce the harm caused by people with an interest in disgusting images, because no one needs to be harmed for a realistic image to be created. I am creeping myself out with this line of thinking, but it seems like one potential beneficial - albeit disturbing - outcome.

Interesting idea, but is there any evidence that e.g. consuming disturbing images makes people less likely to act out on disturbing urges? Far from catharsis, I'd imagine consumption of such material to increase one's appetite and likelihood of fulfilling their desires in real life rather than to decrease it.

I suppose it might be hard to measure.

gknoy · on May 23, 2022

> > ... people could still use it to generate extremely offensive images with little effort. I think people see that as profoundly dangerous. > Do they see it as dangerous? Or just offensive?

I won't speak to whether something is "offensive", but I think that having underlying biases in image-classification or generation has very worrying secondary effects, especially given that organizations like law enforcement want to do things like facial recognition. It's not a perfect analogue, but I could easily see some company pitch a sketch-artist-replacement service that generated images based on someone's description. The potential for having inherent bias present in that makes that kind of thing worrying, especially since the people in charge of buying it are likely to care, or notice, about the caveats.

It does feel like a little bit of a stretch, but at the same time we've also seen such things happen with image classification systems.

webmaven · on May 26, 2022

> I can understand why people wouldn’t want a tool they have created to be used to generate disturbing, offensive or disgusting imagery. But I don’t really see how doing that would be dangerous.

Propaganda can be extremely dangerous. Limiting or discouraging the use of powerful new tools for unsavory purposes such as creating deliberately biased depictions for propaganda purposes is only prudent. Ultimately it will probably require filtering of the prompts being used in much the same way that Google filters search queries.

concordDance · on May 23, 2022

I can't quite tell if you're being sarcastic about people being able to make things other people would find offensive being a problem. Are you missing an /s?

webmaven · on May 26, 2022

> We certainly don't want to perpetuate harmful stereotypes. But is it a flaw that the model encodes the world as it really is, statistically, rather than as we would like it to be? By this I mean that there are more light-skinned people in the west than dark, and there are more women nurses than men, which is reflected in the model's training data. If the model only generates images of female nurses, is that a problem to fix, or a correct assessment of the data?

If the model only generated images of female nurses, then it is not representative of the real world, because male nurses exist and they deserve to not be erased. The training data is the proximate causes here, but one wonders what process ended up distorting "most nurses are female" into "nearly all nurse photos are of female nurses" something amplified a real world imbalance into a dataset that exhibited more bias than the real world, and then training the AI bakes that bias into an algorithm (that may end up further reinforcing the bias in the real world depending on the use-cases).

ben_w · on May 23, 2022

This sounds like descriptivism vs prescriptivism. In English (native language) I’m a descriptivist, in all other languages I have to tell myself to be a prescriptivist while I’m actively learning and then switch back to descriptivism to notice when the lessons were wrong or misleading.

pshc · on May 23, 2022

I think it is problematic, yes, to produce a tool trained on data from the past that reinforces old stereotypes. We can’t just handwave it away as being a reflection of its training data. We would like it to do better by humanity. Fortunately the AI people are well aware of the insidious nature of these biases.

devindotcom · on May 23, 2022

Good lord. Withheld? They've published their research, they just aren't making the model available immediately, waiting until they can re-implement it so that you don't get racial slurs popping up when you ask for a cup of "black coffee."

>While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes

Tossing that stuff when it comes up in a research environment is one thing, but Google clearly wants to implement this as a product, used all over the world by a huge range of people. If the dataset has problems, and why wouldn't it, it is perfectly rational to want to wait and re-implement it with a better one. DALL-E 2 was trained on a curated dataset so it couldn't generate sex or gore. Others are sanitizing their inputs too and have done for a long time. It is the only thing that makes sense for a company looking to commercialize a research project.

This has nothing to do with "inability to cope" and the implied woke mob yelling about some minor flaw. It's about building a tool that doesn't bake in serious and avoidable problems.

concordDance · on May 23, 2022

I wonder why they don't like the idea of autogenerated porn... They're already putting most artists out of a job, why not put porn stars out of a job too?

notahacker · on May 23, 2022

There's definitely a market for autogenerated porn. But automated porn in a Google branded model for general use around stuff that isn't necessarily intended to be pornographic, on the other hand...

astrange · on May 23, 2022

That’s a difficult product because porn is very personalized and if the product is just a little off in latent space it’s going to turn you off.

Also, people have been commenting assuming Google doesn’t want to offend their users or non-users, but they also don’t want to offend their own staff. If you run a porn company you need to hire people okay with that from the start.

renewiltord · on May 23, 2022

Copenhagen ethics (used by most people) require that all negative outcomes of a thing X become yours if you interact with X. It is not sensible to interact with high negativity things unless you are single-issue. It is logical for Google to not attempt to interact with porn where possible.

dragonwriter · on May 23, 2022

> Copenhagen ethics (used by most people)

The idea that most people use any coherent ethical framework (even something as high level and nearly content-free as Copenhagen) much less a particular coherent ethical framework is, well, not well supported by the evidence.

> require that all negative outcomes of a thing X become yours if you interact with X. It is not sensible to interact with high negativity things unless you are single-issue.

The conclusion in the final sentence only makes sense if you use “interact” in an incorrect way describing the Copenhagen interpretation of ethics, because the original description is only correct if you include observation as an interaction. By the time you have noted a thing is “high-negativity”, you have observed it and acquired responsibility for it's continuation under the Copenhagen interpretation; you cannot avoid that by choosing not to interact once you have observed it.

PoignardAzur · on May 24, 2022

> The idea that most people use any coherent ethical framework (even something as high level and nearly content-free as Copenhagen) much less a particular coherent ethical framework is, well, not well supported by the evidence.

I don't have any evidence, but my personal experience is that it feels correct, at least on the internet.

People seem to have a "you touch it, you take responsibility for it" mindset regarding ethical issues. I think it's pretty reasonable to assume that Google execs are assuming "If anything bad happens because of AI, we'll be blamed for it".

renewiltord · on May 23, 2022

I'm sure you are capable of steelmanning the argument.

dragonwriter · on May 24, 2022

The problem is that, were I inclined to do that, anything I would adjust to make it more true also makes it less relevant.

“There exists an ethical framework—not the Copenhagen interpretation —to which some minority of the population adheres in which trying and failing to a correct a problem incurs retroactive blame for the existence of the problem but seeing it and just saying ‘sucks, but not my problem’ does not,“ is probably true, but not very relevant.

It's logical for Google to avoid involvement with porn, and to be seen doing so, because even though porn is popular involvement with it is nevertheless politically unpopular, and Google’s business interest is in not making itself more attractive as a political punching bag. The popularity of Copenhagen ethics (or their distorted cousins) don't really play into it, just self interest.

quickthrower2 · on May 24, 2022

Maybe: Most peoples morals require that all negative outcomes of a thing X become yours if you interact with X.

I am not sure of the evidence but that would seem almost right.

Except for, for example a story I read where a couple lost their housing deposit due to a payment timing issue. They used a lawyer and were not doing anything “fancy” like buying via a holding company. They interacted with “buying a house”, so is this just tough shit because they interacted with X.

That sounds like the original Bitcoin “not your keys not your coin” kind of morality.

I don’t think I can figure out the steel man.

colinmhayes · on May 23, 2022

Same reason pornhub is a top 10 most visited website but barely makes any money. Being associated with porn is not good for business.

user3939382 · on May 23, 2022

Translation: we need to hand-tune this to not reflect reality but instead the world as we (Caucasian/Asian male American woke upper-middle class San Fransisco engineers) wish it to be.

Maybe that's a nice thing, I wouldn't say their values are wrong but let's call a spade a spade.

ceejayoz · on May 23, 2022

"Reality" as defined by the available training set isn't necessarily reality.

For example, Google's image search results pre-tweaking had some interesting thoughts on what constitutes a professional hairstyle, and that searches for "men" and "women" should only return light-skinned people: https://www.theguardian.com/technology/2016/apr/08/does-goog...

Does that reflect reality? No.

(I suspect there are also mostly unstated but very real concerns about these being used as child pornography, revenge porn, "show my ex brutally murdered" etc. generators.)

userbinator · on May 23, 2022

unstated but very real concerns

I say let people generate their own reality. The sooner the masses realise that ceci n'est pas une pipe , the less likely they are to be swayed by the growing un-reality created by companies like Google.

ChadNauseam · on May 23, 2022

You know, it wouldn't surprise me if people talking about how black curly hair shouldn't be seen as unprofessional contributed to google thinking there's an association between the concepts of "unprofessional hair" and "black curly hair"

nearbuy · on May 24, 2022

That's exactly what's happening. Doing the search from the article of "unprofessional hair for work" brings up images with headlines like "It's ridiculous to say that black women's hair is unprofessional". (In addition to now bringing up images from that article itself and other similar articles comparing Google Images searches.)

ceejayoz · on May 24, 2022

You’re getting cause and effect backwards. The coverage of this changed the results, as did Google’s ensuing interventions.

nearbuy · on May 24, 2022

I don't think so. You can set the search options to only find images published before the article, and even find some of the original images.

One image links to the 2015 article, "It's Ridiculous To Say Black Women's Natural Hair Is 'Unprofessional'!". The Guardian article on the Google results is from 2016.

Another image has the headline, "5 Reasons Natural Hair Should NOT be Viewed as Unprofessional - BGLH Marketplace" (2012).

Another: "What to Say When Someone Calls Your Hair Unprofessional".

Also, have you noticed how good and professional the black women in the Guardian's image search look? Most of them look like models with photos taken by professional photographers. Their hair is meticulously groomed and styled. This is not the type of photo an article would use to show "unprofessional hair". But it is the type of photo the above articles opted for.

robocat · on May 24, 2022

You really are not helping that cause.

As a foreigner[], your point confused me anyway, and doing a Google for cultural stuff usually gets variable results. But I did laugh at many of the comments here https://www.reddit.com/r/TooAfraidToAsk/comments/ufy2k4/why_...

[] probably, New Zealand, although foreigner is relative

ChadNauseam · on May 24, 2022

Haha. I've got some personal experience with that one. I used to live in a house with many other people, and one girl was rastafarian and from jamacia and had dreadlocks, and another girl in the house (who wasn't black) thought that her hairstyle was very offensive. We had to have several conflict resolution meetings about it.

As silly as it seemed, I do think everyone is entitled to their own opinion and I respect the anti-dreadlocks girl for standing up for what she believed in even when most people were against her.

robocat · on May 25, 2022

> thought that her hairstyle was very offensive

Telling others they don’t like how others look is right near the top on the scale of offensiveness. I had a partner who had had dreads for 25 years. I’m wasn’t a huge fan of her dreads because although I like the look, hers were somewhat annoying for me (scratchy, dread babies, me getting tangled). That said, I would hope I never tell any other person how to look. Hilarious when she was working, and someone would treat her badly due to their assumptions or prejudices, only to discover to their detriment she was very senior staff!

Dreadlocks are usually called dreads in NZ. My previous link mentions that some people call them locks, which seems inapproprate to me: kind of a confusing whitewashing denial of history.

rvnx · on May 23, 2022

If your query was about hairstyle, why do you even look or care about the skin color ?

Nowhere there is any precision for a preferred skin color in the query of th user.

So it sorts and gives the most average examples based on the examples that were found on the internet.

Essentially answering the query "SELECT * FROM `non-professional hairstyles` ORDER BY score DESC LIMIT 10".

It's like if you search on Google "best place for wedding night".

You may get 3 places out of 10 in Santorini, Greece.

Yes you could have an human remove these biases because you feel that Sri Lanka is the best place for a wedding, but what if there is a consensus that Santorini is really the most appraised in the forums or websites that were crawled by Google ?

ceejayoz · on May 23, 2022

> The algorithm is just ranking the top "non-professional hairstyle" in the most neutral way in its database

You're telling me those are all the most non-professional hairstyles available? That this is a reasonable assessment? That fairly standard, well-kept, work-appropriate curly black hair is roughly equivalent to the pink-haired, three-foot-wide hairstyle that's one of the only white people in the "unprofessional" search?

Each and everyone of them is less workplace appropriate than, say, http://www.7thavenuecostumes.com/pictures/750x950/P_CC_70594... ?

rvnx · on May 23, 2022

I'm saying that the dataset needs to be expanded to cover the most examples possible.

Work a lot on adding even more examples, in order to make the algorithms as close as possible to the "average reality".

At some point we may even ultimately reach the state that the robots even collect intelligence directly in the real world, and not on the internet (even closer to reality).

Censoring results sounds the best recipe for a dystopian world where only one view is right.

colinmhayes · on May 23, 2022

> If your query was about hairstyle, why do you even look at the skin color ?

You know that race has a large effect on hair right?

daenz · on May 23, 2022

I'd be careful where you're going with that. You might make a point that is the opposite of what you intended.

jayd16 · on May 23, 2022

The results are not inherently neutral because the database is from non-neutral input.

It's a simple case of sample bias.

ceeplusplus · on May 23, 2022

The reality is that hair styles on the left side of the image in the article are widely considered unprofessional in today's workplaces. That may seem egregiously wrong to you, but it is a truth of American and European society today. Should it be Google's job to rewrite reality?

ceejayoz · on May 23, 2022

The "unprofessional" results are almost exclusively black women; the "professional" ones are almost exclusively white or light skinned.

Unless you think white women are immune to unprofessional hairstyles, and black women incapable of them, there's a race problem illustrated here even if you think the hairstyles illustrated are fairly categorized.

rvnx · on May 23, 2022

If you type as a prompt "most beautiful woman in the world", you get a brown-skinned brown-haired woman with hazel eyes.

What should be the right answer then ?

You put a blonde, you offend the brown haired.

You put blue eyes, you offend the brown eyes.

etc.

ceejayoz · on May 23, 2022

That's an unanswerable question. Perhaps the answer is "don't".

Siri takes this approach for a wide range of queries.

nomel · on May 23, 2022

How do you pick what should and shouldn't be restricted? Is there some "offense threshold"? I suspect all queries relating to religion, ethnicity, sexuality, and gender will need to be restricted, which almost certainly means you probably can't include humans at all, other than ones artificially inserted with mathematically proven random attributes. Maybe that's why none are in this demo.

astrange · on May 23, 2022

These debates often seem to center around “most X in the world” questions, but I’d expect all of those to be unanswerable if you wanted to know the truth. Who’s done a study on it?

In this case you’re (mostly) getting keyword matches and so it’s answering a different question than the one you asked. It would be helpful if a question answering AI gave you the question it decided to answer instead of just pretending it paid full attention to you.

daenz · on May 23, 2022

"Is Taiwan a country" also comes to mind.

rvnx · on May 23, 2022

What would a human who can freely speak without morale or being judged say on average after having ingested all the information on the internet ?

rvnx · on May 23, 2022

I think the key is to take the information in this world with a little bit pinch of salt.

When you do a search on a search engine, the results are biased too, but still, they shouldn't be artificially censored to fit some political views.

I asked one algorithm few minutes ago (it's called t0pp and it's free to try online, and it's quite fascinating because it's uncensored):

"What is the name of the most beautiful man on Earth ?

- He is called Brad Pitt."

==

Is it true in an objective way ? Probably not.

Is there an actual answer ? Probably yes, there is somewhere a man who scores better than the others.

Is it socially acceptable ? Probably not.

The question is:

If you interviewed 100 persons in the street, and asked the question "What is the name of the most beautiful man on Earth ?".

I'm pretty sure you'd get Brad Pitt often coming in.

Now, what about China ?

We don't have many examples there, they have no clue who is Brad Pitt probably, and there is probably someone else that is considered more beautiful by over 1B people

(t0pp tells me it's someone called "Zhu Zhu" :D )

==

Two solutions:

1) Censorship

-> Sorry there is too much bias in Western and we don't want to offend anyone, no answer, or a generic overriding human answer that is safe for advertisers, but totally useless ("the most beautiful human is you")

2) Adding more examples

-> Work on adding more examples from abroad trying to get the "average human answer".

==

I really prefer solution (2) in the core algorithms and dataset development, rather than going through (1).

(1) is more a choice to make at the stage when you are developing a virtual psychologist or a chat assistant, not when creating AI building blocks.

rcMgD2BwE72F · on May 23, 2022

In any case, Google will be writing their reality. Who picked the image sample for the ML to run on, if not Google? What's the problem with writing it again, then? They know their biases and want to act on it.

It's like blaming a friend for trying to phrase things nicely, and telling them to speak headlong with zero concern for others instead. Unless you believe anyone trying to do good is being hypocrite…

I, for one, like civility.

colinmhayes · on May 23, 2022

Only black people have unprofessional hair and only white people have professional hair is not reality.

barredo · on May 23, 2022

I know you're anon trolling, but the authors' names are:

Chitwan Saharia, William Chan, Saurabh Saxena†, Lala Li†, Jay Whang†, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho†, David Fleet†, Mohammad Norouzi

hda2 · on May 24, 2022

Google AI researchers don't have the final say in what gets published and what doesn't. I think there was a huge controversy when people learned about it last year.

pid-1 · on May 23, 2022

Absolutely not related to the whole discussion, but what do "†" stands for?