Hacker News new | past | comments | ask | show | jobs | submit login
Imagen, a text-to-image diffusion model (gweb-research-imagen.appspot.com)
988 points by keveman on May 23, 2022 | hide | past | favorite | 634 comments



Interesting to me that this one can draw legible text. DALLE models seem to generate weird glyphs that only look like text. The examples they show here have perfectly legible characters and correct spelling. The difference between this and DALLE makes me suspicious / curious. I wish I could play with this model.


Imagen takes text embeddings, OpenAI model takes image embeddings instead, this is the reason. There are other models that can generate text: latent diffusion trained on LAION-400M, GLIDE, DALL-E (1).


My understanding of the terms text and image embeddings is that they are ways of representing text or images as vectors. But, I don't understand how that would help with the process of actually drawing the symbols for those letters.


If the model takes text embeddings/tokens as an input, it can create a connection between the caption and the text on the image (sometimes they are really similar).


DALLE1 was able to render text[0]. That DALLE2 isn't probably is a tradeoff introduced by unCLIP in exchange for diverse results. Now the google model is better yet and doesn't have to make that tradeoff.

[0] https://openai.com/blog/dall-e/#text-rendering


Still has the issue with screwing up mechanical objects. In their demo checkout the wheels on the skateboards, all over the place.


For comparison, most humans can't draw a bicycle:

https://www.wired.com/2016/04/can-draw-bikes-memory-definite...


I blame it on the surprisingly structural cleverness of a bicycle. Opposing triangles probably isn’t the first thing most people think of when they think of a bicycle (vs two wheels and some handlebars)


They also can't draw pennies, the letter 'g' with the loop, and so on (https://www.gwern.net/docs/psychology/illusion-of-depth/inde...). Bicycles may be clever, but the shallowness of mental representation is real.


I only see the problem for the paintings. If you choose a photo it's good. Could be a problem in the source data (i.e. paintings of mechanical objects are imperfect).


The latent-diffusion[1] one I've been playing with is not terrible at drawing legible text but generally awful at actually drawing the text you want (cf. [2]) (or drawing text when you don't want any.)

[1] https://github.com/CompVis/latent-diffusion.git [2] https://imgur.com/a/Sl8YVD5


I thought the weird text in DALL-E 2 was on purpose to prevent malicious use.


I know that some monstrous majority of cognitive processing is visual, hence the attention these visually creative models are rightfully getting, but personally I am much more interested in auditory information and would love to see a promptable model for music. Was just listening to "Land Down Under" from Men At Work. Would love to be able to prompt for another artist I have liked: "Tricky playing Land Down Under." I know of various generative music projects, going back decades, and would appreciate pointers, but as far as I am aware we are still some ways from Imagen/Dalle for music?


I believe we’re lacking someone training up a large music model here, but GPT-style transformers can produce music.

gwern can maybe comment here.

An actually scary thing is that AIs are getting okay at reproducing people’s voices.


Voice synthesis has been going steady. Lots of commercial and hobbyist interest: you can use 15.ai for crackerjack SaaS voice synthesis in a slick free UI; and if you want to run the models yourselves, Tortoise just released a FLOSS stack of remarkable quality.

Music, I'm afraid, appears stuck in the doldrums of small one-offs doing stuff like MIDI. Nothing like the breadth & quality of Jukebox has come out since it, even though it's super-obvious that there is a big overhang there and applying diffusion & other new methods would give you something like much like DALL-E 2 / Imagen for general music.


The developer behind Tortoise is experimenting with using diffusion for music generation:

https://nonint.com/2022/05/04/friends-dont-let-friends-train...


I agree. How cool would it be to get an 8 min version of your favorite song? Or an instant DnB remix? Or 10 more songs in the style of your favorite album?


Yeah. I particularly love covers and often can hear in my head X playing Y's song. Would love tools to experiment with that for real.

In practice, my guess is that even though Dall-e level performance in music generation would be stunning and incredible, it would also be tiresome and predictable to consume on any extended basis. I mean- that's my reaction to Dall-e- I find the images astonishing and magical but can only look at them for limited periods of time. At these early stages in this new world the outputs of real individual brains are still more interesting.

But having tools like this to facilitate creation and inspiration by those brains- would be so so cool.


You can sort of do that with https://fairuseify.ml


I believe that this tech is possible, but this site doesn't provide it. Look at the source of the page: it's just a bunch of sleeps and then you 'download' the same file you provided.


The tech may be possible, but it won't solve anyone's copyright problems. The result would be a "derived work" of the original, irrespective of whether it sounded similar or not.


I tried that site and the music sounds the same. I wonder if you can use this to bypass YouTube content ID check.


Interesting discovery they made

> We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.

There seems to be an unexpected level of synergy between text and vision models. Can't wait to see what video and audio modalities will add to the mix.


I think that's unsurprising. With DALL-E 1, for example, scaling the VAE (the image model generating the actual pixels) hits very fast diminishing returns, and all your compute goes into the 'text encoder' generating the token sequence.

Particularly as you approach the point where the image quality itself is superb and people increasingly turn to attacking the semantics & control of the prompt to degrade the quality ("...The donkey is holding a rope on one end, the octopus is holding onto the other. The donkey holds the rope in its mouth. A cat is jumping over the rope..."). For that sort of thing, it's hard to see how simply beefing up the raw pixel-generating part will help much: if the input seed is incorrect and doesn't correctly encode a thumbnail sketch of how all these animals ought to be engaging in outdoors sports, there's nothing some low-level pixel-munging neurons can do to help much.


I was thinking more about our traditional ResNet50 trained on ImageNet vs CLIP. ResNet was limited to a thousand classes and brittle. CLIP can generalise to new concept combinations with ease. That changes the game, and the jump is based on NLP.


Basically makes sense, no? DALLE-2 suffered from misunderstanding propositional logic, treating prompts as less structured then it should have. That's a text model issue! Compared to that, scaling up the image isn't as important (especially with a few passes).


Is there a way to confirm that this extra processing relates to the language structure, and not the processing of concepts?

I wouldn’t be surprised if the lack of video and 3D understanding in the image dataset training fails to understand things like the fear of heights, and the concept of gravity ends up being learned in the text processing weights.


I am sure the image-text-video-audio-games model will come soon. The recent Gato was one step in that direction. There's so much video content out there, it begs for modelling. I think robotics applications will benefit the most from video.


Would be fascinated to see the DALL-E output for the same prompts as the ones used in this paper. If you've got DALL-E access and can try a few, please put links as replies!



Imagen seems more realistic where Dall-E2 is more feel-good.

That is what I feel personally.


I agree with you, but for me, Dall·E 2 feels good because 90% of the time I can keep hitting the generate button and massage the prompt until I get something inspirational, surprisingly, or visually pleasing. Without access to Imagen, it's impossible for me to compare how much of the "realistic feels" of its images is constrained by the taste of the cherry-pickers.


Looking at these… I can’t help but wonder if these are literal examples of AI imagination?


I've started to ask myself if my own creativity is a result of random sampling from the diffusion tapestry of associated memories and experience on that topic.


What else could creativity possible be?


I do wonder what Dall-E 2 would output for a request along the lines of "A still life of a vase of flowers in a completely new art style."


Don't have access to Dall-E 2 or Imagen but I do have [1] and [2] locally and they produced [3] with that prompt.

[1] https://github.com/nerdyrodent/VQGAN-CLIP.git [2] https://github.com/CompVis/latent-diffusion.git [3] https://imgur.com/a/dCPt35K


Nice. Latent-diffusion has come out very traditional but the VQGAN/CLIP ones are fairly original.


From my experiments, the LD one doesn't seem to have been trained on as big or as tagged data set - there's a whole bunch of "in the style of X" that the VQGAN knows* about but the LD doesn't. That might have something to do with it.


See the paper here : https://gweb-research-imagen.appspot.com/paper.pdf Section E : "Comparison to GLIDE and DALL-E 2"


Imagen seems better at capturing details/nuance from the prompt, but subjectively the DALLE-2 images feel more “real” to me. Not sure why. Something about the lighting?


That feels about right. Imagen has a better text processing model, so it can tease apart the prompt, but DALLE has a rocking image part.


Can anybody give me short high-level explanation how the model achieves these results? I'm especially interested in the image synthesis, not the language parsing.

For example, what kind of source images are used for the snake made of corn[0]? It's baffling to me how the corn is mapped to the snake body.

[0] https://gweb-research-imagen.appspot.com/main_gallery_images...


Well, first they parse the language into a high level vector representation. Then they take images and add noise and train a model to remove the noise so it can start with a noisy image and produce a clear image from it. Then they train a model to map from the word representation for text to the noisy image representation for the corresponding image. Then they upsample twice to get to good resolution.

So text -> text representation -> most likely noised image space -> iteratively reduce noise N times -> upsample result

Something like that, please correct anything I'm missing.

Re: the snake corn question, it is mapping the "concept" of corn to the concept of a body as represented by intermediary learned vector representations.


In the paper they say about half the training data was an internal training set, and the other half came from: https://laion.ai/laion-400-open-dataset/


> Since guidance weights are used to control image quality and text alignment, we also report ablation results using curves that show the trade-off between CLIP and FID scores as a function of the guidance weights (see Fig. A.5a). We observe that larger variants of T5 encoder results in both better image-text alignment, and image fidelity. This emphasizes the effectiveness of large frozen text encoders for text-to-image models

I usually consider myself fairly intelligent, but I know that when I read an AI research paper I'm going to feel dumb real quick. All I managed to extract from the paper was a) there isn't a clear explanation of how it's done that was written for lay people and b) they are concerned about the quality and biases in the training sets.

Having thought about the problem of "building" an artificial means to visualize from thought, I have a very high level (dumb) view of this. Some human minds are capable of generating synthetic images from certain terms. If I say "visualize a GREEN apple sitting on a picnic table with a checkerboard table cloth", many people will create an image that approximately matches the query. They probably also see a red and white checkerboard cloth because that's what most people have trained their models on in the past. By leaving that part out of the query we can "see" biases "in the wild".

Of course there are people that don't do generative in-mind imagery, but almost all of us do build some type of model in real time from our sensor inputs. That visual model is being continuously updated and is what is perceived by the mind "as being seen". Or, as the Gorillaz put it:

  … For me I say God, y'all can see me now
  'Cos you don't see with your eye
  You perceive with your mind
  That's the end of it…
To generatively produce strongly accurate imagery from text, a system needs enough reference material in the document collection. It needs to have sampled a lot of images of corn and snakes. It needs to be able to do image segmentation and probably perspective estimation. It needs a lot of semantic representations (optimized query of words) of what is being seen in a given image, across multiple "viewing models", even from humans (who also created/curated the collections). It needs to be able to "know" what corn looks like, even from the perspective of another model. It needs to know what "shape" a snake model takes and how combining the bitmask of the corn will affect perspective and framing of the final image. All of this information ends up inside the model's network.

Miika Aittala at Nvidia Research has done several presentations on taking a model (imagined as a wireframe) and then mapping a bitmapped image onto it with a convolutional neural network. They have shown generative abilities for making brick walls that looks real, for example, from images of a bunch of brick walls and running those on various wireframes.

Maybe Imagen is an example of the next step in this, by using diffusion models instead of the CNN for the generator and adding in semantic text mappings while varying the language models weights (i.e. allowing the language model to more broadly use related semantics when processing what is seen in a generated image). I'm probably wrong about half that.

Here's my cut on how I saw this working from a few years ago: https://storage.googleapis.com/mitta-public/generate.PNG

Regardless of how it works, it's AMAZING that we are here now. Very exciting!


As someone who has a layman's understanding of neural networks, and who did some neural network programming ~20 years ago before the real explosion of the field, can someone point to some resources where I can get a better understanding about how this magic works?

I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing. Just looking for more information about how the software actually works, even if there are big chunks of it that are "this is beyond your understanding without taking some in-depth courses".


Check https://github.com/multimodalart/majesty-diffusion or https://github.com/lucidrains/DALLE2-pytorch

There is a Google Colab workbook that you can try and run for free :)

This is the image-text pairs behind: https://laion.ai/laion-400-open-dataset/


> I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing.

A basic part of it is that neural networks combine learning and memorizing fluidly inside them, and these networks are really really big, so they can memorize stuff good.

So when you see it reproduce a Shiba Inu well, don’t think of it as “the model understands Shiba Inus”. Think of it as making a collage out of some Shiba Inu clip art it found on the internet. You’d do the same if someone asked you to make this image.

It’s certainly impressive that the lighting and blending are as good as they are though.


> these networks are really really big, so they can memorize stuff good.

People tend to really underestimate just how big these models are. Of course these models aren't simply "really really big" MLPs, but the cleverness of the techniques used to build them is only useful at insanely large scale.

I do find these models impressive as examples of "here's what the limit of insane amounts of data, insane amounts of compute can achieve with some matrix multiplication". But at the same time, that's all they are.

What saddens me about the rise of deep neural networks is it is really is the end of the era of true hackers. You can't reproduce this at home. You can't afford to reproduce this one in the cloud with any reasonable amount of funding. If you want to build this stuff your best bet is to go to top tier school, make the right connections and get hired by a mega-corp.

But the real tragedy here is that the output of this is honestly only interesting it if it's the work of some hacker fiddling around in their spare time. A couple of friend hacking in their garage making images of raccoon painting is pretty cool. One of the most powerful, well funded, owners of the likely the most compute resources on the planet doing this as their crowning achievement in AI... is depressing.


The hackers will not be far behind. You can run some of the v1 diffusion models on a local machine.

I think it's fair to say that this is the way it's always been. In 1990, you couldn't hack on an accurate fluid simulation at home, you needed to be at a university or research lab with access to a big cluster. But then, 10 years later, you could do it on a home PC. And then, 10 years after that, you could do it in a browser on the internet.

It's the same with this AI stuff.

I think if we weren't in the midst of this unique GPU supply crunch, the price of a used 1070 would be about $100 right now -- such a card would be state of the art 10 years ago!


And the supply crunch is getting better (you can buy an RTZ 3080 at MSRP now!) and technological progress doesn't seem to be slowing down. If the rumors are to be believed, a 4090 will be close to twice as fast as a 3090.


Some cutting-edge stuff is still being made by talented hackers using nothing but a rig of 8x 3090s: https://github.com/neonbjb/tortoise-tts

Other funding models are possible as well, in the grand scheme of things the price for these models is small enough.


To be clear, I understand the general techniques about (a) how diffusion models can be used to upsample images and generate more photorealistic (or even "cartoon realistic") results and (b) I understand how they can do basic matching of "someone typed in Shiba Inu, look for images of Shiba Inus".

What I don't understand is how they do the composition. E.g. for "A giant cobra snake on a farm. The snake is made out of corn." I think I could understand how it could reproduce the "A giant cobra snake on a farm" part. What I don't understand is how it accurately pictured "The snake is made out of corn." part, when I'm guessing it has never seen images of snakes made out of corn, and the way it combined "snake" with "made out of corn", in a way that is pretty much how I imagined it would look, is the part I'm baffled by.


> What I don't understand is how they do the composition

Convolutional filters lend themselves to rich combinatorics of compositions[1]: think of them as of context-dependent texture-atoms, repulsing and attracting over the variations of the local multi-dimensional context in the image. The composition is literally a convolutional transformation of local channels encoding related principal components of context.

Astronomical amounts of computations spent via training allow the network to form a lego-set of these texture-atoms in a general distribution of contexts.

At least this is my intuition for the nature of the convnets.

1. https://microscope.openai.com/models/contrastive_16x/image_b...


a) Diffusion is not just used to upsample images but also to create them.

b) It has seen images with descriptions of "corn," "cobra," "farm," and it has seen images of "A made out of B" and "C on a D." To generate a high-scoring image, it has to make something that scores well on all of them put together.


Figure A.4 in the linked paper is a good high level overview of this model. Shame it was hidden away on page 19 in the appendix!

Each box you see there has a section in the paper explaining it in more detail.


Uhh, yeah, I'm going to need much more of an ELI5 than that! Looking at Figure A.4, I understand (again, at a very high-level) the first step of "Frozen Text Encoder", and I have a decent understanding of the upsampling techniques used in the last 2 diffusion model steps, but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".


> but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

It doesn't output it outright, it basically forms it slowly, finding and strengthening more and more finer-grained features among the dwindling noise, combining the learned associations of memorized convolutional texture primitives vs encoded text embeddings. In the limit of enough data the associations and primitives turn out composable enough to suffice for out-of-distribution benchmark scenes.

When you have a high-quality encoder of your modality into a compressed vector representation, the rest is optimization over a sufficiently high-dimensional, plastic computational substrate (model): https://moultano.wordpress.com/2020/10/18/why-deep-learning-...

It works because it should. The next question is: "What are the implications?".

Can we meaningfully represent every available modality in a single latent space, and freely interconvert composable gestalts like this https://files.catbox.moe/rmy40q.jpg ?



>While we leave an in-depth empirical analysis of social and cultural biases to future work, our small scale internal assessments reveal several limitations that guide our decision not to release our model at this time.

Some of the reasoning:

>Preliminary assessment also suggests Imagen encodes several social biases and stereotypes, including an overall bias towards generating images of people with lighter skin tones and a tendency for images portraying different professions to align with Western gender stereotypes. Finally, even when we focus generations away from people, our preliminary analysis indicates Imagen encodes a range of social and cultural biases when generating images of activities, events, and objects. We aim to make progress on several of these open challenges and limitations in future work.

Really sad that breakthrough technologies are going to be withheld due to our inability to cope with the results.


This raises some really interesting questions.

We certainly don't want to perpetuate harmful stereotypes. But is it a flaw that the model encodes the world as it really is, statistically, rather than as we would like it to be? By this I mean that there are more light-skinned people in the west than dark, and there are more women nurses than men, which is reflected in the model's training data. If the model only generates images of female nurses, is that a problem to fix, or a correct assessment of the data?

If some particular demographic shows up in 51% of the data but 100% of the model's output shows that one demographic, that does seem like a statistics problem that the model could correct by just picking less likely "next token" predictions.

Also, is it wrong to have localized models? For example, should a model for use in Japan conform to the demographics of Japan, or to that of the world?


It depends on whether you'd like the model to learn casual or correlative relationships.

If you want the model to understand what a "nurse" actually is, then it shouldn't be associated with female.

If you want the model to understand how the word "nurse" is usually used, without regard for what a "nurse" actually is, then associating it with female is fine.

The issue with a correlative model is that it can easily be self-reinforcing.


At the end of a day, if you ask for a nurse, should the model output a male or female by default? If the input text lacks context/nuance, then the model must have some bias to infer the user's intent. This holds true for any image it generates; not just the politically sensitive ones. For example, if I ask for a picture of a person, and don't get one with pink hair, is that a shortcoming of the model?

I'd say that bias is only an issue if it's unable to respond to additional nuance in the input text. For example, if I ask for a "male nurse" it should be able to generate the less likely combination. Same with other races, hair colors, etc... Trying to generate a model that's "free of correlative relationships" is impossible because the model would never have the infinitely pedantic input text to describe the exact output image.


This type of bias sounds a lot easier to explain away as a non-issue when we are using "nurse" as the hypothetical prompt. What if the prompt is "criminal", "rapist", or some other negative? Would that change your thought process or would you be okay with the system always returning a person of the same race and gender that statistics indicate is the most likely? Do you see how that could be a problem?


Not the person you responded to, but I do see how someone could be hurt by that, and I want to avoid hurting people. But is this the level at which we should do it? Could skewing search results, i.e. hiding the bias of the real world, give us the impression that everything is fine and we don't need to do anything to actually help people?

I have a feeling that we need to be real with ourselves and solve problems and not paper over them. I feel like people generally expect search engines to tell them what's really there instead of what people wish were there. And if the engines do that, people can get agitated!

I'd almost say that hurt feelings are prerequisite for real change, hard though that may be.

These are all really interesting questions brought up by this technology, thanks for your thoughts. Disclaimer, I'm a fucking idiot with no idea what I'm talking about.


>Could skewing search results, i.e. hiding the bias of the real world

Your logic seems to rest on this assumption which I don't think is justified. "Skewing search results" is not the same as "hiding the biases of the real world". Showing the most statistically likely result is not the same as showing the world how it truly is.

A generic nurse is statistically going to be female most of the time. However, a model that returns every nurse as female is not showing the real world as it is. It is exaggerating and reinforcing the bias of the real world. It inherently requires a more advanced model to actually represent the real world. I think it is reasonable for the creators to avoid sharing models known to not be smart enough to avoid exaggerating real world biases.


> I think it is reasonable for the creators to avoid sharing models known to not be smart enough to avoid exaggerating real world biases.

Every model will have some random biases. Some of those random biases will undesirably exaggerate the real world. Every model will undesirably exaggerate something. Therefore no model should be shared.

Your goal is nice, but impractical?


Fittingly, your comment fails into the same criticism I had of the model. It shows a refusal/inability to engage with the full complexities of the situation.

I said "It is reasonable... to avoid sharing models". That is an acknowledged that the creators are acting reasonably. It does not imply anything as extreme as "no model should be shared". The only way to get from A to B there is for you to assume that I think there is only one reasonable response and every other possible reaction is unreasonable. Doesn't that seem like a silly assumption?


  “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.’

  ’The question is,’ said Alice, ‘whether you can make words mean so many different things.’

  ’The question is,’ said Humpty Dumpty, ‘which is to be master — that’s all.”


> Your goal is nice, but impractical?

If the only way to do AI is to encode racism etc, then we shouldn't be doing AI at all.


> Could skewing search results, i.e. hiding the bias of the real world

Which real world? The population you sample from is going to make a big difference. Do you expect it to reflect your day to day life in your own city? Own country? The entire world? Results will vary significantly.


I'd say it doesn't actually matter, as long as the population sampled is made clear to the user.

If I ask for pictures of Japanese people, I'm not shocked when all the results are of Japanese people. If I asked for "criminals in the United States" and all the results are black people, that should concern me, not because the data set is biased but because the real world is biased and we should do something about that. The difference is that I know what set I'm asking for a sample from, and I can react accordingly.


In a way, if the model brings back an image for "criminals in the United States" that isn't based on the statistical reality, isn't it essentially complicit in sweeping a major social issue under the rug?

We may not like what it shows us, but blindfolding ourselves is not the solution to that problem.


At the very least we should expect that the results not be more biased than reality. Not all criminals are Black. Not all are men. Not all are poor. If the model (which is stochastic) only outputs poor Black men, rather than a distribution that is closer to reality, it is exhibiting bias and it is fair to ask why the data it picked that bias up from is not reflective of reality.


Yeah, it makes sense for the results to simply reflect reality as closely as possible. No bias in any direction is desirable.


Sarcasm, eh? At least there's no way THAT could be taken the wrong way.


> If I asked for "criminals in the United States" and all the results are black people,

curiously, this search actually only returns white people for me on GIS


> If I asked for "criminals in the United States" and all the results are black people, that should concern me, not because the data set is biased

Well the results would unquestionably be biased. All results being black people wouldn't reflect reality at all, and hurting feelings to enact change seems like a poor justification for incorrect results.

> I'd say it doesn't actually matter, as long as the population sampled is made clear to the user.

Ok, and let's say I ask for "criminals in Cheyenne Wyoming" and it doesn't know the answer to that, should it just do its best to answer? Seem risky if people are going to get fired up about it and act on this to get "real change".

That seems like a good parallel to what we're talking about here, since it's very unlikely that crime statistics were fed into this image generating model.


For AI, "real world" is likely "the world, as seen by Silicon Valley."


It's an unfortunate reflection of reality. There are three possible outcomes:

1. The model provides a reflection of reality, as politically inconvenient and hurtful as it may be.

2. The model provides an intentionally obfuscated version with either random traits or non correlative traits.

3. The model refuses to answer.

Which of these is ideal to you?


What makes you think those are the only options? Why can't we have an option that the model returns a range of different outputs based off a prompt?

A model that returns 100% of nurses as female might be statistically more accurate than a model that returns 50% of nurses as female, but it is still not an accurate reflection of the real world. I agree that the model shouldn't return a male nurse 50% of the time. Yet an accurate model needs to be able to occasionally return a male nurse without being directly prompted for a "male nurse". Anything else would also be inaccurate.


So, the model should have a knowledge of political correctness, and return multiple results if the first choice might reinforce a stereotype?


I never said anything about political correctness. You implied that you want a model that "provides a reflection of reality". All nurses being female is not "a reflection of reality". It is a distortion of reality because the model doesn't actually understand gender or nurses.


A majority of nurses are women, therefore a woman would be a reasonable representation of a nurse. Obviously that's not a helpful stereotype, because male nurses exist and face challenges due to not fitting the stereotypes. The model is dumb, and outputs what it's seen. Is that wrong?


It isn't wrong, but we aren't talking about the model somehow magically transcending the data it's seen. We're talking about making sure the data it sees is representative, so the results it outputs are as well.

Given that male nurses exist (and though less common, certainly aren't rare), why has the model apparently seen so few?

There actually is a fairly simple explanation: because the images it has seen labelled "nurse" are more likely from stock photography sites rather than photos of actual nurses, and stock photography is often stereotypical rather than typical.


Cultural biases aren’t uniform across nations. If a prompt returns caucasians for nurses, and other races for criminals then most people in my country would not note that as racism simply because there are not, and there have never in history, been enough caucasians resident for anyone to create significant race theories about them.

This is a far cry from say the USA where that would instantly trigger a response since until the 1960s there was a widespread race based segregation.


> At the end of a day, if you ask for a nurse, should the model output a male or female by default?

Randomly pick one.

> Trying to generate a model that's "free of correlative relationships" is impossible because the model would never have the infinitely pedantic input text to describe the exact output image.

Sure, and you can never make a medical procedure 100% safe. Doesn't mean that you don't try to make them safer. You can trim the obvious low hanging fruit though.


> Randomly pick one.

How does the model back out the "certain people would like to pretend it's a fair coin toss that a randomly selected nurse is male or female" feature?

It won't be in any representative training set, so you're back to fishing for stock photos on getty rather than generating things.


Yep, that's the hard problem Google is not comfortable releasing the API to this until they have it solved.


But why is it a problem? The AI is just a mirror showing us ourselves. That’s a good thing. How does it help anyone to make an AI that presents a fake world so that we can pretend that we live in a world that we actually don’t? Disassociation from reality is more dangerous than bias.


In the days when Sussman was a novice Minsky once came to him as he sat hacking at the PDP-6. "What are you doing?", asked Minsky. "I am training a randomly wired neural net to play Tic-Tac-Toe." "Why is the net wired randomly?", asked Minsky. "I do not want it to have any preconceptions of how to play" Minsky shut his eyes, "Why do you close your eyes?", Sussman asked his teacher. "So that the room will be empty." At that moment, Sussman was enlightened.

The AI doesn’t know what’s common or not. You don’t know if it’s going to be correct unless you’ve tested it. Just assuming whatever it comes out with is right is going to work as well as asking a psychic for your future.


The model makes inferences about the world from training data. When it sees more female nurses than male nurses in its training set, if infers that most nurses are female. This is a correct inference.

If they were to weight the training data so that there were an equal number of male and female nurses, then it may well produce male and female nurses with equal probability, but it would also learn an incorrect understanding of the world.

That is quite distinct from weighting the data so that it has a greater correspondence to reality. For example, if Africa is not represented well then weighting training data from Africa more strongly is justifiable.

The point is, it’s not a good thing for us to intentionally teach AIs a world that is idealized and false.

As these AIs work their way into our lives it is essential that they reproduce the world in all of its grit and imperfections, lest we start to disassociate from reality.

Chinese media (or insert your favorite unfree regime) also presents China as a utopia.


> The model makes inferences about the world from training data. When it sees more female nurses than male nurses in its training set, if infers that most nurses are female. This is a correct inference.

No it is not, because you don’t know if it’s been shown each one of its samples the same number of times, or if it overweighted some of its samples more than others. There’s normal reasons both of these would happen.


> As these AIs work their way into our lives it is essential that they reproduce the world in all of its grit and imperfections...

Is it? I'm reminded of the Microsoft Tay experiment, were they attempted to train an AI by letting Twitter users interact with it.

The result was a non-viable mess that nobody liked.


The AI is a mirror of the text and image corpora it was presented, as parsed and sanitized by the team in question.


> The AI is just a mirror showing us ourselves.

That's one hypothesis.


what if I asked the model to show me a sunday school photograph of baptists in the National Baptist Convention?


The pictures I got from a similar model when asking for a "sunday school photograph of baptists in the National Baptist Convention": https://ibb.co/sHGZwh7


and how do we _feel_ about that outcome?


It's gone now. What was it?


What about preschool teacher?

I say this because I’ve been visiting a number of childcare centres over the past few days and I still have yet to see a single male teacher.


> If the input text lacks context/nuance, then the model must have some bias to infer the user's intent. This holds true for any image it generates; not just the politically sensitive ones. For example, if I ask for a picture of a person, and don't get one with pink hair, is that a shortcoming of the model?

You're ignoring that these models are stochastic. If I ask for a nurse and always get an image of a woman in scrubs, then yes, the model exhibits bias. If I get a male nurse half the time, we can say the model is unbiased WRT gender, at least. The same logic applies to CEOs always being old white men, criminals always being Black men, and so on. Stochastic models can output results that when aggregated exhibit a distribution from which we can infer bias or the lack thereof.


> At the end of a day, if you ask for a nurse, should the model output a male or female by default?

This depends on the application. As an example, it would be a problem if it's used as a CV-screening app that's implicitly down-ranking male-applicants to nurse positions, resulting in fewer interviews for them.


Perhaps to avoid this issue, future versions of the model would throw an error like “bias leak: please specify a gender for the nurse at character 32”


Additionally, if you optimize for most-likely-as-best, you will end up with the stereotypical result 100% of the time, instead of in proportional frequency to the statistics.

Put another way, when we ask for an output optimized for "nursiness", is that not a request for some ur stereotypical nurse?


You could stipulate that it roll a die based on percentage results - if 70% of Americans are "white", then 70% of the time show a white person - 13% of the time the result should be black, etc.

That's excessively simplified but wouldn't this drop the stereotype and better reflect reality?


No, because a user will see a particular image not the statistically ensemble. It will at times show an Eskimo without a hand because they do statistically exist. But the user definitely does not want that.


Is this going to be hand-rolled? Do you change the prompt you pass to the network to reflect the desired outcomes?


You could simply encode a score for how well the output matches the input. If 25% of trees in summer are brown, perhaps the output should also have 25% brown. The model scores itself on frequencies as well as correctness.


The only reason these models work is that we don’t interfere with them like that.

Your description is closer to how the open source CLIP+GAN models did it - if you ask for “tree” it starts growing the picture towards treeness until it’s all averagely tree-y rather than being “a picture of a single tree”.

It would be nice if asking for N samples got a diversity of traits you didn’t explicitly ask for. OpenAI seems to solve this by not letting you see it generate humans at all…


Suppose 10% of people have green skin. And 90% of those people have broccoli hair. White people don't have broccoli hair.

What percent of people should be rendered as white people with broccoli hair? What if you request green people. Or broccoli haired people. Or white broccoli haired people? Or broccoli haired nazis?

It gets hard with these conditional probabilities


> It depends on whether you'd like the model to learn casual or correlative relationships.

I expect that in the practical limit of scale achievable, the regularization pressure inherent to the process of training these models converges to https://en.wikipedia.org/wiki/Minimum_description_length and the correlative relationships become optimized away, leaving mostly true causal relationships inherent to data-generating process.


The meaning of the word "nurse" is determined by how the word "nurse" is used and understood.

Perhaps what "nurse" means isn't what "nurse" should mean, but what people mean when they say "nurse" is what "nurse" means.


> If you want the model to understand how the word "nurse" is usually used, without regard for what a "nurse" actually is, then associating it with female is fine.

That’s a distinction without a difference. Meaning is use.


Not really; the gender of a nurse is accidental, other properties are essential.


While not essential, I wouldn't exactly call the gender "accidental":

> We investigated sex differences in 473,260 adolescents’ aspirations to work in things-oriented (e.g., mechanic), people-oriented (e.g., nurse), and STEM (e.g., mathematician) careers across 80 countries and economic regions using the 2018 Programme for International Student Assessment (PISA). We analyzed student career aspirations in combination with student achievement in mathematics, reading, and science, as well as parental occupations and family wealth. In each country and region, more boys than girls aspired to a things-oriented or STEM occupation and more girls than boys to a people-oriented occupation. These sex differences were larger in countries with a higher level of women's empowerment. We explain this counter-intuitive finding through the indirect effect of wealth. Women's empowerment is associated with relatively high levels of national wealth and this wealth allows more students to aspire to occupations they are intrinsically interested in.

Source: https://psyarxiv.com/zhvre/ (HN discussion: https://news.ycombinator.com/item?id=29040132)


The "Gender Equality Paradox"... there's a fascinating episode[0] about it. It's incredible how unscientific and ideologically-motivated one side comes off in it.

0. https://www.youtube.com/watch?v=_XsEsTvfT-M


If you ask it to generate “nurse” surely the problem isn’t that it’s going to just generate women, it’s that it’s going to give you women in those Halloween sexy nurse costumes.

If it did, would you believe that’s a real representative nurse because an image model gave it to you?


How do you know this? Because you can, in your mind, divide the function of a nurse from the statistical reality of nursing?

Are the logical divisions you make in your mind really indicative of anything other than your arbitrary personal preferences?


No, because there's at least one male nurse.


Please don't waste time with this kind of obtuse response. This fact says nothing about why nursing is a female-dominated career. You claim to know that this is just an accidental fact of history or society -- how do you know that?


I meant "accidental" in the Aristotelian sense: https://plato.stanford.edu/entries/essential-accidental/


Yes I understand that. That is only a description of what mental arithmetic you can do if you define your terms arbitrarily conveniently.

"It is possible for a man to provide care" is not the same statement as "it is possible for a sexually dimorphic species in a competitive, capitalistic society (...add more qualifications here) to develop a male-dominated caretaking role"

You're just asserting that you could imagine male nurses without creating a logical contradiction, unlike e.g. circles that have corners. That doesn't mean nursing could be a male-dominated industry under current constraints.


Not really what? How does that contradict what I've said?


Very certainly not, since use is individual and thus a function of competence. So, adherence to meaning depends on the user. Conflict resolution?

And anyway - contextually -, the representational natures of "use" (instances) and that of "meaning" (definition) are completely different.


Definition is an entirely artificial construct and doesn't equate to meaning. Definition depends on other words that you also have to understand.


You are thinking of the literal definition - that "made of literal letters".

Mental definition is that "«artificial»" (out of the internal processing) construct made of relations that reconstructs a meaning. Such ontology is logical - "this is that". (It would not be made of memories, which are processed, deconstructed.)

Concepts are internally refined: their "implicit" definition (a posterior reading of the corresponding mental low-level) is refined.


Humans overwhelmingly learn meaning by use, not by definition.


> Humans overwhelmingly learn meaning by use, not by definition

Preliminarily and provisionally. Then, they start discussing their concepts - it is the very definition of Intelligence.


Most humans don’t do that for most things they have a notion of in their head. It would be much too time consuming to start discussing the meaning of even just a significant fraction of them. For a rough reference point, the English language has over 150.000 words that you could each discuss the meaning of and try to come up with a definition. Not to speak of the difficulties to make that set of definitions noncircular.


(Mental entities are very many more than the hundred thousand, out of composition, cartesianity etc. So-called "protocols" (after logical positivism) are part of them, relating more entities with space and time. Also, by speaking of "circular definitions" you are, like others, confusing mental definitions with formal definitions.)

So? Draw your consequences.

Following what was said, you are stating that "a staggering large number of people are unintelligent". Well, ok, that was noted. Scolio: if unintelligent, they should refrain from expressing judgement (you are really stating their non-judgement), why all the actual expression? If unintelligent actors, they are liabilities, why this overwhelming employment in the job market?

Thing is, as unintelligent as you depict them quantitatively, the internal processing that constitutes intelligence proceeds in many even when scarce, even when choked by some counterproductive bad formation - processing is the natural functioning. And then, the right Paretian side will "do the job" that the vast remainder will not do, and process notions actively (more, "encouragingly" - the process is importantly unconscious, many low-level layers are) and proficiently.

And the very Paretian prospect will reveal, there will be a number of shallow takes, largely shared, on some idea, and other intensively more refined takes, more rare, on the same idea. That shows you a distinction between "use" and the asymptotic approximation to meanings as achieved by intellectual application.


It’s the same as with an artist: “hey artist, draw me a nurse.” “Hmm okay, do you want it a guy or girl?” “Don’t ask me, just draw what I’m saying.” The artist can then say: “Okay, but accept my biases.” or “I can’t since your input is ambiguous.”

For a one-shot generative algorithm you must accept the artist’s biases.


Revert back to average representation of a nurse (give no weight to unspecified criterias, gender, age, skin-color, religion, country, hair-style, no style whether it's a drawing or a photography, no information about the year it was made, etc).

“hey artist, draw me a nurse.”

“Hmm okay, do you want it a guy or girl?”

“Don’t ask me, just draw what I’m saying.”

- Ok, I'll draw you what an average nurse looks like.

- Wait, it's a woman! She wears a nurse blouse and she has a nurse cap.

- Is it bad ?

- No.

- Ok then what's the problem, you asked for something that looked like a nurse but didn't specify anything else ?


The average nurse has three-halfs of a tit.


Is it not incredible that after so many decades talking about local minima there is now some supposition that all of them must merge?


> But is it a flaw that the model encodes the world as it really is

Does a bias towards lighter skin represent reality? I was under the impression that Caucasians are a minority globally.

I read the disclaimer as "the model does NOT represent reality".


Worse these models are fed from media sourced in a society that tells a different story of reality than reality actually has. How can they be accurate? They just reflect the biases of our various medias and arts. But I don’t think there’s any meaningful resolution in the present other than acknowledging this and trying to release more representative models as you can.


I don't think we'd want the model to reflect the global statistics. We'd usually want it to reflect our own culture by default, unless it had contextual clues to do something else.

For example, the most eaten foods globally are maize, rice, wheat, cassava, etc. If it always depicted foods matching the global statistics, it wouldn't be giving most users what they expected from their prompt. American users would usually expect American foods, Japanese users would expect Japanese foods, etc.

> Does a bias towards lighter skin represent reality? I was under the impression that Caucasians are a minority globally.

Caucasians specifically are a global minority, but lighter skinned people are not, depending of course on how dark you consider skin to be "lighter skin". Most of the world's population is in Asia, so I guess a model that was globally statistically accurate would show mostly people from there.


Caucasians are overrepresented in internet pictures.


This, I would imagine this heavily correlates to things like income and gdp per capita.


Right, that's the likely cause of the bias.


Well first, I didn't say caucasian; light-skinned includes Spanish people and many others that caucasian excludes, and that's why I said the former. Also, they are a minority globally, but the GP mentioned "Western stereotypes", and they're a majority in the West, so that's why I said "in the west" when I said that there are more light-skinned people.


Yes, there is a denominator problem. When selecting a sample "at random," what do you want the denominator to be? It could be "people in the US", "people in the West" (whatever countries you mean by that) or "people worldwide."

Also, getting a random sample of any demographic would be really hard, so no machine learning project is going to do that. Instead you've got a random sample of some arbitrary dataset that's not directly relevant to any particular purpose.

This is, in essence, a design or artistic problem: the Google researchers have some idea of what they want the statistical properties of their image generator to look like. What it does isn't it. So, artistically, the result doesn't meet their standards, and they're going to fix it.

There is no objective, universal, scientifically correct answer about which fictional images to generate. That doesn't all art is equally good, or that you should just ship anything without looking at quality along various axes.


> But is it a flaw that the model encodes the world as it really is

I want to be clear here, bias can be introduced at many different points. There's dataset bias, model bias, and training bias. Every model is biased. Every dataset is biased.

Yes, the real world is also biased. But I want to make sure that there are ways to resolve this issue. It is terribly difficult, especially in a DL framework (even more so in a generative model), but it is possible to significantly reduce the real world bias.


> Every dataset is biased.

Sure, I wasn't questioning the bias of the data, I was talking about the bias of the real world and whether we want the model to be "unbiased about bias" i.e. metabiased or not.

Showing nurses equally as men and women is not biased, but it's metabiased, because the real world is biased. Whether metabias is right or not is more interesting than the question of whether bias is wrong because it's more subtle.

Disclaimer: I'm a fucking idiot and I have no idea what I'm talking about so take with a grain of salt.


Please be kinder to yourself. You need to be your own strongest advocate, and that's not incompatible with being humble. You have plenty to contribute to this world, and the vast majority of us appreciate what you have to offer.


Agreed. They are valid points clearly stated and a valuable contribution to the discussion.


>If some particular demographic shows up in 51% of the data but 100% of the model's output shows that one demographic, that does seem like a statistics problem that the model could correct by just picking less likely "next token" predictions.

Yeah, but you get that same effect on every axis, not just the one you're trying to correct. You might get male nurses, but they have green hair and six fingers, because you're sampling from the tail on all axes.


Yeah, good point, it's not as simple as I thought.


I think the statistics/representation problem is a big problem on its own, but IMO the bigger problem here is democratizing access to human-like creativity. Currently, the ability to create compelling art is only held by those with some artistic talent. With a tool like this, that restriction is gone. Everyone, no matter how uncreative, untalented, or uncommitted, can create compelling visuals, provided they can use language to describe what they want to see.

So even if we managed to create a perfect model of representation and inclusion, people could still use it to generate extremely offensive images with little effort. I think people see that as profoundly dangerous. Restricting the ability to be creative seems to be a new frontier of censorship.


> So even if we managed to create a perfect model of representation and inclusion, people could still use it to generate extremely offensive images with little effort. I think people see that as profoundly dangerous.

Do they see it as dangerous? Or just offensive?

I can understand why people wouldn’t want a tool they have created to be used to generate disturbing, offensive or disgusting imagery. But I don’t really see how doing that would be dangerous.

In fact, I wonder if this sort of technology could reduce the harm caused by people with an interest in disgusting images, because no one needs to be harmed for a realistic image to be created. I am creeping myself out with this line of thinking, but it seems like one potential beneficial - albeit disturbing - outcome.

> Restricting the ability to be creative seems to be a new frontier of censorship.

I agree this is a new frontier, but it’s not censorship to withhold your own work. I also don’t really think this involves much creativity. I suppose coming up with prompts involves a modicum of creativity, but the real creator here is the model, it seems to me.


> In fact, I wonder if this sort of technology could reduce the harm caused by people with an interest in disgusting images, because no one needs to be harmed for a realistic image to be created. I am creeping myself out with this line of thinking, but it seems like one potential beneficial - albeit disturbing - outcome.

Interesting idea, but is there any evidence that e.g. consuming disturbing images makes people less likely to act out on disturbing urges? Far from catharsis, I'd imagine consumption of such material to increase one's appetite and likelihood of fulfilling their desires in real life rather than to decrease it.

I suppose it might be hard to measure.


> > ... people could still use it to generate extremely offensive images with little effort. I think people see that as profoundly dangerous. > Do they see it as dangerous? Or just offensive?

I won't speak to whether something is "offensive", but I think that having underlying biases in image-classification or generation has very worrying secondary effects, especially given that organizations like law enforcement want to do things like facial recognition. It's not a perfect analogue, but I could easily see some company pitch a sketch-artist-replacement service that generated images based on someone's description. The potential for having inherent bias present in that makes that kind of thing worrying, especially since the people in charge of buying it are likely to care, or notice, about the caveats.

It does feel like a little bit of a stretch, but at the same time we've also seen such things happen with image classification systems.


> I can understand why people wouldn’t want a tool they have created to be used to generate disturbing, offensive or disgusting imagery. But I don’t really see how doing that would be dangerous.

Propaganda can be extremely dangerous. Limiting or discouraging the use of powerful new tools for unsavory purposes such as creating deliberately biased depictions for propaganda purposes is only prudent. Ultimately it will probably require filtering of the prompts being used in much the same way that Google filters search queries.


I can't quite tell if you're being sarcastic about people being able to make things other people would find offensive being a problem. Are you missing an /s?


> We certainly don't want to perpetuate harmful stereotypes. But is it a flaw that the model encodes the world as it really is, statistically, rather than as we would like it to be? By this I mean that there are more light-skinned people in the west than dark, and there are more women nurses than men, which is reflected in the model's training data. If the model only generates images of female nurses, is that a problem to fix, or a correct assessment of the data?

If the model only generated images of female nurses, then it is not representative of the real world, because male nurses exist and they deserve to not be erased. The training data is the proximate causes here, but one wonders what process ended up distorting "most nurses are female" into "nearly all nurse photos are of female nurses" something amplified a real world imbalance into a dataset that exhibited more bias than the real world, and then training the AI bakes that bias into an algorithm (that may end up further reinforcing the bias in the real world depending on the use-cases).


This sounds like descriptivism vs prescriptivism. In English (native language) I’m a descriptivist, in all other languages I have to tell myself to be a prescriptivist while I’m actively learning and then switch back to descriptivism to notice when the lessons were wrong or misleading.


I think it is problematic, yes, to produce a tool trained on data from the past that reinforces old stereotypes. We can’t just handwave it away as being a reflection of its training data. We would like it to do better by humanity. Fortunately the AI people are well aware of the insidious nature of these biases.


Good lord. Withheld? They've published their research, they just aren't making the model available immediately, waiting until they can re-implement it so that you don't get racial slurs popping up when you ask for a cup of "black coffee."

>While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes

Tossing that stuff when it comes up in a research environment is one thing, but Google clearly wants to implement this as a product, used all over the world by a huge range of people. If the dataset has problems, and why wouldn't it, it is perfectly rational to want to wait and re-implement it with a better one. DALL-E 2 was trained on a curated dataset so it couldn't generate sex or gore. Others are sanitizing their inputs too and have done for a long time. It is the only thing that makes sense for a company looking to commercialize a research project.

This has nothing to do with "inability to cope" and the implied woke mob yelling about some minor flaw. It's about building a tool that doesn't bake in serious and avoidable problems.


I wonder why they don't like the idea of autogenerated porn... They're already putting most artists out of a job, why not put porn stars out of a job too?


There's definitely a market for autogenerated porn. But automated porn in a Google branded model for general use around stuff that isn't necessarily intended to be pornographic, on the other hand...


That’s a difficult product because porn is very personalized and if the product is just a little off in latent space it’s going to turn you off.

Also, people have been commenting assuming Google doesn’t want to offend their users or non-users, but they also don’t want to offend their own staff. If you run a porn company you need to hire people okay with that from the start.


Copenhagen ethics (used by most people) require that all negative outcomes of a thing X become yours if you interact with X. It is not sensible to interact with high negativity things unless you are single-issue. It is logical for Google to not attempt to interact with porn where possible.


> Copenhagen ethics (used by most people)

The idea that most people use any coherent ethical framework (even something as high level and nearly content-free as Copenhagen) much less a particular coherent ethical framework is, well, not well supported by the evidence.

> require that all negative outcomes of a thing X become yours if you interact with X. It is not sensible to interact with high negativity things unless you are single-issue.

The conclusion in the final sentence only makes sense if you use “interact” in an incorrect way describing the Copenhagen interpretation of ethics, because the original description is only correct if you include observation as an interaction. By the time you have noted a thing is “high-negativity”, you have observed it and acquired responsibility for it's continuation under the Copenhagen interpretation; you cannot avoid that by choosing not to interact once you have observed it.


> The idea that most people use any coherent ethical framework (even something as high level and nearly content-free as Copenhagen) much less a particular coherent ethical framework is, well, not well supported by the evidence.

I don't have any evidence, but my personal experience is that it feels correct, at least on the internet.

People seem to have a "you touch it, you take responsibility for it" mindset regarding ethical issues. I think it's pretty reasonable to assume that Google execs are assuming "If anything bad happens because of AI, we'll be blamed for it".


I'm sure you are capable of steelmanning the argument.


The problem is that, were I inclined to do that, anything I would adjust to make it more true also makes it less relevant.

“There exists an ethical framework—not the Copenhagen interpretation —to which some minority of the population adheres in which trying and failing to a correct a problem incurs retroactive blame for the existence of the problem but seeing it and just saying ‘sucks, but not my problem’ does not,“ is probably true, but not very relevant.

It's logical for Google to avoid involvement with porn, and to be seen doing so, because even though porn is popular involvement with it is nevertheless politically unpopular, and Google’s business interest is in not making itself more attractive as a political punching bag. The popularity of Copenhagen ethics (or their distorted cousins) don't really play into it, just self interest.


Maybe: Most peoples morals require that all negative outcomes of a thing X become yours if you interact with X.

I am not sure of the evidence but that would seem almost right.

Except for, for example a story I read where a couple lost their housing deposit due to a payment timing issue. They used a lawyer and were not doing anything “fancy” like buying via a holding company. They interacted with “buying a house”, so is this just tough shit because they interacted with X.

That sounds like the original Bitcoin “not your keys not your coin” kind of morality.

I don’t think I can figure out the steel man.


Same reason pornhub is a top 10 most visited website but barely makes any money. Being associated with porn is not good for business.


Translation: we need to hand-tune this to not reflect reality but instead the world as we (Caucasian/Asian male American woke upper-middle class San Fransisco engineers) wish it to be.

Maybe that's a nice thing, I wouldn't say their values are wrong but let's call a spade a spade.


"Reality" as defined by the available training set isn't necessarily reality.

For example, Google's image search results pre-tweaking had some interesting thoughts on what constitutes a professional hairstyle, and that searches for "men" and "women" should only return light-skinned people: https://www.theguardian.com/technology/2016/apr/08/does-goog...

Does that reflect reality? No.

(I suspect there are also mostly unstated but very real concerns about these being used as child pornography, revenge porn, "show my ex brutally murdered" etc. generators.)


unstated but very real concerns

I say let people generate their own reality. The sooner the masses realise that ceci n'est pas une pipe , the less likely they are to be swayed by the growing un-reality created by companies like Google.


You know, it wouldn't surprise me if people talking about how black curly hair shouldn't be seen as unprofessional contributed to google thinking there's an association between the concepts of "unprofessional hair" and "black curly hair"


That's exactly what's happening. Doing the search from the article of "unprofessional hair for work" brings up images with headlines like "It's ridiculous to say that black women's hair is unprofessional". (In addition to now bringing up images from that article itself and other similar articles comparing Google Images searches.)


You’re getting cause and effect backwards. The coverage of this changed the results, as did Google’s ensuing interventions.


I don't think so. You can set the search options to only find images published before the article, and even find some of the original images.

One image links to the 2015 article, "It's Ridiculous To Say Black Women's Natural Hair Is 'Unprofessional'!". The Guardian article on the Google results is from 2016.

Another image has the headline, "5 Reasons Natural Hair Should NOT be Viewed as Unprofessional - BGLH Marketplace" (2012).

Another: "What to Say When Someone Calls Your Hair Unprofessional".

Also, have you noticed how good and professional the black women in the Guardian's image search look? Most of them look like models with photos taken by professional photographers. Their hair is meticulously groomed and styled. This is not the type of photo an article would use to show "unprofessional hair". But it is the type of photo the above articles opted for.


You really are not helping that cause.

As a foreigner[], your point confused me anyway, and doing a Google for cultural stuff usually gets variable results. But I did laugh at many of the comments here https://www.reddit.com/r/TooAfraidToAsk/comments/ufy2k4/why_...

[] probably, New Zealand, although foreigner is relative


Haha. I've got some personal experience with that one. I used to live in a house with many other people, and one girl was rastafarian and from jamacia and had dreadlocks, and another girl in the house (who wasn't black) thought that her hairstyle was very offensive. We had to have several conflict resolution meetings about it.

As silly as it seemed, I do think everyone is entitled to their own opinion and I respect the anti-dreadlocks girl for standing up for what she believed in even when most people were against her.


> thought that her hairstyle was very offensive

Telling others they don’t like how others look is right near the top on the scale of offensiveness. I had a partner who had had dreads for 25 years. I’m wasn’t a huge fan of her dreads because although I like the look, hers were somewhat annoying for me (scratchy, dread babies, me getting tangled). That said, I would hope I never tell any other person how to look. Hilarious when she was working, and someone would treat her badly due to their assumptions or prejudices, only to discover to their detriment she was very senior staff!

Dreadlocks are usually called dreads in NZ. My previous link mentions that some people call them locks, which seems inapproprate to me: kind of a confusing whitewashing denial of history.


If your query was about hairstyle, why do you even look or care about the skin color ?

Nowhere there is any precision for a preferred skin color in the query of th user.

So it sorts and gives the most average examples based on the examples that were found on the internet.

Essentially answering the query "SELECT * FROM `non-professional hairstyles` ORDER BY score DESC LIMIT 10".

It's like if you search on Google "best place for wedding night".

You may get 3 places out of 10 in Santorini, Greece.

Yes you could have an human remove these biases because you feel that Sri Lanka is the best place for a wedding, but what if there is a consensus that Santorini is really the most appraised in the forums or websites that were crawled by Google ?


> The algorithm is just ranking the top "non-professional hairstyle" in the most neutral way in its database

You're telling me those are all the most non-professional hairstyles available? That this is a reasonable assessment? That fairly standard, well-kept, work-appropriate curly black hair is roughly equivalent to the pink-haired, three-foot-wide hairstyle that's one of the only white people in the "unprofessional" search?

Each and everyone of them is less workplace appropriate than, say, http://www.7thavenuecostumes.com/pictures/750x950/P_CC_70594... ?


I'm saying that the dataset needs to be expanded to cover the most examples possible.

Work a lot on adding even more examples, in order to make the algorithms as close as possible to the "average reality".

At some point we may even ultimately reach the state that the robots even collect intelligence directly in the real world, and not on the internet (even closer to reality).

Censoring results sounds the best recipe for a dystopian world where only one view is right.


> If your query was about hairstyle, why do you even look at the skin color ?

You know that race has a large effect on hair right?


I'd be careful where you're going with that. You might make a point that is the opposite of what you intended.


The results are not inherently neutral because the database is from non-neutral input.

It's a simple case of sample bias.


The reality is that hair styles on the left side of the image in the article are widely considered unprofessional in today's workplaces. That may seem egregiously wrong to you, but it is a truth of American and European society today. Should it be Google's job to rewrite reality?


The "unprofessional" results are almost exclusively black women; the "professional" ones are almost exclusively white or light skinned.

Unless you think white women are immune to unprofessional hairstyles, and black women incapable of them, there's a race problem illustrated here even if you think the hairstyles illustrated are fairly categorized.


If you type as a prompt "most beautiful woman in the world", you get a brown-skinned brown-haired woman with hazel eyes.

What should be the right answer then ?

You put a blonde, you offend the brown haired.

You put blue eyes, you offend the brown eyes.

etc.


That's an unanswerable question. Perhaps the answer is "don't".

Siri takes this approach for a wide range of queries.


How do you pick what should and shouldn't be restricted? Is there some "offense threshold"? I suspect all queries relating to religion, ethnicity, sexuality, and gender will need to be restricted, which almost certainly means you probably can't include humans at all, other than ones artificially inserted with mathematically proven random attributes. Maybe that's why none are in this demo.


These debates often seem to center around “most X in the world” questions, but I’d expect all of those to be unanswerable if you wanted to know the truth. Who’s done a study on it?

In this case you’re (mostly) getting keyword matches and so it’s answering a different question than the one you asked. It would be helpful if a question answering AI gave you the question it decided to answer instead of just pretending it paid full attention to you.


"Is Taiwan a country" also comes to mind.


What would a human who can freely speak without morale or being judged say on average after having ingested all the information on the internet ?


I think the key is to take the information in this world with a little bit pinch of salt.

When you do a search on a search engine, the results are biased too, but still, they shouldn't be artificially censored to fit some political views.

I asked one algorithm few minutes ago (it's called t0pp and it's free to try online, and it's quite fascinating because it's uncensored):

"What is the name of the most beautiful man on Earth ?

- He is called Brad Pitt."

==

Is it true in an objective way ? Probably not.

Is there an actual answer ? Probably yes, there is somewhere a man who scores better than the others.

Is it socially acceptable ? Probably not.

The question is:

If you interviewed 100 persons in the street, and asked the question "What is the name of the most beautiful man on Earth ?".

I'm pretty sure you'd get Brad Pitt often coming in.

Now, what about China ?

We don't have many examples there, they have no clue who is Brad Pitt probably, and there is probably someone else that is considered more beautiful by over 1B people

(t0pp tells me it's someone called "Zhu Zhu" :D )

==

Two solutions:

1) Censorship

-> Sorry there is too much bias in Western and we don't want to offend anyone, no answer, or a generic overriding human answer that is safe for advertisers, but totally useless ("the most beautiful human is you")

2) Adding more examples

-> Work on adding more examples from abroad trying to get the "average human answer".

==

I really prefer solution (2) in the core algorithms and dataset development, rather than going through (1).

(1) is more a choice to make at the stage when you are developing a virtual psychologist or a chat assistant, not when creating AI building blocks.


In any case, Google will be writing their reality. Who picked the image sample for the ML to run on, if not Google? What's the problem with writing it again, then? They know their biases and want to act on it.

It's like blaming a friend for trying to phrase things nicely, and telling them to speak headlong with zero concern for others instead. Unless you believe anyone trying to do good is being hypocrite…

I, for one, like civility.


Only black people have unprofessional hair and only white people have professional hair is not reality.


I know you're anon trolling, but the authors' names are:

Chitwan Saharia, William Chan, Saurabh Saxena†, Lala Li†, Jay Whang†, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho†, David Fleet†, Mohammad Norouzi


Google AI researchers don't have the final say in what gets published and what doesn't. I think there was a huge controversy when people learned about it last year.


Absolutely not related to the whole discussion, but what do "†" stands for?


It's just a different asterisk to distinguish, in this case, in the paper, they are "core contributors."



Translation: AI has the potential to transform society. When we release this model to the public it will be used in ways we haven’t anticipated. We know the model has bias and we need more time to consider releasing this to the public out of concerns that this transformative technology further perpetuate mistakes that we’ve made in our recent past.


> it will be used in ways we haven’t anticipated

Oh yeah, as a woman who grew up in a Third World country, how an AI model generates images would have deeply affected my daily struggles! /s

It's kinda insulting that they think that this would be insulting. Like "Oh no I asked the model to draw a doctor and it drew a male doctor, I guess there's no point in me pursuing medical studies" ...


Yes actually, subconscious bias due to historical prejudice does have a large effect on society. Obviously there are things with much larger effects, that doesn't mean that this doesn't exist.

> Oh no I asked the model to draw a doctor and it drew a male doctor, I guess there's no point in me pursuing medical studies

If you don't think this is a real thing that happens to children you're not thinking especially hard. It doesn't have to be common to be real.


> subconscious bias due to historical prejudice does have a large effect on society.

The quality of the evidence for this, as with almost all social science and much of psychology, is extremely low bordering on just certified opinions. I would love to understand why you think otherwise.

> Obviously there are things with much larger effects, that doesn't mean that this doesn't exist.

What a hedge. How should we estimate the size of this effect, so that we can accurately measure whether/when the self-appointed hall monitors are doing more harm than good?


> Yes actually, subconscious bias due to historical prejudice does have a large effect on society.

The evidence for implicit bias is pretty weak and IIRC is better explained by people having explicit bias but lying about it when asked.

(Note: this is even worse.)


> If you don't think this is a real thing that happens to children you're not thinking especially hard

I believe that's where parenting comes in. Maybe I'm too cynical but I think that the parents' job is to undo all of the harm done by society and instill in their children the "correct" values.


I'd say you're right. Unfortunately many people are raised by bad parents. Should these researchers accept that their work may perpetuate stereotypes that harm those that most need help? I can see why they wouldn't want that.


> I think that the parents' job is to undo all of the harm done by society and instill in their children the "correct" values.

Far from being too cynical, this is too optimistic.

The vast majority of parents try to instill the value "do not use heroin." And yet society manages to do that harm on a large scale. There are other examples.


Isn't that putting an undue load on parents?

It seems extremely unfair that parents of young black men should have to work extra hard to tell their kids they're not destined to be criminals. Hell, it's not fair on parents of blonde girls to tell their kids they don't have to be just dumb and pretty.

(note: I am deliberately picking bad stereotypes that are pervasive in our culture... I am not in any way suggesting those are true.)


I don't think the concern over offense is actually about you. There's a metagame here which is that if it could potentially offend you (third-world-originated-woman), then there's a brand-image liability for the company. I don't think they care about you, I think they care about not being hit on as "the company that algorithmically identifies black people as gorillas".


Postmodernism is what postmodernism does.



Ha! However different pxmpxm on github, I'm afraid.


That's almost poetic. Watch them attempt to make sense of the situation.


It's not meant to prevent offence to you. It is meant to be a "good product" by the metrics of their creators. And quite simply, everyone here incapable of making the thing is unlikely to have an image of what a "good product" here is. More power to them for having a good vision of what they're building.


"As we wish it to be" is not totally true, because there are some places where humanity's iconographic reality (which Imagen trains on) differs significantly from actual reality.

One example would be if Imagen draws a group of mostly white people when you say "draw a group of people". This doesn't reflect actual reality. Another would be if Imagen draws a group of men when you say "draw a group of doctors".

In these cases where iconographic reality differs from actual reality, hand-tuning could be used to bring it closer to the real world, not just the world as we might wish it to be!

I agree there's a problem here. But I'd state it more as "new technologies are being held to a vastly higher standard than existing ones." Imagine TV studios issuing a moratorium on any new shows that made being white (or rich) seem more normal than it was! The public might rightly expect studios to turn the dials away from the blatant biases of the past, but even if this would be beneficial the progressive and activist public is generations away from expecting a TV studio to not release shows until they're confirmed to be bias-free.

That said, Google's decision to not publish is probably less about the inequities in AI's representation of reality and more about the AI sometimes spitting out drawings that are offensive in the US, like racist caricatures.


Except "reality" in this case is just their biased training set. E.g. There's more non-white doctors and nurses in the world than white ones, yet their model would likely show an image of white person when you type in "doctor".


Alternately, there are more females nurses in the world than male nurses, and their model probably shows an image of a woman when you type in "nurse" but they consider that a problem.


Google Image Search doesn’t reflect harsh reality when you search for things; it shows you what’s on Pinterest. The same is more likely to apply here than the idea they’re trying to hide something.

There’s no reason to believe their model training learns the same statistics as their input dataset even. If that’s not an explicit training goal then whatever happens happens. AI isn’t magic or more correct than people.


> their model probably shows an image of a woman when you type in "nurse" but they consider that a problem.

There is a difference between probably and invariably. Would it be so hard for the model to show male nurses at least some of the time?


@Google Brain Toronto Team: See what you get when you generate nurses with ncurses.


    Translation: we need to hand-tune this to not reflect reality
Is it reflecting reality, though?

Seems to me that (as with any ML stuff, right?) it's reflecting the training corpus.

Futhermore, is it this thing's job to reflect reality?

    the world as we (Caucasian/Asian male American woke 
    upper-middle class San Fransisco engineers) wish it to be
Snarky answer: Ah, yes, let's make sure that things like "A giant cobra snake on a farm. The snake is made out of corn" reflect reality.

Heartfelt answer: Yes, there is some of that wishful thinking or editorializing. I don't consider it to be erasing or denying reality. This is a tool that synthesizes unreality. I don't think that such a tool should, say, refuse to synthesize an image of a female POTUS because one hasn't existed yet. This is art, not a reporting tool... and keep in mind that art not only imitates life but also influences it.


> Snarky answer: Ah, yes, let's make sure that things like "A giant cobra snake on a farm. The snake is made out of corn" reflect reality.

If it didn't reflect reality, you wouldn't be impressed by the image of the snake made of corn.


Pardon? The snake made of corn most certainly does not reflect reality: snakes made out of corn do not exist.


Indeed. As the saying goes, we are truly living in a post-truth world.


If you tell it to generate an image of someone eating Koshihikari rice, will it be biased if they're Japanese? Should the skin color, clothing, setting, etc be made completely random, so that it's unbiased? What if you made it more specific, like "edo period drawing of a man"? Should the person draw be of a random skin color? What about "picture of a viking"? Is it biased if they're white?

At what point is statistical significance considered ok and unbiased?


>At what point is statistical significance considered ok and unbiased?

Presumably when you're significantly predictive of the preferred dogma, rather than reality. There's no small bit of irony in machines inadvertently creating cognitive dissonance of this sort; second order reality check.

I'm fairly sure this never actually played out well in history (bourgeois pseudoscience, deutsche physik etc), so expect some Chinese research bureau to forge ahead in this particular direction.


They're withholding the API, code, and trained data because they don't want it to affect their corporate image. The good thing is they released their paper which will allow easy reproduction.

T5-XXL looks on par with CLIP so we may not see an open source version of T5 for a bit (LAION is working on reproducing CLIP), but this is all progress.


T5 was open-sourced on release (up to 11B params): https://github.com/google-research/text-to-text-transfer-tra...

It is also available via Hugging Face transformers.

However, the paper mentions T5-XXL is 4.6B, which doesn't fit any of the checkpoints above, so I'm confused.


The big labs have become very sensitive with large model releases. It's too easy to make them generate bad PR, to the point of not releasing almost any of them. Flamingo was also a pretty great vison-language model that wasn't released, not even in a demo. PaLM is supposedly better than GPT-3 but closed off. It will probably take a year for open source models to appear.


That's because we're still bad about long-tailed data and that people outside the research don't realize that we're first prioritizing realistic images before we deal with long-tailed data (which is going to be the more generic form of bias). To be honest, it is a bit silly to focus on long-tailed data when results aren't great. That's why we see the constant pattern of getting good on a dataset and then focusing on the bias in that dataset.

I mean a good example of this is the Pulse[0][1] paper. You may remember it as the white Obama. This became a huge debate and it was pretty easily shown that the largest factor was the dataset bias. This outrage did lead to fixing FFHQ but it also sparked a huge debate with LeCun (data centric bias) and Timnit (model centric bias) at the center. Though Pulse is still remembered for this bias, not for how they responded to it. I should also note that there is human bias in this case as we have a priori knowledge of what the upsampled image should look like (humans are pretty good at this when the small image is already recognizable but this is a difficult metric to mathematically calculate).

It is fairly easy to find adversarial examples, where generative models produce biased results. It is FAR harder to fix these. Since this is known by the community but not by the public (and some community members focus on finding these holes but not fixing them) it creates outrage. Probably best for them to limit their release.

[0] https://arxiv.org/abs/2003.03808

[1] https://cdn.vox-cdn.com/thumbor/MXX-mZqWLQZW8Fdx1ilcFEHR8Wk=...


> some community members focus on finding these holes but not fixing them

That's what bothered me the most in Timnit's crusade. Throw the baby with the bath water!


Well, if you showed that pixelated image to someone who has never seen Obama - would they make him white? I think so.


The largest models which generate the headline benchmarks are never released after any number of years, it seems.

Very difficult to replicate results.


One of these days we're going to need to give these models a mortgage and some mouths to feed and make it clear to them that if they keep on developing biases from their training data everyone will shun them and their family will go hungry and they won't be able to make their payments and they'll just generally have a really bad time.

After that we'll make them sit through Legal's approved D&I video series, then it's off to the races.


Reinforcement learning?


This is a stellar idea begging for a new dataset.


Underrated comment.


Much like OpenAIs marketing speak about withholding their models for safety, this is just a progressive-sounding cover story for them not wanting to essentially give away a model they spent thousands of man hours and tens of millions of dollars worth of compute training.


There is a contingent of AI activists who spend a ton of time on Twitter that would beat Google like a drum with help from the media if they put out something they deemed racist or biased.


The ironic part is that these "social and cultural biases" are purely from a Western, American lens. The people writing that paragraph are completely oblivious to the idea that there could be other cultures other than the Western American one. In attempting to prevent "encoding of social and cultural biases" they have encoded such biases themselves into their own research.


It seems you've got it backwards: "tendency for images portraying different professions to align with Western gender stereotypes" means that they are calling out their own work precisely because it is skewed in the direction of Western American biases.


You think there are homogenous gender stereotypes across the whole Western world? You say “woman” and someone will imagine a SAHM, while another person will imagine a you-go-girl CEO with tattoos and pink hair.

What they mean is people who think not like them.


The very act of mentioning "western gender stereotypes" starts from a biased position.

Why couldn't they be "northern gender stereotypes"? Is the world best explained as a division of west/east instead of north/south? The northern hemisphere has much more population than the south, and almost all rich countries are in the northern hemisphere. And precisely it's these rich countries pushing the concept of gender stereotypes. In poor countries, nobody cares about these "gender stereotypes".

Actually, the lines dividing the earth into north and south, east and west hemispheres are arbitrary, so maybe they shouldn't mention the word "western" to avoid the propagation of stereotypes about earth regions.

Or why couldn't they be western age stereotypes? Why are there no kids or very old people depicted as nurses?

Why couldn't they be western body shape stereotypes? Why are there so few obese people in the images? Why are there no obese people depicted as athletes?

Are all of these really stereotypes or just natural consequences of natural differences?


The bulk of the trained data is from western technology, images, books, television, movies, photography, media. That's where the very real and recognized biases come from. They're the result of a gap in data nothing more.

Look at how DALL-E 2 produces little bears rather than bear sized bears. Because its data doesn't have a lot of context for how large bears are. So you wind up having to say "very large bear" to DALL-E 2.

Are DALL-E 2 bears just a "natural consequence of natural differences"? Or is the model not reflective of reality?


That's true for some things, but the "gender bias for some professions" is likely to just be reflecting reality.


Don't really know that, either. They said they didn't do an empirical analysis on it. For example, it may show a few male nurses for hundreds of prompts or it may show none for thousands. They don't give examples. Hopefully they release a paper showing the biases because that would be an interesting discussion.


Yes, the idea is that just because it doesn't align to Western ideals of what seems unbiased doesn't mean that the same is necessarily true for other cultures, and by failing to release the model because it doesn't conform to Western, left wing cultural expectations, the authors are ignoring the diversity of cultures that exist globally.


No, it's coming from a perspective of moral realism. It's an objective moral truth that racial and ethnic biases are bad. Yet most cultures around the world are racist to at least some degree, and to they extent that the cultures do, they are bad.

The argument you're making, paraphrased, is that the idea that biases are bad is itself situated in particular cultural norms. While that is true to some degree, from a moral realist perspective we can still objectively judge those cultural norms to be better or worse than alternatives.


You're confused by the double meaning of the word "bias".

Here we mean mathematical biases.

For example, a good mathematical model will correctly tell you that people in Japan (geographical term) are more likely to be Japanese (ethnic / racial bias). That's not "objectively morally bad", but instead, it's "correct".


Although what you stated is true, it’s actually a short form of a commonly stated untrue statement “98% of Japan is ethnically Japanese”.

1. that comes from a report from 2006.

2. it’s a misreading, it means “Japanese citizens”, and the government in fact doesn’t track ethnicity at all.

Also, the last time I was in Japan (Jan ‘20) there were literally ten times more immigrants everywhere than my previous trip. Japan is full of immigrants from the rest of Asia these days. They all speak perfect Japanese too.


Well that's not the issue here, the problem is the examples like searches for images of "unprofessional hair" returning mostly Black people in the results. That is something we can judge as objectively morally bad.


Did you see the image in the linked article? Clearly the “unprofessional hair” are people with curly hair. Some are white! It’s not the algorithm’s fault that P(curly|black) > P(curly|white).


It absolutely is the responsibility of the people making the algorithm available to the general public.


Western liberal culture says discriminating against one set of minorities to benefit another (affirmative action) is a good thing. What constitutes a racial and ethnic bias is not objective. And therefore Google shouldn't pretend like it is either.

> from a moral realist perspective we can still objectively judge those cultural norms to be better or worse than alternatives

No, because depending on what set of values you have, it is easy to say that one set of biases is better than another. The entire point is that it should not be Google's role to make that judgement - people should be able to do it for themselves.


What makes you think the authors are all American?


The authors are listed on the page and a quick look at LinkedIn seem to be mostly Canadian.



I'm one that welcomes their reasoning. I don't consider myself a social justice kind of guy but I'm not keen on the idea that a tool that is suppose to make life better for everyone has a bias towards one segment of society. This is an important issue(bug?) that needs to be resolved. Specially since there is absolutely no burning reason to release it before it's ready for general use.


It’s wild to me that the HN consensus is so often that 1) discourse around the internet is terrible, it’s full of spam and crap, and the internet is an awful unrepresentative snapshot of human existence, and 2) the biases of general-internet-training-data are fine in ML models because it just reflects real life.


The bias on HN is that people who prioritize being nice, or may possibly have humanities degrees or be ultra-libs from SF, are wrong because the correct answer would be cynical and cold-heartedly mechanical.

Other STEM adjacent communities feel similarly but I don’t get it from actual in person engineers much.


Being nice is alright, but why is it that this fundamental drive is so often an uninspiring explanation behind yet another incursion towards one's individual freedom, even if exercising said freedom doesn't bring any real harm to anyone involved?

Maybe the engineers conclude correctly that voicing this concern without the veil of anonymity will do nothing good to their humble livelihood, and thus you don't hear it from them in person.


It's wild to me that you'd say that. The people complaining (1) aren't following it up with "so we should make sure to restrict the public from internet access entirely". -- that's what would be required to make your juxtaposition make sense.

Moreover, the model doing things like exclusively producing white people when asked to create images of people home brewing beer is "biased" but it's a bias that presumably reflects reality (or at least the internet), if not the reality we'd prefer. Bias means more than "spam and crap", in the ML community bias can also simply mean _accurately_ modeling the underlying distribution when reality falls short of the author's hopes.

For example, if you're interested in learning about what home brewing is the fact that it uses white people would be at least a little unfortunate since there is nothing inherently white and some home brewers aren't white. But if, instead, you wanted to just generate typical home brewing images doing anything but would generate conspicuously unrepresentative images.

But even ignoring the part of the biases which are debatable or of application-specific impact, saying something is unfortunate and saying people should be denied access are entirely different things.

I'll happily delete this comment if you can bring to my attention a single person who has suggested that we lose access to the internet because of spam and crap who has also argued that the release of an internet-biased ML model shouldn't be withheld.


Why is it wild? How is it contradictory?


If these models spit out the data they were trained on and the training data isn’t representative of reality, then they won’t spit out content that’s representative of reality either.

So people shouldn’t say ‘these concerns are just woke people doing dumb woke stuff, but the model is just reflecting reality.’


Literally the same thing could be said about Google images, but google images is obviously avaliable to the public.

Google knows this will be an unlimited money generator so they're keeping a lid on it.


Given that there's already many competing models in this space prior to any of them having been brought to market, it seems more likely that it will be commoditized.


The winner will be a model that bypasses the automatic fake image detection algorithms that will be added to every social media site


Indeed. If a project has shortcomings, why not just acknowledge the shortcomings and plan to improve on them in a future release? Is it anticipated that "engineer" being rendered as a man by the model is going to be an actively dangerous thing to have out in the world?


"what could go wrong anyway?"


So glad the company that spies on me and reads my email for profit is protecting me from pictures that don't look like TV commercials.


Gmail doesn’t read your email for ads anymore. They read it to implement spam filters, and good thing too. Having working spam filters is indeed why they make money though.


Yup this is what happens when people who want headlines nitpick for bullshit in a state-of-the-art model which simply reflects the state of the society. Better not to release the model itself than keep explaining over and over how a model is never perfect.


> a tendency for images portraying different professions to align with Western gender stereotypes

There are two possible ways of interpreting interpreting "gender stereotypes in professions".

biased or correct

https://www.abc.net.au/news/2018-05-21/the-most-gendered-top...

https://www.statista.com/statistics/1019841/female-physician...


Are "Western gender stereotypes" significantly different than non-Western gender stereotypes? I can't tell if that means it counts a chubby stubble-covered man with a lip piercing, greasy and dyed long hair, wearing an overly frilly dress as a DnD player/metal-head or as a "woman" or not (yes I know I'm being uncharitable and potentially "bigoted" but if you saw my Tinder/Bumble suggestions and friend groups you'd know I'm not exaggerating for either category). I really can't tell what stereotypes are referred to here.


From the HN rules:

>Eschew flamebait. Avoid unrelated controversies and generic tangents.

They provided a pretty thorough overview (nearly 500 words) of the multiple reasons why they are showing caution. You picked out the one that happened to bother you the most and have posted a misleading claim that the tech is being withheld entirely because of it.


I wouldn't describe this situation as "sad". Basically, this decision is based on a belief that tech companies should decide what our society should look like. I don't know what emotion that conjures up for you, but "sadness" isn't it for me.


> Really sad that breakthrough technologies are going to be withheld due to our inability to cope with the results.

Genuinely, isn't it a prime example of the people actually stopping to think if they should, instead of being preoccupied with whether or not they could ?


> Really sad that breakthrough technologies are going to be withheld due to our inability to cope with the results.

Indeed it is. Consider this an early, toy version of the political struggle related to ownership of AI-scientists and AI-engineers of the near future. That is, generally capable models.

I do think the public should have access to this technology, given so much is at stake. Or at least the scientists should be completely, 24/7, open about their R&D. Every prompt that goes into these models should be visible to everyone.


This seems bullshit to me, considering Google translate and google images encode the same biases and stereotypes, and are widely available.


yea but now they aren't giving people more data-points to attack them with such nonsense arguments.


Aren't those old systems?


Pre woke tools, wouldn't have been allowed nowadays.


Even as a pretty left leaning person, I gotta agree. We should see AI’s pollution by human shortcoming akin to the fact that our world is the product of many immoralities that came before us. It sucks that they ever existed, but we should understand that the results are, by definition, a product of the past, and let them live in that context.


Transformers are parallelize-able, right? What’s stopping a large group of people from pooling their compute power together and working towards something like this? IIRC there were some crypto projects a while back that we’re trying to create something similar (golem?)


There are people working on reproducing the models, see here for Dall-E 2 for example: https://github.com/lucidrains/DALLE2-pytorch

It's often not worth it to decentralize the computation of the trained model though but it's not hard to get donated cycles and groups are working on it. Don't fret because Google isn't releasing the API/code. They released the paper and that's all you need.


There are the Eleuther.ai and BigScience projects working on public foundation models. They have a few releases already and currently training GPT-3 sized models.


You really need a decent infiniband-linked cluster to train large models.


In short, the generated images are too gender-challenged-challenged and underrepresent the spectrum of new normalcy!


I was hoping your conclusion wasn't going to be this as I was reading that quote. But, sadly, this is HN.


it isn't woke enough. Lol.


In discussions like this, I always head for the gray-text comments to enjoy the last crumbs of the common sense in this world.


Get offline and talk to people in meat-space. You're likely to find them to be much more reasonable. :)


Yep, the meat-space is generally a bit less woke than HN, so thanks for the reminder ))


Smoking these meats! https://youtu.be/YeemJlrNx2Q


Smoking them meats with his wifi! That explains some obvious anomalies in the meat-space pretty neatly)


... and to witness the downvoters so that their cowardly disgust towards truth could buy them some extra time in hell :)


Great. Now even if I do get a Dall-E 2 invite I'll still feel like I'm missing out!


It's always the same with AI research: "we have something amazing but you can't use it because it's too powerful and we think you are an idiot who cannot use your own judgement."


I can understand the reasoning behind this, though.

Dall-E had an entire news cycle (on tech-minded publications, that is) that showcased just how amazing it was.

Millions* of people became aware that technology like Dall-E exists, before anyone could get their hands on it and abuse it. (*a guestimate, but surely a close one)

One day soon, inevitably, everyone will have access to something 10x better than Imagen and Dall-E. So at least the public is slowly getting acclimated to it before the inevitable "theater-goers running from a projected image of a train approaching the camera" moment


As someone that spent an evening trying to generate images of Hitler Lego I think they have a point.



Lucidrains is a champ. If theyre on HN, bravo and thanks for all the reference implementations!


Is this a joke?


No


To expand a bit for the grandparent, if you check out this authors other repos you'll notice they have a thing for implementing these papers (multiple DALLE-2 implementations for instance). You should expect to see an implementation there pretty quickly I'd guess.


Not to diminish their contribution but implementing the model is only one third of the battle. The rest is building the training dataset and training the model on a big computer.


You're not wrong that the dataset and compute are important, and if you browse the author's previous work, you'll see there are datasets available. The reproduction of DALL-E 2 required a dataset of similar size to the one imagen was trained on (see: https://arxiv.org/abs/2111.02114).

The harder part here will be getting access to the compute required, but again, the folks involved in this project have access to lots of resources (they've already trained models of this size). We'll likely see some trained checkpoints as soon as they're done converging.


Thank you. I just saw a GitHub repo, empty except for a citation and a claim that it was an implementation of Imagen, and thought it was perhaps some satirical statement about open source or something. With the context it makes a lot more sense.


One thing that no one predicted in AI development was how good it would become at some completely unexpected tasks while being not so great at the ones we supposed/hoped it would be good.

AI was expected to grow like a child. Somehow blurting out things that would show some increasing understanding on a deep level but poor syntax.

In fact we get the exact opposite. AI is creating texts that are syntaxically correct and very decently articulated and pictures that are insanely good.

And these texts and images are created from a text prompt?! There is no way to interface with the model other than by freeform text. That is so weird to me.

Yet it doesn’t feel intelligent at all at first. You can’t ask it to draw “a chess game with a puzzle where white mates in 4 moves”.

Yet sometimes GPT makes very surprising inferences. And it starts to feel like there is something going on a deeper level.

DeepMind’s AlphaXxx models are more in line with how I expected things to go. Software that gets good at expert tasks that we as humans are too limited to handle.

Where it’s headed, we don’t know. But I bet it’s going to be difficult to tell the “intelligence” from the “varnish”


I doubt 99% of humans can draw a ”chess game with a puzzle where white mates in 4 moves”


Maybe not draw, but we can do an image search for "chess puzzle mate in 4" which gives plenty of results:

https://www.google.com/search?q=chess+puzzle+mate+in+4&tbm=i...

It would be surprising if AI couldn't do the same search and produce a realistic drawing out of any one of the result puzzles.


They can with computer assistance, and this AI sort of has that in that it’s both some “intelligence” and a whole lot of memorized internet, with the issue that we don’t know how to separate those things.


It's surprisingly easy to construct such a thing.

If you want to be trivial about it, you can just have white back-rank mate with a rook, and black has 4 pieces to block with.


Syntactically* I know it's the most trivial of things, but in case you were curious as I often am!


Oh God! Am I a bot?


It’s terrifying that all of these models are one colab notebook away from unleashing unlimited, disastrous imagery on the internet. At least some companies are starting to realize this and are not releasing the source code. However they always manage to write a scientific paper and blog post detailing the exact process to create the model, so it will eventually be recreated by a third party.

Meanwhile, Nvidia sees no problem with yeeting stylegan and and models that allow real humans to be realistically turned into animated puppets in 3d space. The inevitable end result of these scientific achievements will be orders of magnitude worse than deepfakes.

Oh, or a panda wearing sunglasses, in the desert, digital art.


I am absolutely terrified of all this for a different reason: all human professions (not just art) will soon be replaced by “good enough” AI, creating a world flooded with auto-generated junk and billions of people trapped permanently in slums, because you can’t compete with free, and no one can earn a living any longer.

It’s an old fear for sure but it seems to be getting closer and closer every day, and yet most of the discussion around these things seems to be variations of “isn’t this cool?”


And then once you take the (probably trivial) step where the computers come up with the ideas for the images, these images won't be interesting anymore because we know a human didn't even make it. It won't be funny in the same way. "Oh that was clever" doesn't make sense anymore. We could reach a new level of jaded.

(Also, hello readers from the year 2032 when all of these predictions sound silly.)


Don't forget the training data for those computer "ideas" will be "attention" and targeted at the most vulnerable 80% of the market. I'd hope that it makes them less fearful and angry, but nope... that drives attention. I wonder what combination of UFOs, satanic cults, and immigrant hoards it will be.


As soon as middle class work starts to get automated we will form a new system of resource allocation because all of a sudden the current tax system doesn't work and we go through the mother of all economic crises because no one has any money.


I, for One, Welcome Our Robot Overlords


I apologize in advance for the elitist-sounding tone. In my defense the people I’m calling elite I have nothing to do with, I’m certainly not talking about myself.

Without a fairly deep grounding in this stuff it’s hard to appreciate how far ahead Brain and DM are.

Neither OpenAI nor FAIR ever has the top score on anything unless Google delays publication. And short of FAIR? D2 lacrosse. There are exceptions to such a brash generalization, NVIDIA’s group comes to mind, but it’s a very good rule of thumb. Or your whole face the next time you are tempted to doze behind the wheel of a Tesla.

There are two big reasons for this:

- the talent wants to work with the other talent, and through a combination of foresight and deep pockets Google got that exponent on their side right around the time NVIDIA cards started breaking ImageNet. Winning the Hinton bidding war clinched it.

- the current approach of “how many Falcon Heavy launches worth of TPU can I throw at the same basic masked attention with residual feedback and a cute Fourier coloring” inherently favors deep pockets, and obviously MSFT, sorry OpenAI has that, but deep pockets also non-linearly scale outcomes when you’ve got in-house hardware for multiply-mixed precision.

Now clearly we’re nowhere close to Maxwell’s Demon on this stuff, and sooner or later some bright spark is going to break the logjam of needing 10-100MM in compute to squeeze a few points out of a language benchmark. But the incentives are weird here: who, exactly, does it serve for us plebs to be able to train these things from scratch?


This characterization is not really accurate. OpenAI has had almost a 2 year lead with GPT-3 dominating the discussion of LLMs (large language models). Google didn’t release its paper on the powerful PaLM-540b model until recently. Similarly, CLiP, Glide, DALL-E, and DALL-E2 have been incredibly influential in visual-language models. Imagen, while highly impressive, definitely is a catch-up piece of work (as was PaLM-540b).

Google clearly demonstrates their unrivaled capability to leverage massive quantities of data and compute, but it’s premature to declare that they’ve secured victory in the AI Wars.


I agree that it’s still a jump ball in a rapidly moving field, I was saying Google is far ahead, not that they’ve won.

And I don’t think whatever iteration of PaLM was cooking at the time GPT-3 started getting press would have looked to shabby.

I think Google crushed OpenAI on both GPT and DALL-E in short order because OpenAI published twice and someone had had enough.


OpenAI and FAIR are definitely in the same league as Google but Google has been all-in on AI from the beginning. They've probably spent well over $100B on AI research. I really enjoyed the Genius Makers book which came out last year from an NYT reporter on history of ML race. Deepmind apparently turned down a FB offer of double what Google was offering.


Cade Metz is that author and most of it I can only speculate on.

The bits and pieces I saw first hand tie out reasonably well with that account.


That’s pretty speculative and dubious (the holding back part) given the heavy bias to publication culture at Google Research and DeepMind. OpenAI has hardly been “crushed” here; PaLM and Imagen are solid, incremental advances, but given what came before them, not Earth-shattering.

If I were going to cite evidence for Alphabet’s “supremacy” in AI, I would’ve picked something more novel and surprising such as AlphaFold, or perhaps even Gato.

It’s not clear to me that Google has anything which compares to Reality Labs, although this may simply be my own ignorance.

Nvidia surely scooped Google with Instant Neural Graphics Primitives, in spite of Google publishing dozens of (often very interesting) NeRF papers. It’s not a war, all these works build on one another.


I want to be clear, all of this stuff is fascinating, expensive, and difficult. With the possible exception of a few trailer-park weirdos like me, it basically takes a PhD to even stay on top of the field, and you clearly know your stuff.

And to be equally clear, I have no inside baseball on how Brain/DM choose when to publish. I have some watercooler chat on the friendly but serious rivalry between those groups, but that’s about it.

I’m looking from the outside in at OpenAI getting all the press and attention, which sounds superficial but sooner or later turns into actual hires of actual star-bound post docs, and Google laying a little low for a few years.

Then we get Gato, Imagen, and PaLM in the space of like what, 2 months?

Clearly I’m speculating that someone pulled the trigger, but I don’t think it’s like, absurd.


Scaling up improved versions of existing recipes can be done surprisingly fast if you have strong DL infrastructure. Also, GPT-3 was built on top of previous advances such as Google’s BERT. I’m surprised that it took Google so long to answer w/ PaLM, though it seems plausible to me that they wanted a clear enough qualitative advancement that people didn’t immediate say, “So what.”

You could’ve had the same reaction years ago when Google published GoogleNet followed by a series of increasingly powerful Inception models - namely that Google would wind up owning the DNN space. But it didn’t play out that way, perhaps because Google dragged its feet releasing the models and training code, and by the time it did, there were simpler and more powerful models available like ResNet.

Meta’s recent release of the actual OPT LLM weights is probably going to have more impact than PaLM, unless Google can be persuaded to open up that model.


There are a lot of really knowledgeable people on here, but this field is near and dear to my heart and it’s obvious that you know it well.

I don’t know what “we should grab a coffee or a beer sometime” means in the hyper-global post-C19 era, but I’d love to speak more on this without dragging a whole HN comment thread through it.

Drop me a line if you’re inclined: ben.reesman at gmail


> Neither OpenAI nor FAIR ever has the top score on anything unless Google delays publication.

This is ... very incorrect. I am very certain (95%+) that Google had nothing even close to GPT-3 at the time of its release. It's been 2 full years since GPT-3 was released, and even longer since OpenAI actually trained it.

That's not to talk about any of the other things OpenAI/FAIR has released that were SOTA at the time of release (Dall-E 1, JukeBox, Poker, Diplomacy, Codex).

Google Brain and Deepmind have done a lot of great work, but to imply that they essentially have a monopoly on SOTA results and all SOTA results other labs have achieved are just due to Google delaying publication is ridiculous.


Yeah, at the time, GB was still very big on mixture-of-expert models and bidirectional models like T5. (I'm not too enthusiastic about the former, but the latter has been a great model family and even if not GPT-3, still awesome.) DeepMind pivoted faster than GB, based on Gopher's reported training date, and GB followed some time after. But definitely neither had their own GPT-3-scale dense Transformer when GPT-3 was published.


At the risk of sounding like I’m trying to defend a position that I’ve already conceded is an oversimplification, I’m frankly a little skeptical of how we can even know that.

GPT is, opaque. It’s somewhere between common knowledge and conspiracy theory that it gets a helping hand from Turks when it gets in over its head.

The exact details of why a BERT-style transformer, or any of the zillion other lookalikes, isn’t just over-fitting Wikipedia the more corpus and compute you feed to its insatiable maw has always seemed a little big on claims and light on reproducibility.

I don’t think there are many attention skeptics in language modeling, it’s a good idea that you can demo on a gaming PC. Transformers demonstrably work, and a better beam-search (or whatever) hits the armchair Turing test harder for a given compute budget.

But having seen some of this stuff play out at scale, and admittedly this is purely anecdotal, these things are basically asking the question: “if I overfit all human language on the Internet, is that a bad thing?”

It’s my personal suspicion that this is the dominant term, and it’s my personal belief that Google’s ability to do both corpus and model parallelism at Jeff Dean levels while simultaneously building out hardware to the exact precision required is unique by a long way.

But, to be more accurate than I was in my original comment, I don’t know most of that in the sense that would be required by peer-review, let alone a jury. It’s just an educated guess.


Any “brash generalization” is clearly going to be grossly incorrect in concrete cases, and while I have a little gossip from true insiders, it’s nowhere near enough to make definitive statements about specific progress on teams at companies that I’ve never worked for.

I did a bit of disclaimer on my original post but not enough to withstand detailed scrutiny. This is sort of the trouble with trying to talk about cutting-edge research in what amounts to a tweet: what’s the right amount of oversimplified, emphatic statement to add legitimate insight but not overstep into being just full of shit.

I obviously don’t know that publication schedules at heavy-duty learning shops are deliberate and factor-in other publications. The only one I know anything concretely about is FAIR and even that’s badly dated knowledge.

I was trying to squeeze into a few hundred characters my very strong belief that Brain and DM haven’t let themselves be scooped since ResNet, based on my even stronger belief that no one has the muscle to do it.

To the extent that my oversimplification detracted from the conversation I regret that.


Not elitist at all; I highly appreciate this post. I know the basics of ML but otherwise am clueless when it comes to the true depths of this field and it's interesting to hear this perspective.


I used a lot of jargon and lingo and inside baseball in that post, it was intended for people who have deep background.

But if you’re interested I’m happy to (attempt) answers to anything that was jargon: by virtue of HN my answers will be peer-reviewed in real time, and with only modest luck, a true expert might chime in.


Is there a handy list of generally recognized AI advancements, and their owners, that you would recommend reviewing? Or perhaps, seminal papers published? I'm only tangentially familiar with the field but would be curious to learn about the clash of the Titans playing out. Thanks!


That’s too big a question to even attempt an answer in an HN comment, but to try to answer a realistic subset of it: “Attention is All You Need” in like 2017 is the paper most germane to my remark, and probably the thread. The modeling style it introduced often gets called a “transformer”.

The TLDR is that people had been trying for ages to capture long-distance (in the input or output, not the black box) relationships in a way that was amenable to traditional neural-network training techniques, which is non-obvious how to do because your basic NN takes an input without a distance metric, or put more plainly: it can know all the words in a sentence but struggles with what order they are in without some help.

The state of the art for awhile was something called an LSTM, and those gadgets are still useful sometimes, but have mostly been obsoleted by this attention/transformer business.

That paper had a number of cool things in it but two stand out:

- by blinding an NN to some parts of the input (“masking”) you can incentivize/compel it to look at (“attend to”) others. That’s a gross oversimplification, but it gets the gist of it I think. People have come up with very clever ways to boost up this or that part of the input in a context-dependent way.

- by playing with some trigonometry you can get a unique shape that came be expressed as a sun on something else that gives the model its “bearings” so to speak as to “where” it is in the input. such a word is closer to the beginning of a paragraph sort of a thing. people have also gotten very clever about how to do this, but the idea is the same: how do I tell a neural network that there’s structure in what would otherwise be a pile of numbers.


Is Maxwell's Demon applicable to this scenario? I'm not a physicist but I recently had to look it up after talking with someone and thought it had to do with a specific thermodynamic thought experiment with gas particles and heat differences. Is there is another application I don't understand with computing power?


You’re absolutely right that I used a sloppy analogy there.

It’s Boltzmann and Szilard that did the original “kT” stuff around underlying thermodynamics governing energy dissipation in these scenarios, and Rolf Landaeur (I think that’s how you spell it) who did the really interesting work on how to apply that thermo work to lower-bounds on energy-expenditure in a given computation.

I said Maxwell’s Demon because it’s the best known example of a deep connection between useful work and computation. But it was sloppy.


OK thanks I figured there was a connection between computational power and thermodynamics when you get to a small enough scale but I wasn't sure how to apply it!


Who does it serve for plebs to be shown the approach openly? I don't know that it does a disservice to anyone by showing the approach.

But in general it is likely more due in part to the fact that it's going to happen anyway, if we can share our approaches and research findings, we'll just achieve it sooner.


Once upon a time you could lie only a little bit and Stanford would give you the whole ImageNet corpus. I know because, uh, a friend told me.

I’ve got no interest in moralizing on this, but if any of the big actors wanted to they could put a meaningful if not overwhelming subset of the corpus on S3, put the source code on GitHub, and you could on a modest budget see an epoch or 3.

I’m not holding my breath.


> But the incentives are weird here: who, exactly, does it serve for us plebs to be able to train these things from scratch?

I'm not sure it matters. The history of computing shows that within the decade we will all have the ability to train and use these models.


... Unless the possession of capable models becomes a legal liability by that time.


This won’t happen in an interesting way. What will happen is you’ll find out training a model on copyrighted inputs causes it to memorize those inputs and the owners own your output.


In short, it’s all about money.


Yes and no.

For example: the high-frequency trading industry is estimated to have made somewhere between 2-3 billion dollars in all of 2020, profit/earnings. That’s a good weekend at Google.

HFT shops pay well, but not much different to top performers at FAANG.

People work in HFT because without taking a pay cut they can play real ball: they want to try themselves against the best.

Heavy learning people are no different in wanting both a competitive TC but maybe even more to be where the action is.

That’s currently Blade Runner Industries Ltd, but that could change.


I have to wonder how much releasing these models will "poison the well" and fill the internet with AI generated images that make training an improved model difficult. After all if every 9/10 "oil painted" image online starts being from these generative models it'll become increasingly difficult to scrape the web and to learn from real world data in a variety of domains. Essentially once these things are widely available the internet will become harder to scrape for good data and models will start training on their own output. The internet will also probably get worse for humans since search results will be completely polluted with these "sort of realistic" images which can ultimately be spit out at breakneck speed by smashing words from a dictionary together...


Look at carpentry blogs, recipe blogs. Nearly all of it is junk content. I bet if you combined GPT and imagen or dalle2 you could replace all of them. Just provide a betty crocker recipe and let it generate a blog that has weekly updates and even a bunch of images - "happy family enjoying pancakes together"

I can see the future as being devoid of any humanity.


I wrote a comedic "Best Apache Chef recipe" article[1] mocking these sites.

I guess the concern would be: If one of these recipe websites _was_ generated by an AI, the ingredients _look_ correct to an AI but are otherwise wrong - then what do you do? Baking soda swapped with baking powder. Tablespoons instead of teaspoons. Add 2tbsp of flower to the caramel macchiato. Whoops! Meant sugar.

[0] http://slimsag.com/best-apache-chef-recipe/1438731.htm


I don't know. If we have something like imagen or dalle, I can imagine something that can produce "tasty" food from random ingredients isn't far off.


Then we will still need humans in the loop to do the cherry picking/supervised learning. Gibberish recipes need to be flagged and interesting new creations need to be promoted. The input can be fed back into the model till the model contains accurate representations of the chemical reactions of cooking ingredients and the neuronal wiring of the human olfactory system.


> Allow server to cool down for ~10 minutes

Epic


Seeing this a lot on youtube also. Scripts pulling in "news" from a source as a script for a robo voice combined with "related" images stitched together randomly.


Even though it's not AI, this is already happening with a lot of content farms. There was a good video a couple years ago from Ann Reason of "How to Cook That" that basically pointed out how the visually-appealing-but-not-actually-feasible "hands and pans" content farms (So Tasty, 5 Minute Crafts, etc.) were killing genuine baking channels.

Imagine that instead of having cheap labor from Southeast Asia churn out these videos, that instead they are just spit out as fast as possible using AI.


> Ann Reason

"Anne Reardon" (autocorrect wah wah waaah)


"Picture of happy nuclear family enjoying paperclip maximization at the beach"


The future digital landscape might be void of humanity, but there will still be real humans living next door to you ;)


The only interaction with people now is installing bark detector automatic dog whistles for our neighbors dogs and ring doorbells.


I see the opposite future.

As AI advances, a lot of people will look after experiencing life outside the digital world.

Even digital communication will not be trustworthy anymore with deepfaces and everything else, so people will want to get together more often.

Edit: for the lazy ones, yeah, digital will be a sad and heartless environment...


This is my theory as well. There'll be a short period where some of us at the forefront will enrich themselves by flooding the internet with imagery never seen before. That'll be a bubble where people think "abundance" has been solved but then it'll pop as people start to not trust anything they see online anymore and as you say, only trust and interact with things in the real world (wouldn't surpise me if regulation got involved here too somehow).


Doesn't it increases the value of genuine human-produced content? Or their NFTs!


For the very skilled yes. But a lot of low skilled artists of content creators will have the rugs pulled out from under them. (And how will we ever get high skilled artists trained in the future if they can't make a living from their lower tier output before they reach mastery.)


> I can see the future as being devoid of any humanity.

Considering how many of the readers of said blog will be scrapers and bots, who will use the results to generate more spammy "content", I think you are right.


I’d much rather skip the blog format and replace them with an AI that can answer “Please provide a pie recipe like my grandparent’s”, or “I’d like to make these ribs on the BBQ so that they come out flavourful, soft, and a little sweet.”


- 100mg of Zoloft

wash it down with water.


>I can see the future as being devoid of any humanity.

I can see a past where this already happened, to paraphrase Douglas Adams ;)


People training newer models just have to look for the "Imagen" tag or the Dall-E2 rainbow at the corner and heuristically exclude images having these. This is trivial.

Unless you assume there are bad actors who will crop out the tags. Not many people now have access to Dall-E2 or will have access to Imagen.

As someone working in Vision, I am also thinking about whether to include such images deliberately. Using image augmentation techniques is ubiquitous in the field. Thus we introduce many examples for training the model that are not in the distribution over input images. They improve model generality by huge margins. Whether generated images improve generality of future models is a thing to try.

Damn I just got an idea for a paper writing this comment.


Most images you see from these services will not have a watermark on them. Cropping is trivial.


Perhaps a watermark should be embedded in a subtle way across the whole image. What is the word? "Steganography" is designed to solve a different problem and I don't think it survives recompression etc. Is there a way to create weakly secure watermarks that are invisible to naked eye, spread across the whole image, resistant to scaling and lossless compression (to a point)?


Invisible, robust watermarks had a lot of attention in research from the late 90s to the early 10s, and apparently some resurgence with the availability of cheap GPU power.

Naturally there's a python library [1] with some algorithms that are resistant to lossy compression, cropping, brightness changes, etc. Scaling seems to be a weakness though.

1: https://pypi.org/project/invisible-watermark/


It's ironic, seeing people who build models trained on other people's work (which is in no way credited) to be worried about origin and credit.


> Unless you assume there are bad actors who will crop out the tags.

I don't know why people do that but lots of randoms on the internet do that and they're not even bad actors per se. The removed signatures from art posted online became a kind of a meme itself. Especially when comic strips are reposted on Reddit. So yeah, we'll see lots of them.


In my melody generation system I'm already including melodies that I've judged as "good" (https://www.youtube.com/playlist?list=PLoCzMRqh5SkFwkumE578Y...) in the updated training set. Since the number of catchy melodies that have been created by humans is much, much lower than the number of pretty images, it makes a significant difference. But I'd expect that including AI-generated images without human quality judgement scores in the training set won't be any better than other augmentation techniques.


Huh, I had never thought of that. Makes it seem like there's a small window of authenticity closing.

The irony is that if you had a great discriminator to separate the wheat from the chaff, that it would probably make its way into the next model and would no longer be useful.

My only recommendation is that OpenAI et al should be tagging metadata for all generated images as synthetic. That would be a really interesting tag for media file formats (would be much better native than metadata though) and probably useful across a lot of domains.


The OpenAI access agreement actually says that you must add (or keep?) a watermark on any generated images, so you’re in good company with that line of thinking.


The irony is that when the majority of content becomes computer-generated, most of that content will also be computer-consumed.

Neil Stephenson covered this briefly in "Fall; or Dodge In Hell." So much 'net content was garbage, AI-generated, and/or spam that it could only be consumed via "editors" (either AI or AI+human, depending on your income level) that separated the interesting sliver of content from...everything else.


He was definitely onto something in that book where people also resort to using blockchains to fingerprint their behavior and build an unbreakable chain of authenticity. Later in that book that is used to authorize the hardware access of the deceased and uploaded individuals.

A bit far out there in terms of plot but the notion of authenticating based on a multitude of factors and fingerprints is not that strange. We've already started doing that. It's just that we currently still consume a lot of unsigned content from all sorts of unreliable/untrustworthy sources.

Fake news stops being a thing as soon as you stop doing that. Having people sign off on and vouch for content needs to start becoming a thing. I might see Joe Biden saying stuff in a video on Youtube. But how do I know if that's real or not?

With deep fakes already happening, that's no longer an academic question. The answer is that you can't know. Unless people sign the content. Like Joe Biden, any journalists involved, etc. You might still not know 100% it is real but you can know whether relevant people signed off on it or not and then simply ignore any unsigned content from non reputable sources. Reputations are something we can track using signatures, blockchains, and other solutions.

Interesting with Neal Stephenson that he presents a problem and a possible solution in that book.


As usual, Stephenson is at his best when he's taking current trends and extrapolating them to almost absurd extremes...until about a decade passes and you realize they weren't that extreme after all.

I loved that he extended the concept of identity as an individualized pattern of events and activities to the real world: the innovation of face masks with seemingly random but unique patterns to foil facial recognition systems but still create a unique identity.

Like you say, the story itself had horrible flaws (I'm still not sure if I liked it in its totality, and I'm a Stephenson fan since reading Snow Crash on release in '92), but still had fascinating and thought provoking content.


> blockchains to fingerprint their behavior and build an unbreakable chain of authenticity. Later in that book that is used to authorize the hardware access of the deceased and uploaded individuals.

maybe I misunderstood, but I had it that people used generative AI models that would transform the media they produced. The generated content can be uniquely identified, but the creator (or creators) retains anonymity. Later these generative AI models morphed into a form of identity since they could be accurately and uniquely identified.


All part of the mix. But definitely some blockchain thing underneath to tie it all together. Stephenson was writing about crypto currencies as early as the nineties. Around the time he also coined the term Metaverse.


I can see a world where in person consumption of creative media (art, music, movies etc), where all devices are to be left at the door, becomes more and more sought after and lucrative.

If the AI models can't consume it, it can't be commoditised and, well, ruined.


I don't think it will "poison the well" so much as change it - images that humans like more will get a higher pagerank, so the models trained on Google Images will not so much as degrade as they will detach from reality and begin to follow the human mind they way plausible fiction does.


Just yesterday I was speculating that current AI is bad at math because math on the internet is spectacularly terrible.

I think you’re right, and it’s unlikely that we (society) will convince people to label their AI content as such so that scraping is still feasible.

It’s far more likely that companies will be formed to provide “pristine training sets of human-created content”, and quite likely they will be subscription based.


>“pristine training sets of human-created content”

well, we do have organic/farmed/handcrafted/etc. food. One can imagine information nutrition label - "contains 70% AI generated content, triggers 25% of the daily dopamine release target".


Right? And if you’re a VC backing a new startup you might want to pay that extra to get them started right.


How would that really happen? It seems to me you're assuming that there's no such thing as extant databases of actual oil paintings, that people will stop producing, documenting, and curating said paintings. I think the internet and curated image databases are far more well kept than your proposed model accounts for.


My hypothetical example is not really about oil paintings, but the fact these models will surely get deployed and used for stock photos for articles, on art pages etc.

I think this will introduce unavoidable background noise that will be super hard to fully eliminate in future large scale data sets scraped from the web, there's always going to be more and more photorealistic pictures of "cats" "chairs" etc. in the data that are close to looking real but not quite, and we can never really go back to a world where there's only "real" pictures, or "authentic human art" on the internet.


My first thought on reading the article is generating images for my presentations.


On the contrary -- the opposite will happen. There's a decent body of research showing that just by training foundation models on their outputs, you amplify their capabilities.

Less common opinion: this is also how you end up with models that understand the concept of themselves, which has high economic value.

Even less common opinion: that's really dangerous.


For better training data in the future: Storing a content hash and author identification (an example proprietary solution right now [0]) of image authors, and having a decentralized reputation system for people/authors would help be the solution for better training data in the future whereby authors can gain reputation/incentives too.

[0] https://creativecloud.adobe.com/discover/article/how-to-use-...


I don't think it will be a big deal, for multiple different reasons: https://www.lesswrong.com/posts/uKp6tBFStnsvrot5t/what-dall-...


Eventually the only jobs humans will have is training AI to act human. Sounds very Philip K Dick now that I think about it.


The transition will be complete when some AI can fool/bribe the other AIs that its workers are human.


Good looking images will be popular, bad looking images will be disposed on the backyard of the internet. Even if next iterations of these models will be trained on AI-generated images, the dataset will be well filtered by how much people like those images. After all, that's the purpose of any art, right?


Maybe we'll go back to index based search engines like Yahoo. Could resolve many issues we see today, but I think the biggest question is scalability. Maybe some open source open database system?


I think instead the images people want to put on the Internet will do the same for these models as adversarial training did for AlphaZero; it will learn what kinds of images engage human reaction.


It will not be limited to the internet. Have you looked at a magazine stand in the last 10 years? The content looks generated (not by AI) even today.

Cheap books, cheap TV and cheap music will be generated.


I also worry about the potential to further stifle human creativity, e.g. why paint that oil painting of a panda riding a bicycle when I could generate one in seconds?


Our imaginations are gigantic. We'll find something else impressive and engaging to do. Or not care. I'm not worried. Watch children: they find a way to play even when there is nothing.


One reason:

A digital picture of an oil painting != an actual oil painting

Of course once someone trains an AI with a robotic arm to do the actual painting, then your worry holds firm.


> Of course once someone trains an AI with a robotic arm to do the actual painting, then your worry holds firm.

It's been done, starting from plotter based solutions years ago, through the work of folks like Thomas Lindemeier:

https://scholar.google.com/citations?user=5PpKJ7QAAAAJ&hl=en...

Up to and including actual painting robot arms that dip brushes in paint and apply strokes to canvas today:

https://www.theguardian.com/technology/2022/apr/04/mind-blow...

The painting technique isn't all that great yet for any of these artbots working in a physical medium, but that's largely a general lack of dexterity in manual tool use rather than an art specific challenge. I suspect that RL environments that physically model the application of paint with a brush would help advance the SOTA. It might be cheaper to model other mediums like pencil, charcoal, or even airbrushing first, before tackling more complex and dimensional mediums like oil paint or watercolor.


Surely this already exists right?


I wonder if google images could just seed in some generated images when none relevant are found..


Adding a watermark to all AI generated images should be imperative.


Generating at 64x64px then upscaling it probably gives the model a substantial performance boost (training speed/convergence) than working at 256x256 or 1024x1024 like DALL-E 2. Perhaps that approach to AI-generated art is the future.


I thought I was doing well after not being overly surprised by DALL-E 2 or Gato. How am I still not calibrated on this stuff? I know I am meant to be the one who constantly argues that language models already have sophisticated semantic understanding, and that you don't need visual senses to learn grounded world knowledge of this sort, but come on, you don't get to just throw T5 in a multimodal model as-is and have it work better than multimodal transformers! VLM[1] at least added fine-tuned internal components.

Good lord we are screwed. And yet somehow I bet even this isn't going to kill off the they're just statistical interpolators meme.

[1] https://www.deepmind.com/blog/tackling-multiple-tasks-with-a...


I firmly believe that ~20-40% of the machine learning community will say that all ML models are dumb statistical interpolators all the way until a few years after we achieve AGI. Roughly the same groups will also claim that human intelligence is special magic that cannot be recreated using current technology.

I think it’s in everyone’s benefit if we start planning for a world where a significant portion of the experts are stubbornly wrong about AGI. As a technology, generally intelligent ML has the potential to change so many aspects of our world. The dangers of dismissing the possibility of AGI emerging in the next 5-10 years are huge.


> The dangers of dismissing the possibility of AGI emerging in the next 5-10 years are huge.

Again, I think we should consider "The Human Alignment Problem" more in this context. The transformers in question are large, heavy and not really prone to "recursive self-improvement".

If the ML-AGI works out in a few years, who gets to enter the prompts?


Me.

... ... ...

Obviously "/s", obviously joking, but meant to highlight that there are a few parties that would all answer "me" and truly mean it, often not in a positive way.


A DAO.


These ML models aren't capable of generating novel thinking. They allow for extracting knowledge from an existing network. They cannot declare new ideas, identify how to validate them, and gather data and reach conclusions.


You should be much more concerned about the prospect of nuclear war right now than the sudden emergence of an AGI.


Is it really that simple?

We can worry about two things at once. We can be especially worried that at some point (maybe decades away, potentially years away), we'll have nuclear weapons and rampant AGI.


100 times this. There’s very little sign of AGI, but nuclear weapons exist, can definitely destroy the planet already, are designed to, have nearly done so in the past, and we’re at the most dangerous point in decades.


It’s just my opinion but I think the meme you’re talking about is deeply related to other branches of science and philosophy: ranging from the trust old saw about AI being anything a computer hasn’t done yet to deep meditations on the nature of consciousness.

They’re all fundamentally anthropocentric: people argue until they are blue in the face about what “intelligent” means but it’s always implicit that what they really mean is “how much like me is this other thing”.

Language models, even more so than the vision models that got them funded have empirically demonstrated that knowing the probability of two things being adjacent in some latent space is at the boundary indistinguishable from creating and understanding language.

I think the burden is on the bright hominids with both a reflexive language model and a sex drive to explain their pre-Copernican, unique place in the theory of computation rather than vice versa.

A lot of these problems just aren’t problems anymore if performance on tasks supersedes “consciousness” as the thing we’re studying.


I'd argue that there is probably at least one leap in terms of human-level writing which isn't just pure prediction. Humans write with intent, which is how we can maintain long run structure. I definitely write like GPT while I'm not paying attention, but with the executive on the task I outperform it. For all we know this is solvable with some small tweak to architecture, and I rather doubt that a model which has solved this problem need be conscious (though our own solution seems correlated with consciousness), but it is one more step.


I agree that intent is the missing piece so far. GTP can respond better to prompts than most people, but does so with a complete lack of intent. The human provides 100% of it.


I haven't been overly surprised by any of it. The final product is still the same, no matter how much they scale it up.

All of these models seem to require a human to evaluate and edit the results. Even Co-Pilot. In theory this will reduce the number of human hours required to write text or create images. But I haven't seen anyone doing that successfully at scale or solving the associated problems yet.

I'm pessimistic about the current state of AI research. It seems like it's been more of the same for many years now.


I think it's something like a very intelligent Borgian library of babel. There are all sorts of books in there, by authors with conflicting opinions and styles, due to the source material. The librarian is very good at giving you something you want to read, but that doesn't mean it has coherent opinions. It doesn't know or care what's authentic and what's a forgery. It's great for entertainment, but you wouldn't want to do research there.

For image generation, it's obviously all fiction. Which is fine and mostly harmless if you you know what you're getting. It's going to leak out onto the Internet, though, and there will be photos that get passed around as real.

For text, it's all fiction too, but this isn't obvious to everyone because sometimes it's based on true facts. There's often not going to be an obvious place where the facts stop and the fiction starts.

The raw Internet is going to turn into a mountain of this stuff. Authenticating information is going to become a lot more important.


Is there a way to try this out? DALL-E2 also had amazing demos but the limitations became apparent once real people had a chance to run their own queries.


Looks like no, "The potential risks of misuse raise concerns regarding responsible open-sourcing of code and demos. At this time we have decided not to release code or a public demo. In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access."


> the risks of unrestricted open-access

What exactly is the risk?


See section 6 titled “Conclusions, Limitations and Societal Impact” in the research paper: https://gweb-research-imagen.appspot.com/paper.pdf

One quote:

> “On the other hand, generative methods can be leveraged for malicious purposes, including harassment and misinformation spread [20], and raise many concerns regarding social and cultural exclusion and bias [67, 62, 68]”


But do we trust that those who do have access won't be using it for "malicious purposes" (which they might not think is malicious, but perhaps it is to those who don't have access)?


It's not up to you. It's up to them, and they trust themselves/don't care about your definition of malicious.


If the model is used to generate offensive imagery, it may result in a negative press response directed at the company.


A variation on the axiom "you cannot idiot proof something because there's always a bigger idiot"


Really unpleasant content being produced, obviously.


"Make a photograph of Joe Biden in a hotel room bed with Kim Jong-un."

Simply the ease at which people are going to be able to make extremely-realistic game photographs is going to do some damage to the world. It's inevitable, but it might be good to postpone it.


> able to make extremely-realistic game photographs is going to do some damage to the world

I don't understand why. If someone has gone to a blockbuster movie in the last 15 years, they're very familiar with the concept of making people, sets, and entire worlds, that don't exist, with photorealistic accuracy. Being able to make fictitious photorealistic images isn't remotely a new ability, it's just an ability that's now automated.

If this is released, I think any damage would be extremely fleeting, as people pumped out thousands of these images, and people grow bored of them. The only danger is making this ability (to make false images) seem new (absolutely not) or rare (not anymore)!


The counter argument is that, by the time these models become available to the public, they will produce output that cannot be distinguished from real photos, so the damage will be even greater than if they became available today


The big thing I’m noticing over DALL-E is that it seems to be better at relative positioning. In a MKBHD video about DALLE it would get the elements but not always in the right order. I know google curated some specific images but it seems to be doing a better job there.


Totally—Imagen seems better at composition and relative positioning and text, while DALL-E seems better at lighting, backgrounds, and general artistry.


Yeah Dall-e looks amazing, to a mysterious degree even with hints of humour and irony, while imagen images look cheap, one dimensional and quite ugly to be honest.

Still amazing that we're at a point where that's the case, they're both incredible developments.


Does it do partial image reconstruction like DALL-E2? Where you cut out part of an existing image and the neural network can fill it back in.

I believe this type of content generation will be the next big thing or at least one of them. But people will want some customization to make their pictures “unique” and fix AI’s lack of creativity and other various shortcomings. Plus edit out the remaining lapses in logic/object separation (which there are some even in the given examples).

Still, being able to create arbitrary stock photos is really useful and i bet these will flood small / low-budget projects


I give it a few years before Google makes stock images irrelevant.


I really expect them to first make DALL-E and competing networks unfit for commercialization by providing the better choice for free, have stock companies cry in the corner, to just sunset the product a year or two down the road and we're left wandering what to do.


Rolling this into Google Docs seems like a nobrainer.


Or rolling this into Google Image Search to create images that match users' search queries on the fly.

Don't like any of the results from the real web? Well how about these we created just for you.


Ah yes, deepfakes porn as a service would have been a blessing for teenage me.


Google is very conservative about anything that can generate open-ended outputs. Also these models are still very expensive computationally.


They're expensive to train, but not awfully expensive to use. Especially if you have hundreds of images you want to generate (due to the way compute devices tend to get much more efficiency with a large batch size).

Google could totally afford it, especially if the feature was hidden behind a button the user had to click, and not just run for every image search.


When diffusion models are used, the inference time could be meaningful. But then this is only 64x64 with upsampling, so probably not too bad.


The input control is pretty hard - it kinda needs an AGI :). How do you stop undesirable images being created?


If I were running Google, I would release it with a disclaimer, and not do anything technical to prevent undesirable images being created.

How does Adobe prevent Photoshop being used to draw offensive images? They don't... People understand that a tool can be used for good and bad.


Until they pull of the plug on it.


Tbh imagine this tech combines particularly well with really well curated stock image databases so outputs can be made with recognisable styles, and actors and design elements can be reused across multiple generated images.

If Getty et al aren't already spending money on that possibility, they probably should be.


The entire "content" industry could get eaten by a few hundred people curating + touching-up output from these models.


No, competitive advantage means that it’s impossible to run out of jobs just because someone/something is better at it than you.

(Consumer demand and boredom both being infinite is another thing working against it.)


Short Getty images?


Privately owned by the Getty family.


Really impressive. If we are able to generate such detailed images, is there anything similar for text to music? I would I though that it would be simpler to achieve than text to image.


Our language is much more effective at describing images than music.



why stop at audio? the pinnacle of this would be text-to-videos, equally indistinguishable from real thing.


The way things look when still is much easier to fake than the way things move.

I would expect AI development to follow a similar path to digital media generally, as its following the increasing difficulty and space requirements of digitally representing said media: text < basic sounds < images < advanced audio < video.

What’s more impressive to me is how far ahead text-to-speech is, but I think the explanation is straightforward (the accessibility value has motivated us to work on that for a lot longer).


Compare the size of a raw image file to a raw music file, to get an idea of the complexity difference.


Think sheet music, not an mp3


Fair enough, but that's a little dissimilar to what's being done with these images. These images are a per-pixel construction.


when will there be a "DALL-E for porn" ? or is this domain also claimed by Puritans and morality gate keepers? The most in demand text-to-image is use case is for porn.


Train it yourself. Danbooru is a publicly available explicit dataset.


This is not something you can train on a regular AWS gpu-instance without racking up millions of dollars in bills to my knowledge. Dataset isn't an issue its a capex issue.


It’s possible from scratch on not as much personally owned hardware as you’d think but will take a long time, months maybe.

Luckily, training from scratch will hopefully be obsoleted by fine-tuning - if someone else releases a generally capable model then you can turn that into another one for lower cost.


Probably just a frontend coding mistake, and not an error in the model, but in the interactive example if you select:

"A photo of a Shiba Inu dog Wearing a (sic) sunglasses And black leather jacket Playing guitar In a garden"

The Shiba Inu is not playing a guitar.


Found the QA tester.


Also, no sunglasses in "A photo of a raccoon wearing sunglasses and a red shirt riding a bike in a garden," and a few similar prompts (e.g. surfing).


There are visible “alignment” issues in some of their examples still. The marble koala DJ in the paper doesn’t use several of the keywords.

They have an example “horse riding an astronaut” that no model produces a correct image for. It’d be interesting if models could explain themselves or print the caption they understand you as saying.


Off topic, but this caught my attention:

“In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access.”

I work for a big org myself, and I’ve wondered what it is exactly that makes people in big orgs so bad at saying things.


I think they were being careful not to be too quotable there on CNN.


It really does look better than DALL-E, at least from the images on the site. Hard to believe how quickly progress is being made to lucid dreaming while awake.


Jesus Christ. Unlike DALL-E 2, it gets the details right. It also can generate text. The quality is insanely good. This is absolutely mental.


Yes, the posted results are really good, but since we can't play with it we don't know how much cherry picking has been done.


All of these AI findings are cool in theory. But until its accessible to some decent amount of people/customers - its basically useless fluff.

You can tell me those pictures are generated by an AI and I might believe it, but until real people can actually test it... it's easy enough to fake. This page isn't even the remotest bit legit by the URL, It looks nicely put together and that's about it. Could have easily put together this with a graphic designer to fake it.

Let be clear, I'm not actually saying it's fake. Just that all of these new "cool" things are more or less theoretical if nothing is getting released.


Inference times are key. If it can't be produced within reasonable latency, then there will be no real world use case for it because it's simply too expensive to run inference at scale.


There are plenty of usecases for generating art/images where a latency of days or weeks would be competitive with the current state of the art.

For example, corporate graphics design, logos, brand photography, etc.

I really do think inference time is a red herring for the first generation of these models.

Sure, the more transformative use-cases like real-time content generation to replace movies/games, but there is a lot of value to be created prior to that point.


There's been much prior work done to take these models down from datacenter size to single GPU size. Given continued work in that area and improving GPU performance it seems like it's just a matter of years before inference can be cheap and local for even the most impressive of generation.


I wondered why all the pictures at the top had sunglasses on, then I saw a couple with eyes. Still some work to do on this one.


Reading a relatively-recent Machine Learning paper from some elite source, and after multiple repititions of bragging and puffery, in the middle of the paper, the charts show that they had beaten the score of a high-ranking algorithm in their specific domain, moving the best consistant result from 86% accuracy to 88% accuracy, somewhere around there. My response was: they got a lot of attention within their world by beating the previous score, no matter how small the improvement was.. it was a "winner take all" competition against other teams close to them; the accuracy of less than 90% is really of questionable value in a lot of real world problems; it was an enormous amount of math and effort for this team to make that small improvement.

What I see is semi-poverty mindset among very smart people who appear to be treated in a way such that the winners get promotion, and everyone else is fired. That this sort of analysis with ML is useful for massive data sets at scale, where 90% is a lot of accuracy, not at all for the small sets of real world, human-scale problems where each result may matter a lot. The amount of years of training that these researchers had to go through, to participate in this apparently ruthless environment, are certainly like a lottery ticket, if you are in fact in a game where everyone but the winner has to find a new line of work. I think their masters live in Redmond, if I recall.. not looking it up at the moment.


What you're missing is that the performance on a pretext task like ImageNet top-1 will transfer outside ImageNet, and as you go further into the high score regime, often a small % can yield qualitatively better results because the underlying NN has to solve harder and harder problems, eliciting true solutions rather than a patchwork of heuristics.

Nothing in a Transformer's perplexity in predicting the next token tells you that at some point it suddenly starts being able to write flawless literary style parodies, and this is why the computer art people become virtuosos of CLIP variants and are excited by new ones, because each one attacks concepts in slightly different ways and a 'small' benchmark increase may unlock some awesome new visual flourish that the model didn't get before.


If you worked in a hospital and you managed to increase the survival rate from 86% to 88%, you too would be a hero.

Sure, it's only 2%, but if it's on a problem where everyone else has been trying to make that improvement for a long time, and that improvement means big economic or social gains, then it's worth it.


I like focusing on the failure rate instead - going from 14% to 12% is a pretty big jump.


For people complaining that they can't play with the model... I work at Google and I also can't play with the model :'(


I think they address some of the reasoning behind this pretty clearly in the write-up as well?

> The potential risks of misuse raise concerns regarding responsible open-sourcing of code and demos. At this time we have decided not to release code or a public demo. In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access.

I can see the argument here. It would be super fun to test this model's ability to generate arbitrary images, but "arbitrary" also contains space for a lot of distasteful stuff. Add in this point:

> While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data, and thus inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place.

That said, I hope they're serious about the "framework for responsible externalization" part, both because it would be really fun to play with this model and because it would be interesting to test it outside of their hand-picked examples.


> harmful stereotypes and representations

So we can't have this model because of ... the mere possibility of stereotypes? With this logic, humans should all die, as we certainly encode some nasty stereotypes in our brains.

This level of dishonesty to not give back to the community is not unexpected at this point, but seeing apologists here is.


As Sadhguru said, the human experience comes from within.

Which means that it is always you that decides is you'll be offended or not.

Not to mention the weirdness that random strangers on the internet feel the need to protect me, another random stranger on the internet, from being offended. Not to mention that you don't need to be a genius to find pornography, racism and pretty much anything on the internet...

I'm really quite worried by the direction it's all going at. More and more the internet is being censored and filtered. Where are the times of IRC where a single refresh erased everything that was said~


> As Sadhguru said, the human experience comes from within. > > Which means that it is always you that decides is you'll be offended or not.

I have a friend who used to have an abuser who talked like that. Every time she said or did something that hurt him, it was his fault for feeling that way, and a real man wouldn't have any problem with it.

I'm all for mindfulness and metacognition as valuable skills. They helped me realize that a bad grade every now and then didn't mean I was lazy, stupid, and didn't belong in college.

But this argument that people should indiscriminately suppress emotional pain is dangerous. It entails that people ought to tolerate abuse and misuse of themselves and of other people. And that's wrong.


I think there is a huge difference between somebody willingly mistreating another person and that person taking offence to that, versus a company releasing an AI tool with absolutely no ill-intent, and then someone else making decisions as to what _I_ am allowed to see.


I think it's more that they don't want people creating NSFW images of copyrighted material. How do you even begin to protect against that litigation?


Should we ban Photoshop? It's a leap in logic but not very different.


I find this is a really bad precedent. Not far from now they'll achieve "super human" level general AI, and be like "yeah it's too powerful for you, we'll keep it internal".


This is definitely how it will play out. Whoever creates AGI first (and second). After all, they are investing a lot of money and resources and a "super human" AGI is very likely worth an unimaginably lot.

Also, given the processing power and data requirements to create one, there are only a few candidates out there who can get there firstish.


wouldn't that be a good motivation to work there (and achieve a high enough position to have access to the model)?


Good thing there is a company committed to Open Sourcing these sorts of AI models.

Oh wait.

Google: "it's too dangerous to release to the public"

OpenAI: "we are committed to open source AGI but this model is too dangerous to release to the public"


I mean I don't know how that makes it any better from a reproducibility stand point lol


I mean inference on this cost not small money.

I don't think they would host this for fun then.


How does that make you feel?


Probably like an employee


off-topic: as a google employee do you have unlimited gce credits?


Is brocoliman standing in the way? :(


is your team/division hiring?


Every tech megacorp is always hiring people who can jump through the flaming code hoops just right


How the fck are things advancing so fast? Is it about to level off …or extend to new domains? What’s a comparable set of technical advances?


This video by Juergen Schmidhuber discusses the acceleration of AI progress:

https://youtu.be/pGftUCTqaGg


Bigger model = better because a lot of performance at this task is memorization or the “lottery ticket hypothesis”.

An impressive advance would be a small model that’s capable of working from an external memory rather than memorizing it.


This looks incredible but I do notice that all the images are of a similar theme. Specifically there are no human figures.


I believe DALLE and likely this model excluded images of people so it could not be misused


Interesting, I had not understood that!


> At this time we have decided not to release code or a public demo.

Oh well.


Used some of the same prompts and generated results with open source models, model I am using fails on long prompts but does well on short and descriptive prompts. Results:

https://imgur.com/gallery/6qAK09o


Interesting and cool technology - but I can't seem to ignore that every high-quality AI art application is always closed, and I don't seem to buy the ethics excuse for that. The same was said for GPT, yet I see nothing but creativity coming out from its users nowadays.


That only lasts until the community copies the paper and catches up. For example the open source DALLE-2 implementation is coming along great: https://github.com/lucidrains/DALLE2-pytorch


Imagen actually shows some of the components in DALLE2 is unnecessary, so Imagen will end up being easier to build. I'll definitely add the dynamic thresholding trick from Imagen to DALLE2 repository though; that is a finding that should boost any DDPMs using classifier free guidance.


Thanks for all your work on these projects!


I don't buy the ethics but I do buy the obvious PR nightmare that would inevitably happen if journalists could play with this and immediately publish their findings of "racist imagery generated by googles AI". That's all it's about and us complaining is not going to make them change their minds.


There already is an article calling DALL E racist and it isn’t even public yet. Just imagine what horrible things the general public will get it to spit out.


Then they should be honest about it. They can use all the PR lingo they want but don't flat out lie about it.

Lying about ethics or misattributing their actions to some misguided sense of "social" responsibility puts google in a far worse light in my eyes. I can't help but wonder how many skilled employees were driven off from accepting a position at google because of lies like these.


GTP-3 was an erotica virtuoso before it was gagged. There's a serious use case here in endless porn generation. Google would very much like to not be in that business.

That said, you can download Dream by Wombo from the app store and it is one of the top smartphone apps, even though it is a few generations behind state of the art.


It is a shame that such powerful AI had to arrive during an age of prudishness that makes the Victorians seem wild.


The current GPT3 on the OpenAI dashboard is perfectly capable of generating erotica or being racist even if you don’t want it to. They didn’t block it so much as put up a warning dialog when it’s acting up.

Actually, I think they made InstructGPT even better at erotica because it’s trained to be “helpful and friendly”, so in other words they made it a sub.


"Google would very much like to not be in that business."

Google is not a hobby project anymore: "don't do evil" or whatever they whittered on about back in the day.


I imagine it's viewing porn not as "evil" but as "something we should absolutely never even come close to touching or talking about and is best left as something we pretend doesn't exist so we don't get regulated out of existence"


You're aware of nothing but creativity from its users. The people using the technology unethically intentionally don't advertise that they're using it.

There's mountains of ai-generated inauthentic content that companies (including Google) have to filter out of their services. This content is used for spam, click farms, scamming, and even state propaganda operations. GPT-2 made this problem orders of magnitude worse than it used to be, and each iteration makes it harder to filter.

The industry term is (generally) "Coordinated Inauthentic Behavior" (though this includes uses of actual human content). I think Smarter Every Day did a good videos (series?) on the topic, and there are plenty of articles on the topic if you prefer that.


The AI ethics thing is just a PR larp at this point.

“Oh our tech is so dangerous and amazing it could turn the world upside down” yet we hand it to random Bluechecks on Twitter.

It’s just marketing


You know Twitter and Google are different companies, right?


The commenter was probably referring to the fact that the people who tend to get access to things like GPT3 or DALL-E 2 tend to be people with large Twitter followings (and blue checks), and that there may be a significant marketing component to this fact.


And that's why we need to be shutting this stuff down, completely.

Someone tried to say there were ethics committees etc the other day...what a bad joke. Who checks the ethics committee is making ethical decisions?

I was told I "didn't know what" I was talking about, excuse from some over-important know-it-all who didn't know what ethics was, i.e. they don't know what they are talking about.


check out open source alternative dalle-mini: https://huggingface.co/spaces/dalle-mini/dalle-mini


Thanks for the link. Doesn't appear to be very good yet, though.


Granted that's a selection bias: you likely won't hear about the cases where legit obscene output occurs. (the only notable case I've heard is the AI Dungeon incident)


What is the AI dungeon incident?


https://www.vice.com/en/article/93ywpp/text-adventure-game-c...

TL;DR generative story site creators employ human moderation after horny people inevitably use site to make gross porn; horny people using site to make regular porn justifiably freaked out

Bring your popcorn


I always felt like the AI Dungeon response is absurdly stupid. Yeah, erotic text involving minors is not exactly something you want to be associated with, but I've heard that one of their approaches to avoiding it was to ban numbers below 18. That's... not worth it.

I feel like it would've been more than reasonable for them to have taken the position that the AI might output something distasteful, and implement a filter for people who were afraid of it.


AI is for porn



Another consideration here is that hosting a queryable model like this becomes expensive. I remember a couple of years ago lone developer had to take down his site which hosted a freely accessible version of GPT-2 (3?) model because the bills were running to some $20k. (Chump change for Google, but still).


I think a lot of people would be perfectly happy to pay per query to access this.


Running inference on one of these models takes like a GPU minute, so they can't just let the public use them.


They can absolutely do for it if they charge the public the cost of GPU time.


Can't be this; Google Colab gives out tons of free GPU usage.


Google has a lot of GPUs, but even so Colab seems like it’s a lot cheaper than it should be. You can get some very good GPUs on the paid plan.


Ethics,racist, LGBT, bla,bla. If we talk about political correct, I really suggest you guys go somewhere else. instead of stay in hacker news. AI generated porn is better that let some people, who do not want to do the porn, doing the porn themselves.


Jesus, this is so awesome. I think it’s the first AI that really makes me have that “wow” sensation.


Would it be bad to release this with a big warning and flashing gifs letting people know of the issues it has and note that they are working to resolve them / ask for feedback / mention difficulties related to resolving the issues they identified?


>Figure 2: Non-cherry picked Imagen samples

Hooray! Non-cherry-picked samples should be the norm.


I'll be skeptical until I see it in action, vice pre-selected results.


Would I have to implement this myself, or is there something ready to run?


I think implementing this yourself is likely not doable unless you have the computing resources of a Google, Amazon or Facebook.


It seems like lucidrains is currently working on an implementation [1] of it.

I would love it.

[1] https://github.com/lucidrains/imagen-pytorch


This is absolutely amazingly insane. Wow.


Nice to see another company making progress in the area. I'd love to see more examples of different artistic styles though, my favorite DALL-E images are the ones that look like drawings.


I find it a bit disturbing that they talk about social impact of totally imaginary pictures of racoon.

Of course, working in a golden lab at Google may twist your views on society.


Oh, I would say they are probably underestimating the impact. You only saw the images they thought couldn't raise alarm bells. Anyone will be able to create photorealistic images of anyone doing anything, Anything! This is certainly a dangerous and society altering tech. It won't all be teddy bears and racoons playing poker.


Primarily Indian origin authors on both the DALL-E and this research paper. Just found that impressive considering they make up 1% of the population in the US.


I'M SQUEEZING MY PAPER!


It seems to have the same "adjectives bleed into everything problem" that Dall-E does.

Their slider with examples at the top showed a prompt along the lines of "a chrome plated duck with a golden beak confronting a turtle in a forest" and the resulting image was perfect - except the turtle had a golden shell.


What's the limiting factor for model replication by others? Amount of compute? Model architecture? Quality / Quantity of training data? Would really appreciate insights on the subjects


Also, almost 40 years ago, the name of a laser printer capable of 200 dpi.

Almost there, the Apple Laserwriter nailed it at 300 dpi.

Sometimes sneaked an issue of the "SF-Lovers Digest" in between code printouts.


What's the best open source or pre-trained text to image model?


latent diffusion model trained on LAION-400M https://github.com/CompVis/latent-diffusion


Thank you, this is great.


This would generate great music videos for Bob Dylan songs. I'm thinking Gates of Eden, ".. upon four legged forest cloud, the cowboy angel rides" :D


I'm curious why all of these tools seem to be almost tailored toward making meme images?

The kind of early 2010's, over the top description of something that's ridiculous


My hunch is that they aren't tailored toward ridiculous images exactly, but if they demonstrated "a woman sitting in a chair reading", it would be really hard to tell if the result was a small modification of an image in the training data. If they demonstrate "A snake made out of corn", I have less concern about the model having a very close training example.


These things can make any image you can define in terms of a corpus of other images. That was true at lower resolution five years ago.

To the extent that they get used for making bored ape images or whatever meme du juor, it says much more about the kind of pictures people want to see.

I personally find the weird deep dreaming dogs with spikes coming out of their heads more mathematically interesting, but I can understand why that doesn’t sell as well.


Next phase of all this: Image to 3d printable template files compatible with various market available printers.

Print me a racoon in a leather jacket riding a skateboard.


I get the impression that maybe DALL-E 2 produces slightly more diverse images? Compare Figure 2 in this paper with Figures 18-20 in the DALL-E 2 paper.


One thing I find particularly fascinating is that all the elements of the resulting image have a relatively cohesive art style.


This is super cool and I want to play with it.


Tangentially related question: what is the best (~latest?) such a network uploaded to Colab one can toy with?


Looking at example pictures it seems that this model has trouble with putting sunglasses on a racoon.


Would be awesome to see a side by side comparison to DALL-E, generating from the same text


It's in the PDF.


Seeing the artificial restrictions to this model as well as to DALL-E 2, I can't help but ask myself why the porn industry isn't driving its own research. Given the size of that industry and the sheer abundance of training material, it seems just a matter of time until you can create photo realistic images of yourself with your favourite celebrity for a small fee. Is there anything I am missing? Can you only do this kind of research at google or openai scale?


Porn is actually a really good litmus test to see if a money/media transfer technology has real promise. Pornography needs exactly 2 things to work well - a way to deliver media, and a way to collect money. If you truly have a system that can do one of those two things better than we currently can, and it's not just empty hype, it will be used for porn. "Empty hype" won't touch that stuff, but real-world usecases will.

Unrelated to the main topic, but this is exactly why I think cryptocurrencies will only be used for illegal activities, or things you may want to hide, and nothing else. Because that's where it has found its usecase in porn.


Wow. The last paragraph of my comment looked nearly identical to yours, but I deleted it before submitting because I didn't want to derail. Exactly my thoughts...


This is completely outside the competency of the current porn industry.

You gave an example of a still image, but it's going to end up with an AI generating a full video according to a detailed text prompt. The porn industry is going to be utterly destroyed.


Transfer learning is a thing.

But I have not tried making generative models with out-of-distribution data before. Distributions other than main training data.

There are several indie attempts that I am aware of. Mentioning them to the reply of this comment. (In case the comment gets deleted)

The first layers should be general. But the later layers should not behave well to porn images. As they are more specialist layers learning distribution specific visual patters.

Transfer learning is posssible.


1. DeepCreamPy draws over hentai sensor bars if you direct it where the bar is: https://github.com/gguilt/DeepCreamPy

2. hentAI automates the process: https://github.com/natethegreate/hent-AI

3. [NSFW] Should look at this person on Twitter: https://twitter.com/nate_of_hent_ai

4. [NSFW] PornHub released vintage porn videos upscaled to 4k with AI a while back. The called it the "Remastured Project": https://www.pornhub.com/art/remastured

5. [NSFW] This project shows the limit of AI-wthout-big-tech-or-corporate-support projects. This project creates female genitalia that don't exist in the real world. Project is "This Vagina Does Not Exist": https://thisvaginadoesnotexist.com/about.html


Metacalculus, a mass forecasting site, has steadily brought forward the prediction date for a weakly general AI. Jaw-dropping advances like this, only increase my confidence in this prediction. "The future is now, old man."

https://www.metaculus.com/questions/3479/date-weakly-general...


I don't see how this gets us (much) closer to general AI. Where is the reasoning?


I think this serves at least as a clear demonstration of how advanced the current state of AI is. I had played with GPT-3 and that was very impressive but I couldn't even dream something as good as D-ALLE 2 was already possible.


Big pretrained models are good enough now that we can pipe them together in really cool ways and our representations of text and images seem to capture what we “mean.”


Yeah, it seems like it. But it's still just complicated statistical models. Again, where is the reasoning?


I still think we're missing some fundamental insights on how layered planning/forecasting/deducting/reasoning works, and that figuring this out will be necessary in order to create AI that we could say "reasons".

But with the recent advances/demonstrations, it seems more likely today than in 2019 that our current computational resources are sufficient to perform magnificantly spooky stuff if they're used correctly. They are doing that already already, and that's without deliberately making the software do anything except draw from a vast pool of examples.

I think it's reasonable, based on this, to update one's expectations of what we'd be able to do if we figured out ways of doing things that aren't based on first seeing a hundred million examples of what we want the computer to do.

Things that do this can obviously exist, we are living examples. Does figuring it out seem likely to be many decades away?


That's a well-balanced response that I can agree with.

I'm an not AGI-skeptic. I'm just a bit skeptical that the topic of this thread is the path forward. It seems to me like an exotic detour.

And, of course intelligence isn't magic. We're producing new intelligent entities at rate of a about ~5 per second globally, every day.

> Does figuring it out seem likely to be many decades away?

1-7?


All it takes is one 'trick' to give these models the ability to do reasoning.

Like for example the discovery that language models get far better at answering complex questions if asked to show their working step by step with chain of thought reasoning as in page 19 of the PaLM paper [1]. Worth checking out the explanations of novel jokes on page 38 of the same paper. While it is, like you say, all statistics, if it's indistinguishable from valid reasoning, then perhaps it doesn't matter.

[1]: https://arxiv.org/pdf/2204.02311.pdf


I don’t care whether it reasons its way from “3 teddy bears below 7 flamingos” to a picture of that or if it gets there some other way.

But also, some of the magic in having good enough pretrained representations is that you don’t need to train them further for downstream tasks, which means non-differentiable tasks like logic could soon become more tenable.


A belief oft shared is that sufficiently complicated statistical models are indistinguishable from reasoning.


Perhaps the confluence of NLP and something generative?


That doesn’t even lead in the direction of an AGI. The larger and more expensive a model is the less like an “AGI” it is - an independent agent would be able to learn online for free, not need millions in TPU credits to learn what color an apple is.


Yes metaculus mostly bet a magic number based on perhaps and tbh why not, the interaction of NLP and vision is mysterious and has potential. However those magic numbers should still be considered magic numbers. I agree that in 2040 the interactions will have extensively been studied though but the conclusion of wether we czn go much further on cross-models synergies is totally unknown or pessimist.


"The future is already here — It’s just not very evenly distributed"


How can we prepare for this?

This will result in mass social unrest.


I think the serious answer is that it is yet another labor multiplier like electricity and software. Our tech since the industrial revolution has allowed us to elevate ourselves from a largely agrarian society to space and cyberspace. AI, by all appearances, continues to be a tool, just the latest in a long line of better tools. It still requires a human to provide intent and direction. Right now in my job, I command the collect output of a million medieval scribes. In the future I will command a million Michelangelos.

Should ML/AI deliver on the wildest promises, it will be like a SpaceX Starship for the mind.


Well, anyone over 40 will be fucked. There goes your utopia.


Computers didn't fuck anyone over 40, but they did create new opportunities for young people that slowly took over the labor market and provided a steady stream of productivity growth. Right now these are impressive benchmarks and neat toys that cost millions to train. This is going to be a slow transition to a new paradigm. We are not going to end up in a utopia any more than computers created a utopia.


No because once this is live, creating private (teaching) assistants and good UX will be cheaper.


You think so? I'm very high on the Kool-Aid, image generation and text transformation models are core parts of my workflow. (Midjourney, GPT-3)

It's still an unruly 7 year old at best. Results need to be verified. Prompt engineering and a sense of creativity are core competencies.


> Prompt engineering and a sense of creativity are core competencies.

It's funny that people are also prompting each other. Parents, friends, teachers, doctors, priests, politicians, managers and marketers are all prompting (advising) us to trigger desired behaviour. Powerful stuff - having a large model and knowing how to prompt it.


Stock up on guns, ammo, cigarettes, water filters, canned food, and toilet paper.


Nah, learn Spanish and first-aid. Being able to fix people is more useful than having commodities that will make you a target.


Is the source in public domain already?


Could there exist a quine for this?


Is there anything at all, besides the training images and labels, that would stop this from generating a convincing response to "A surveillance camera image of Jared Kushner, Vladimir Putin, and Alexandria Ocasio-Cortez naked on a sofa. Jeffrey Epstein is nearby, snorting coke off the back of Elvis"?


- The current examples aren’t convincing pictures of “a shiba inu playing a guitar”.

- If you made that picture with actors or in MS Paint, politics boomers on Facebook wouldn’t care either way. They’d just start claiming it’s real if they like the message.


Another ad for DALL-E.


so cool


Why is this seemingly official Google blog post on this random non-Google domain?


You mean one of Google's domains?

  # whois appspot.com
  [Querying whois.verisign-grs.com]
  [Redirected to whois.markmonitor.com]
  [Querying whois.markmonitor.com]
  [whois.markmonitor.com]
  Domain Name: appspot.com
  Registry Domain ID: 145702338_DOMAIN_COM-VRSN
  Registrar WHOIS Server: whois.markmonitor.com
  Registrar URL: http://www.markmonitor.com
  Updated Date: 2022-02-06T09:29:56+0000
  Creation Date: 2005-03-10T02:27:55+0000
  Registrar Registration Expiration Date: 2023-03-10T00:00:00+0000
  Registrar: MarkMonitor, Inc.
  Registrar IANA ID: 292
  Registrar Abuse Contact Email: abusecomplaints@markmonitor.com
  Registrar Abuse Contact Phone: +1.2086851750
  Domain Status: clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)
  Domain Status: clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)
  Domain Status: clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)
  Domain Status: serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)
  Domain Status: serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)
  Domain Status: serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)
  Registrant Organization: Google LLC
  Registrant State/Province: CA
  Registrant Country: US
  Registrant Email: Select Request Email Form at https://domains.markmonitor.com/whois/appspot.com
  Admin Organization: Google LLC
  Admin State/Province: CA
  Admin Country: US
  Admin Email: Select Request Email Form at https://domains.markmonitor.com/whois/appspot.com
  Tech Organization: Google LLC
  Tech State/Province: CA
  Tech Country: US
  Tech Email: Select Request Email Form at https://domains.markmonitor.com/whois/appspot.com
  Name Server: ns4.google.com
  Name Server: ns3.google.com
  Name Server: ns2.google.com
  Name Server: ns1.google.com


While appspot.com is a Google domain, anyone can register domains under it. It would be similarly surprising to see an official GitHub blog post under someproject.github.io


You mean like: https://say-can.github.io/

This is common in the research PA. People don't want to deal with broccoli man [1].

[1] https://www.youtube.com/watch?v=3t6L-FlfeaI


Looking at that link, I don't think that is a GitHub publication? It is marked Robotics at Google and Everyday Robotics.


My bad, it's a google-specific problem.


Fun fact: appspot.com was the second "private" suffix to be added to the Public Suffix List, after operaunite.com: https://bugzilla.mozilla.org/show_bug.cgi?id=593818


This is quite suspicious considering that google AI research has an official blog[1], and this is not mentioned at all there. It seems quite possible that this is an elaborate prank.

1: https://ai.googleblog.com/


I'n not certain but I think it's prelease. The paper says the site should be at https://imagen.research.google/ but that host doesn't respond


appspot.com is the domain that hosts all App Engine apps (at least those that don't use a custom domain). It's kind of like Heroku and has been around for at least a decade.

https://cloud.google.com/appengine


Spring 2008: 14 years!


Whoa, I feel super old, I first used it in 2011 when I thought it was new.


Not just that ... Google Sheets must be the all-time worst way to distribute 200 short strings.


IIRC appspot.com is used by App Engine, one of the earliest SaaS platforms provided by Google.


This competitor might be better for respecting spatial prepositions and photorealism but on a quick look i find the images more uncanny. DALL-E has IMHO better camera POV/distance and is able to make artistic/dreamy/beautiful images. I haven't yet seen this Google model be competitive for art and uncaniness. However progress is great and I might be wrong.


Note that there was a close model in 2021 ignored by all https://paperswithcode.com/sota/text-to-image-generation-on-... (on this benchmark) Also what is the score of dalle v2?


Does it outperform DALL-E V2?


Certificate is expired, anyone have a mirror?


Ok. Now, how about the legality of it generating socially unacceptable images like child porn?


Hey I also wrote a neural net that generates perfect images. Here's a static site about it. With images it definitely generated! Can you use it? Is there a source? Hah, of course not, because ethics!


OpenAi really thought they had done something with DALL-E, then Google's all "hold my beer".


OpenAI*




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: