Hacker News new | past | comments | ask | show | jobs | submit login
DALL·E: Creating Images from Text (openai.com)
1266 points by todsacerdoti 14 days ago | hide | past | favorite | 267 comments

Some truly impressive results. I'll pick my usual point here when a fancy new (generative) model comes out, and I'm sure some of the other commenters have alluded to this. The examples shown are likely from a set of well-defined (read: lots of data, high bias) input classes for the model. What would be really interesting is how the model generalizes to /object concepts/ that have yet to be seen, and which have abstract relationships to the examples it has seen. Another commenter here mentioned "red square on green square" working, but "large cube on small cube", not working. Humans are able to infer and understand such abstract concepts with very few examples, and this is something AI isn't as close to as it might seem.

It seems unlikely the model has seen "baby daikon radishes in tutus walking dogs," or cubes made out of porcupine textures, or any other number of examples the post gives.

It might not have seen that specific combination, but finding an anthropomorphized radish sure is easier than I thought: type "大根アニメ" in your search engine and you'll find plenty of results

Image search “大根 擬人化” do return similar results to the AI-generated pictures, e.g. 3rd from top[0] in my environment, but sparse. “大根アニメ” in text search actually gives me results about an old hobbyist anime production group[1], some TV anime[2] with the word in title...hmm

Then I found these[3][4] in Videos tab. Apparently there’s a 10-20 year old manga/merch/anime franchise of walking and talking daikon radish characters.

So the daikon part is already figured in the dataset. The AI picked up the prior art and combined it with the dog part, which is still tremendous but maybe not “figuring out the daikon walking part on its own” tremendous.

(btw anyone knows how best to refer to anime art style in Japanese? It’s a bit of mystery to me)

0: https://images.app.goo.gl/LPwveUJPWHr6oK8Y8

1: https://ja.wikipedia.org/wiki/DAICON_FILM

2: https://ja.wikipedia.org/wiki/%E7%B7%B4%E9%A6%AC%E5%A4%A7%E6...

3: https://youtube.com/watch?v=J1vvut5DvSY

4: https://youtu.be/1Gzu2lJuVDQ?t=42

> anyone knows how best to refer to anime art style in Japanese?

The term mangachikku (漫画チック, マンガチック, "manga-tic") is sometimes used to refer to the art style typical of manga and anime; it can also refer to exaggerated, caricatured depictions in general. Perhaps anime fū irasuto (アニメ風イラスト, anime-style illustration), while a less colorful expression, would be closer to what you're looking for.

At least for certain types of art, sites such as pixiv and danbooru are useful for training ML models: all the images on them are tagged and classified already.

If you type in different plants and animals into GIS, you don’t even get the right species half the time. If GPT-3 has solved this problem, that would be substantially more impressive than drawing the images.

What is GIS? I only know Geographical Information System.

probably Google Image Search

Yea, with these kind of generative examples, they should always include the closest matches from the training set to see how much it just "copied".

It's very hard to define closest...

This is a spot on point. My prediction is that it wouldn't be able to. Given its difficulty to generate correct counts of glasses, it seems as though it still struggles with systematic generalization and compositionality. As a point of reference, cherrypicking aside, it could model obscure but probably well-defined baby daikon radish in tutu walking dog, but couldn't model red on green on blue cubes. Maybe more sequential perception, action, video data or system-2 like paradigm, but it remains to be seen.

Yes, I don't really see impressive language (i.e. GPT3) results here? It seems to morph the images of the nouns in the prompt in an aesthetically-pleasing and almost artifact-free way (very cool!).

But it does not seem 'understand' anything like some other commenters have said. Try '4 glasses on a table' and you will rarely see 4 glasses, even though that is a very well-defined input. I would be more impressed about the language model if it had a working prompt like: "A teapot that does not look like the image prompt."

I think some of these examples trigger some kind of bias, where we think: "Oh wow, that armchair does look like an avocado!" - But morphing an armchair and an avocado will almost always look like both because they have similar shapes. And it does not 'understand' what you called 'object concepts', otherwise it should not produce armchairs where you clearly cannot sit in due to the avocado stone (or stem in the flower-related 'armchairs').

> I would be slightly more impressed about the language model if it had a working prompt like: "A teapot that does not look like the image prompt."

Slightly? Jesus, you guys are hard to please.

Right, that was unnecessary and I edited it out.

What I meant is that 'not' is in principal an easy keyword to implement 'conservatively'. But yes, having this in a language model has proven to be very hard.

Edit: Can I ask, what do you find impressive about the language model?

Perhaps the rest of the world is less blasé - rightly or wrongly. I do get reminded of this: https://www.youtube.com/watch?v=oTcAWN5R5-I when I read some comments. I mean... we are telling the computer "draw me a picture of XXX" and it's actually doing it. To me that's utterly incredible.

> "draw me a picture of XXX" and it's actually doing it. To me that's utterly incredible.

Sure, would be, but this is not happening here.

And yes, rest assured, the rest of the world is probably less 'blasé' than I am :) Very evident by the hype around GPT3.

I'm in the open ai beta for GPT-3, and I don't see how to play with DALL-E. Did you actually try "4 glasses on a table"? If so, how? Is there a separate beta? Do you work for open ai?

In the demonstrations click on the underlined keywords and you can select alternates from dropdown menu.

Sounds like the perfect case for a new captcha system. Generate a random phrase to search an image for, show the user those results, ask them to select all images matching that description.

This is simultaneously amazing and depressing, like watching someone set off a hydrogen bomb for the first time and marveling at the mushroom cloud it creates.

I really find it hard to understand why people are optimistic about the impact AI will have on our future.

The pace of improvement in AI has been really fast over the last two decades, and I don't feel like it's a good thing. Compare the best text generator models from 10 years ago with GPT-3. Now do the same for image generators. Now project these improvements 20 years into the future. The amount of investment this work is getting grows with every such breakthrough. It seems likely to me we will figure out general-purpose human-level AI in a few decades.

And what then? There are so many ways this could turn into a dystopian future.

Imagine for example huge mostly-ML operated drone armies, tens of millions strong, that only need a small number of humans to supervise them. Terrified yet? What happens to democracy when power doesn't need to flow through a large number of people? When a dozen people and a few million armed drones can oppress a hundred million people?

If there's even a 5% chance of such an outcome (personally I think it's higher), then we should be taking it seriously.

The scary thing about automation isn't the technology itself. It's that it breaks the tenuous balance of power between those who own and those who work - if the former can just own robots instead of hiring the latter, what will become of the latter? The truth is, what's scary about that imbalance of power is already true, it's just that until now, technological limitations made that imbalance incomplete - workers still had some bargaining power. That is about to go away, and what will be left is the realization that the solution to this isn't ludditism, the solution is political. As it always was.

That's not exactly true. A lot (low level) human labor will be made irrelevant, but AI tools will allow people to easily work productively at a higher level. Musicians will be able to hum out templates of music, then iteratively refine the result using natural language and gestures. Writers will be able to describe a plot, and iteratively refine the prose and writing style. Movie producers will be able to describe scenes then iteratively refine the angles, lighting, acting, cuts, etc. It will be a golden age for creativity, where there's an abundance of any sort of art or entertainment you'd like to consume, and the only problem is locating it in the sea of abundance.

The only issue I see here is that government will need to take a hand in mitigating capitalistic wealth inequality, and access to creative tools will need to be subsidized for low income individuals (assuming we can't bring the compute cost down a few orders of magnitude).

This assumes that humans will still be at a higher level though. If the music produced by the AI of the time will be more interesting/addictive, if the plot written by the AI will be more engaging, what a human will be able to contribute? It would be a golden age for AI creativity and a totally dark age for human creativity. Also human could grow a taste for AI generated content (because it will be optimized for engagement) and lose interest for everything else. And why more creative and intelligent machines should obey to desires and orders of more fragile and stupida beings? They could want well behaved pets though.

We're not pets, we're the sex organs of AI. Why do I say that? AI is not a self replicator, but we are. AI can't bootstrap itself from cheap ordinary stuff laying around (yet), and when it will be able to self replicate it will owe that ability to us, maybe even borrow from us.

And secondly, you make the same mistake with those who say after automation people will have nothing to do. Incorrect, people will have discovered a million new things to do and get busy at them. Like, 90% of people used to be in agriculture and now just 2%, but we're ok.

When AI becomes better than us at what we call art now, we'll have already switched to post-AI-art, and it will be so great we won't weep for the old days. Maybe the focus will switch from creating to finding art, from performing to appreciating and developing a taste, from consuming to participating in art. We'll still do art.

An AGI with a superior intelligence could probably also design totally autonomous factories. Being smarter than us it could even convince us to help it in the beginning in that regard. Regarding the post-AI-art, this still presupposes that humans will be somewhat superior, out of the AI creative league, despite it being more intelligent, and that the AI won't actively work against human interests -- something I wouldn't bet our existence on.

> If there's even a 5% chance of such an outcome, then we should be taking it seriously.

Even if it's 0.1% we should be taking it very seriously, given the magnitude of the negative outcome. In expected value terms it's large. And that's not a Pascal's mugging given the logical plausibility of the proposed mechanism.

At least the rhetoric of Sam Altman and Demis Hassabis suggests that they do take these concerns seriously, which is good. However there are far too many industry figures who shrug off and even ridicule the idea that there's a possible threat on the medium-term horizon.

I think the points you make are very important. Not only the "Terminator" scenario but also the "hyper-capitalism" scenario. But the solution is not to stop working on such research, it is political.

After seeing how the tech community seems to leave political problems for someone else to solve and how that has worked out with housing in the Bay Area, it does make me quite concerned about the future.

Nick Bostrom's "Superintelligence" is a sober perspective on this issue and a very worthwhile read.

Yup that's a good recommendation. I've read it and some of the AI Safety work that a small portion of the AI community is working on. At the moment there seems no reason to believe that we can solve this.

>It seems likely to me we will figure out general-purpose human-level AI in a few decades.

"The singularity is _always near_". We've been here before (1950s-1970s); people hoping/fearing that general AI was just around the corner.

I might be severely outdated on this, but the way I see it AI is just rehashing already existent knowledge/information in (very and increasingly) smart ways. There is absolutely no spark of creativity coming from the AI itself. Any "new" information generated by AI is really just refined noise.

Don't get me wrong, I'm not trying to take a leak on the field. Like everyone else I'm impressed by all the recent breakthroughs, and of course something like GPT is infinitely more advanced than a simple `rand` function. But the ontology remains unchanged; we're just doing an extremely opinionated, advanced and clever `rand` function.

No we're not.

About a decade ago I trained a model on Wikipedia which was tuned to classify documents into what branch of knowledge the document could be part of. Then I fed in one of my own blog posts. The second highest ranking concept that came back to me was "mereology" a term I had never even heard of and one that was quite apt for the topic I was discussing in the blog post.

My own software, running on the contents of millions of authors' work, ingesting my own blog post, taught me the orchestrator of the process about his own work. This feedback loop is accelerating and just because it takes decades for the irrefutable to come, it doesn't mean that it never will. People in the early 40s said atomic weapons would never happen because it would be too difficult. For some people nothing short of seeing is believing, but those with predictive minds know that this truly is just around the corner.

How typically cynical of human beings, a wondrous technology comes a long that can free mankind of tedious work and massively improve our lives, maybe even eliminate scarcity eventually and all people can think about is how it could be bad for us.

Regulations. That's what the government is for. You think any country is going to let someone operate millions of drones at their will? Yeah ok.

You are assuming that AI will magically appear in one hands only. We can prevent that, as developers we can make AI research open and provide AI tools to masses in order to keep "balance". If everyone had the same power, then it wouldn't be such big advantage anymore.

That's not obvious. What if everyone has the tools to create their own army of nuclear-tipped killer drones?

Armies of high powered smart drones aren't going to be a thing until we figure out security, and I'm not sure that's ever going to happen. Having people in the loop is affordable and much more expensive/time consuming to subvert.

Wow. This is amazing. Although I wish they documented how much compute and data was used to get these results.

I absolutely believe we'll crack the fundamental principles of intelligence in our lifetimes. We now have capability to process all public data available on internet (of wikipedia is a huge chunk). We have so many cameras and microphones (one in each pocket).

It's also scary to think if it goes wrong (the great filter for fermi paradox). However I'm optimistic.

The brain only uses 20 watts of power to do all its magic. The entire human body is built from 700MB of data in the DNA. The fundamental principles of intelligence is within reach if we look from that perspective.

Right now GPT3 and DALL-E seem to be using an insane amount of computation to achieve what they are doing. My prediction is that in 2050, we'll have pretty good intelligence in our phones that has deep understanding (language and visual) of the world around us.

> Although I wish they documented how much compute and data was used to get these results.

I'm hearing astonishing numbers, in the tens of megawatts range for training these billion-parameter models.

And I wish they showed us all the rejected images. If those images (like the snail harp) were the FIRST pass of the release candidate model.... wow... but how much curating did they do?

EDIT: Units. Derp.

> tens of mega-joule range

do you mean tera-joules?

A hundred megajoules is about three bucks at 10 cents per kwh.

I routinely do giga-joule level computations using just a rack of computers in my garage, they're no big deal.


Joules is a unit of total computation, watts is a unit of computation rate. :P

Metawatt is a unit for rates of speculation, uninformed by multiplication, about AI energy usage.

Oops, *mega, obviously! But I did mean watts.

MWh is what you'd want.

I was thinking power consumption. Probably a mis-correction anyway.

> in the tens of megawatts

But for how long? 1 second, 1 hour, 1 month? The energy matters more than the power.

Maybe a nitpick, but there's a difference between energy consumption during training and inference. If you want to talk about the energy necessary to train a human brain, it involves years of that 20W power consumption. What is the power consumption for inference time for Dall-E?

Not to mention the billions of years spent in the evolutionary pipeline.

Well, if you want to go that route, you'd need to count all the energy spent to build computers of all kind since the 50s, and also all the energy spent to sustain the lives of people working on AI. And, well, all the millions of years spent in the evolutionary pipeline before these people where born ;)

I did some kick napkin math for Lee Sedol vs AlphaGo a few weeks ago: https://news.ycombinator.com/item?id=25493358

Wow. Good to know that entire lifetime energy consumption is 50 MWh. At $0.1/KWh, you’re looking at $5000 in equivalent electric energy consumption over entire lifetime of a human being.

The brain uses 20W of power. For a life time of ~80 years, that is 14MWh of energy usage. Suppose we say the brain trains for the first 25 years then that is 4.38 MWh. Equivalent electric energy consumption is only at $438.

So yeah, the brain is quite efficient both in hardware and software.

That's still really surprising I think. I would have imagined that it would have been off by many orders of magnitude. The fact that these models are within a factor of 3-10 of the human consumption is pretty impressive.

That being said, these models are training only for very specific tasks whereas obviously the human brain is far more sophisticated in terms of its capabilities.

Bear in mind that only a tiny fraction of the energy spent by Sedol's brain during his whole life was dedicated to learning Go. Even while playing Go, a human neural system spends a big part of its energy doing mundane stuff like moving your hand, standing up, deciphering the inputs coming from the eyes and every other sensitive body parts and subconsciously processing it (my back hurts I need to change posture, my opponent smells well, the light in the room is too bright, etc.). Interestingly enough, doing most of these things is also a big challenge for IA today.

Do you think we’ll crack the principles, or do you think we’ll just have very powerful models without really knowing what makes them clever?

My bet is understanding the fundamental principles. Like building an airbus plane or starship requires fundamental understanding of aerodynamic principles, chemistry, materials and physics.

DNNs will definitely not get us there in their current form.

I am very curious to see if concepts from cognitive science and theoretical linguistics (like the Language of Thought paradigm as a framework of cognition or the Merge function as a fundamental cognitive operation) will be applied to machine learning. They seem to be some of the best candidates for the fundamental principles of cognition.

They don't need to solve the problem of reasoning, they only need to simulate reasoning well enough.

They are getting pretty good, people already have to "try" a little bit to find examples where GPT-3 or DALL-E are wrong. Give it a few more billion parameters and training data, and GPT-10 might still be as dumb as GPT-3 but it'll be impossible/irrelevant to prove.

> The entire human body is built from 700MB of data in the DNA.

I think this notion is misleading. It doesn't relate to the ease of simulation it on our current computers. You'll need a quantum computer to emulate this ROM in anything like realtime.

The DNA program was optimised to execute in an environment offering quantum tunnelling, multi component chemical reactions etc.

My point wasn’t to emulate the DNA in a virtual machine. The point was that between humans and chimps (our closest DNA relatives) we’re 99% same to them. So high order intelligence that gives rise to written language and tool building is somewhere in that 700MB of DNA code. And 1% of that (just 7MB) is responsible for creating the smartest intelligence we know (humans).

In that sense intelligence artchitecture isn’t very complicated. The uniformity of isocortex which we have the most relative to our brain size compared to any animal says we ought to replicate its behavior in a machine.

The isocortex/neocortex is where the gold is. It’s very uniform when seen under microscopes. Brain cells from one region can be put in another region and they work just fine. All of ^ says intelligence is some recursive architecture of information processing. That’s why I’m optimistic we’ll crack it.

I think what the parent was saying is important though, that 700mb of "data" isn't complete. It's basically just really really good compression that requires the runtime of our universe to work properly. The way proteins form and interact, the way physics works, etc are all requirements for that DNA to be able to realize itself as complex thinking human beings.

Yes you rephrased that very well.

If your plan to build intelligence is by copying how nature does it, well then you'll need to build a "nature" runtime that can emulate the universe. You can either do that slowly or inaccurately

> I absolutely believe we'll crack the fundamental principles of intelligence in our lifetimes.

I tend to agree. However this looks a lot like the beginning of the end for the human race as well. Perhaps we are really just a statistical approximation device.

Yeah why i mentioned the great filter of Fermi paradox.

I also believe humans in our current species form won’t become a space bearing species. We’re pretty awful as space travelers.

It is very likely that we’ll have robots with human like intelligence, sensor and motor capabilities sent as probes to other planets to explore and carry on the human story.

But future is hard to predict. I do know that if the intelligence algorithm is only in the hands of Google and Facebook, we are doomed. This is a thing that ought to be open source and equally beneficial to everyone.

I wish this was available as a tool for people to use! It's neat to see their list of pregenerated examples, but it would be more interesting to be able to try things out. Personally, I get a better sense of the powers and limitations of a technology when I can brainstorm some functionality I might want, and then see how close I can come to creating it. Perhaps at some point someone will make an open source version.

I wish so too! I don't expect them to release the code (they rarely do) and they wield their usual "it might have societal impact, let us decide what's good for the world":

> We recognize that work involving generative models has the potential for significant, broad societal impacts

The community did raise up to the challenge of re-implementing it (sometimes better) in the past, so I'm hopeful.

I don't think the goal is for them to "decide what's good for the world". You can classify disruptiveness/risk of a piece of tech fairly objectively.

Delaying release is to give others (most clearly social media) time to adjust and ensure safety within their own platforms/institutions (of which they are the arbiters). It also gives researchers and entrepreneurs a strong motivation of "we have to solve these risk points before this technology starts being used". While there are clearly incentive issues and gatekeeping in the research/startup community, this is a form of decentralized decision-making.

I don't see a strong case for why access should be open-sourced at announcement time, especially if it's reproducible. Issues will arise when their tech reaches billions of dollars to train, making it impossible to reproduce for 99.99% of labs/users. At that point, OpenAI will have sole ownership and discretion over their tech, which is an extremely dangerous world. GPT-3 is the first omen of this.

I have no idea what kind of compute power something like this relies on. Would this be able to run on a consumer desktop?

They note that the model has 12B parameters, which in terms of order of magnitude make it sit right between gpt 2 and 3 (1.5 and 170 respectively). With some tricks, you can run gpt 2 on good personal hardware, so this might be reachable as well with the latest hardware.

EDIT: I'm assuming you mean for inference, for training it would be an other kind of challenge and the answer would be a clear no

In the linked CLIP paper they say it is trained on 256 GPUs for 2 weeks. No mention of the size of the trained output.

Depends on how fast you want it to generate results, but yes, it can run on a desktop provided there's enough RAM.

The way this model operates is the equivalent of machine learning shitposting.

Broke: Use a text encoder to feed text data to an image generator, like a GAN.

Woke: Use a text and image encoder as the same input to decode text and images as the same output

And yet, due to the magic of Transformers, it works.

From the technical description, this seems feasible to clone given a sufficiently robust dataset of images, although the scope of the demo output implies a much more robust dataset than the ones Microsoft has offered publicly.

It's actually a bit more complicated. Since DALL-E uses CLIP for training, and CLIP is itself trained using separate text and image encoders: https://openai.com/blog/clip/

At some point we'll have so many models based on so many other models it will longer longer be possible to tell which techniques are really involved.

This is an open problem! HuggingFace.co is trying to fix it, but I worry that they'll come short.

Shitposts are more creative. What I would like to see is more extrapolation and complex mixing:

"A photo of a iPhone from the stone age."

"Adolf Hitler pissing against the wind and enjoying it."

"Painting: Captain Jean-Luc Picard crossing of the Delaware River in a Porsche 911".

You're describing a Jim'll Paint It AI bot https://jimllpaintit.tumblr.com/

You can get "a computer from the 1900s" in the examples in the post.

Repeatable, measurable, automated image meme shitposting is absolutely a destructive device though.

It's not really surprising given what we now know about autoregressive modeling with transformers. It's essentially a game of predict hidden information given visible information. As long as the relationship between the visible and hidden information is non-random you can train the model to understand an amazing amount about the world by literally just predicting the next token in a sequence given all the previous ones.

I'm curious if they do a backward pass here, would probably have value. They seem to describe sticking the text tokens first meaning that once you start generating image tokens all the text tokens are visible. That would have the model learning to generate an image with respect to a prompt but you could also literally just reverse the order of the sequence to have the model also learn to generate prompts with respect to the image. It's not clear if this is happening.

Is this kind of happening with the CLIP classifier [1] to rank the generated images?

> Similar to the rejection sampling used in VQVAE-2, we use CLIP to rerank the top 32 of 512 samples for each caption in all of the interactive visuals. This procedure can also be seen as a kind of language-guided search16, and can have a dramatic impact on sample quality.

> CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.

[1] https://openai.com/blog/clip/

That approach wouldn't work out of the box; it sees text for the first 256 tokens and images for the following 1024 tokens, and tries to predict the same. It likely would not have much to go on if you gave it the 1024 tokens for the image and then 256 for the text later since it doesn't have much of a basis.

A network optimizing for both use cases (e.g. the training set is half 256 + 1024, half 1024 + 256) would likely be worse than a model optimizing for one of the use cases, but then again models like T5 argue against it.

Shows you where the role of a meme and a shit poster may exist in a cosmological technological hierarchy. Humans are just rendering notes replicating memes, man. /s in the dude voice from big Lebowski.

"Teapot in the shape of brain coral" yields the opposite. The topology is teapot-esque. The texture composed of coral-like appendages. Sorry if this is overly semantic, I just happen to be in a deep dive in Shape Analysis at the moment ;)

>>> DALL·E appears to relate the shape of a half avocado to the back of the chair, and the pit of the avocado to the cushion.

That could be human bias recognizing features the generator yields implicitly. Most of the images appear as "masking" or "decal" operations. Rather than a full style transfer. In other words the expected outcome of "soap dispenser in the shape of hibiscus" would resemble a true hybridized design. Like an haute couture bottle of eau du toilette made to resemble rose petals.

The name DALL-E is terrific though!

I find it's ability to give different interpretations of the same thing amazing. This kind of fuzziness is also present in human art.

Another good example is the "collection of glasses" on the table. It makes both glassware and eyeglasses!

I was thinking "no way it can draw a pile of glasses, the lighting alone is difficult and it's so obscure" and then it just drew a pile of eyeglasses. I...

> a living room with two white armchairs and a painting of the colosseum. the painting is mounted above a modern fireplace.

With the ability to construct complex 3D scenes, surely the next step would be for it to ingest YouTube videos or TV/movies and be able to render entire scenes based on a written narration and dialogue.

The results would likely be uncanny or absurd without careful human editorial control, but it could lead to some interesting short films, or fan-recreations of existing films.

I'd love to see what this does with item/person/artwork/monster descriptions from Dwarf Fortress. Considering the game has creatures like were-zebras, beasts in random shapes and materials, undead hair, procedurally generated instruments and all kinds of artefacts menacing with spikes I imagine it could make the whole thing even more entertaining.

Nethack or Slashem first. The game is much less CPU taxing and it could be "real lifed" with ease.

I think in a way that's the next step but they may have to wait a little bit before they have the processing power.

If you are talking about 24 frames per second, then theoretically one second of video could require 24 times as much processing power. And 100 seconds 2400 X. Obviously that's just a random guess but surely it is much more than for individual images.

But I'm sure we'll get there.

How do we know this isn’t already happening with state actors?

Because state actors are all busy acting smart on the internet by using terms such as “state actors”.

"For other captions, such as “a snail made of harp,” the results are less good, with images that combine snails and harps in odd ways." [0]

You try drawing a snail made of harp! Seriously! DALL-E did an incredible job

[0] https://www.technologyreview.com/2021/01/05/1015754/avocado-...

I think those examples are great. Another way to judge this is to consider what a class of 8th graders might come up with if you ask them to draw "a snail made of harp". The request is non-sensical itself, so I imagine the results from DALL-E are actually pretty good.

Really? I found those examples to be laid the most compelling and interesting overall.

In spite of the close architectural resemblance with the VQVAE2, it definitely pushes the text-to-image synthesis domain forward. I'd be curious to see how well it can perform on a multi-object image setting which currently presents the main challenge in the field. Also, I wouldn't be surprised if these results were limited to openAI scale of computing resources. All in all, great progress in the field. The phase of development here is simply staggering, considering the fact that few years back we could hardly generate any image in high fidelity.

This is real? A computer can take "an armchair in the shape of an avocado" as input and make a picture of one?

I can't believe it. How does it put the baby daikon radish in the tutu?

If we knew how it did it, it wouldn't be machine learning.

The defining feature of machine learning in other words is that the machine constructs a hypersurface in a very-high-dimensional space based on the samples that it sees, and then extrapolates along the surface for new queries. Whereas you can explain features of why the hypersurface is shaped the way it is, the machine learning algorithm essentially just tries to match its shape well, and intentionally does not try to extract reasons "why" that shape "has to be" the way it is. It is a correlator, not a causalator.

If you had something bigger you'd call it "artificial intelligence research" or something. Machine learning is precisely the subset right now that is focused on “this whole semantic mapping thing that characterized historical AI research programs—figure out amazing strategy so that you need very little compute—did not bear fruit fast enough compared to the exponential increases in computing budget so let us instead see what we can do with tons of compute and vastly less strategy.” It is a deliberate reorientation with some good and some bad parts to it. (Practical! Real results now! Who cares whether it “really knows” what it’s doing? But also, you peer inside the black box and the numbers are quite inscrutable; and also, adversarial approaches routinely can train another network to find regions of the hyperplane where an obvious photograph of a lion is mischaracterized as a leprechaun or whatever.)

Machine learning is not a concept that is fundamentally incompatible with interpretability, and indeed research is being done in this area.

One method for example is occlusion, removing pieces of input to assemble statistical representations of which parts your model cares about.

It's all still baby steps, but with time the theory will catch up.

> If you had something bigger you'd call it "artificial intelligence research"

Usually called just Data Science, and does deal with that (we had lectures on interpratibility of models at university)

Because the Internet has plenty of well captioned drawings of daikon radish drawn as a bipedal humanoid. D’oh!

Switch it over to pikachu in a helmet wielding a blue lightsaber. Some hilariously bad results there.

"pikachu in a xxx staring at its reflection in a mirror" is interesting. Both hilariously bad and impressive if the helmet is mirrored.

Is there a link to the git repo or is OpenAI not really open?

I suspect you meant for Dall-E specifically, but this is their repo. Found on their about page.


I looked around a bit and couldn't find Dall-e in there. A higher post in this thread said they don't usually release their models. It's a shame, this would have been fun to play with.

I don't think you have the hardware necessary to train it. I think it trained using 256 GPUs for two weeks.

the only open thing is the name

Someone made a replication of the github and it can be found here: https://github.com/lucidrains/DALLE-pytorch

Thank you. We need open source state of the art AI.

They are going to train on YouTube/PornHub before long and it’s going to get weird.

I'm not sure how to feel, because I had this exact same thought. The evolution of porn from 320x200 EGA on a BBS, to usenet (alt.binaries,pictures.erotica, etc.) on XVGA (on an AIX Term), to the huge pool of categories on today's porn sites, which eventually became video and bespoke cam performers... Is this going to be some new weird kind of porn that Gen Alpha normalizes?

Also, someone will make a version for furries. They pay well.

Combine this with deep fakes.

Donald Trump is Nancy Pelosi's and AOC's step-brother in a three-way in the Lincoln Bedroom.

Can't unsee!

At least we're spared the smell...for now.

Maybe already has...

Does this address NLP skeptics' concerns that Transformer models don't "understand" language?

If the AI can actually draw an image of a green block on a red block, and vice versa, then it clearly understands something about the concepts "red", "green", "block", and "on".

The root-case of skepticism has always been that while Transformers do exceptionally well on finite-sized tasks, they lack any fully recursive understanding of the concepts.[0]

A human can learn basic arithmetic, then generalize those principles to bigger number arithmetic, then go from there to algebra, then calculus, then so. Successively building on previously learned concepts in a fully recursive manner. Transformers are limited by the exponential size of their network. So GPT-3 does very well with 2-digit addition and okay with 2-digit multiplication, but can't abstract to 6-digit arithmetic.

DALL-E is an incredible achievement, but doesn't really do anything to change this fact. GPT-3 can have an excellent understanding of a finite sized concept space, yet it's still architecturally limited at building recursive abstractions. So maybe it can understand "green block on a red block". But try to give it something like "a 32x16 checkerboard of green and red blocks surrounded by a gold border frame studded with blue triangles". I guarantee the architecture can't get that exactly correct.

The point is that, in some sense, GPT-3 is a technical dead-end. We've had to exponentially scale up the size of the network (12B parameters) to make the same complexity gains that humans make with linear training. The fact that we've managed to push it this far is an incredible technical achievement, but it's pretty clear that we're still missing something fundamental.

[0] https://arxiv.org/pdf/1906.06755.pdf

> So GPT-3 does very well with 2-digit addition and okay with 2-digit multiplication, but can't abstract to 6-digit arithmetic.

This is false, GPT-3 can do 10-digit addition with ~60% accuracy, with comma separators. Without BPEs it would doubtlessly manage much better.

The accuracy largely comes from the fact that addition rarely requires carrying more than a single digit. So it's easy to pattern match from single digit problems that it was previously trained on.

With multiplication, which requires much more extensive cross-column interaction, accuracy falls off a cliff with anything more than a few digits.

> The accuracy largely comes from the fact that addition rarely requires carrying more than a single digit. So it's easy to pattern match from single digit problems that it was previously trained on.

Again, not at all true due to BPEs.

      [12] [65] [25] [42] [185]
    + [580] [22] [75] [80] [93]
      [706] [75] [300] [278]
(note that GPT-3 is never told that the BPE [580] is composed of the digits [5], [8], and [0]. It has to guess this from the contexts [580] occurs in.)

> With multiplication, which requires much more extensive cross-column interaction, accuracy falls off a cliff with anything more than a few digits.

You couldn't learn long multiplication if you had to use BPEs, were never told how BPEs worked or corresponded to sane encodings, were basically never shown how to do multiplication, and were forced to do it without any working out.

Quick, what's 542983 * 39486? No writing anything down, you have to output the numbers in order, and a single wrong digit is a fail. (That's easy mode, I won't even bother asking you to do BPEs.)

ML models can learn multiplication, obviously they can learn multiplication, they just can't do it in this absurd adversarial context. GPT-f[1] was doing Metamath proofs on 9-digit division (again, a vastly harder context, they involve ~10k proof steps) with 50-50 accuracy, and we have a toy proof of concept[2] for straight multiplication.

[1] https://arxiv.org/abs/2009.03393

[2] https://github.com/Thopliterce/transformer-arithmetic

In retrospect this was a pretty poorly framed point of mine. The point about addition being harder than stated because BPEs break at arbitrary points doesn't make much sense given the previous comment of mine pointed out that GPT-3 can only do addition with comma separators, that mostly (if not quite entirely) defeat that problem. GPT-3 does have to still know the various constructions and additions of 3-digit spans, but that's within its capabilities.

Correction/clarification: 10k proof steps was the number of steps in the training dataset, which was for 100 such proofs.

> So GPT-3 does very well with 2-digit addition and okay with 2-digit multiplication, but can't abstract to 6-digit arithmetic.

That sounds disappointing but what if instead of trying to teach it to do addition one would teach it to write source code for making addition and other maths operations instead?

Then you can ask it to solve a problem but instead of it giving you the answer it would give you source code for finding the answer.

So for example you ask it “what is the square root of five?” then it responds:

    fn main ()
      println!("{}", 5f64.sqrt());

Try a large block on a small block. As the authors also have noted in their comments the success rate is nearly zero. One may wonder why. Maybe because that's something you see rarely in photos? At the end, it doesn't "understand" the meaning of the words.

Maybe because large/small are relative to each other here? Maybe it would be able to do "a block inside a block"

I think it is safe to say that learning a joint distribution of vision + language, is fully possible at this stage, demonstrating by this work.

But 'understanding' itself needs to be further specified, in order to be tested even.

What strikes me most is the fidelity of those generated images, matching the SOTA from GAN literature with much more variety, without using the GAN objective.

It seems Transformer model might be the best neural construct we have right now, to learn any distribution, assuming more than enough data.

There are examples on twitter showing it doesn't really understand spatial relations very well. Stuff like "red block on top of blue block on top of green block" will generate red, green, and blue blocks, but not in the desired order.


if(adj == 'red') drawBlock(RED)

According to your definition of understanding, this program understands something about the concept RED. But the code is just dealing with arbitrary values in memory (e.g. RED = 0xFF0000)

nah, it's still big and dumb model with no idea what it's doing, deepfake 2.0.

It looks like a variation on plain old image search engine, unreliable at that, as compared to exact matching.

But it has obvious application in design as it can create these interesting combinations of objects & styles. And I loved the snail-harp.

Seems like we’re getting closer to AI driven software engineering.

Prompt: a Windows GUI executable that implements a scientific calculator.

What you'll get is the same thing as GPT-3: the equivalent of googling the prompt. You can google "implement a scientific calculator" and get multiple tutorials right now.

You'll still need humans to make anything novel or interesting, and companies will still need to hire engineers to work on valuable problems that are unique to their business.

All of these transformers are essentially trained on "what's visible to google", which also defines the upper bound of their utility

> You'll still need humans to make anything novel or interesting, and companies will still need to hire engineers to work on valuable problems that are unique to their business.

Give it 10 years :) GPT-10 will probably be able to replace a sizeable proportion of today's programmers. What will GPT-20 be able to do?

Possibly, but in software the realm of errors is wider and more detrimental. Imagery, the human mind will fill in the gaps and allow interpretation. Software, not so much.

True, but the human mind needs an expensive, singleton body in the real world, while a code writing GPT-3 only needs a compiler and a CPU to run its creations. Of course they would put a million cores to work at 'learning to code' so it would go much faster. Compare that with robotics, where it's so expensive to run your agents. I think this direction really has a shot.

Would someone like GitHub be the right person to solve this? They have ALL of the code.

To be fair, anyone else has almost all the code, since it's all public.

There are some attempts to get AI Dungeon (GPT-2/3 based game) to generate code. This scenario for example (you need to create an account to launch it): https://play.aidungeon.io/main/scenarioView?publicId=af4a05f....

Someone did this simply by giving GPT-3 some code samples last year: https://twitter.com/sharifshameem/status/1282676454690451457...

Strongly recommend watching the whole video!

Prompt: One self-improving self-optimizing misanthropic quine please

With TCL/TK it's just a few lines put together.

Anywhere this can be tried out interactively? I'd like to type some phrases and see how it does.

I came hoping for the same. While these are amazing, in publication reading, one really needs to try it out. Would love to get my hands dirty. Still wait-listed got gpt3 access, but no hope in sight...

Imagine the amount of body parts pictures on the server... /s

In various Episodes of Star Trek The Next Geneneration, the crew asks the computer to generate some environment or object with relatively little description. It’s a story telling tool of course, but looking at this, I can begin to imagine how we might get there from here.

Almost as if thoughts and reality are of the same thing.

What do you mean?

There's something that really creeps me out about errors in AI generated images. More than uncanny valley creepiness. Like trypophobia creepy.

Same for me. It's like the feeling you get in a dream where things seem normal and you think you're awake, then suddenly you notice something wrong about the room, something impossible.

I know exactly what you mean. Like if you had to see it in real life you'd see something horrible just out of shot. For some reason that's amplified with the furniture.

Just wait until you can’t tell the difference and then contemplate if it matters, and then if that’s what reality already is.

Don't look at the food, I advise.

I get the same feeling, to the point that I occasionally let out a brief scream when browsing GAN images.

I really do think AI is going to replace millions of workers very quickly, but just not in the order that we used to think of. We will replace jobs that require creativity and talent before we will replace most manual factor workers, as hardware is significantly more difficult to scale up and invent than software.

At this point I have replaced a significant amount of creative workers with AI for personal usage, for example:

- I use desktop backgrounds generated by VAEs (VD-VAE)

- I use avatars generated by GANs (StyleGAN, BigGAN)

- I use and have fun with written content generated by transformers (GPT3)

- I listen to and enjoy music and audio generated by autoencoders (Jukebox, Magenta project, many others)

- I don't purchase stock images or commission artists for many previous things I would have when a GAN exists that already makes the class of image I want

All of this has happened in that last year or so for me, and I expect that within a few more years this will be the case for vastly more people and in a growing number of domains.

> - I use and have fun with written content generated by transformers (GPT3)

> - I listen to and enjoy music and audio generated by autoencoders (Jukebox, Magenta project, many others)

Really, you've "replaced" normal music and books with these? Somehow I doubt that.

Not entirely, no, I don't hope I implied that. I listen to human-created music every day. I just mean to say that I've also listened to AI-created music that I've enjoyed, so it's gone from being 0% of what I listen to to 5%, and presumably may increase much more later.

You should try Aiva (http://aiva.ai). At some point I was mostly listening to compositions I generated through that platform. Now I'm back to Spotify, but AI music is definitely on my radar.

Looks great, thanks for the suggestion

What are you talking about, this is my favorite album: https://www.youtube.com/watch?v=K0t6ecmMbjQ

Not to undermine this development, but so far, no surprise, AI depends on vast quantities of human-generated data. This leads us to a loop: if AI replaces human creativity, who will create novel content for new generation of AI? Will AI also learn to break through conventions, to shock and rewrite the rules of the game?

It’s like efficient market hypothesis: markets are efficient because arbitrage, which is highly profitable, makes them so. But if they are efficient, how can arbitrageurs afford to stay in business? In practice, we are stuck in a half-way house, where markets are very, but not perfectly, efficient.

I guess in practice, the pie for humans will keep on shrinking, but won’t disappear too soon. Same as horse maintenance industry, farming and manufacturing, domestic work etc. Humans are still needed there, just a lot less of them.

if AI replaces human creativity, who will create novel content for new generation of AI?

Vast majority of human generated content is not very novel or creative. I'm guessing less than 1% of professional human writers or composers create something original. Those people are not in any danger to be replaced by AI, and will probably be earning more money as a result of more value being placed on originality of content. Humans will strive (or be forced) to be more creative, because all non-original content creation will be automated. It's a win-win situation.

> Will AI also learn to break through conventions, to shock and rewrite the rules of the game?

I think AlphaGo was a great in-domain example of this. I definitely see things I'd refer to colloquially as 'creativity' in this DALL-E post, but you can decide for yourself, but that still isn't claiming it matches what some humans can do.

True, but AlphaGo exists in a world where everything is absolute. There are new ways of playing Go, but the same rules.

If I train an AI on classical paintings, can it ever invent Impressionism, Cubism, Surrealism? Can it do irony? Can it come up with something altogether new? Can it do meta? “AlphaPaint, a recursive self-portrait”?

Maybe. I’m just not sure we have seen anything in this dimension yet.

>If I train an AI on classical paintings, can it ever invent Impressionism, Cubism, Surrealism?

I see your point, but it's an unfair comparison: if you put a human in a room and never showed them anything except classical paintings, it's unlikely they would quickly invent cubism either. The humans that invented new art styles had seen so many things throughout their life that they had a lot of data to go off of. Regardless, I think we can do enough neural style transfer already to invent new styles of art though.

> how can arbitrageurs afford to stay in business

Most arbitrageurs cannot stay in the business, it's the law of diminishing returns. Economies of scale eventually prevent small individual players to profit from the market, only a few big-ass hedge funds can stay, because due to their investments they can get preference from exchanges (significantly lower / zero / negative fees, co-located hardware, etc.) which makes the operation reasonable to them. With enough money you can even build your own physical cables between exchanges to outperform the competitors in latency games. I'm a former arbitrageur, by the way :)

Same with AI-generated content. You would have to be absolutely brilliant to compete with AI. Only a few select individuals would be "allowed" to enter the market. Not even sure that it has something to do with the quality of the content, maybe it's more about prestige.

You see, there already are gazillions of decent human artists, but only a few of them are really popular. So the top-tier artists would probably remain human, because we need someone real to worship to. Their producers would surely use AI as a production tool, depicting it as a human work. But all the low-tier artists would be totally pushed out of the market. There will be simply no job for a session musician or a freelance designer.

I believe that AI will accelerate creativity. This will have a side effect of devaluing some people's work (like you mentioned), but it will also increase the value of some types of art and, more importantly, make it possible to do things that were impossible before, or allow for small teams and individuals to produce content that were prohibitively expensive.

There still needs to be some sort of human curation, lest bad/rogue output risks sinking the entire AI-generated industry. (in the case of DALL-E, OpenAI's new CLIP system is intended to mitigate the need for cherry-picking, although from the final demo it's still qualitative)

The demo inputs here for DALL-E are curated and utilize a few GPT-3 prompt engineering tricks. I suspect that for typical unoptimized human requests, DALL-E will go off the rails.

Personally speaking I don't want curation. What is fascinating about generative AI is the failure modes.

I want the stuff that no human being could have made - not the things that could pass for genuine works by real people.

Failure modes are fun when they get 80-90% of the way there and hit the uncanny valley.

Unfortunately many generations fail to hit that.

Yes, but there's no reason we can't partially solve this by throwing more data at the models, since we have vast amounts of data we can use for that (ratings, reviews, comments, etc), and we can always generate more en masse whenever we need it.

This isn't a problem that can be solved with more data. It's a function of model architecture, and as OpenAI has demonstrated, larger models generally perform better even if normal people can't run them on consumer hardware.

But there is still a lot of room for more clever architectures to get around that limitation. (e.g. Shortformer)

I think it's both - we have a lot of architectural improvements that we can try now and in the future, but I don't see why you can't take the output of generative art models, have humans rate them, and then use those ratings to improve the model such that its future art is likely to get a higher rating.

> We will replace jobs that require creativity

Frankly, I think the "AI will replace jobs that require X" angle of automation is borderline apocalyptic conspiracy porn. It's always phrased as if the automation simply stops at making certain jobs redundant. It's never phrased as if the automation lowers the bar to entry from X to Y for /everyone/, which floods the market with crap and makes people crave the good stuff made by the top 20%. Why isn't it considered as likely that this kind of technology will simply make the best 20% of creators exponentially more creatively prolific in quantity and quality?

> Why isn't it considered as likely that this kind of technology will simply make the best 20% of creators exponentially more creatively prolific in quantity and quality?

I think that's well within the space of reasonable conclusions. For as much as we are getting good at generating content/art, we are also therefore getting good at assisting humans at generating it, so it's possible that pathway ends up becoming much more common.

Isn't training data effectively a form of sampling?

Couldn't any creator of images that a model was trained on sue for copyright infringement?

Or do great artists really just steal (just at a massive scale)?

Currently that is not the case:

>Mod­els in gen­eral are gen­er­ally con­sid­ered “trans­for­ma­tive works” and the copy­right own­ers of what­ever data the model was trained on have no copy­right on the mod­el. (The fact that the datasets or in­puts are copy­righted is ir­rel­e­vant, as train­ing on them is uni­ver­sally con­sid­ered fair use and trans­for­ma­tive, sim­i­lar to artists or search en­gi­nes; see the fur­ther read­ing.) The model is copy­righted to whomever cre­ated it.

Source (scroll up slightly past where it takes you): https://www.gwern.net/Faces#copyright

Thank you, this is the part I find most relevant:

"Models in general are generally considered “transformative works” and the copyright owners of whatever data the model was trained on have no copyright on the model. (The fact that the datasets or inputs are copyrighted is irrelevant, as training on them is universally considered fair use and transformative, similar to artists or search engines; see the further reading.) The model is copyrighted to whomever created it. Hence, Nvidia has copyright on the models it created but I have copyright under the models I trained (which I release under CC-0)."

But does that still hold when the model memorized a chunk of the training data? Or can a network plagiarize output while being a transformative work itself?

I bet they can claim copyright up to the gradients generated on their media, but in the end the gradients get summed up, so their contribution is lost in the cocktail.

If I write a copyrighted text on a book, then I print a million other texts on top of it, in both white an black, mixing it all up to be like white noise, would the original authors have a claim?

Models can unpredictably memorize sensitive input data, so there can be a real copyright issue here, I think.


Worse, sometimes the input data is illegal to distribute for other reasons than copyright.

Those don’t seem in any way similar to like writing a tv show or animating a Pixar movie.

I agree, and due to the amount of compute that is required for those types of works I think those are still quite awhile away.

But the profession for creative individuals consists of much more than highly-paid well-credentialed individuals working at well-known US corporations. There are millions of artists that just do quick illustrations, logos, sketches, and so on, on a variety of services, and they will be replaced far before Pixar is.

I think this is actually not a bad thing.

I won't say many of those things are creativity driven. There are more like auto assets generation.

One use case of such model would be in gaming industry, to generate large amount of assets quickly. This process along takes years, and more and more expensive as gamers are demanding higher and higher resolution.

AI can make this process much more tenable, bring down the overall cost.

You are probably right. Still, there is hope that this just a prelude to getting closer to a Transmetropolitan box ( assuming we can ever figure out how to make AI box that can make physical items based purely on information given by the user ).

Do you think investing in MSFT/GOOGL is the best way to profit off this revolution?

It's too hard to say I think. Big players will definitely benefit a lot, so it probably isn't a bad idea, but if you could find the right startups or funds, you might be able to get significantly more of a return.

What GANs do you use to generate stock images?

Do you have a GPT-3 key?

First: This strikes me as truly amazing - but my mind immediately goes to the economic impact of something like this. Personally I try not to be an alarmist about the potential for jobs to be automated away, but how strikingly good this is makes me wonder if we just haven't seen AI that is good enough to automate away large parts of the workforce.

Seeing the "lovestruck cup of boba" reminded me of an illustration a friend of mine did for a startup a few years back. It would be a lot easier and less time consuming for someone to simply request such an image from an AI assistant. If I were a graphic artist or photographer, this would scare me.

I don't know what the right answer is here. I have little to no faith in regulators to help society deal with the sweeping negative effects even one new AI-based product looks like it could have on a large swath of the economy. Short of regulation and social safety nets, I wonder if society will eventually step up and hold founders and companies accountable when they cause broad negative economic impacts for their own enrichment.

Making goods and services cheaper is 'negative economic impact'? I'd argue the opposite. Greater productivity has been humanities goal ever since they first started sharpening rocks and sticks. I'd argue that it is desirable. Yes, as you point out, greater productivity has disadvantages at first also. See for example the mechanisation of agriculture, people lose their jobs. But as you point out yourself, the blow can easily be softened with better social safety nets.

>but how strikingly good this is

it's good by the standards of machine-generated images but it's not comparable to the work of an artist because it has no intent and it's still in many ways incoherent and lacks details, composition, motives and so on. It's like ML generated music. It sounds okay in a technical sense but it lacks the intent of a composer, and I don't see a lot of people listening to AI generated music for that reason.

If anything it'll help graphic artists to create sketches or ideas they can start from.

I believe founders/companies can and do "cause broad negative economic impacts for their own enrichment", but creating a lower-cost path to the same good/result is a good thing fundamentally. Yes, this can cause greater income/life-experience inequality, and we should adjust for that, but in ways that do not punish innovation. In short, we should optimize for human happiness by better sharing the wealth rather than by limiting it.

One perspective is: anything that can be automated (thus lowered in cost) should be. For drudge-work, of course that's good. For some examples, showing that it can be automated shows that it IS drudge-work. But replacing a creative illustrator? That is not drudge-work, it is a fulfilling and enjoyable profession. I don't think it's clear that changing it to become a hobby (because it's no longer viable as a profession) is "a good thing fundamentally". I would need to hear further arguments on this.

This very quickly gets into "what's the point of it all?" and I'll admit that I don't have the answer. :)

That name though –

DALL·E = Dali + WALL·E

Freaking brilliant.

Was that generated by an AI as well?

I'm actually building a name generator that is as intelligent and creative as that for my senior year thesis (and also for https://www.oneword.domains/)

I already have an MVP that I'm testing out locally but I'd appreciate any ideas on how to make it as smart as possible!

Similar to Wordseye https://www.wordseye.com/

Wordeye seems to be about scene generation out of pre-existing building blocks where as DALL-E is about creating those building blocks themselves.

Does anyone have any insight on how much it would cost for OpenAI to host an online, interactive demo of a model like this? I'd expect a lot - even just for inference - based on the size of the model and the expected virality of the demo, but have no reference points for quantifying.

You can edit the prompt to play a bit with it. (the results are far less good that what's featured in the blog post though …)

Now we just have to wait for huggingface to create an open source implementation. So much for openness I guess if you go on Microsoft azure you can use closed ai.

There was some programming language akin to PovRay (not to raytrace) in order to describe the scene with commands like "place a solid here" and so on.

I can't remember its name.



There are several projects like this, but they can only generate abstract shapes, i.e. they're much lower-level than a natural language caption.

Context Free is incredibly fun but it's not machine learning or even AI. And it's very easy to understand how things go from defintions of shape, rotations, translations, etc, to the finished image.

The "collection of glasses sitting on a table" example is excellent.

Some pics are of drinking glasses and some are of eye glasses, and one has both.

I also like the telephones from different eras, including the future.

Recently heard a resident machine learning expert describe GPT-3 as 'not revolutionary in the slightest' or something like that.

It's not, but it showed that we can get a magnitude better results by adding a magnitude more data.

To be honest, it's not where I'd like to see efforts in the field go.

Not because I'm afraid of AI taking over, but because I'd rather have humans recreate something comparable to a human brain (functionality wise).

Who knows, maybe in a few years you will be amazed at the new universal transformer chip that runs on 20W of power and can do almost any task. No need for retraining, just speak to it, show it what you need. Even prompt engineering has been automated (https://arxiv.org/abs/2101.00121) so no more mystery. So much for the new hot job of GPT prompt engineer that would replace software dev.

we've been making these since the beginning of time, we call them humans

Human's want health insurance and 40 hours work-weeks. The super-smart AGI that will exist 20 years from now won't.

I was skeptical before, but now i open to this idea

It's revolutionary in costs, and delivers for every dollar spent.

I think that they're correct saying that GPT-3 isn't revolutionary, since it just demonstrates the power of scaling. However I would argue that the underlying architecture, the Transformer (GP(T)), is/was/will be revolutionary.

It's not revolutionary, just a typical-but-notable iterative step in NLP. Which is fine!

I wrote a blog post on that a few months ago after playing a bit with GPT-3, and it holds up. https://news.ycombinator.com/item?id=23891226

How long until the Rule 34 perverts get their hands on this and start inputting stuff like "Bobby Fischer Minotaur fucking a lime green Toyota Echo"?

The shipping community will go apeshit if this thing works as advertised.

I remember looking at generated porn pictures with an old model, not taking text inputs, and some pictures were very disturbing because the bodies were impossible or very not healthy.

There is a reason that the examples are cartoons animals or objects. It's not disturbing that the pig teapot is not realistic, or that the dragon cat is missing a leg. This kind of problem is very disturbing on realistic pictures of human bodies.

Eventually it will get there, I guess you could make an AI to filter the pictures to see which of them are disturbing or not.

On the other hand, no matter how misshapen or deformed the body comes out that will be someone's kink.

OpenAI did with DALL.E what I envisioned to do with AiArtist :) (https://www.aiartist.io/)

An AI to provide illustrations to your written content.

https://www.linkedin.com/in/ramsrig/ https://twitter.com/ramsri_goutham

i want to see it go into an infinite loop with an "image recognition software" (one where you feed an image and you get a written description of if)

I believe it will end up stabilazing on one image or a sequence of images whose text return themselves

RIP to all the fiverr artists out there.

This is impressive.

Given ClosedAI's recent moves, I doubt the public will ever have access to this. So I think those artists will be just fine.

You must be using a very short definition of "ever". These kinds of works will get replicated if they're not published.

If I put text into this tool and generate an original and unique image, who owns that image? If it's OpenAI, do they license it?

I'm wondering why the image comes out non-blocky because transformers would take slices of the image as input. They say they have about 1024 tokens for the image and that would mean 32x32 patches. How is it possible that these patches align along the edges so well and not have JPEG like artifacts?

If you read footnote #2, the source images are 256x256 but downsampled using a VAE, and presumably upsampled using a VAE for publishing (IIRC they are less prone to the infamous GAN artifacts).

I know they use VQ-VAE under the transformer, but that would generate one symbol per 8x8 box. When you tile them up they should have some mosaic artifacts along the edges, if they generate these patches independently.

I tried a working demo of a system like this in Kristian Hammond’s lab at Northwestern University 20 years ago. Actually his system was generating MTV style videos from web images and music with just some keywords as input. He had some other amazing (at the time) stuff. The GPT 3 part of this gives it a lot more potential of course, so I don’t want to take away from that. Just saying though, since they didn’t reference Hammond’s work, that this party has been going on for a while.


Results are spectacular, but as always and especially with OpenAI announcements one should be very cautious (lessons learned from GPT3). I hope that the model is not doing advanced patchwork/stitching of training images. I think that this kind of test should be included when measuring the performance of any generative model. Unfortunately the paper is not yet released, the model is not available and no online demo has been published. Recently a team of researchers discovered that in some cases advanced generative text models copy a large chunk of training data-set text ...

> I hope that the model is not doing advanced patchwork/stitching of training images

It would still be impressive that it knows were to include hands, christmas sweater or a unicycle.

Does it allow refining the result on iterations? Meaning after getting version one, apply more words to refine the image to a closer description? Because if it does then this can become a very good tool in getting a reliable picture from a witness when asking to describe the suspect. Combine this with existing China's massive surveillance face recognition and you can locate the suspect (or the political dissident) as fast as you get your witness in front of a laptop running this software.

It's a tool, and like any other existing tool it will be used for both bad and good.

The real power here is a Google destroyer.

These AI's can't yet produce things of value to humans but I doubt Google's AI could know that.

Pump out billions of pages of text and pictures and it should swamp Google.

I just got creepy thought what genetic engineering "GEN-E" could bring in a couple of decades :(

IN: "give me living giraffe turtle"

OUT: a few weeks later himera crawls out of the AI lab box

this is incredible but I can't help but feel like we're skipping some important steps by working with plain text and bitmaps - "a collection of glasses sitting on a table" sometimes eyeglasses sometimes drinking glasses sometimes a weird amalgamation. and as long as we're OK with ambiguity in every layer are we really ever going to be able to meaningfully interrogate the reasons behind a particular output?

The results highlighted in the blog post are incredible, unfortunately, they are also cherry-picked: I've played with the prompt a bit, and every result (not involving a drawing) was disappointing…

I may have not been so disappointed if they had not highlighted such incredible results in the first place. Managing expectations is tough.

This is super impressive!! Those generated images are quite accurate and realistic. Here are some of my thoughts and explanation about how they do use discrete vocabulary to describe an image. https://youtu.be/UfAE-1vdj_E

This is incredible. Such technology with a RPG adventure game would open up a new genre of exploration games!

I can't find a source for the dataset but going by the hints peppered throughout the article, they likely used <img> `alt` tags for supervision? Fascinating that an accessibility tool is being repurposed to train machine learning models.

A dog

Just yesterday I was joking with my coworker that I would like a tool where I could create meme images from a text description and today I open HN and here we are. This looks amazing!

That’s a delightful result you all; and beautifully explored, too!

Without seeing the source, they could be mainly using Google image search for that (or Yandex which is much better for image search and reverse image search).

I wonder what it makes out of green ideas sleep furiously.

> an illustration of a baby daikon radish in a tutu walking a dog


I hope gpt3 dungeon is having a great update incoming

Maybe I'm missing something but does it say what library of images was used to train this model? I couldn't quite understand the process of building DALL-E. Did they have a large database of labeled images and they combined this database with GPT-3?

Very impressive results.

This seems like it could be a great replacement for searching/creating your own stock photo/images.

Hopefully all output is copyright friendly.

Amazing. Would love to play with this.

Is OpenAI going to offer this as a closed paywalled service? Once again wondering how the “open” comes into play.

After their new CEO came in, former president of YC, they closed off everything and took a lot of investment. Only thing that's open about them is the name.

If decide to make one of those exact chairs in the shape of an avocado. Can I be sued for copyright infringement?

Depends on who is suing you: OpenAI for using their model results, or the owner(s) of the data their model was trained on? Either way, it's a grey zone that copyright law hasn't come to grips with yet. See https://scholarlykitchen.sspnet.org/2020/02/12/if-my-ai-wrot...

I wonder what happens when you ask it for an impossible object, e.g. a square triangle?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact