Then I found these in Videos tab. Apparently there’s a 10-20 year old manga/merch/anime franchise of walking and talking daikon radish characters.
So the daikon part is already figured in the dataset. The AI picked up the prior art and combined it with the dog part, which is still tremendous but maybe not “figuring out the daikon walking part on its own” tremendous.
(btw anyone knows how best to refer to anime art style in Japanese? It’s a bit of mystery to me)
The term mangachikku (漫画チック, マンガチック, "manga-tic") is sometimes used to refer to the art style typical of manga and anime; it can also refer to exaggerated, caricatured depictions in general. Perhaps anime fū irasuto (アニメ風イラスト, anime-style illustration), while a less colorful expression, would be closer to what you're looking for.
But it does not seem 'understand' anything like some other commenters have said. Try '4 glasses on a table' and you will rarely see 4 glasses, even though that is a very well-defined input. I would be more impressed about the language model if it had a working prompt like: "A teapot that does not look like the image prompt."
I think some of these examples trigger some kind of bias, where we think: "Oh wow, that armchair does look like an avocado!" - But morphing an armchair and an avocado will almost always look like both because they have similar shapes. And it does not 'understand' what you called 'object concepts', otherwise it should not produce armchairs where you clearly cannot sit in due to the avocado stone (or stem in the flower-related 'armchairs').
Slightly? Jesus, you guys are hard to please.
What I meant is that 'not' is in principal an easy keyword to implement 'conservatively'. But yes, having this in a language model has proven to be very hard.
Edit: Can I ask, what do you find impressive about the language model?
Sure, would be, but this is not happening here.
And yes, rest assured, the rest of the world is probably less 'blasé' than I am :) Very evident by the hype around GPT3.
I really find it hard to understand why people are optimistic about the impact AI will have on our future.
The pace of improvement in AI has been really fast over the last two decades, and I don't feel like it's a good thing. Compare the best text generator models from 10 years ago with GPT-3. Now do the same for image generators. Now project these improvements 20 years into the future. The amount of investment this work is getting grows with every such breakthrough. It seems likely to me we will figure out general-purpose human-level AI in a few decades.
And what then? There are so many ways this could turn into a dystopian future.
Imagine for example huge mostly-ML operated drone armies, tens of millions strong, that only need a small number of humans to supervise them. Terrified yet? What happens to democracy when power doesn't need to flow through a large number of people? When a dozen people and a few million armed drones can oppress a hundred million people?
If there's even a 5% chance of such an outcome (personally I think it's higher), then we should be taking it seriously.
The only issue I see here is that government will need to take a hand in mitigating capitalistic wealth inequality, and access to creative tools will need to be subsidized for low income individuals (assuming we can't bring the compute cost down a few orders of magnitude).
And secondly, you make the same mistake with those who say after automation people will have nothing to do. Incorrect, people will have discovered a million new things to do and get busy at them. Like, 90% of people used to be in agriculture and now just 2%, but we're ok.
When AI becomes better than us at what we call art now, we'll have already switched to post-AI-art, and it will be so great we won't weep for the old days. Maybe the focus will switch from creating to finding art, from performing to appreciating and developing a taste, from consuming to participating in art. We'll still do art.
Even if it's 0.1% we should be taking it very seriously, given the magnitude of the negative outcome. In expected value terms it's large. And that's not a Pascal's mugging given the logical plausibility of the proposed mechanism.
At least the rhetoric of Sam Altman and Demis Hassabis suggests that they do take these concerns seriously, which is good. However there are far too many industry figures who shrug off and even ridicule the idea that there's a possible threat on the medium-term horizon.
"The singularity is _always near_". We've been here before (1950s-1970s); people hoping/fearing that general AI was just around the corner.
I might be severely outdated on this, but the way I see it AI is just rehashing already existent knowledge/information in (very and increasingly) smart ways. There is absolutely no spark of creativity coming from the AI itself. Any "new" information generated by AI is really just refined noise.
Don't get me wrong, I'm not trying to take a leak on the field. Like everyone else I'm impressed by all the recent breakthroughs, and of course something like GPT is infinitely more advanced than a simple `rand` function. But the ontology remains unchanged; we're just doing an extremely opinionated, advanced and clever `rand` function.
About a decade ago I trained a model on Wikipedia which was tuned to classify documents into what branch of knowledge the document could be part of. Then I fed in one of my own blog posts. The second highest ranking concept that came back to me was "mereology" a term I had never even heard of and one that was quite apt for the topic I was discussing in the blog post.
My own software, running on the contents of millions of authors' work, ingesting my own blog post, taught me the orchestrator of the process about his own work. This feedback loop is accelerating and just because it takes decades for the irrefutable to come, it doesn't mean that it never will. People in the early 40s said atomic weapons would never happen because it would be too difficult. For some people nothing short of seeing is believing, but those with predictive minds know that this truly is just around the corner.
I absolutely believe we'll crack the fundamental principles of intelligence in our lifetimes. We now have capability to process all public data available on internet (of wikipedia is a huge chunk). We have so many cameras and microphones (one in each pocket).
It's also scary to think if it goes wrong (the great filter for fermi paradox). However I'm optimistic.
The brain only uses 20 watts of power to do all its magic. The entire human body is built from 700MB of data in the DNA. The fundamental principles of intelligence is within reach if we look from that perspective.
Right now GPT3 and DALL-E seem to be using an insane amount of computation to achieve what they are doing. My prediction is that in 2050, we'll have pretty good intelligence in our phones that has deep understanding (language and visual) of the world around us.
I'm hearing astonishing numbers, in the tens of megawatts range for training these billion-parameter models.
And I wish they showed us all the rejected images. If those images (like the snail harp) were the FIRST pass of the release candidate model.... wow... but how much curating did they do?
EDIT: Units. Derp.
do you mean tera-joules?
A hundred megajoules is about three bucks at 10 cents per kwh.
I routinely do giga-joule level computations using just a rack of computers in my garage, they're no big deal.
Metawatt is a unit for rates of speculation, uninformed by multiplication, about AI energy usage.
But for how long? 1 second, 1 hour, 1 month? The energy matters more than the power.
The brain uses 20W of power. For a life time of ~80 years, that is 14MWh of energy usage. Suppose we say the brain trains for the first 25 years then that is 4.38 MWh. Equivalent electric energy consumption is only at $438.
So yeah, the brain is quite efficient both in hardware and software.
That being said, these models are training only for very specific tasks whereas obviously the human brain is far more sophisticated in terms of its capabilities.
DNNs will definitely not get us there in their current form.
They are getting pretty good, people already have to "try" a little bit to find examples where GPT-3 or DALL-E are wrong. Give it a few more billion parameters and training data, and GPT-10 might still be as dumb as GPT-3 but it'll be impossible/irrelevant to prove.
I think this notion is misleading. It doesn't relate to the ease of simulation it on our current computers. You'll need a quantum computer to emulate this ROM in anything like realtime.
The DNA program was optimised to execute in an environment offering quantum tunnelling, multi component chemical reactions etc.
In that sense intelligence artchitecture isn’t very complicated. The uniformity of isocortex which we have the most relative to our brain size compared to any animal says we ought to replicate its behavior in a machine.
The isocortex/neocortex is where the gold is. It’s very uniform when seen under microscopes. Brain cells from one region can be put in another region and they work just fine. All of ^ says intelligence is some recursive architecture of information processing. That’s why I’m optimistic we’ll crack it.
If your plan to build intelligence is by copying how nature does it, well then you'll need to build a "nature" runtime that can emulate the universe. You can either do that slowly or inaccurately
I tend to agree. However this looks a lot like the beginning of the end for the human race as well. Perhaps we are really just a statistical approximation device.
I also believe humans in our current species form won’t become a space bearing species. We’re pretty awful as space travelers.
It is very likely that we’ll have robots with human like intelligence, sensor and motor capabilities sent as probes to other planets to explore and carry on the human story.
But future is hard to predict. I do know that if the intelligence algorithm is only in the hands of Google and Facebook, we are doomed. This is a thing that ought to be open source and equally beneficial to everyone.
> We recognize that work involving generative models has the potential for significant, broad societal impacts
The community did raise up to the challenge of re-implementing it (sometimes better) in the past, so I'm hopeful.
Delaying release is to give others (most clearly social media) time to adjust and ensure safety within their own platforms/institutions (of which they are the arbiters). It also gives researchers and entrepreneurs a strong motivation of "we have to solve these risk points before this technology starts being used". While there are clearly incentive issues and gatekeeping in the research/startup community, this is a form of decentralized decision-making.
I don't see a strong case for why access should be open-sourced at announcement time, especially if it's reproducible. Issues will arise when their tech reaches billions of dollars to train, making it impossible to reproduce for 99.99% of labs/users. At that point, OpenAI will have sole ownership and discretion over their tech, which is an extremely dangerous world. GPT-3 is the first omen of this.
EDIT: I'm assuming you mean for inference, for training it would be an other kind of challenge and the answer would be a clear no
Broke: Use a text encoder to feed text data to an image generator, like a GAN.
Woke: Use a text and image encoder as the same input to decode text and images as the same output
And yet, due to the magic of Transformers, it works.
From the technical description, this seems feasible to clone given a sufficiently robust dataset of images, although the scope of the demo output implies a much more robust dataset than the ones Microsoft has offered publicly.
At some point we'll have so many models based on so many other models it will longer longer be possible to tell which techniques are really involved.
"A photo of a iPhone from the stone age."
"Adolf Hitler pissing against the wind and enjoying it."
"Painting: Captain Jean-Luc Picard crossing of the Delaware River in a Porsche 911".
I'm curious if they do a backward pass here, would probably have value. They seem to describe sticking the text tokens first meaning that once you start generating image tokens all the text tokens are visible. That would have the model learning to generate an image with respect to a prompt but you could also literally just reverse the order of the sequence to have the model also learn to generate prompts with respect to the image. It's not clear if this is happening.
> Similar to the rejection sampling used in VQVAE-2, we use CLIP to rerank the top 32 of 512 samples for each caption in all of the interactive visuals. This procedure can also be seen as a kind of language-guided search16, and can have a dramatic impact on sample quality.
> CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.
A network optimizing for both use cases (e.g. the training set is half 256 + 1024, half 1024 + 256) would likely be worse than a model optimizing for one of the use cases, but then again models like T5 argue against it.
>>> DALL·E appears to relate the shape of a half avocado to the back of the chair, and the pit of the avocado to the cushion.
That could be human bias recognizing features the generator yields implicitly. Most of the images appear as "masking" or "decal" operations. Rather than a full style transfer. In other words the expected outcome of "soap dispenser in the shape of hibiscus" would resemble a true hybridized design. Like an haute couture bottle of eau du toilette made to resemble rose petals.
The name DALL-E is terrific though!
Another good example is the "collection of glasses" on the table. It makes both glassware and eyeglasses!
With the ability to construct complex 3D scenes, surely the next step would be for it to ingest YouTube videos or TV/movies and be able to render entire scenes based on a written narration and dialogue.
The results would likely be uncanny or absurd without careful human editorial control, but it could lead to some interesting short films, or fan-recreations of existing films.
If you are talking about 24 frames per second, then theoretically one second of video could require 24 times as much processing power. And 100 seconds 2400 X. Obviously that's just a random guess but surely it is much more than for individual images.
But I'm sure we'll get there.
You try drawing a snail made of harp! Seriously! DALL-E did an incredible job
I can't believe it. How does it put the baby daikon radish in the tutu?
The defining feature of machine learning in other words is that the machine constructs a hypersurface in a very-high-dimensional space based on the samples that it sees, and then extrapolates along the surface for new queries. Whereas you can explain features of why the hypersurface is shaped the way it is, the machine learning algorithm essentially just tries to match its shape well, and intentionally does not try to extract reasons "why" that shape "has to be" the way it is. It is a correlator, not a causalator.
If you had something bigger you'd call it "artificial intelligence research" or something. Machine learning is precisely the subset right now that is focused on “this whole semantic mapping thing that characterized historical AI research programs—figure out amazing strategy so that you need very little compute—did not bear fruit fast enough compared to the exponential increases in computing budget so let us instead see what we can do with tons of compute and vastly less strategy.” It is a deliberate reorientation with some good and some bad parts to it. (Practical! Real results now! Who cares whether it “really knows” what it’s doing? But also, you peer inside the black box and the numbers are quite inscrutable; and also, adversarial approaches routinely can train another network to find regions of the hyperplane where an obvious photograph of a lion is mischaracterized as a leprechaun or whatever.)
One method for example is occlusion, removing pieces of input to assemble statistical representations of which parts your model cares about.
It's all still baby steps, but with time the theory will catch up.
> If you had something bigger you'd call it "artificial intelligence research"
Usually called just Data Science, and does deal with that (we had lectures on interpratibility of models at university)
Donald Trump is Nancy Pelosi's and AOC's step-brother in a three-way in the Lincoln Bedroom.
If the AI can actually draw an image of a green block on a red block, and vice versa, then it clearly understands something about the concepts "red", "green", "block", and "on".
A human can learn basic arithmetic, then generalize those principles to bigger number arithmetic, then go from there to algebra, then calculus, then so. Successively building on previously learned concepts in a fully recursive manner. Transformers are limited by the exponential size of their network. So GPT-3 does very well with 2-digit addition and okay with 2-digit multiplication, but can't abstract to 6-digit arithmetic.
DALL-E is an incredible achievement, but doesn't really do anything to change this fact. GPT-3 can have an excellent understanding of a finite sized concept space, yet it's still architecturally limited at building recursive abstractions. So maybe it can understand "green block on a red block". But try to give it something like "a 32x16 checkerboard of green and red blocks surrounded by a gold border frame studded with blue triangles". I guarantee the architecture can't get that exactly correct.
The point is that, in some sense, GPT-3 is a technical dead-end. We've had to exponentially scale up the size of the network (12B parameters) to make the same complexity gains that humans make with linear training. The fact that we've managed to push it this far is an incredible technical achievement, but it's pretty clear that we're still missing something fundamental.
This is false, GPT-3 can do 10-digit addition with ~60% accuracy, with comma separators. Without BPEs it would doubtlessly manage much better.
With multiplication, which requires much more extensive cross-column interaction, accuracy falls off a cliff with anything more than a few digits.
Again, not at all true due to BPEs.
    
+     
   
> With multiplication, which requires much more extensive cross-column interaction, accuracy falls off a cliff with anything more than a few digits.
You couldn't learn long multiplication if you had to use BPEs, were never told how BPEs worked or corresponded to sane encodings, were basically never shown how to do multiplication, and were forced to do it without any working out.
Quick, what's 542983 * 39486? No writing anything down, you have to output the numbers in order, and a single wrong digit is a fail. (That's easy mode, I won't even bother asking you to do BPEs.)
ML models can learn multiplication, obviously they can learn multiplication, they just can't do it in this absurd adversarial context. GPT-f was doing Metamath proofs on 9-digit division (again, a vastly harder context, they involve ~10k proof steps) with 50-50 accuracy, and we have a toy proof of concept for straight multiplication.
That sounds disappointing but what if instead of trying to teach it to do addition one would teach it to write source code for making addition and other maths operations instead?
Then you can ask it to solve a problem but instead of it giving you the answer it would give you source code for finding the answer.
So for example you ask it “what is the square root of five?” then it responds:
fn main ()
But 'understanding' itself needs to be further specified, in order to be tested even.
What strikes me most is the fidelity of those generated images, matching the SOTA from GAN literature with much more variety, without using the GAN objective.
It seems Transformer model might be the best neural construct we have right now, to learn any distribution, assuming more than enough data.
According to your definition of understanding, this program understands something about the concept RED. But the code is just dealing with arbitrary values in memory (e.g. RED = 0xFF0000)
It looks like a variation on plain old image search engine, unreliable at that, as compared to exact matching.
But it has obvious application in design as it can create these interesting combinations of objects & styles. And I loved the snail-harp.
Prompt: a Windows GUI executable that implements a scientific calculator.
You'll still need humans to make anything novel or interesting, and companies will still need to hire engineers to work on valuable problems that are unique to their business.
All of these transformers are essentially trained on "what's visible to google", which also defines the upper bound of their utility
Give it 10 years :) GPT-10 will probably be able to replace a sizeable proportion of today's programmers. What will GPT-20 be able to do?
Strongly recommend watching the whole video!
At this point I have replaced a significant amount of creative workers with AI for personal usage, for example:
- I use desktop backgrounds generated by VAEs (VD-VAE)
- I use avatars generated by GANs (StyleGAN, BigGAN)
- I use and have fun with written content generated by transformers (GPT3)
- I listen to and enjoy music and audio generated by autoencoders (Jukebox, Magenta project, many others)
- I don't purchase stock images or commission artists for many previous things I would have when a GAN exists that already makes the class of image I want
All of this has happened in that last year or so for me, and I expect that within a few more years this will be the case for vastly more people and in a growing number of domains.
> - I listen to and enjoy music and audio generated by autoencoders (Jukebox, Magenta project, many others)
Really, you've "replaced" normal music and books with these? Somehow I doubt that.
It’s like efficient market hypothesis: markets are efficient because arbitrage, which is highly profitable, makes them so. But if they are efficient, how can arbitrageurs afford to stay in business? In practice, we are stuck in a half-way house, where markets are very, but not perfectly, efficient.
I guess in practice, the pie for humans will keep on shrinking, but won’t disappear too soon. Same as horse maintenance industry, farming and manufacturing, domestic work etc. Humans are still needed there, just a lot less of them.
Vast majority of human generated content is not very novel or creative. I'm guessing less than 1% of professional human writers or composers create something original. Those people are not in any danger to be replaced by AI, and will probably be earning more money as a result of more value being placed on originality of content. Humans will strive (or be forced) to be more creative, because all non-original content creation will be automated. It's a win-win situation.
I think AlphaGo was a great in-domain example of this. I definitely see things I'd refer to colloquially as 'creativity' in this DALL-E post, but you can decide for yourself, but that still isn't claiming it matches what some humans can do.
If I train an AI on classical paintings, can it ever invent Impressionism, Cubism, Surrealism? Can it do irony? Can it come up with something altogether new? Can it do meta? “AlphaPaint, a recursive self-portrait”?
Maybe. I’m just not sure we have seen anything in this dimension yet.
I see your point, but it's an unfair comparison: if you put a human in a room and never showed them anything except classical paintings, it's unlikely they would quickly invent cubism either. The humans that invented new art styles had seen so many things throughout their life that they had a lot of data to go off of. Regardless, I think we can do enough neural style transfer already to invent new styles of art though.
Most arbitrageurs cannot stay in the business, it's the law of diminishing returns. Economies of scale eventually prevent small individual players to profit from the market, only a few big-ass hedge funds can stay, because due to their investments they can get preference from exchanges (significantly lower / zero / negative fees, co-located hardware, etc.) which makes the operation reasonable to them. With enough money you can even build your own physical cables between exchanges to outperform the competitors in latency games. I'm a former arbitrageur, by the way :)
Same with AI-generated content. You would have to be absolutely brilliant to compete with AI. Only a few select individuals would be "allowed" to enter the market. Not even sure that it has something to do with the quality of the content, maybe it's more about prestige.
You see, there already are gazillions of decent human artists, but only a few of them are really popular. So the top-tier artists would probably remain human, because we need someone real to worship to. Their producers would surely use AI as a production tool, depicting it as a human work. But all the low-tier artists would be totally pushed out of the market. There will be simply no job for a session musician or a freelance designer.
The demo inputs here for DALL-E are curated and utilize a few GPT-3 prompt engineering tricks. I suspect that for typical unoptimized human requests, DALL-E will go off the rails.
I want the stuff that no human being could have made - not the things that could pass for genuine works by real people.
Unfortunately many generations fail to hit that.
But there is still a lot of room for more clever architectures to get around that limitation. (e.g. Shortformer)
Frankly, I think the "AI will replace jobs that require X" angle of automation is borderline apocalyptic conspiracy porn. It's always phrased as if the automation simply stops at making certain jobs redundant. It's never phrased as if the automation lowers the bar to entry from X to Y for /everyone/, which floods the market with crap and makes people crave the good stuff made by the top 20%. Why isn't it considered as likely that this kind of technology will simply make the best 20% of creators exponentially more creatively prolific in quantity and quality?
I think that's well within the space of reasonable conclusions. For as much as we are getting good at generating content/art, we are also therefore getting good at assisting humans at generating it, so it's possible that pathway ends up becoming much more common.
Couldn't any creator of images that a model was trained on sue for copyright infringement?
Or do great artists really just steal (just at a massive scale)?
>Models in general are generally considered “transformative works” and the copyright owners of whatever data the model was trained on have no copyright on the model. (The fact that the datasets or inputs are copyrighted is irrelevant, as training on them is universally considered fair use and transformative, similar to artists or search engines; see the further reading.) The model is copyrighted to whomever created it.
Source (scroll up slightly past where it takes you): https://www.gwern.net/Faces#copyright
"Models in general are generally considered “transformative works” and the copyright owners of whatever data the model was trained on have no copyright on the model. (The fact that the datasets or inputs are copyrighted is irrelevant, as training on them is universally considered fair use and transformative, similar to artists or search engines; see the further reading.) The model is copyrighted to whomever created it. Hence, Nvidia has copyright on the models it created but I have copyright under the models I trained (which I release under CC-0)."
If I write a copyrighted text on a book, then I print a million other texts on top of it, in both white an black, mixing it all up to be like white noise, would the original authors have a claim?
Worse, sometimes the input data is illegal to distribute for other reasons than copyright.
But the profession for creative individuals consists of much more than highly-paid well-credentialed individuals working at well-known US corporations. There are millions of artists that just do quick illustrations, logos, sketches, and so on, on a variety of services, and they will be replaced far before Pixar is.
I won't say many of those things are creativity driven. There are more like auto assets generation.
One use case of such model would be in gaming industry, to generate large amount of assets quickly. This process along takes years, and more and more expensive as gamers are demanding higher and higher resolution.
AI can make this process much more tenable, bring down the overall cost.
Do you have a GPT-3 key?
Seeing the "lovestruck cup of boba" reminded me of an illustration a friend of mine did for a startup a few years back. It would be a lot easier and less time consuming for someone to simply request such an image from an AI assistant. If I were a graphic artist or photographer, this would scare me.
I don't know what the right answer is here. I have little to no faith in regulators to help society deal with the sweeping negative effects even one new AI-based product looks like it could have on a large swath of the economy. Short of regulation and social safety nets, I wonder if society will eventually step up and hold founders and companies accountable when they cause broad negative economic impacts for their own enrichment.
it's good by the standards of machine-generated images but it's not comparable to the work of an artist because it has no intent and it's still in many ways incoherent and lacks details, composition, motives and so on. It's like ML generated music. It sounds okay in a technical sense but it lacks the intent of a composer, and I don't see a lot of people listening to AI generated music for that reason.
If anything it'll help graphic artists to create sketches or ideas they can start from.
DALL·E = Dali + WALL·E
Was that generated by an AI as well?
I'm actually building a name generator that is as intelligent and creative as that for my senior year thesis (and also for https://www.oneword.domains/)
I already have an MVP that I'm testing out locally but I'd appreciate any ideas on how to make it as smart as possible!
I can't remember its name.
Some pics are of drinking glasses and some are of eye glasses, and one has both.
To be honest, it's not where I'd like to see efforts in the field go.
Not because I'm afraid of AI taking over, but because I'd rather have humans recreate something comparable to a human brain (functionality wise).
I wrote a blog post on that a few months ago after playing a bit with GPT-3, and it holds up. https://news.ycombinator.com/item?id=23891226
The shipping community will go apeshit if this thing works as advertised.
There is a reason that the examples are cartoons animals or objects. It's not disturbing that the pig teapot is not realistic, or that the dragon cat is missing a leg. This kind of problem is very disturbing on realistic pictures of human bodies.
Eventually it will get there, I guess you could make an AI to filter the pictures to see which of them are disturbing or not.
An AI to provide illustrations to your written content.
This is impressive.
It would still be impressive that it knows were to include hands, christmas sweater or a unicycle.
It's a tool, and like any other existing tool it will be used for both bad and good.
These AI's can't yet produce things of value to humans but I doubt Google's AI could know that.
Pump out billions of pages of text and pictures and it should swamp Google.
IN: "give me living giraffe turtle"
OUT: a few weeks later himera crawls out of the AI lab box
I may have not been so disappointed if they had not highlighted such incredible results in the first place. Managing expectations is tough.
This seems like it could be a great replacement for searching/creating your own stock photo/images.
Hopefully all output is copyright friendly.
Is OpenAI going to offer this as a closed paywalled service? Once again wondering how the “open” comes into play.