We're still in the first innings of this stuff. Zero-shot/CLiP work will extend to audio, music, music videos and perhaps full movies. I would love to see:
"The Phantom Menace with Jar Jar Binks replaced by leonardo dicaprio using the voice of James Earl Jones"
Take it further. I want to see "an entirely new Star Wars movie directed by the Coen brothers where between scenes of relationship dysfunction, reality is revealed to be a Matrix-like prison owned and exploited by Disney and policed by the Jedi."
Wtf, after waiting about 2mins for your site to load all I could tell from your product page is that I could somehow spend $1,299.00 on whatever it was.
Dall-E was amazing in that it showed good image generation worked at all, but nothing else actually uses its model architecture - even craiyon/dalle-mini which is popular right now is pretty different.
Although, I have access to DALL-E 2 and I have to say it's not as great as it looks. It is not productizable as is and couldn't replace an artist. It can't do text, the art styles change wildly and aren't controllable, image quality degrades with longer prompts, and most importantly their training data has everything vaguely interesting censored out of it.
Which is surprising since GPT-3 is very complete and a little too uncensored. Not sure if they think visuals are worse or they just have a lot of extra ethicists messing with it now.
I feel like censoring the training data as a way of attaining "safety" is a cheap fix that will obviously never extend to AGIs, which will probably have to be trained in the real world to even be AGI.
In OpenAI’s case it is less about safety and more about serving outputs from an API without causing huge PR backlash and without having to wait for _proper_ safety research to be completed to do so.
Besides that, “safety” is a very abstract term, and when it comes to a specific model like this the problems become more prosaic things like “embarrassment”.
That could be of people you’ve made deepfakes of, but it could be OpenAI embarrassed that their outputs look bad, or their users embarrassed that they got an NSFW output for an SFW prompt.
I think the censorship is also about copyright though; it barely knows any characters while dalle mini users are generating stuff like “Shrek on trial for murder” all day.
Indeed and while I have no access to DALLE2, I’d be willing to bet it happily outputs Minecraft-style imagery and IP, as business-daddy MS has assured them it won’t be an issue.
Well “Minecraft” isn’t a character so there’s definitely some of that in it. It knows the look of other IPs too, like various movies or games like Borderlands, it’s just that it doesn’t know many people’s names whether they’re fictional or not. So maybe “Minecraft Steve” wouldn’t work - can’t tell atm as I’m out of credits.
From earlier prompts I tried, “Minecraft” makes things look conceptually blockier rather than visually blockier, if that makes sense. So a “Minecraft mansion” will turn into a midcentury-modernist house with flat walls and roofs rather than being made of voxels.
That’s a pretty neat effect, but if you use names of video games in dalle or other image AIs they tend to start generating UI elements on top of your image and it looks bad.
This may have more to do with them filtering out training data containing names than anything else. They did the same thing for their released GLIDE checkpoint (predates DALLE2 by a couple of months).
Indeed - the first evidence for this came from the DALL-E paper itself where they used CLIP to rerank outputs (as classifier-free guidance hadn’t yet been applied to fix DALL-E’s problems with noisy outputs).
It's a bit of a freaky question (you mentioned the same example last week too, I had to double-check if you were the same person) but I think it's actually really important and we should be talking more about things like that. It's entirely feasible with the current technology we have, and we can't just ignore that fact. We're basically one rich eccentric (who can afford the server time) away from a model that can remove the clothing from an image of a person. You can't put that cat back in the bag.
Most AI is currently done by big, "serious" companies that both care about liability and bad press of being associated with that sort of thing, and also have a lot of AI-ethicist-type folks on board who care a lot about what they consider "misuse". Some AI designs try to limit the NSFW training data used in the model (e.g. DALL-E 2), while others try to fine-tune or censor results after the fact (e.g. GPT-3).
Right now the adult-oriented AI applications lag slightly behind the most cutting-edge ones, but actually make up a shockingly big percentage of the consumer base, both current and potential -- adult content is probably one of the biggest actual potential applications for AI, and there are some really fascinating ethical questions around it (e.g. ethics of AI-generated porn vs real life porn, considerations around real people, minors, other illegal content, etc.).
Generally, adult-oriented models are either hobbyist clones/finetunes of existing models, or just existing models that people have figured out ways to get to work with adult inputs. There are plenty of AI model hosting services out there that have no qualms about being used for shady or even illegal purposes, so it's difficult if not impossible to stop it from the server provider side.
We need to be thinking more about what how we want to handle that sort of thing socially/culturally/legally, because it's gonna happen whether we want it to or not.
Our best minds are working on this amazing near-magical new technology.. which will end up being productized into an Instagram Filter service to dynamically inject a stained glass unicorn in place of a horse in a video.
That's cool and all, but also really stupid and a pointless distraction compared to how novel the underlying mathematics and science are. This will quickly become a commodity and humans will acclimate to seeing such tricks.
The content produced won't even be considered particularly impressive.
Damn. I was hoping the singularity would be better.
I see it differently. As a robotics engineer I know the biggest impediment to robotics development is getting computers to understand the real world. The work on multimodal neurons, which see the word cake and know to associate it with images of cake, is a key stepping stone along the way to a fully functional embodied AI that can solve difficult real world problems. CLIP, DALL-E, and all these off shoots are representations of what we can pull from these efforts today. But long term this work will be incorporated in to bigger and more capable AI systems.
Just think: when I ask you “walk in to the workshop, grab a hammer and a box of nails, and meet me on the roof to help me secure some loose shingles” your mind is already imagining the path you will take to get there, what it will look like when you locate and grab the hammer and nails, and you’ve filled in that to get on the roof you have to meet me in the back yard to climb the ladder, which I never mentioned.
All these tiny details your mind can do effortlessly take huge efforts like CLIP to sort out how to make it work. And even CLIP is only text and images. There is a lot more to go from there.
A lot of people focus on DALL-E and the artifacts that come out along the way, but these are not the destination, just little stops showing the progress we are making on a much larger journey.
The thesis "Best new minds" is not even correct. If the "Best new minds" didn't create things like social networking and instagram we wouldn't even have the data sources to build upon these new algorithms to even work. Also, without a bunch of video game makers actually pushing graphics cards we also would have the hardware to do these things. So the "best new minds" accidentally enabled more "best new minds".
I think now is a good time (now that they're out of research zone and actually working) for a flood of new people to enter this space and try to come up with more creative ideas than that. It's an exciting time for cool ideas and cool projects if we take off our cynicism hat for a bit.
Well you will have to do some work to understand them (read papers etc). But distilled (from larger models) versions of these things are fairly capable of being computed even on the web etc. There's definitely cool low hanging fruit here (though it's not plug and play import a library). The main thing is that these have been proven to be as powerful as they are (last year it wasn't clear they'd be able to get this good), so with some effort there's definitely cool stuff to be built. I'm excited (and working on it myself).
Yes, lowering the cost of special effects means that they're not that special and there will be lots of crap. On the other hand, it lowers the cost of filmmaking, so there should be storytellers who use this to good effect, where the special effect isn't that noticeable but serves the story?
Though, a question is whether the good storytellers can be found easily? It seems like the situation is similar in fan fiction.