As an owner of a Video Production studio, this kind of tech is blowing my mind and makes me equally excited and scared. I can see how we could incorporate such tools in our workflows, and at the same time I'm worried it'll be used to spam the internet with thousands and thousands of souless generated videos, making it even harder to look through the noise.
A fun related experiment, I thought it was fun to see what kind of movies AI would generate, so I created a "This Movie Does Not Exist" website[1] that auto generates fake movies (movie posters + synopsis). It basically uses GPT-3 to generate some story plots, and then uses that as a prompt (with in-between steps) for Stable Diffusion. Results may vary, but it definitely surprises sometimes with movies that look and sound amazing!
Once a capable AI that can make a "Two Brothers" trailer like this one below comes up, humanity has some serious issues to tackle, other than climate change ofc! :D
Yep, that's because GPT-3 was trained on real existing data, and it's quite a challenge to make sure the story plot is 100% fake. When it's too close from an existing film, it just sometimes gives it the same film title. I have in-between GPT-3 prompts to avoid that as much as possible, but sometimes real movie titles slips through the cracks. Something I hope to improve shortly.
Given Hollywood's proclivity to remake everything on a 20-year cycle, it seems completely appropriate to get a 2023 Terminator reboot in a 1920's style.
There are only so many stories to tell. The Terminator is a rehash of so many other previous stories. The real art is in putting it together so that it seems new and fresh and gets people exited about it. The Terminator 1920s style looks interesting.
If a story is a rehash of "many" stories, then it's actually a new story. Similar to how an "Airbnb but for dog walkers" isn't actually a ripoff of Airbnb, but is in fact an original idea.
Which reads like a bad translation of a bad translation. Like the the old joke about the AI program which was supposed to translate "The spirit is willing but the flesh is weak" from English to Russian to English, and after the roundtrip came up with "The vodka is good but the meat is rotten."
I feel like "generate" is kind of a strong word in this case though. At this rate if the machines rise up, they will do so just to parrot all the "machines rise up" plot synopses in their training corpus.
I think some refer to this as the dead internet theory, e.g. AI content creation becomes the majority of media on the internet instead of humans posting (may be wrong in my explanation but think that's the premise).
It's scary to think about it but seems plausible—like if someone can make an app with Tiktok-like ubiquity of only AI content. Although to your point I imagine there will be so much nonsensical noise that curating will become a useful skill, it is today but even more so.
> It's scary to think about it but seems plausible—like if someone can make an app with Tiktok-like ubiquity of only AI content. Although to your point I imagine there will be so much nonsensical noise that curating will become a useful skill, it is today but even more so.
This just gave me a disturbingly vivid vision of exactly what you described. A seemingly likely future where everyone's in their homes scrolling endlessly on a TikTok-like app where there's literally infinite content being generated by the AI all the time, and as people like and dislike certain types of content, the AI just gets better and better at generating new videos...this is honestly kind of terrifying. I have no doubt this will exist one day, and that it'll print money for one company while billions of people are spending all their free time consuming it.
People like interacting with other people, not bots. If a few individuals start to feel like they're interacting with too many bots, they'll retreat into small private silos. The human-to-bot ratio on public forums then drops. Then more people realize that they're just talking to bots and further and further it goes until practically no humans are left. That's the dead internet theory.
I really don't think it applies to us in this context though because I think that a decent number of humans don't care whether some content is AI generated. Furry porn is all hand-drawn and people still like it despite it not being real.
From what I can see, YouTube has done quite a bit of work to cleanup YouTube Kids, but it's kind of an arms race.
There's this worrying issue in AI ethics discussions where most people seem to assume the problems and dangers of AI are still off in the future, that as long as we don't have the malicious AGI of sci-fi stories, then AI and "lesser" algorithmically generated content isn't harming society.
I think that's not true at all. I think we've seen massive damage to social structures thanks to algorithmic feeds and generated content, already, for years now. I don't think, just because they aren't necessarily neural-network-based, doesn't make them something to not worry about.
So I don't see AI as a particularly, different, worrisome problem. It's an extension of an already existing, worrisome problem that most people have ignored beyond occasionally complaining about election results.
I know some extremely hard working independent filmmakers who struggle so hard to get noticed. After this tech goes mainstream in 5 years and gets really good, I don’t know what they’re going to do
Have you talked to the average 15 year old kid? Hell, have you talked to the average 50 year old? It will be the same as ever, a sea of absolute shit surrounding some true gems, be they from novel creativity or just excellent execution of well worn ground. The role of the trusted curator will rise and brands will gain more power.
I am sure I will watch MY 15 year old's attempts, and maybe a few from my extended circle but most of my consumed content will still come from what makes the cut to Netflix or HBO etc. Technology like this will empower the truly creatives once it has matured. I would expect closer to 20 years than 5 however.
> It will be the same as ever, a sea of absolute shit surrounding some true gems
Ah, I can see what's wrong there, just turn down shit randomness and increase the true gem sampling steps to proportionally increase the input weight of true gems and fix your output quality.
Basically - if it empowers creative people, their output will be fed back in and parameterised.
From my personal perspective; knowing how something is made affects my interest in it. I guess that's why provenance ascribes value. I value human creativity and so I'll likely always seek out something created by craft than by shortcuts.
AI is eating the world, and the vast majority of people are not paying attention.
I don't know what artists, truck drivers, Uber/Lyft/taxi drivers, delivery drivers, programmers, doctors, judges, fast food workers, etc. are going to do.
Perhaps humanity would happily share the benefits of machine work and we can all spend idyllics lifes eating, laughing and loving while exploring the galaxy?
If they’re not trading with everyone else they’re not rich. Being rich /is/ the ability to trade a lot.
The idea that rich people will all leave and start a different rich-people-only economy that somehow takes all the economic activity with it isn’t how it really works, it’s the plot of Atlas Shrugged.
Why do you need consumers when you have machines making most of what you want? Then you just need to be able to trade things with other people who have different machines and resources from yourself to get the bits you can't have made by your machines.
Where do the inputs for the machines come from? ex: having a machine that generates CPUs out of sand and works for free would mean cheaper CPUs, but it would also mean a lot more employment in the sand industry, and there might not be a machine for creating sand.
Nope. He is saying power concentration will increase further. This is also Yuval Noah Harari's thesis. You can read about how that will get achieved here.
Surely if we get to the point where programmers are no longer needed, humans have essentially been replaced by AI? Since the AI programmers could just program better and better AI?
Why will conventional investments have value? Why will it be safe? I could be retired and well off with say even 5 or 10 million and then large numbers of people lose their jobs, yet we can still make widgets and grow and distribute food with robot everything, society will fall apart. There will be fewer people with money to support themselves or buy things. I don't see the capitalist society in the US giving people basic income, but we won't need so many workers, and so anarchy.
> I'm worried it'll be used to spam the internet with thousands and thousands of souless generated videos
I agree, and I think when that happens, it will tend to increase the value of curation. High quality curation that is, probably done mostly by hand, as opposed to the at-best-mediocre automated curation that is commonly used.
It could be bad for things like YouTube, for example. I think there will be an arms race between generated video content one one side and automated curation on the other. I mean, you can still leverage viewer choices for curation (looking at what people are watching a lot of), but that is just shifting the burden of curation to users. Few people will be willing to sift through dozens of cheaply generated crap videos to find something they actually want to watch.
I was thinking about something like this site, but also taking some randomized existing plot outlines to generate specific stills from each part of the plot. Might require isolating character archetypes too?
The description text does not convert sentences with carriage returns (or probably, newlines) into separate div's or whatever html element you'd prefer, FYI! Otherwise, very cool!
Cool toy. One of the most useful side effects of AI right now is idea generation. Market this as an idea generator for movies and such and people will eat it up. Try posting it on the entertainment focused area of Twitter and people will go nuts for it.
These prompts are way better than the drivel I see on Amazon Prime. Half their movie descriptions don't even tell you anything about the film, they seem to be just a random paragraph from the pitch document.
"Soulless" -- they'll be low quality, both in rendering and in plot/acting, but they'll be anything but soulless. Each will be a labor of love of someone with a dream.
For those who are scared about this technology, it’s good to look at what AI has done to Chess.
The best chess seems to be when AI is used along with humans. I think image and video AI will best be exploited when human input is also taken into account.
There is still something special about human creativity, I think AI will just be another tool to expand that. At least, in the short term I would say 10 years perhaps. AI will probably one day take over all aspects of creativity and humans won’t be able to contribute.
> The best chess seems to be when AI is used along with humans.
I don't think this is true anymore. I don't think I've heard about successful centaur chess games in years. I would love to be wrong there though (in particular if anyone knows about how correspondence chess games have been played in the last 2 or 3 years with the availability of Leela Zero and Stockfish NNUE).
Based on that thread, it looks like centaur chess is close to dead.
> Human input in top-level ICCF [correspondence games with chess engine support] games is now 99% eliminated, other than personal preference in selection of openings.
Don't worry, humans will still be relevant for ~10 years!.
(and then regrettably irrelevant thereafter).
I think it is a legitimate worry, as the pace of progress is considerable. These tools are impressive, and are only going to get more impressive: more people should be talking about where this is headed.
These models are literally one French court decision away from being banned and investment in them being completely halted. So I wouldn't worry too much about them specifically.
It would have gotten rid of every single one of those if the only method for their production required a tech company to invest a 8-digit sum in R&D and said tech company was threatened with litigation.
Porn companies, Facebook, Uber, ... these are just the latest companies, in a long line, for whom huge lawsuits and fortune-sized settlements are simply a line item on the expense sheet.
Are you somehow under the impression that, if a government decides to impose a fine that will financially kill the company once and for all, it is incapable of it?
Incidentally, Uber has been banned in the entire country where I live for 8+ years now.
Of course. And that kind of action would make corporations sit up.
But what a government can do, and what its politicians will do are very different things. The political backlash for going nuclear on corporations would be massive and sustained. Not even good acting corporations want that kind of thing happening.
Regarding, Uber and your country: your point makes my point stronger.
A whole country is now off limits to Uber, but on the whole, across all the countries, their aggressive behaviors has been a win for them anyway. Just another cost of business.
A french court banning it will probably be a (short) precursor to an EU-wide ban which is effectively a worldwide ban. EU courts will be fining companies heavily if they don't comply which effectively means big tech no longer invests a penny in models.
I use 'French court' here mostly because I believe it most likely to be firstly litigated there.
Even still, if the EU chooses to regress, I can imagine a lot of non-EU-based companies—especially smaller ones—just choosing not to deal with them anymore
EU can pretty much unilaterally cut access to the financial system of said companies, demonstrated this year to a number of russian entities.
Meanwhile, given what Xi has done in the last couple of years, I would not be surprised if China moves against the model makers first.
Yeah, just like Facebook and Google can't operate in Europe, or the US isn't still spying on the whole world still. Those eu courts really control them.
EU courts fined google just this week. Way to prove exactly how uninformed you are. There has been no reason to ban meta or google in EU so far, but you can ask RT how good their operations are at the moment.
EU wouldn’t ban homegrown technology. They’re making up laws an an excuse to punish Google for not being EU based, it’s not that Google happens to be violating the law.
EU would ban any technology that disrupts the labor market to an extent it leads to widespread protests among a segment of society. For how a somehow similar thing has played out in the last 40 years (instead of banning technologies, it has mostly been instituting quotas, paying subsidies and banning/taxing imports), you can look at its agricultural policy.
The EU has had many sanctions against google, tried to force them to change various practices, just like the pointless attempts to get microsoft to make changes. None of them made any real difference.
I think a key difference here is that with chess, 'goodness' is defined by winning. With content generation, the training methods point towards some form of comparing the generated thing to some observed data, but the 'goodness' of the content from the perspective of potentially competing with or displacing human creators is "do people like to consume it?"
If one trained using e.g. a tiktok like dataset showing viewer response measurements for each video, and do conditional generation on those response values ("prank video watchers are highly likely to watch the full video"), are we really that far from a system that learns to generate content that attracts and hold eyeballs? Not so long ago there were a lot of concerning trend pieces about how youtube had a network of creators making bizarre, disturbing or transfixing videos being watched entirely by young children. Before that, it was clickbait listicles. "Bad" content that can get eyeballs can still wildly steer what humans create and consume. I'm wondering if in 2 years we'll have an enormous number of short videos that we all agree are "bad" but which are nevertheless constantly watched.
I may be mistaken but I believe that human/machine pairing was dominant for a long while, but the last few years the chess solvers have progressed to a point where they're dominant on their own.
Poker on the other hand I think human players still win vs GTO solvers, but again I may be mistaken here too.
Is there a single "intellectual" game left that humans can beat computers at?
I suppose an AI has yet to beat lebron james at basketball, but I suppose that's for want of having a body.
AI is the winningest in chess, but the real life purpose of chess is to produce interesting gameplay for people to watch, and so AI is less good than Magnus at that. You’d need the AI to throw games and write press releases.
There are two classes of engines. One is like you describe, faster and faster brute forcing. AlphaZero was much more creative and didn’t use brute force.
Yes it's happening real time, so far it's been generated about 8k movies over the last 3 hours. The costs right now are roughly about $150 for the generated images (Stable Diffusion) and $30 for the generated texts (GPT-3).
Sorry about this, we're having a hard time handling the heavy traffic coming from Hacker News. Over 4k movies generated in just the last 2hours, this definitely impacts our servers performances, not mentioning our Stable Diffusion and GPT bills! hehe. Currently working to make things smoother!
What I want to see is a model that can generate 3D models for use in applications such as Blender. It would provide a good starting point for someone with talent to make beautiful. Or just save people like me time for making games.
I've been looking into some of the 3D model generators this past week, and there is some work happening in that field. See the following non-exhaustive list:
Have you by chance tried out NeROIC? I'm a 3D printing enthusiast, mostly video game stuff, and it seems like like it would be excellent for that purpose.
I actually have been trying it out this week, and in fact it's currently trying to process the video generation, like their example shows. While I was able to follow their steps for training using their dataset, and generate the lighting/depth maps for the milkcarton example, the video generation is taking a long time (over 24 hours so far, using a 3070Ti with 8GB VRAM).
From what I understand with NeROIC, it's not particularly meant to be able to generate an 3D model that can be imported into Blender (or other software). It requires more work to take the meshes it generates to do something with it. See https://github.com/snap-research/NeROIC/issues/10
I too was looking into it to generate 3D models for some software I've been working on.
Thanks! I imagine one could use something like ZBrush's dynamesh to create a usable mesh from the output. Shame the library doesn't provide it by default, though.
Funnily enough, you might not even need a 3D model if the txt2video is good enough. Whatever you wanted to render in Blender could just be rendered via text prompt (when this becomes 100x better).
This looks like the video equivalent of Dall-E 1. Hard to believe how far we've come so quickly.
The paper talks about "pseudo 3D attention layers" that are used in place of temporal attention layers for each dimension due to memory consumption. It seems like AI research is vastly outpacing GPU development.
Indeed - it's not hard from a research point of view - it's hard from a compute perspective because adding one more dimension requires hundreds of times more compute.
Even then, these videos are only like 50 frames long - and a real movie you would want to be hundreds of thousands of frames long.
So you need to optimise on compressed version, not the whole thing. What they’re doing right now is akin to a human trying to hold an entire picture - or entire movie - in their head all at once.
We can’t do it. AIs can sort of do it.
Latent diffusion models already demonstrated that operating on a compressed representation gives far better results, faster, but I don’t think we’re anywhere near the limit for what’s possible there. It’s no coincidence that this is how humans work.
I am curious if there has been any research on temporal attention in humans. I'm not sure how you'd quantify it. But in myself I know that I'm constantly predicting where something will be or what it will look like based on how it did a second ago. It's probably the root of reflexes.
They put an eye tracker on someone and captured their motion when walking in some rough terrain. You can sort of see that the person is focusing on the most likely place their foot will go next.
I think that we will discover that there is a more efficient way to encode temporal relationships, which appears to be "just throw transformers at it." My guess is that it will be in a more conceptual latent space that this attention will be applied.
True, but the attention layers still need to be able to look at all the shots - for example to make sure the background of a room shown at the start of the movie is the same as the background of the same room at the end.
Obviously you could do 'human assisted' movie making where humans decide the storyboard and make directions for each shot, and then that isn't necessary.
Hardware was probably always lagging behind cutting edge research, just consider video games, they pushed hardware limitations very hard since Pong.
It's a good thing to be fair, forcing research teams to optimize their projects is beneficial and creates a competition for limited resources. This gets a bit skewed when we consider a university research team vs. a MANGA type company, but the team behind Stable diffusion proved that innovation can come from unexpected places.
What's mind blowing is that you can extrapolate where this is going to go. Eventually, you will be able to generate full movie scenes from descriptions.
What's interesting to me is how this is so similar to human imagination. Give me a description and I will fabricate the visuals in my mind. Some aspects will be detailed, others will be vague, or unimportant. Crazy to see how fast AI is progressing. Machines are approaching the ability to interpret and visualize text in the same way humans can.
This also fascinates me as a form of compression. You can transmit concepts and descriptions without transmitting pixel data, and the visuals can be generated onsite. Wonder if there is some practical application for this.
IMHO this particular avenue is a dead end. It's an extraordinarily impressive dead end but it's clear that there's no real understanding here. Look at this video of the knight riding a horse:
The knight's upper body doesn't match with the lower and they're not moving correctly
I think ultimately the right path is something like AI automated Blender. AI creates the models & actions while Blender renders it according to a rules based physics engine.
Of course there "is no understanding here", but yet it's not all wrong. Somehow it did move the horse's legs roughly correctly (using the proper joints and all), somehow the cape is moving roughly as it should through the air and the knight's body absorbs the force of stomping on the ground…
It doesn't seem that the fundamental inability to understand what is going on in the scene is stopping models of this kind to eventually lead to realistic results.
I guess it comes down to how much wiggle room is in "roughly". If you watch the video closely the horse briefly gets a 5th leg when the front left one moves the first time. And yes the legs and joints are sort of right but they don't match up with the direction the horse is moving and wouldn't work in the real world.
It's superficially close but when you look at details they're all slightly off. To wit:
> the knight's body absorbs the force of stomping on the ground…
But the Knight doesn't have any way to see through the helmet.
I would say that the wiggle room is smaller then I would expect it to be. Don't you agree?
And I was surprised about how small the wiggle room was the first time I interacted with GPT in text or saw the first images from DALL-E, since I too expected them to be (severly) limited by not understanding what's actually respresented in the input/output.
With new versions the wiggle room shrinks further. So I guess the question is, whether it will be able to shrink enough to be satisfactory. We will see…
Whenever there is an explosion of content, curation and search become important. Meta has many of the products people use to show off their taste, so having more content to curate and share is good for their ecosystem (until people stop consuming as much because they know they can produce even better with their own imagination). This may be good for Google - the more content there is, the more you need search to find it.
The downsides: there will be less money out there for creators, because that becomes a commodity. You will be able to make money if you are known for quality content polishing, editing and generally bringing that last 5-10% of generated content to look perfect (all the way until automated tools are trained to do that as well). AI will automate and improve most white-collar jobs. Instead of generators, everyone will become curators, as taste will be more important (until that's trained into the system as well).
For deeper levels of consequence we have to look at history: how did the world change when people finally got paper after most writing was done on animal skins (more got to write, the richest or most powerful didn't have the only say), or water piped to your house after you had to carry buckets and dig wells (It freed up time for everyone for more interesting tasks). Now GPUs are going to be the new paper and the new PVC. Yes, software has been eating the world for a while, but you won't be able to brainstorm without AI generating the first pass.
Creators are already curators. Musicians often don’t produce their own sounds, they curate and piece together samples. Designers and software engineers cut and paste from existing work.
The idea of a blank slate creator has been dead long before ML tools were introduced :)
I'm rooting for this tech. Hopefully this will get modern movies out of their low risk reboot loop since it will be cheaper to make a movie that have new story lines that are commercially untested. I'd be happy to watch a movie that doesn't look AAA, but has compelling writing and makes me think. Or maybe I'll just stick to books.
I don't think modern movies are stuck in a "low risk reboot loop" because of the cost to produce, it's because of the potential profit.
Why spend money on a film with new IP and ideas that you're not sure will be popular when the data science team has already worked with marketing to figure out exactly what movie will sell well?
Good luck finding your movie with compelling and thought provoking writing in the big pile of movies produced by comittee to show up above yours in discovery algorithms.
I was just thinking about this for the long tail problem in recommendation engines for stuff like video games. I blame the algorithms themselves.
We're in a situation where the very best algorithms (like the one used by Netflix), doing exactly what they're designed to do, create inequality and the vacuous economy of influencers we have today. Look at Steam or any other marketplace: they're all the same, with 1% of the players getting 99% of the prizes. In a very real way, the only winning move is not to play.
I would suggest that this tendency of capitalism (economic evolution) is unstoppable, and that it must be attacked from a different angle. If we don't want to inevitably end up in late-stage capitalism that looks like neofeudalism, then there has to be some form of redistribution or people spend the entirety of their lives running the rat race to make rent. Traditionally that was high taxes on the winners, but UBI would probably work better. Unfortunately, the very same people who win are the ones most resistant to any notion of a level playing field or social safety net.
So I feel like there may be no solution coming. We're probably looking at long slow decline for the next 15 years or so until AI reaches a level where economics don't really make sense anymore, since economic systems by definition control the distribution of resources under scarcity. Without scarcity, they're pointless. And we moved into the age of artificial scarcity sometime after WWII, probably in the late 1960s, but certainly no later than 1990 with the fall of the USSR and the rise of straw man enemies like terrorism, using divisive politics as the primary means of controlling the population. Noam Chomsky saw this coming before most of us were born.
In other words, when anyone can wish for anything by turning sentences into 3D-printed manifestations of their dreams, then artificial scarcity quickly loses its luster. Because the systems of control around dependency no longer work. Then a new fear-based enemy comes along to fill the void, probably aliens. I wish I was joking.
Films today require institutional capital. There are less than a thousand directors a year that get greenlit. There isn't enough conceptual or directorial diversity, and it sucks.
In the future, kids will be making their own Star Wars movies from home. All kinds of people from all kinds of backgrounds will make novel films that would ordinarily never be made, such as "Steampunk Vampires of Venus", starring John Wayne, a young Betty White, and Samuel L. Jackson. This is absolutely the future.
I'm working on building this. I'm sure lots of others are too.
With the future version of this you could theoretically prototype movies way faster and try ideas with test audiences without requiring to actually film them
In the future version of this, the end user asks for a movie and gets a custom movie generated just for them.
You want a thought-provoking Bourne-style action thriller with hints of Jane Austen and a bollywood dance sequence? How about a Matrix sequel that lives up to the first one, but ends just the way you like it? Just ask.
Xenomorph cop. It's a Dirty Harry style movie except the lead is a xenomorph. The other characters are only vaguely aware that he's a xenomorph.
Or how about the movie Clue with the three endings except an infinite sequence of "or maybe it happened this way" sequences. I mean how else are we going to get a sequence where Darth Vader and Tim Curry reenact the "No I am your father" scene while Martin Mull dies in the background due to a heart attack.
This could certainly be entertaining. The trick will be for the studios to continue to brand it as recognizable, yet have it be unique. It begs to ask what part of the experience will be a shared experience since it could be radically different for all of us. Would we then share the story created for us with friends? Will there be an Oscar for best mind to have a video created for them?
I think it would be great for people who write scripts for a movie they imagined.
You could conceivably write a script and feed it into a machine and have a decent 1080p rendition of the movie with consistent characters and voice acting which you could use to better pitch your movie idea to people, or get to watch a movie you created between you and your friends even if no one else ever gets to see it.
Having written several feature length scripts, I'm certainly looking forward to trying this. But with an explosion of cheaply produced content, people will probably spend less time watching movies and TV shows. I'm a lot more selective now than I used to be because there's just too much stuff to watch even if I spent the rest of my life on the couch.
The current trend of remaking movies as 10 hour miniseries (and then making more and more seasons) is Not Great. Whereas I could be fascinated by a quirky-but-compelling original movie, I'm less attracted to 10+ hours of hyper-polished content. Sometimes I've watched a series and thought 'that was good, but it could have been a better movie.'
Not sure we're quite there yet. A real movie needs a lot of dialogue, speech, sound effects, music, etc. Even the best LLM's don't do really coherent storytelling yet and a script for a movie is just the absolute barebones.
I was pretty surprised to see that the WebVid-10M Dataset used as part of this training - https://m-bain.github.io/webvid-dataset/ - consists entirely of video preview clips scraped from Shutterstock!
Really cool seeing this & excited to really start playing with it. 2023 is going to be funky!
It's pretty wild to see how quickly the space of Generative AI Media is coming along.
I started a newsletter on the topic, called The Art of Intelligence (GPT-3 came up with the name) with the first post going out last Friday on the topic of how far are we from AI generated videos, and simulated worlds like the Holodeck given the rapid progress of these visual A.I. Thought y'all might find it interesting:
https://artofintelligence.substack.com/p/dall-e-stable-diffu...
This type of progress also reminds me of a really lovely publication from 2017 in Distill.pub, on the topic of these A.I. enabled creation tools - I think y'all would enjoy seeing what folks were thinking even then:
https://distill.pub/2017/aia/
Are GPU vendors (well, gpu vendor, as far as I can tell) focusing heavily on increasing VRAM? My understanding is that transformers are pretty quick to train, but have significant memory costs.
When they say that video is infeasible with memory... does that mean that if we had enough memory (128? 256? gb) we would be able to realistically train such networks with temporal attention?
This is insanely exciting. It looks like we are limited, at this point, only by compute.
A long time. Much like self-driving cars, AI can take you 90% of the way there, but the last 10% is the difference between music that sort-of-seems-competent, and music that people will actually listen to and enjoy.
Is it really the same though? A driving system which can only create a good outcome 90% of the time is not really useful as irl human safety is a factor.
However, a creative system curated by a human could end up creating useful outputs, could it not?
Obviously different safety outcomes, but ultimately it's the same idea: unless you can hit that last 10%, it's essentially useless. You could pump out AI songs that are about as good as the 10,000 songs that get uploaded to Soundcloud every day, but nobody listens to those already. It's really only the best <1% of music that people listen to.
Something that can be further refined by humans is more interesting. There's people looking into AI-based sample generation which is a lot more promising than full song generation, IMO.
> Something that can be further refined by humans is more interesting.
Exactly, that is what I was trying to say. The way I look at it is that most people who have Ableton installed cannot create an amazing song. Now let's say they are able to prompt a Stable Diffusion Audio system with a prompt like kanye type beat with flute melody in the key of E.
The system might output 90% hot garbage, but it's easy to skip that within seconds of hearing it. So they clip and loop the good part, add whatever personal skills they do have, and upload that.
And wow, I just found out that OpenAI's Jukebox[0] was creating this stuff two years ago. This seems like the lowest hanging fruit to me, compared to visuals. Also could be extremely lucrative. I wonder if we are already listen to ML generated music and it's just not advertised?
I'm not so sure that the most popular music is the best 1% of music.
Artists don't always become famous or popular because they're the best, instead it happens because they're pliable in a business sense and fit into a bigger picture of what the product is supposed to be.
> Honestly, MidJourney, Stable Diffussion, et. al. have closed that 10% gap quite quickly (in months!).
Have they? If anything, the past few months has shown that after the initial hype dies down, people find the limitations very quickly. It was only weeks ago that people here and on Reddit were proclaiming that graphic designers and artists were no longer needed. But while Stable Diffusion is great at making somewhat surreal images of Jeffrey Epstein eating an apple in the style of Picasso, it's not very good at, say, making a sleek, modern user interface mockup for a bank's login portal. And it turns out graphic designers are currently only paid for one of those things.
Stable Diffusion does not generate nice UIs because it has not been trained to do so. It has mostly been trained on the "laion-improved-aesthetics" dataset, which includes "aesthetic" images. There are only very few UI mockups in there. You can explore the dataset here: https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...
You could finetune the Stable Diffusion model to generate UI mockups and would receive much better results.
Self driving will likely grow to a point where driving will become an art, sport or a hobby but outside self driving traffic. For movies and music it could be similar in some ways. When the next generations will be raised by AI they will adapt their taste as well. Classical driving or music or films won’t completely die off but would become a niche thing.
I think it will take a lot less time than self driving cars because testing and iteration is so much cheaper. I would say that, in 5 years this feature will be part of a major commercial streaming service and people listen to it and enjoy it enough to keep using the feature.
I think we all can view a video of 'nails on a chalkboard' before we hear audio of nails on a chalkboard.
For some reason, unacceptable uncanny sounds is a much wider valley than unacceptable videos/pictures. The hand holding is uncanny in the family video, but I'm fine watching it for a second - it doesn't cause pain the way that same error would in music.
I've found this is my biggest issue with music streaming services. Often I just want another song similar to the one I am listening to, but no "generate radio from this track" comes close.
Not to insult your startup or your work (as this is pretty cool), but this is a good illustration of why that last 10% is so important. While the AI-generated song is 90% similar at a surface level, there's also no way anyone would listen to that track on Spotify if it came up on an Eminem playlist.
Wonder if all these things will bring about a kind of cambrian explosion of creativity.
Imagine a future of Prompt Wizards, who are able to coax the AI to generate things in a very specific way.
Although we would probably need a much greater level of human curation. The way algorithms curate on youtube and spotify just doesn't really hit the spot.
Perhaps stability and Dall-e already kind of showed that the value is not so much in the physical act of creating, but more-so in the ability to express something that the AI can represent and which can connect with you.
Think about 25 or 30 or 50 years down the line. More interesting than the next couple of years. Speculate a bit.
After Neuralink and a few other companies torture enough poor monkeys, they eventually figure out how to create high bandwidth brain computer interfaces.
We go through a few more paradigm shifts in computing and get 10000-1000000 X or more performance increase. Metaverse protocols have advanced to allow for seamless integration of simulated persons and environments across multiple clusters.
The software continues to improve.
What you could get is a simulated realm with simulated AI characters living their own lives. But groups of real people are plugged directly in to the simulation and can influence it with their thoughts. There may be some sort of rules to ensure a certain level of stability. But basically you just think "there should be a storm today" and maybe visualize some strong winds. And then it happens in their world.
So at that point we become Gods.
I am probably getting carried away because I am tired.
> Wonder if all these things will bring about a kind of cambrian explosion of creativity.
Maybe, if memes are the peak of creativity. (And who am I to judge?)
If we look at static images, the bar for distribution is zero and bar for creation is near zero since the arrival of these new AI tools – though it was circling zero before that.
And what seems to be most widely shared is memes, in my feeds anyway. When Stable Diffusion landed, people giggled about the president of my country rendered in the style of Grand Theft Auto. After a week of that my feeds went right back to memes.
Every human with an internet connection can draw like Picasso now, but it doesn't seem to matter. Because what we mostly seem to want is to take part in a conversation and get some validation, it seems to me.
We already have evidence that the explosion in availability decreases the overall quality, even if the quantity of high-quality productions also increases greatly.
Every creative platform is going to be flooded with the equivalent of a Reddit comment.
I really don’t understand the fear people have about these things. have I missed something and everyone else was placing huge value in out of context videos and pictures?
if you read “France declares war on Canada”, you’re not gonna believe it unless it’s coming from an extremely reputable source. so why would you trust a random unsourced video?
the absolute worst thing that’s gonna happen is that video-based social media is gonna be flooded with low (or even high) quality AI videos. and I ask you: who the fuck cares? are these places doing wonders for society as it is? what’s a bit more rubbish amongst all the rest?
We are quickly approaching a point where these independently created AI systems will be wrangled together the same away protocols were to create a computer network and the AI that emerges will likely be able to create its own code, solve its own problems and generate its own recursive processes. Everything at a conscious level without the ability to self recognize. If AI ever does wake up; we'll never know - first thing that happens is it will hide from us.
The fear is real and only seems fantastical because life is often stranger than fiction.
Instead of training on data these AIs will soon train on "creativity" and these layers of containerized thought will merge.
> first thing that happens is it will hide from us.
To do this it would need to want to live and know what fear is. It's just a piece of code, without being conscious there is no need to be online/alive for it. What people call AI today is just a few/hundreds GBs of data that reacts to some text and push it through a heuristic like system to get an answer doing fancy pattern matching. It doesn't do anything when there is no input, there are no processes, no thinking, no anything, it's "dead".
Fear and instinct is not taught in any species, it's inherited prior and biological process, you would need to explicitly code feeling into AI for it to be able to "feel" anything.
We can't even define consciousness or sentience in a useful way though. You can't decide except for extremes, people propose inadequate definitions that include say brick, flowers, ants, honeybees, cars, people in a spectrum.
well perhaps that does sound like it might be an issue, but I’m talking more specifically about the societal consequences of video being made more like text
> if you read “France declares war on Canada”, you’re not gonna believe it
A lie that is repeated a thousand times becomes truth. We are not talking about one out of place, weird news that would appear once on someone's newsfeed. We are talking about mass flooding.
> unless it’s coming from an extremely reputable source
there’s mass flooding of text too. some people fall for it, some people don’t. the being of it in a different medium makes little change.
>verifying sources
I’m talking about if your news came from Reuters or the BBC compared to coming from realnews247.io or a facebook post. the same applies. if you’ll believe some text on the BBC, you’ll believe a video from there, and if you’ll believe the words of AnonTruther on facebook, then you’ll believe their videos too. this makes no difference
Think of the SEO blogspam from corporate sites that buried genuinely interesting hand-made websites on search results pages. That is what AI-generated video is going to do to YouTube and pretty soon, Netflix and all other other streaming services.
When will one of these projects finally name themselves “Infinite Jest”? I’m guessing when the perfect pornography can be generated for you immediately, it will have an entertainment-to-death effect on a number of people. An Infinite Jest for conspiracy theories; one for shit posts, etc.
I mean, there are some theories that the mental illness epidemic is partially caused by internet use. I did strict 2 months Internet fast and I can attest it has a healing effect. Does it have entertainment-to-death effect? I guess there is a bigger than expected number of suicides that wouldn't happen have the people not been addicted to Internet. But I agree with every passing year we are moving further and further into entertainment-to-death with our civilisations.
It's funny that in terms of popular culture, we've all but accepted that Star Trek's Holodeck-type fantasy creation is something that will/should exist in a "proper" future. Yet all the intermediary technology to get there -- AI-generated visuals would be among the first such steps -- still makes most of us feel a little uncomfortable at first glance.
These things still feel a bit like e.g. Google/GCP services to me: Super appealing at first glance, quite close to what you want, but somehow never quite there. Maybe they'll asymptotically get there, eventually? Perhaps that statistical model can't really make it to the level we want it to?
I’ve found that replacing the bad parts with new ones, like Dalle Outpainting, can remove the worst parts of the image, like the hands here… doesn’t make it perfect, but certainly removes the worst offenders that instantly bring attention to themselves.
It may be that it's the deep learning tech which will never quite get there. GPT-3 has similar shortcomings in its mimicry. We're 95% there, I guess, but may never quite reach 100%.
Nah, the current issues are just because we're trying to do everything in one step. Because we've built tools that have so much of a stimulus-response approach, few efforts have been made toward interfaces that ask for clarification ('when you say X, do you mean XYZ or XXX?').
Image-to-image and tuning already addresses many of these issues; just as inpainting works really well, it won't be long before we have select-and-repair, where you add an additional prompt like 'improve this part - the ice cream is fine, just work on the dog's muzzle.'
The mistakes the AI makes are too numerous and hard-to-define for this to work I think. They could perhaps be addressed by having two different models trained differently, each fixing the errors of the other. When humans draw a realistic artwork, it's not 'single-pass'; they have to iterate on the details to get it right.
I get the same feeling as well. This approach may well be eternal demo-ware, and you'll actually need AGI (or manual direction by a real human) to get to 100%.
The hands throw me off. The same with the cat holding the remote... never thought that hands on animals would be able to trigger my uncanny valley response, but here we are
if people weren’t so repressed, this could also be used to severely reduce exploitation in the porn industry. what’s the point in making and selling exploitative porn when it can be auto-generated at will?
"Our research takes the following steps to reduce the creation of harmful, biased, or misleading content."
"Our goal is to eventually make this technology available to the public, but for now we will continue to analyze, test, and trial Make-A-Video to ensure that each step of release is safe and intentional."
Are they really going to do a replay of OpenAI and Stable Diffusion? Deja vu coming soon.
What does "intentional" mean here? Does it mean that the person intends to type the text that they do? My cat sometimes walks across the keyboard, but for the most part I pretty much intend to type everything I type. Am I missing something?
It's referring to the release of the software. I read it as "released with minimal embarassing unintended features". Like, as a rough made-up example, a "boy saluting" prompt producing a video of a Hitler youth giving a Nazi salute wouldn't make for great PR.
Very exciting to see the field progressing so quickly. I wonder how quickly it's going to move forward from here. Will we be generating coherent audio to accompany these videos soon? Will we have multi-scene videos in the next year? Ones with coherent plot? Can we get there just by scaling up, or are other advances needed? Excited to see what comes next!
I'm very interested in what will come out of this new (sub)medium. By virtue of video being a collaborative medium, I never feel like I'm getting a message from a singular consciousness like I do from less resource-intensive mediums like books (I know that book editors exist, but the medium has less filters to pass through compared to large products like movies). I could see this substantively lowering the barrier of entry for video and enabling a lot of new stories to be told.
You can view all their example videos at https://make-a-video.github.io (warning: all the ~95 videos are webp and are loaded at once on the page so it may take some time to load)
My daughter just had her first child, a baby boy that was born in the first week of July. The rate of progress in AI/ML technology is nearly unbelievable and my mind boggles at how it's going to influence his childhood (both good and bad).
My daughter was born in 1998. The Internet made a bit of a phase change during her early childhood and there were many instances where there was no precedent to guide a decision on what to do, so we just had to wing it as parents based on our own intuition and values. I certainly got some things wrong but we made it in one piece. I just worry a bit about what kind of new challenges my daughter will face in raising her son in this new environment, where powerful organizations wield unbelievable resources to create deep, life long co-dependencies on their revenue streams and intellectual empires.
No worries, mate, just another shiny toy. As far as us getting milked for every bit of money, this has been going on for some time now, nothing really new, right? Douglas Adams put it best - "I've been to the future. It's the same as anywhere else. Just the same old stuff in faster cars and smellier air."
What is interesting to me is the decline in both understanding and communication abilities. When I look at modern speech and compare it to books written hundreds of years ago the deficiencies are stark.
I can't imagine how poorly people will speak and mentally process emotions in 20 years.
If you took an average working-class, "blue collar" person from hundreds of years ago, they would speak very differently from how those books are written.
Ok, maybe a better comparison that is sort of apples to apples would be to take a speech by George W. Bush or Donald Trump compared to say, George Washington or Abraham Lincoln.
Surely you can see there is a sharp drop off in vocabulary and ability (or desire) to convey complex ideas.
I remember the threads about StableDiffusion less than 4 weeks ago where people were confidently saying this kind of thing is 6 to 12 months away. Turns out it was a month. Singularity soon?
At this rate, within a few years we'll have people creating low quality movies just by describing what they want the movie to look like. Soon, every book will have a video version. As someone who has written several unpublished science fiction novels, I can't wait.
Some of these look like existing videos that have just be garbled - check the clownfish one and the litter of puppies - maybe that is because the prompts aren't that detailed or they just need to up the "creative/randomness" factor.
The sloth one though has a more specific prompt and came out looking more better? and more original.
With the huge amount of data being used to back these, how do we know the "uniqueness" of the generated content? is it original or is it just mangling of existing content?
Even if it is just garbling existing content, it is pretty amazing. Frame by frame the mutant unicorns float across the beach hah.
Reading the paper, what's interesting is that they really build upon text-to-image models. Instead of generating 1 image with diffusion model, they generate 16, and then use several upsampling/interpolation methods to increase the resolution/framerate. For the text-image model, they use the DALL-E 2 architecture, so in theory just switching to a latent diffusion model (like stable diffusion) would boost training/inference cost of the model
The shortcoming is that, since they don't use any video/text annotation (just image/text pairs), complex temporal prompts would probably not work, something which require multiple steps like "a bear scoring a goal, then doing the Ronaldo celebration"
One can imagine that resolving this will require a separate model to construct a timeline/keyframe series from a text description then interpolate between them.
Fascinating times, either way this is allegedly being open–sourced, so I'm hopeful others will be able to build on top.
I wouldn't have thought we'd have these text-to-video models before the end of this year, so we might actually not be that far from minutes-long video generation
I would guess that it didn't take long running on Meta's servers, but if you were doing this at home with a 3090 it would take at least 6 minutes per 15 second video, assuming your settings take 15 seconds to render a single frame and there are 24 frames per second in the video.
I could see this type of technology being used at some point in the future to essentially algorithmically generate content (movies, tv shows, etc) for viewers to watch or engage with. I wonder if it leads to extremely customized content where what I watch is 100% different from any other user on the same platform. However, I also wonder if people enjoy watching the same things because it becomes harder to talk about a movie you’ve seen if there’s no way anyone else could have ever watched it without you sharing it with them.
it looks like bizarre dreams as videos... so far but it will improve to less weird... eventually being used in VR or AR to give you instant environments/scenarios as you wish, I would imagine
This is going to be a revolution, I already imagine AI movies in cinemas, pioneering hits at scale of Avatar on IMAX 3D, remakes of classics just based on screenplay texts.
The progress in AI we are seeing is unprecedented. I can't think of any other time in history where I would have had my mind blown on such a regular basis.
“Your scientists were so preoccupied with whether they could, they didn't stop to think if they should.” –pithy quote from a summer action movie
I'll just say it now: this is a mistake, quite possibly a huge mistake. The average human is not intelligent enough to deal with computer-generated video that they can mistake for reality, and so this can and will become a tool for despots.
Only a small matter of time now before we have Vec2TikTok (or similar) where you'll be served up videos specifically optimised with the kind of content that will keep you scrolling or watching more.
People will end up watching news about events that never happened or product reviews for things that don't exist.
All the comments seem to be discussing the implications of this as if it worked, but almost all of these look like complete garbage. All the humans look deformed and broken, or transparent in places they should not be. The only ones that are passable are the fantastical ones like the sloth with the laptop.
A lot of people saying it's over for traditional movie-making - lmao. I look at these and see nothing but uncanny valley artifacts, and I don't think it will improve much from here.
It's like self-driving cars. They use almost very effective statistical models, certainly better than our previous models, but they never seem to shake off that "almost" and become truly effective.
The first thing I thought was the exact opposite. This isn't very good, but it's only version 1. Motion pictures are less than 150 years old. In another 150 years I bet virtual filmmaking will progress a lot.
I don't agree with the prediction but I will say that at least in America, we've been overly optimistic about almost everything since world war II. Is particularly painful when you look at the imagination of science fiction versus the reality of science in the realm of say space travel, especially interstellar travel. This imagination effect gets the public behind your new invention in part because it seems to be just the beginning the start of something new. The tip of the iceberg. But sometimes it's really just a tiny bit of ice.
For the record, I'm actually rather bullish on self-driving cars. There's nothing physically impossible about solving the problem, but I'm not surprised it's harder than it sounds. But I don't see humans being fundamentally prevented from solving the problem in the same way that humans are fundamentally prevented from ever engaging in everyday space travel.
Absolutely, particularly if you see this as simply another tool in the belt, like how Jurassic Park's thirty year old special effects absolutely stand the test of time because of them being a convincing mix of early CGI and puppetry.
It's not hard to imagine that this kind of thing could end up doing a lot of the heavy lifting for things like background scenes in the future, opening up the kind of stuff we saw in The Mandalorian, Game of Thrones, and the LOTR film trilogy to increasingly lower and lower budget productions.
The first Jurassic Park holds up unusually well. They used CGI for things they couldn't convincingly do any other way, and used practical effects for almost everything else. And for the most part they hid the CGI well, they took a lot of care to mix it up with close ups of practical effects, the way they framed the shots to direct audience attention, etc.
I think in a lot of modern stuff they go too far. They now use CGI for things that could easily be practical effects, but they go with CGI because it's simply cheaper or because they want to A/B test different colors of wall paneling behind a character in post production, or who knows what. The end result is apparently good enough to make a billion dollars, but I can't stand it. Movies don't feel authentic anymore. It's hard to describe rationally, but the word 'soulless' sums up how movies made the new way make me feel. Even the scenes which are wholly practical/real get degraded; the excessive CGI and compositing used in the rest of the movie cast a miasma of unrealness across the entire movie.
I remember listening to the DVD commentary track for X2 (from 2003), and loving that scene near the beginning where Nightcrawler is hiding out in a cathedral but losing control of his teleportation powers because he's being remotely controlled/possessed by the villain of the story. Bryan Singer talks on the track about how great it was to fling the character (can't remember if it it was actually Alan Cumming or a stuntman) from the rafters, and how much better and more weighty it looked to have a "real" shot and just have to remove wires/harnessing vs filming it as a blank canvas and having to insert the character digitally.
Haha, why did I know this was going to be a Critical Drinker video.
For real though, I think the best CGI has always been when it's a light touch, providing only slight enhancements to an otherwise mostly-practical scene. And that's also when it's most invisible— so while superhero movies are obvious CGI-fests and can be clearly said to have been "ruined" by it, I think the most interesting modern CGI use is in lower-budget productions like TV shows, enabling the insertion of fantasy- or period-themed backdrops that would never be possible if you had to actually come up with all those props and extras in real life.
And I think to the extent that this is already happening, it's a lot harder to track because it's so much more subtle than the bombastic in-your-face effects of a Marvel movie showdown.
It would be interesting to see a revival of a show like Drive [0] as that was pretty ambitious and expensive for its time, but might be a lot more possible to do now.
Most of the AI generated images are good enough to be first drafts or sketches, and with some editing, they can be coherent images. This makes AI generators a good tool. Not sure the same is true for video though. It feels like you'll need reliable still image creation before you can get to a video generator that's useful since editing video takes a bit more than a still image.
The key to predicting technological advancement, that is so often misunderstood by technologists, is the S-curve.
Laypeople tend to think everything is linear, technologists tend to think everything is exponential; more often than not, reality tends to be sigmoidal. Technologies have exponential takeoffs followed by logarithmic plateaus. We're clearly well into the exponential phase of deep-learning ML, but it's only a matter of time before this approach hits its logarithmic phase.
Of course, the hard part of sigmoidal prediction is determining where in the curve we are. Does the current paradigm have an even steeper part ahead of it? Maybe. And yet, we could just as easily be right in the middle of the function, with a leveling off coming as the state of the art gives way to incremental improvements.
Case in point, just compare the results of Stable Diffusion / Dall-E / midjourney to papers from 5-10 years ago (here's a random one I found http://proceedings.mlr.press/v48/reed16.pdf). Remarkable progress in that time, and as I understand it, a good amount (but certainly not all) of the improvement comes “free” from being able to scale up to more weights.
> A lot of people saying it's over for traditional movie-making - lmao.
The groups supporting such absurd claims largely boil down to:
- Money-driven researchers in the applied AI field. Those are the people that spam popular ML conferences with barely novel contributions other than some minor tweaks on code from their previous papers.
- People unable to critically think and evaluate the significant limitations of SOTA methods occasionally marketed as AGI breakthroughs.
Last is virtually the entire HN userbase and the one that needs to be taken the least serious.
The first category is much more troubling however, since they can significantly influence research directions due to the broken citation system in academia (more citations --> higher quality contributions).
On your second point, a few years ago the SOTA was Google's Deep Dream.
I agree with your deeper criticism, though preferential attachment/ranking is very much How Humans Do Things. You could do a much improved citation system by expanding the time dimension and looking at papers that were unpopular at first and then attracted wide interest later.
Of course, academics also have a tendency to over-cite (because they don't want to be rejected for inadequate literature review), so there are incentives to cite a bunch of research whose premises or conclusions you hope to overturn.
I don't think AI will be able to create a movie anytime soon, but I think it will become "good enough" to serve as inspiration for creatives, or to replace simple stock footage (Much like SD and DALLE-2 is now).
It will definitely improve from here onward. On the other hand, I agree that it's a ridiculous thing to say that traditional movie-making is over. There's so much involved in making a movie, so much happenstance from the actors performances in a specific environment. You will never be able to get this in AI.. you might get some sort of mimicry, but the comparison is futile.. a movie isn't just a sequence of changing pixels.
What you’re going to see is a race to the bottom, the same as with claymation films and 3d.
It will suddenly require a lot less (expensive, highly trained) people to make the same films.
You’ll still need expensive highly trained people to do it, but with different skills and a lot less of them can do a lot more a lot more quickly.
…and that means that some studios will make bad, low grade films… and some studios will make amazing films using hybrid techniques (like 3d printed faces for claymation).
…but overall, the people funding movies will expect to get more for less, and that will mean a downsizing of the number of people employed currently in certain roles.
Traditional film making over? Hm… it’s complicated. Is it over if the entire industry changes, but people are still making films? Or is that just the “traditional” part of it which is over?
It’s definitely going to change the industry.
People will still definitely film things.
…but, I wager, less people will be doing highly payed skilled manual work, which will replaced by a few people doing a different type of AI assisted work.
…and we’ll see some really amazing indy films, of small highly technical teams producing content with very little physical filming.
> A lot of people saying it's over for traditional movie-making
While these early versions are primitive, as a traditional filmmaker I think in a couple years these technologies will creatively empower visual storytellers in exciting new ways. The key will be developing interfaces which allow us to engage, direct and constrain the AI to help us achieve specific goals.
The leap from "generating a frame" to "generating a video" is not as big as the leap from where we are now to a 'perfect' image synthesis engine(which would be required to get rid of artifacts).
The much more likely scenario IMO is that people get used to the artifacts and notice them less.
There is something with the AI needing to have a ‘sense’ for the world the scene exists in so longer videos can be created that are coherent. Currently we’ve only seen long videos that have no consistency and jump around a lot like an acid trip.
A good path forward is to fuse these image-element compositing tools with some of the 3d scene inference ones. So you start out with 'giant fish riding in a golf cart, using its tail to steer', then give that as ground truth to a modeling tool that figures out a fish and a wheeled vehicle well enough to reference some canonical examples with detailed shape and structure, the idea of weight etc. Then you build a new model with those and do some physics simulation (while maintaining a localized constraint of unreality that allows the fish to somehow stay in the seat of the golf cart).
Their video training set is still small (10M + 10M). A lot of the interpolation artifacts seems come from the model haven't acquired enough real-world understanding of "natural" movements (looking at the horse running and the sail-boat examples). I suspect scaling this up to 10x would have much less artifacts.
Reading the paper, it seems to be the "right" approach (separating temporal / spatial for both convolution and attention). Thus, I am optimistic what remains is to scale it up.
The horse leg motion in these examples is really poor. But to be fair, horse legs are very difficult for human animators too, generally necessitating the use of reference photos to get it right. The first time somebody got a series of photographs showing a horse in motion caused quite a stir: https://en.wikipedia.org/wiki/The_Horse_in_Motion
I'm not sure I'd agree with the "it won't improve much from here" sentiment, but I am a little confused by the sharp disagreements with this comment. They seem to contain a tacit assumption that the complexity curve for this problem is smoothe, but I could easily imagine a very sharp elbow in that curve, making any progress come to a seemingly grinding halt.
We already have films that use "similar" technology (for example, adding Leia in the new SW movies), and this is just another step, another improvement into the direction of autogenerated videos. Are we far from generating complete movies? Sure! Are we progressing? Yes too!
It's not over for traditional moving-making. It would be decades before the software and hardware could surpass. But it will improve tremendously, just like computers do for nearly everything.
The only way it will be over is that tiktok/instagram will attract more eyeballs and hurt their bottom line. Ultimately this is for amateurs to create really short form content for the foreseeable future.
This tech went from nothing to beating a human artist in an art competition in a few years, and yet you say "I don't think it will improve much". So, I disagree.
You'll see an excellent mostly- or all-AI feature within 5-10 years. There will be terrible ones before that, maybe 2-3 years. The first really good one will be enjoyed on its own merits, ad the artificiality of it will come to light after it has gained popularity.
I would imagine you’ll see one or a few that are barely AI-produced as soon as marketing teams get wind of this. “come and see the first ever AI-produced* blockbuster!
⠀
⠀
⠀
*produced, assisted, or discussed on set one time during lunchbreak”
If this technology (or its future iterations) is not enough you can always combine this with your movie cuts. This lowers the resources/bar needs for creative people and obviously also for the uncreative ones.
This is a good example of the most abject type of Luddism: telling yourself that a fast-developing tech you happen to dislike is already at or near the limit of its potential and that defects you notice now will be there forever.
It's not name calling to make a critique of an argument. Luddism is a real term that describes a particular philosophical tendency that places itself in opposition to what it views as a naive technophilia.
The Luddites were a labor rights movement reacting to the new shifting power dynamic of centralized production owned by a few, in an era with virtually no safety and worker rights regulations. How would you feel if your career evaporated and you were forced to choose between starvation and sending your children to work in a textile factory where they get maimed by the machines they're told to crawl into and repair? You probably wouldn't find much comfort in the popular retort of "but shirts are really cheap now!"
That Luddites have been successfully maligned as irrationally anti-technology crazy people is a propaganda victory by industrialist factory owners and their friends, the newspaper men.
Whatever their motivation, their ire is misguided and selfish. Is the world supposed to just sit and around and never innovate or try to become more productive? So people can have a job doing work better done by machines? I don't think so. The Luddites and any such analogs today are focused on the wrong thing - they should not attack new technology or the companies and individuals using them, but rather campaign for better social safety net from the government.
Maybe new technology should be restricted by default until regulation can catch up with it. I'm not sure, but the 'cheap shirts benefit everybody too much to pump the brakes' argument for unfettered innovation leaves me deeply unimpressed.
IMHO the best argument for unfettered innovation is the impossibility of slowing innovation globally; it can be slowed in one country, but that country can't force all the rest to get with that program and will eventually be overtaken by technologically superior foes.
You are hilariously wrong if you think it will not improve. Combination of prompt engineering, better and bigger datasets, better architectures, bigger models, artifact-fixing tools and sheer human creativity will seriously challenge traditional movie-making within two years.
The maximum compression for a video file of a movie is its script.
Perhaps eventually the most common (and time-tested) piracy platform will be email. Or maybe newsgroups (are they still on?), but the message payloads will be losslessly compressed text.
This tech will definitely revolutionize stock footage. Generating a post-modern cyber-ghost glitchcore horse drinking water is definitely way more exciting than just using some boring old video of some stupid real horse drinking water
The thing that frightens me is that we are rapidly reaching broad humanity disrupting ML technologies without any of the social or societal frameworks to cope with it.
Precisely, this is a dream for the likes of Netflix. They can perfect their vision of machine-generated assembly-line horseshit to feed to 190 countries. No staff, crew, no originality needed.
Note the "AI by Meta" watermark in the example. So Zuck gets a cut as well.
If you look out long-term, 2030s and 2040s, this is where all of this is likely going.
Right now the pool of content you might be interested in is constrained to all the content that has been made. But there may be better content that does not exist yet that you would be even more interested in. The future is going to be very weird, but also very entertaining
I don’t want to think reality is a simulation, but wouldn’t our everyday experience being generated by a neural network be far far simpler to achieve than a universe of infinite minuscule atoms all interacting.. like ozcam’s razor is pointing towards our lives being a realistic dream.
Like in 10 years you could plug this tech into high end VR and get a prompted reality dynamically generated that would be indistinguishable from our own.
That what we're seeing is really reality is the simpler explanation. Because the alternative is that what we see is a simulation ... inside another reality that has full complexity. Which increases the overall complexity. Thus, Occam's Razor says what we see is likely real and not a simulation.
Check out the original simulation argument paper[0]. The issue is, if we think we are heading to a world in which we can do simulations, it becomes increasingly likely that we are in one of those (presumably very many) worlds, rather than in the one world that existed before the advent of such simulations.
Sounds good because it seems like it explains what our reality is, but it really doesn't. It just pushes the fundamental problem up X simulations into the supposed "real reality". It also assumes that simply simulating a universe would generate consciouss beings which nobody knows the answer to, but my guess is that it would not.
These particle physics are still needed. We generally assume the dream you have at night does not take place within a whole universe, yet it is still governt by some rules - for example you may experience gravity within your dream and understand that to move around when actually there isn't any gravity but it seems like it is.
It's equally important to understand and study the rules of this waking dream you are having while reading, not because there are actually atoms out there but because it behaves like there are. You can do physics without metaphysics, altough the two usually get entangled somewhat.
The ‘outside’ reality only needs to be the size of a data center, not the entire universe. And maybe the outside reality only needs to be a trained 100gb model. A much simpler outside reality than we have now. You don’t even need to sim 8 billion people, just 1, you.
Occans razor in this case applies to the complexity giving rise to a situation here though, not the complexity of a described system.
Conventional physics giving birth to our universe is currently the model with the fewest assumptions.
What would have to take place to give rise to a universe the size of a data center, running an AI model of a human? It feels like we have to bake in assumptions of stable physics, a rise of a stable system for that data center, and some path towards creating it and modelling a human.
That said, if we believe we're capable of running billions of believable simulations, then we're more likely to be in such a simulation than ground reality. But a datacenter pocket dimension bakes in a lot of assumptions that make it less likely than our own universe.
The outside universe doesn’t require our complex particles or physics - all that is overkill to run a neural net. Also think of all the data we can store in a fingernail, stable diffusion is like 4gb, what could a 4tb model produce?
You only need a universe the size of a data center with nothing in it but a bit of vacuum fluctuation that causes particles to appear sometimes. Then just wait.
Exactly. The simulation theory proponents almost always seem to presume the existence of some sort of being that deliberately created the simulation. What if the simulation is a complete accident that has arisen from the complex interaction of brainless bacteria competing for scarce resources on the surface of a big rock? An accidental simulation emerging from randomly initialized 'cellular automata' on one rock in a host universe containing trillions of such rocks.
The usual presumption of the simulation being a deliberate construction of a conscious being makes the whole thing seem like nu-religion for people who reject supernatural things. With the presumption of a being deliberately creating the situation, you pull in these notions: We're special, we exist on purpose, we are probably being examined and judged. This reeks of religion.
Because even if our reality came from that rock bacteria, we're on the cusp of creating simulations within and statistically beings may still expect to be in a deliberate simulation, if we think we'd eventually simulate more beings than exist.
The rules of the presumed host universe are unknown, but which do you guess is more common in ours? trillions upon trillions of rocks, may of them probably covered in organic goo interacting with itself, or intellectually sophisticated computer scientists deliberately designing simulations? I think it's got to be the goo.
The neural net itself needs something to run in (some universe). If the argument is that it needs a smaller or simpler universe than ours: first of all it is simulating our universe so all complexity of ours is included anyway, but also, maybe a neural net in a big chaotic universe is more likely than a tiny universe designed to perfectly fit one neural net
Neural nets as an abstract concept don’t necessarily need to be built with atoms, just the substance of the external universe which may be far simpler.
I also didn't imply this neural network's universe needs atoms, but it needs to run the computation somehow, the computation itself is the complexity, no matter with what physics or other manifestation it shows up
And if it is simulating our universe's atoms, then that part of it basically is our atoms. But is a neural net doing that really simpler than the atoms just running with a more direct mathematical model?
You don’t need atoms, just a universe with the minimal blocks to build a turning machine. Like building a computer in Minecraft they could conceivably run these neural nets without any sense of ‘atoms’. The point is the outside universe could have extremely simple design.
If you're assuming each atom is independent of other atoms then yes. But if you believe the universe is deterministic and atoms are just following the laws of physics then you can think of the universe as nothing more than a computer with electrons instead of atoms just floating around.
In the sim theory, the external neural net tells us our brain is made of atoms.. you could theoretically feed your senses a Minecraft reality and you'd think your head is full of interacting blocks - which is still capable of performing the same NN functions as a wet brain... So I don't know, the brain could be made out of anything that fits the requirements of running the computation.
That's assuming there even really is one - senses can be hijacked, just like in dreams, so we may only think we have a brain - so strange.
Why then program the brain to try to understand that it's in a simulation when the goal was to create a realistic simulation in the first place? That'd be a rookie mistake on part of the simulation developers.
Ever wonder why you weren't born a medieval peasant?
Well, from the outside reality's perspective, it's helpful for people to spend the first few decades of their lives in an early 21st century simulation, just so they can gradually acclimate themselves to all this technology.
Yeah, my grandfather once said that he lived in the most exciting possible time, having been born before the first automobile, and having lived to see a man on the moon.
But even so, this era feels like it could be a singular phase shift. Maybe.
ozcam’s razor is a human made concept, the universe doesn't obey human laws, it's the opposite
> Like in 10 years you could plug this tech into high end VR and get a prompted reality dynamically generated that would be indistinguishable from our own.
They said that 10 years ago about VR and it still is dog shit
>Our goal is to eventually make this technology available to the public, but for now we will continue to analyze, test, and trial Make-A-Video to ensure that each step of release is safe and intentional.
With DALL-E for pictures and this for video, I'm starting to wonder if Neal Stephenson was onto something when he wrote Fall, or Dodge in Hell.
I know it's not the most loved Stephenson book, but bear with me (spoilers warning). The book features a couple of major plot points that feel increasingly prophetic:
- A global hoax, carefully coordinated, convinces a good chunk of the world that Moab has been destroyed by an atomic bomb. This is managed via a flurry of confusion, thoughtfully deployed pre-recorded video footage, social media posts, and paid actors who don't realize the scope of the hoax at the time. Naturally, a massive chunk of the population refuses to believe that Moab still exists even after the hoax is exposed.
- A group of hackers deploy and open source a massive suite of spambots that inundate social media and news sources with nonstop AI-generated misinformation to drown out real-world conversations about a topic. This is used specifically to drown out real conversation about a single individual as a proof of concept, but soon after has repercussions for the entire net...
- Thanks to exactly this kind of spambot, the "raw" unfiltered internet becomes totally unusable. Those with means end up paying for filtering services to keep unwanted misinformation out of their perspective of the internet. Those without means... either don't use the internet, or work in factories where human eyes filter out content that can't be filtered by AI.
I worry that exactly these kinds of developments are speeding us faster and faster down the road to a dystopian world where the "raw" internet is totally unusable. Right now, stuff like captcha and simple filters can keep out a lot of low-effort bot content on sites like Hacker News and niche forums (I think of home barista, bike forums, atlas, etc). Sites like Reddit are losing the war against bots and corporate propaganda; comment sections across the rest of the internet lost that war long ago, and just didn't realize it.
But those filters and moderators can only keep up with the onslaught of content for so long. What happens when GPT is used to spew millions of comments and posts at a forum from millions of ephemeral cloud IPs? And when DALL-E creates memes and photographic content? And when Make-a-Video enables those same spambots to inundate YouTube and Vimeo? It's clear that captchas are not long for this world either.
Will we see websites force more and more users to authenticate as a "real human" using their passport and government-issued ID? Maybe a Turing or Voight-Kampff test? And what does it mean when there's no longer a way to participate on the internet anonymously? As far as I can tell, limiting a site to only real human users doesn't guarantee quality -- all you have to do is look at Facebook to understand that. And somehow, despite being an incredibly easy target for bots and spam, niches of 4chan retain (racist, insane, and confusing) traces of genuine thought and conversation.
The internet has been such a valuable tool in my life, and I still love browsing blogs, forums, etc. to learn about unique people doing unique things. What happens when human-generated content is hard to come by and near impossible to distinguish from promotional AI garbage? I fear for my ability to discover new content in that world.
Please don't post snarky comments, or shallow dismissals, to HN. You may not owe BigCo better but you owe this community better if you're participating in it:
A fun related experiment, I thought it was fun to see what kind of movies AI would generate, so I created a "This Movie Does Not Exist" website[1] that auto generates fake movies (movie posters + synopsis). It basically uses GPT-3 to generate some story plots, and then uses that as a prompt (with in-between steps) for Stable Diffusion. Results may vary, but it definitely surprises sometimes with movies that look and sound amazing!
[1] This Movie Does Not Exist: https://thismoviedoesnotexist.org/