Hacker News new | past | comments | ask | show | jobs | submit login
Sora: Creating video from text (openai.com)
3647 points by davidbarker 3 months ago | hide | past | favorite | 2231 comments



Related ongoing thread: Video generation models as world simulators - https://news.ycombinator.com/item?id=39391458 - Feb 2024 (43 comments)

Also (since it's been a while): there are over 2000 comments in the current thread. To read them all, you need to click More links at the bottom of the page, or like this:

https://news.ycombinator.com/item?id=39386156&p=2

https://news.ycombinator.com/item?id=39386156&p=3

https://news.ycombinator.com/item?id=39386156&p=4[etc.]


This is insane. But I'm impressed most of all by the quality of motion. I've quite simply never seen convincing computer-generated motion before. Just look at the way the wooly mammoths connect with the ground, and their lumbering mass feels real.

Motion-capture works fine because that's real motion, but every time people try to animate humans and animals, even in big-budget CGI movies, it's always ultimately obviously fake. There are so many subtle things that happen in terms of acceleration and deceleration of all of the different parts of an organism, that no animator ever gets it 100% right. No animation algorithm gets it to a point where it's believable, just where it's "less bad".

But these videos seem to be getting it entirely believable for both people and animals. Which is wild.

And then of course, not to mention that these are entirely believable 3D spaces, with seemingly full object permanence. As opposed to other efforts I've seen which are basically briefly animating a 2D scene to make it seem vaguely 3D.


I disagree, just look at the legs of the woman in the first video. First she seems to be limping, than the legs rotate. The mammoth are totally uncanny for me as its both running and walking at the same time.

Don't get me wrong, it is impressive. But I think many people will be very uncomfortable with such motion very quickly. Same story as the fingers before.


> I think many people will be very uncomfortable with such motion very quickly

So... I think OP's point stands. (impressive, surpasses human/algorithmic animation thus far).

You're also right. There are "tells." But, a tell isn't a tell until we've seen it a few times.

Jaron Lanier makes a point about novel technology. The first gramophone users thought it sounded identical to live orchestra. When very early films depicting a train coming towards a camera, and people fell out of their chairs... Blurry black and white, super slow frame rate projected on a bedsheet.

Early 3d animation was mindblowing in the 90s. Now it seems like a marionette show. Well... I suppose there was a time when marionette shows were not campy. They probably looked magic.

It seems we need some experience before we internalize the tells and it starts to look fake. My own eye for CG images seems to improving faster then the quality. We're all learning to recognize GPT generated text. I'm sure these motion captures will look more fake to us soon.

That said... the fact that we're having this discussion proves that what we have here is "novel." We're looking at a breakthrough in motion/animation.

Also, I'm not sure "real" is necessary. For games or film what we need is rich and believable, not real.


> You're also right. There are "tells." But, a tell isn't a tell until we've seen it a few times.

Once you have seen a few you can tell instantly. They all move at 2 keyframes per second, that makes all movements seem alien and everything in an image moves strangely in sync. The dog moves in slow motion since they need more keyframes etc. That street some looks like they move in slow motion and others not.

People will quickly learn to notice those issues, they aren't even subtle once you are aware of them, not to mention the disappearing things etc.

And that wouldn't be very easy to fix, they need to train it on keyframes because training frame by frame is too much.

But that should make this really easy for others to replicate. You just train on keyframes and then train a model to fill in between keyframes, and you get this. It has some limitations as we see with movement keeping the same pace in every video, but there are a lot of cool results from it anyway.


I have a friend who has worked on many generations of video compression over the last 20 years. He would rather watch a movie on film without effects than anything on a TV or digital theater. He's trained himself to spot defects and now even with the latest HEVC H.265 he finds it impossible to enjoy. It's artifacts all the way down and the work never ends. At the superbowl he was obsessed with blocking for fast objects, screen edge artifacts, flat field colors, and something with the grass.

Luckily, I think he'll retire sooner than later, and maybe it will get better then.


I think a lot of these issues could be "solved" by lowering the resolution, using a low quality compression algorithm, and trimming clips down to under 10 seconds.

And by solved, I mean they'll create convincing clips that'll be hard for people to dismiss unless they're really looking closely. I think it's only a matter of time until fake video clips lead to real life outrage and violence. This tech is going to be militarized before we know it.


Yeah, we are very close to losing video as a source of truth.

I showed these demos to my partner yesterday and she was upset about how real AI has become, how little we will be able to trust what we see in the future. Authoritative sources will be more valuable, but they themselves may struggle to publish only the facts and none of the fiction.

Here's one possible military / political use:

The commander of Russia's Black Sea Fleet, Viktor Sokolov, is widely believed to have been killed by a missile strike on 22 September 2023. https://en.wikipedia.org/wiki/Viktor_Sokolov_(naval_officer)

Russian authorities refute his death and have released proof of life footage, which may be doctored or taken before his death. Authoritative source Wikipedia is not much help in establishing truth here, because without proof of death they must default to toeing the official line.

I predict that in the coming months Sokolov (who just yesterday was removed from his post) will re-emerge in the video realm, and go on to have a glorious career. Resurrecting dead heroes is a perfect use of this tech, for states where feeding people lies is preferable to arming them with the truth.

Sokolov may even go on to be the next Russian President.


> Yeah, we are very close to losing video as a source of truth.

I think this way of thinking is distracted. No type of media has ever been a source of truth in itself. Videos have been edited convincingly for a long time, and people can lie about their context or cut them in a way that flips their meaning.

Text is the easiest media to lie on, you can freely just make stuff up as you go, yet we don't say "we cannot trust written text anymore".

Well yeah duh, you can trust no type of media just because it is formatted in a certain way. We arrive at the truth by using multiple sources and judging the sources' track records of the past. AI is not going to change how sourcing works. It might be easier to fool people who have no media literacy, but those people have always been a problem for society.


Text was never looked at a source of truth like video was. If you messaged someone something, they wouldn't necessarily believe it. But if you sent them a video of that something, they would feel that they would have no choice but to believe that something.

> Well yeah duh, you can trust no type of media just because it is formatted in a certain way

Maybe you wouldn't, but the layperson probably would.

> We arrive at the truth by using multiple sources and judging the sources' track records of the past

Again, this is something that the ideal person would, not the average layperson. Almost nobody would go through all that to decide if they want to believe something or not. Presenting them a video of this sometjing would've been a surefire way to force them to believe it though, at least before Sora.

> people have always been a problem for society

Unrelated, but I think this attitude is by far the bigger "problem for society". It encourages us to look down on some people even when we do not know their circumstances or reasons, all for an extremely trivial matter. It encourages gatekeeping and hostility, and I think that kind of attitude is at least as detrimental to society as people with no media literacy.


During a significant part of history, text was definitely considered a source of truth, at least to the extent a lot of people see video now. A fancy recommendation letter from a noble would get you far. It makes sense because if you forge it, that means you had to invest significant amount of effort and therefore you had to plan the deception. It's a different kind of behavior than just lying on a whim.

But even then, as nowadays, people didn't trust the medium absolutely. The possibility of forgery was real, as it has been with the video, even before generative AI.


To back up this claim, when fictional novels first became a literary format in the Western world, there was immense consternation about the fact that un-true things were being said in text. It actually took a while for authors to start writing in anything besides formats that mimicked non-fictional writing (letters, diary entries, etc.).


> No type of media has ever been a source of truth in itself.

'pics or it didn't happen' has been a thing (possibly) until very recently for good reason.


And they've been doctored almost as long as photography has been around: https://en.wikipedia.org/wiki/Censorship_of_images_in_the_So...


As has been pointed ad nauseam by now, no one's suggesting that AI unlocks the ability to doctor images; they're suggesting that it makes it trivially easy for anyone, no matter how unskilled, to do so.

I really find this constant back and forth exhausting. It's always the same conversation: '(gen)AI makes it easy to create lots of fake news and disinformation etc.' --> 'but we've always been able to do that. have you not guys not heard of photoshop?' --> 'yes, but not on this scale this quickly. can you not see the difference?'

Anyway, my original point was simply to say that a lot of people have (rightly or wrongly) indeed taken photographic evidence seriously, even in the age of photographic manipulation (which as you point out, pretty much coincides with the age of photography itself).


> Videos have been edited convincingly for a long time,

You are right but the thing with this is the speed and ease with which you can generate something completely fake.


> Yeah, we are very close to losing video as a source of truth.

Why have you been trusting videos? The only difference is that the cost will decrease.

Haven't you seen Holywood movies? CGI has been convincing enough for a decade. Just add some compression and shaky mobile cam and it would be impossible to tell the difference on anything.


Of course, any video could be a fake, it's a question of the cost, and corresponding likelihood of that being the case.


Hell, some people have been doubting moon landing videos for even longer now. Video wasn't a reliable source since its inception.


The truth is to be found in sources not the content itself.

Every piece of information should have "how do you know?" question attached.


> Yeah, we are very close to losing video as a source of truth.

We've been living in a post-truth society for a while now. Thanks to "the algorithm" interacting with basic human behavior, you can find something somewhere that will tell you anything is true. You'll even find a community of people who'll be more than happy to feed your personal echo chamber -- downvoting & blocking any objections and upvoting and encouraging anything that feeds the beast.

And this doesn't just apply to "dumb people" or "the others", it applies to the very people reading this forum right now. You and me and everybody here lives in their safe, sound truth bubble. Don't like what people tell you? Just find somebody or something that will assure you that whatever it is you think, you are thinking the truth. No, everybody is the asshole who is wrong. Fuck those pond scum spreaders of "misinformation".

It could be a blog, it could be some AI generated video, it could even be "esteemed" newspapers like the New York Times or NPR. Everybody thinks their truth is the correct one and thanks to the selective power of the internet, we can all believe whatever truth we want. And honestly, at this point, I am suspecting there might not be any kind of ground truth. It's bullshit all the way down.


so where do we go from here? the moon landing was faked, we're ruled by lizard people, and there are microchips in the vaccine. at some level, you can believe what you want to believe, and if the checkout clerk thinks the moon is made of cheese, it makes no difference to me, I still get my groceries. but for things like nuclear fusion, are we actually making progress on it or is it also a delusion. where the rubber meets the road is how money gets spent on building big projects. is JWST bullshit? is the LHC? ITER? GPS?

we need ground truths for these things to actually function. how else can things work together?


I've always found that take quite ridiculous. Fake videos have existed for a long time. This technology reduces the effort required but if we're talking about state actors that was never an issue to begin with.

People already know that video cannot be taken at face value. Lord of the rings didn't make anyone belive orcs really exist.


> This technology reduces the effort required

Which is a huge deal. It’s absurd to brush that off.

> People already know that video cannot be taken at face value.

No, no they do not. People don’t even know to not take photos at face value, let alone video.

https://www.forbes.com/sites/mattnovak/2023/03/26/that-viral...


Lord of the Rings had a budget in the high millions and took years to make with a massive advertising campaign.

Riots happen due to out of context video clips. Violence happens due to people seeing grainy phone videos and acting on it immediately. We're reaching a point where these videos can be automatically generated instantly by anyone. If you can't see the difference between anyone with a grudge generating a video that looks realistic enough, and something that requires hundreds of millions of dollars and hundreds of employees to attain similar quality, then you're simply lying.


A key difference in the current trajectory is its becoming feasible to generate highly targeted content down to an individual level. This can also be achieved without state actor level resources or the time delays needed to traditionally implement, regardless of budget. The fact it could also be automated is mildly terrifying.


Coordinated campaigns of hate through the mass media - like kicking up war fever before any major war you care to name - is far more concerning and has already been with us for about a century. Look at WWII and what Hitler was doing with it for a clearest example; propaganda was the name of the game. The techniques haven't gone anywhere.

If anything, making it cheap enough that people have to dismiss video footage might soften the impact. It is interesting how the internet is making it much harder for the mass media to peddle unchallenged lies or slanted perspectives. This tech might counter-intuitively make it harder again.


I have no doubt trust levels will adjust, eventually. The challenge is that takes a non-trivial amount of time.

It's still an issue with traditional mass media. See basically any political environment where the Murdoch media empire is active. The long tail of (I hate myself for this terminology, but hey, it's HN) 'legacy humans' still vote and have a very real affect on society.


It's funny you mention LotR, because the vast vast vast majority of the character effects were practical (at least in the original trilogy). They were in fact, entirely real, even if they were not true to life.


You can still be enraged by things you know are not real. You can reason about your emotional response, but it's much harder to prevent an emotional response from happening in the first place.


... and learning to prevent emotional response means unlearning to be human, like burnt out people.

The only winning move is to not watch.


You can have an emotional response and still act rationally.


The issue is not even so much generating fake videos as creating plausible deniability. Now everything can be questioned for the pure reason of seeming AI-generated.


Yeah, it looks good at first glance. Also the fingers are still weird. And I suppose for every somewhat working vid, there were dozens of garbage. At least that was my experience with image generation.

I don't believe, movie makers are out of buisness any time soon. They will have to incorporate it though. So far this can make convincing background scenery.


> I don't believe, movie makers are out of business any time soon

My son was learning how to play keyboard and he started practicing based on metronome. At some point, I was thinking, why is he learning it at all? We can program which key to be pressed at what point in time, and then a software can play itself! Why bother?

Then it hit me! Musicians could automate all the instruments with incredible accuracy since a long time. But they never do that. For some reason, they still want a person behind the piano / guitar / drums.


Isn't it obvious? Life is about experiences and enjoyment, all of this tech is fun and novel and interesting but realistically, it's really exciting for tech people because it's going to be used to make more computer games, social media posts and advertisements, essentially, it's exciting because it's going to "make money".

Outside of that, people just want to know what it feels like to be able to play their favorite song on guitar and to go skiing etc.

Being perfect at everything would be honestly boring as shit.


I completely agree. There is more to a product than the final result. People who don't play an instrument see music I terms of money. (Hint: there's no money in music). But those who play know that the pleasure is in the playing, and jamming with your mates. Recording and selling are work, not pleasure.

This is true for literally every hobby people do for fun. I am learning ceramics. Everything I've ever made could be bought in a shop for a 100th of the cost, and would be 100 times "better". But I enjoy making the pot, and it's worth more to me than some factory item.

Sona will allow a new hobby, and lots will have fun with it. Pros will still need to fo Pro things. Not everything has to be viewed through the lens of money.


You articulated what I wanted to add to this thread -- thank you!

I play the piano, and even though MIDI exists, I still derive a lot of enjoyment from playing an acoustic instrument.


I like this saying: “The woods would be very silent if no birds sang except those who sang the best.” It's fun learning to play the instrument.


I think it's not. If musicians and only musicians wanted themselves behind instruments, for the sake of being, there should be a market for autogenerated self-playing music machines for their former patrons who wouldn't care. And that's not the case; the market for ambient sound machines is small. It takes equal or more insanity to have one at home than, say, having a military armored car in the garage.

On the other hand you've probably heard of an iPod, which I think I could describe as a device dedicated to give false sense of an ever-present musician, so to speak.

So, "they" in "they still want a person behind the piano" is not just limited to hobbyists and enthusiasts. People wants people behind an instrument, for some reason. People pays for others' suffering, not for a thing's peculiarity.


I don't think this is entirely accurate. There are entire genres of music where the audience does not want a person behind the piano/guitar/drums. Plenty of electronic artists have tried the live band gimmick and while it goes down well with a certain segment of the audience, it turns off another segment that doesn't want to hear "humanized" cover versions of the material. But the point is that both of those audiences exist, and they both have lots of opportunity to hear the music they want to hear. The same will be true of visual art created by computers. Some people will prefer a stronger machine element, other people will prefer a stronger human element, and there is room for us all.


I don't think this is entirely accurate. There are entire genres of music where the audience does not want a person behind the piano/guitar/drums.

Hilariously, nearly every electronic artist I can think of, stands in front of a crowd and "plays "live" by twisting dials etc, so I think it's fairly accurate.

Carl Cox, Tycho, Aphex Twin, Chemical Brothers, Underworld, to name a few.


DJ performances far outnumber "live" performances in the electronic scene. Perhaps you can cherry-pick certain DJs and make a point that they are creating a new musical composition by live-remixing the tracks they play, but even then a significant number of clubbers don't care, they just want to dance to the music. There are venues where a bunch of the audience can't even see the DJ and they still dance because they are enjoying the music on its own merits.

I stand by my original point. There are plenty of people who really do not care if there is a human somewhere "performing" the music or not. And that's totally fine.


If there is no human performing there, then it's a completely different event, so I actually have little idea what we're debating.


Your reasoning is circular. Humans who go to performances of other humans playing instruments enjoy seeing other humans playing instruments. That should not be surprising. The question is whether humans as a whole intrinsically prefer seeing other humans playing instruments over hearing a "perfect" machine reproduction. And the answer to that question is no. There are plenty of humans who really do prefer the machine reproduction.


If you're still talking about whether people want to hear live covers, or recordings, I think it's an apples to oranges comparison therefore I don't see the point in it.


Why does the DJ need to be there, in such a case?


Mainly to pick songs that fit the mood of the audience. At the moment, humans seem to do a better job "reading" the emotions of other humans in this kind of group setting than computers do, and people are willing to pay for experts who have that skill.

An ML model could probably do a good job at selecting tunes of a particular genre that fit into a pre-defined "journey" that the promoter is trying to construct, so I could see a role for "AI DJs" in the future, especially for low budget parties during unpopular timeslots like first day of a festival while people are still arriving and the crew is still setting up. Some of that is already done by just chucking a smart playlist on shuffle. But then you also have up-and-comer or hobbyist DJs who will play for free in those slots, so maybe there's not really a need for a smarter computer to take over the job.

This whole thread started from the question of why a human should do something when a machine can do it better. And the answer is simple: because humans like to do stuff. It is not because humans doing stuff adds some kind of hand-wavey X factor that other humans intrinsically prefer.


> Musicians could automate all the instruments with incredible accuracy since a long time. But they never do that.

What do you judge was the ratio of automated music (recordings played back) to live music played in the last year?


Just to be clear, I was talking about the original sound produced by a person (vs. a machine). Of course it was recorded and played back a _lot_ more than folks listening live.

But I take it, maybe I'm not so familiar with world music, I was talking more about Indian music. While the music is recorded and mixed across several tracks electronically, I think most of it is played (or sang) originally by a person.


His point still stands.

In the US atleast there's the occasional acoustic song that becomes a hit, but rock music is obviously on its way to slowly becoming jazz status. It and country are really the last genres where live traditional instruments are common during live performances. Pop, Hip Hop, and EDM basically all are put together as being nearly computer perfect.

All the great producers can play instruments, and that's often times the best way to get a section out initially. But what you hear on Spotify is more and more meticulously put together note by note on a computer after the fact.

Live instruments on stage are now often for spectacle or worse a gimmick, and it's not the song people came to love. I think the future will have people like Lionclad[1] in it pushing what it means to perform live, but I expect them to become fewer and fewer as music just gets more complex to produce overall.

[1] https://www.youtube.com/watch?v=MuBas80oGEU


Thankfully, art is not about the least common denominator and I'm confident that there will continue to be music played live as long as humanity exists.


Music has a lot of people who believe that not only is their favorite genre the best but that they must tear down people who don't appreciate it.

You aren't better because you prefer live music, you just have a preference. Music wasn't better some arbitrary number of years ago, you just have a preference.

Nobody said one form is objectively better, just that there is a form that is becoming more popular.

But to state my opinion, I can't imagine something more boring than thinking the best of music, performance, TV, or media in general was done best and created in the past.


It's not that I think my tastes in music are objectively better, it's that I strongly feel that music is a very personal matter for many people and there will be enough people who will seek out different forms of music than what is "popular". Rock, jazz, even classical music, are still alive and well.

> But to state my opinion, I can't imagine something more boring than thinking the best of music, performance, TV, or media in general was done best and created in the past.

And to state my opinion, art isn't about "the best" or any sort of progress, it's about the way we humans experience the world, something I consider to be a timeless preoccupation, which is why a song from 2024 can be equally touching as a ballad from the 14th century.


When I was studying music technology and using state of the art software synthesizers and sequencers, I got more and more into playing my acoustic guitar. There's a deep and direct connection and a pleasure that comes with it that computers (and now/eventually AI) will never be able to match.

(That being said, a realtime AI-based bandmate could be interesting...)


My son is an interesting example of this, I can play all the best guitar music on earth via the speakers, but when I physically get the guitar out and strum it, he sits up like he has just seen god, and is total awe of the sounds of it, the feel of the guitar and the site of it. It's like nothing else can compare. Even if he is hysterically crying, the physical isntrument and the sound of it just makes him calm right down.

I wonder if something is lost in the recording process that just cannot be replicated? A live instrument is something that you can actually feel the sound of IMO, I've never felt the same with recorded music even though I of course enjoy it.

I wonder if when we get older we just get kind of "bored" (sadly) and it doesn't mean as much to us as it probably should.


Mirror neurons?


What does this have to do with it?


I'm speculating that one would have more mirror neuron activation watching a person perform live, compared to listening to a recording or watching a video. Thus the missing component that makes live performance special.


The sound feels present with live music. Speakers have this synthetic far away feel no matter how good they are.


What about live music on non-acoustic instruments so it inherently comes through a speaker?


My son isn't even a toddler so I don't think it would possibly be "mirror neurons".


For me the guitar is like the keyboard I am writing on right now. It will never be replaced, because that is how I input music into the world. I could not program that, I was doing tracker music as a teenager, and all of the songs sounded weird, because the timing, and so on is not right. And now when I transcribe demos, and put them into a DAW, there seem to be the milliseconds off, that are not quite right. I still play the piano parts live, because we don't have the technology right now to make it sound better than a human, and even if we had, it would not be my music, but what an AI performed.


I really briefly looked at AI in music, lots of wild things are made. It is hard to explain, one was generating a bunch of sliders after mimicking a sample from sine waves (quite accurately)


> Musicians could automate all the instruments with incredible accuracy since a long time. But they never do that. For some reason, they still want a person behind the piano / guitar / drums.

This actually happened on a recent hit, too -- Dua Lipa's Break My Heart. They originally had a drum machine, but then brought in Chad Smith to actually play the drums for it.

Edit: I'm not claiming this was new or unusual, just providing a recent example.


This goes way back. Nine Inch Nails was a synth-first band with the music being written by Trent in a studio on a DAW. That worked but what really made the bad was live shows so they found ways even using 2 drummers to translate the synths and machines into human-plated instruments.

Also way before that back in the early 80’a Depeche Mode displayed the recorded drumb-reel onstage so everyone knew what it was, but when the got big enough they also transitioned into an epic live show with guitars and live drum a as well as synth-hooked drums devices they could bag on in addition to keyboards.

We are human. We want humans. Same reason I want a hipster barista to pour my coffee when a machine could do it just as well.


Same reason I want a hipster barista to pour my coffee when a machine could do it just as well.

I've wondered about this for a long time too, why on earth is anyone still able to be a barista, it turns out, people actually like the community around cafes and often that means interacting with the staff on a personal level.

Some of my best friends have been barista's I've gone to over several years.


Back before Twitter was born, or perhaps tv, cafes were just that - a place to spend evenings (…just don’t ask who watched over the kids)


It’s more than that, doing it well is still beyond sophisticated automation. Many variables that need do be constantly adjusted for. Humans are still much better at it than machines, regardless of the social element.


If true, probably not for long. Still my point is people are customer. It’s more fun to think about what won’t change. I think we will still have baristas.


A good live performance is intentionally not 100% the same as in the studio, but there can and should be variations. A refrain repeated another time, some improvisation here. Playing with the tempo there. It takes a good band, who know each other intimately, to make that work, though. (a good DJ can also do this with electronic music)

A recorded studio version, I can also listen to at home. But a full band performing in this very moment is a different experience to me.


Regarding your point about music:

There are subtle and deliberate deviations in timing and elements like vibrato when a human plays the same song on an instrument twice, which is partly why (aside from recording tech) people prefer live or human musicians.

Think about how precise and exacting a computer can be. It can play the same notes in a MIDI editor with exact timing, always playing note B after 18 seconds of playing note A. Human musicians can't always be that precise in timing, but we seem to prefer how human musicians sound with all of the variations they make. We seem to dislike the precise mechanical repetition of music playback on a computer comparatively.

I think the same point generalises into a general dislike on the part of humans of sensory repetition. We want variety. (Compare the first and second grass pictures at [0] and you will probably find that the second which has more "dirt" and variety looks better.) "Semantic satiation" seems to be a specific case of the same tendency.

I'm not saying that's something a computer can't achieve eventually but it's something that will need to be done before machines can replace musicians.

[0] http://gas13.ru/v3/tutorials/sywtbapa_gradient_tool.php


You can modulate midi timinbg with noise. In some programs, there’s literally a Humanize button.


Yes. I tried that with some software-based synthesisers (like the SWAM violin and Reason's Friktion) which are designed for human-playing (humans controlling the VST through a device that emits MIDI CC control messages) but my understanding is that the modulation that skilled human players perform with tends to be better/more desirable than what software modulators can currently achieve.


The real dilemma is with composition/song-writing.

Ability to create live experiences can still be a motivating factor for musicians (aside from the love of learning). Yet, when AI does the song-writing far more effectively, then will the musician ignore this?

It's like Brave New World. Musicians who don't use these AI tools for song-writing will be like a tribe outside modern world. That's a tough future to prepare for. We won't know whether a song was actually the experience and emotions of a person or not.


Even if we assume that people want fully automated music, the process of learning to play educates the musician. Similarly, you'd still need a director/auteur, editors, writers and other roles I have no appreciation or knowledge of to create a film from AI models.

Steam shovels and modern excavators didn't remove our need for shovels or more importantly, the know-how to properly apply these tools. Naturally, most people use a shovel before they operate an excavator.


It's interesting though, the question really becomes, if 10 people used to shovel manually to feed their family. And now it takes 1 person and an excavater, what in good faith do you tell those other 9..."don't worry you can always be a hobby shovelist?"


They can apply their labor wherever it is valued. Perhaps they will become more productive excavator operators. By creating value in a specialized field their income would increase. Technology does not decrease the need for labor. Rather it increases the productivity of the laborer.

Human ingenuity always finds a need for value creation. Greater abundance creates new opportunities.

Take the inverse position. Should we go back to reading by candlelight to increase employment in candle making?

No, electric lighting allowed peopled to become productive during night hours. A market was created for electricity producers, which allowed additional products which consume electricity to be marketed. Technological increases in productivity cascade into all areas of life, increasing our living standards.

A more interesting, if not controversial line of inquiry might start with: If technology is constantly advancing human productivity, why do modern economies consistently experience price inflation?


You miss the important point, which is the productivity gain means the average living standard of society as a whole increases. A chunk of what is now regarded as 'toil' work disappears, and the time freed up is able to be deployed more productively in other areas.

Of course, this change is dislocating for the particular people whose toil disappeared. They need support to retrain to new occupations.

The alternative is to cling to a past where everyone - on average - is poorer, less healthy, and works in more dangerous jobs.


That's awesome, sign me up for retraining. Where do I go and who can I talk to so I can be retrained into a less drudgery filled position?

Clearly if there are ways out of being displaced, please share them


The ‘augmented singer’ is very popular, though. https://en.wikipedia.org/wiki/Auto-Tune: “Auto-Tune has been widely criticized as indicative of an inability to sing on key.”


Live play is what, 1% of all music heard in the world? Computers, radios, iPods and phones all play automated reproductions.


Musicians could automate all the instruments with incredible accuracy since a long time. But they never do that. For some reason, they still want a person behind the piano / guitar / drums.

You've never been to a rave, huh? For that matter, there's a lot of pop artists that use sequencers and dispense with the traditional band on stage.


I can see this being used extensively for short commercials, as the uncanny aspect of a lot of the figures will help to capture people's attention. I don't necessarily believe it will be less expensive than hiring a director and film crew however.


I love these hot takes based on profoundly incredible tech that literally just launched. Acting like 2030 isn't around the corner.


> I love these hot takes based on profoundly incredible tech that literally just launched. Acting like 2030 isn't around the corner.

It seems bizarre to think the gee whiz factor in a new commercial creative product makes critiquing its output out-of-bounds. This isn't a university research team: they're charging money for this. Most people have to determine if something is useful before they pay for it.


Let me guess, hard singularity take-off in 2030? Does the hype cycle not exist for techno-optimists? Just one breathless prediction after another?


Anything less than absolute enrapture is a "hot take"... :)


We’re glad you love them.


> fingers are still weird

Also keep an eye on teeth and high contrast text. Anything small and prone to distortion in low resolution video and images used to train this stuff.


Yeah. I think people nowadays are in a kind of AI-euphoria and they took every advancement in AI for more than what they really are. The realization of their limitations will set in once people have been working long enough on the stuff. The capacity of the newfangled AIs are impressive. But even more impressive are their mimicry capabilities.


Are you joking?

We were not even able to just create random videos by just text promoting a few years back and now this.

The progress is crazy.

Why do you dismiss this?


Not dismissing, but being realistic. I observed all the AI tools, usually amaze most people initially by showing capabilities never seen before. Then people realise their limitations, ie what capabilities are still missing. And they're like: "oh, this is no genie in a bottle capable of satisfying every wish. We'll still have to work to obtain our vision..." So the magic fades away, and the world returns to normal, but now with an additional tool very useful in some situations :)


I'm still amazed.

The progress doesn't slow down right now at all.

This is probably one of the most exciting developments in the world besides the Internet.

And Geminis news regarding the 1 million token window shows were we are going.

This will impact a lot of people faster than a lot of people realize


I agree. Skepticism usually serves people well as a lot of new tech turns out to be just hype. Except when it is not and I think this is one of those few cases.


Not who you're replying to but this is a toy.

AI won't make artistic decisions that wow an audience.

AI won't teach you something about the human condition.

AI will only enable higher quarterly profits from layoffs until GPU costs catch up.

What the fuck is the point of AI automating away jobs when the only people who benefit are the already enormously wealthy? AI won't be providing time to relax for the average worker, it will induce starvation. Anything to prevent will be stopped via lobbying to ensure taxes don't rise.

Seriously, what is the point? What is the point? What the fuck is there to live for when art and humanities is undermined by the MBA class and all you fucking have is 3 gig jobs to prevent starvation?


Problem isn't the tool but with the tools using the tool.

It's not ML fault that we don't have UBI, it's voters' faults.


I believe ai and full automatisation is critical for a Star Trek society.

We are not very good in providing anything reasonable today because capitalism is still way to strong and manual laber still way to necessary.

Nonetheless look at my country Germany: we are a social state. Plenty of people get 'free' money and it works.

The other good thing: there are plenty of people who know what good is (good art etc) but are not able to draw. The can also express themselves. AI as a tool.

If we as society discover that there will be no really new music or art happening I don't know what we will do.

Plenty of people are well entertained with crap anyway.


Sure there are limitations but this is still absurdly impressive.

My benchmark is the following: imagine if someone 5 years ago told you that in 5 years we could do this, you would think they were crazy.


I would not. Five (six, seven?) years ago, we had style transfer with video and everyone was also super euphoric about that. If I compare to those videos, there is clearly progress but it is not like we started from zero 2 years ago.


I don't really know what you mean by "euphoric", this is a term I only know from drugs. Can you define it?


"Blissful/happy", which is why the word euphoria is often abused to be sinister


It means "extremely happy", but it's usually used to refer to a particular moment in time (rather than a general sentiment), and so the word sounds a bit out of place here, to me.


And further down the page the:

"The camera follows behind a white vintage SUV with a black roof": The letters clearly wobble inconsistently.

"A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast": The woman in the white dress in the bottom left suddenly splits into multiple people like she was a single cell microbe multiplying.


Sure, but think what it will be capable of two papers ahead :)


Progress is this field has not been linear, though. So it's quite possible that two papers ahead we are still in the same place.


On the other hand, this is the first convincing use of a “diffusion transformer” [1]. My understanding is that videos and images are tokenized into patches, through a process that compresses the video/images into abstracted concepts in latent space. Those patches (image/video concepts in latent space) can then be used with transformers (because patches are the tokens). The point is that there is plenty of room for optimization following the first demonstration of a new architecture.

Edit: sorry, it’s not the first diffusion transformer. That would be [2]

[1] https://openai.com/research/video-generation-models-as-world...

[2] https://arxiv.org/abs/2212.09748



I think it is misleading. The role of the diffusion network is completely absent from this explanation


Hold on to your papers~


It’s not perfect, for sure. But maybe this isn’t the final pinnacle of the tech?


> I disagree, just look at the legs of the woman in the first video.

The people behind her all walk at the same pace and seem like floating. The moving reflections, on the other hand, are impressive make-believe.


Really makes me think of The Matrix scene with the woman in the red dress. Can't tell if they did this on purpose to freak us all out? Are we all just prompts?


I'm 99% sure this is supposed to invoke cyberpunk but not sure about The Matrix.


If you watch the background, you'll see one guy has hits pants change color. And also, some of the guys are absolute giants compared to people around them.


Yep. If you look at the detail you can find obvious things wrong and these are limited to 60s in length with zero audio so I doubt full motion picture movies are going to be replaced anytime soon. B-roll background video or AI generated backgrounds for a green screen sure.

I would expect any subscription to use this service when it comes out to be very expensive. At some point I have to imagine the GPU/CPU horsepower needed will outweigh the monetary costs that could be recovered. Storage costs too. Its much easier to tinker with generating text or static images in that regard.

Of note: NVDA's quarterly results come out next week.


Same story as the fingers before.

This is weird to me considering how much better this is than the SOTA still images 2 years ago. Even though there's weirdo artefacts in several of their example videos (indeed including migrating fingers), that stuff will be super easy to clean up, just as it is now for stills. And it's not going to stop improving.


Agreed and these are the cherry picked examples of course.


>>>just look at the legs of the woman

Denise Richards hard sharp knees in '97

--

these infant tech are already insanely good... just wait and rahter try to focus on the "what should I be betting on in 5 years from now?

I suggest 'invisibility cloaks' (ghosts in machines?)


> But I think many people will be very uncomfortable with such motion very quickly.

Given the momentum in this space, I think you will have get very uncomfortable super quick about any of the shortcomings of any particular model.


At second 15, of the woman video, the legs switch sides!! Definitely there are some glitches :)


The left and right side of her face are almost... a different person.


When others create text to video systems (eg. Lumiere from Google) they publish the research (eg. https://arxiv.org/pdf/2401.12945.pdf). Open AI is all about commercialization. I don't like their attitude


Google is hardly a good actor here. They just announced Gemini 1.5 along with a "technical report" [1] whose entire description of the model architecture is: "Gemini 1.5 Pro is a sparse mixture-of-expert (MoE) Transformer-based model". Followed by a list of papers that it "builds on", followed by a definition of MoE. I suppose that's more than OpenAI gave in their GPT-4 technical report. But not by much!

[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...


The report and the previous one for 1.0 definitely contain much more information than the GPT-4 whitepaper. And Google regularly publishes technical details on other models, like Lumiere, things that OpenAI stopped doing after their InstructGPT paper.


Maybe because GPT3.5 is closer to what Gemini 1.0 was... GPT4 and Gemini 1.5 are similarly sparse in their "how we did it and what we used" when it comes to papers


Not to be overly cute, but if the cutting edge research you do is maybe changing the world fundamentally, forever, guarding that tech should be really, really, really far up your list of priorities and everyone else should be really happy about your priorities.

And that should probably take precedence over the semantics of your moniker, every single time (even if hn continues to be super sour about it)


I'd much rather this tech be open - better for everyone to have it than a select few.

The more powerful, the more important it is that everyone has access.


Do you feel the same way about nuclear weapons tech?

That "the more powerful, the more important it is that everyone has access"?

Especially considering that the biggest killer app for AI could very well be smart weapons like we've never seen before.


I feel this is a false equivalence.

Nukes aren’t even close to being commodities, cannot be targeted at a class of people (or a single person), and have a minutely small number of users. (Don’t argue semantics with “class of people” when you know what I mean, btw)

On the other hand, tech like this can easily become as common as photoshop, can cause harm to a class of people, and be deployed on a whim by an untrained army of malevolent individuals or groups.


So if someone discovered a weapon of mass destruction (say some kind of supervirus) that could be produced and bought cheaply and could be programmed to only kill a certain class of people, then you'd want the recipe to be freely available?


This poses no direct threat to human life though. (Unlike, say, guns - which are totally fine for everyone in the US!)

The direct threat to society is actually this kind of secrecy.

If ordinary people don't have access to the technology they don't really know what it can do, so they can't develop a good sense of what could now be fake that only a couple of years ago must have been real.

Imagine if image editing technology (Photoshop etc) had been restricted to nation states and large powerful corporations. The general public would be so easy to fool with mere photographs - and of course more openly nefarious groups would have found ways to use it anyway. Instead everybody now knows how easily we can edit an image and if we see a shot of Mr Trump apparently sharing a loving embrace with Mr Putin we can make the correct judgement regarding a probable origin.


The bottleneck for bioterrorism isn't AI telling you how to do something, it's producing the final result. You wanna curtail bioweapons, monitor the BSL labs, biowarfare labs, bioreactors, and organic 3D printers. ChatGPT telling me how to shoot someone isn't gonna help me if I can't get a gun.


This isn't related to my comment. I wasn't asking what if an AI invents a supervirus. I was asking what if someone invents a supervirus. AI isn't involved in this hypothetical in any way.

I was replying to a comment saying that nukes aren't commodities and can't target specific classes of people, and I don't understand why those properties in particular mean access to nukes should be kept secret and controlled.


I understand your perspective regarding the potential risks associated with freely available research, particularly when it comes to illegal weapons and dangerous viruses. However, it's worth considering that by making research available to the world, we enable a collaborative effort in finding solutions and antidotes to such threats. In the case of Covid, the open sharing of information led to the development of vaccines in record time.

It's important to weigh the benefits of diversity and open competition against the risks of bad actors misusing the tools. Ultimately, finding a balance between accessibility and responsible use is key.

What guarantee do we have that OpenAI won't become an evil actor like Skynet?


I'm not advocating for or against secrecy. I'm just not understanding the parent comment I replied to. They said nukes are different than AI because they aren't commodities and can't target specific classes of people, and presumably that's why nukes should be kept secret and AI should be open. Why? That makes no sense to me. If nukes had those qualities, I'd definitely want them kept secret and controlled.


An AI video generator can't kill billions of people, for one. I'd prefer it if access wasn't limited to a single corporation that's accountable to no one and is incentivized to use it for their benefits only.


> accountable to no one

What do you mean? Are you being dramatic or do you actually believe that the US government will/can not absolutely shut OpenAI down, if they feel it was required to guarantee state order?


For the US government to step in, they'd have to do something extremely dangerous (and refuse to share with the government). If we're talking about video generation, the benefits they have are financial, and the lack of accountability is in that they can do things no one else can. I'm not saying they'll be allowed to break the law, there's plenty of space between the two extremes. Though, given how things were going, I can also see OpenAI teaming up with the US government and receiving exclusive privileges to run certain technologies for the sake of "safety". It's what Altman has already been pushing for.


> An AI video generator can't kill billions of people, for one.

Not directly. But I won't be surprised if AI video generators aren't somewhere in the chain of causes of gigadeaths this century.


I think it could. The right sequence of videos sent to the right people could definitely set something catastrophic off.


> The right sequence of videos sent to the right people could definitely set something catastrophic off.

...after amazing public world wide demos that show how real the AI generated videos can be? How long has Hollywood had similar "fictional videos" powers?


> ...after amazing public world wide demos that show how real the AI generated videos can be?

How quickly do you think our gerontocracy will adapt to the new reality?


Flat earth Billy can now make videos with a $20 subscription.


I think that's great. Billy will feed his flat earther friends for a few weeks or months and pretty soon the entire world will wise up and be highly skeptical of any new such videos. The more of this that gets out there, the quicker people will learn. If it's 1 or 2 videos to spin an election... People might not get wise to it.


Given the last 10 years I have no such faith in the common person.


which will only continue to convince people if the technology stays safely locked away in possession of a single corp.

if it were opened to public faking such videos would lose (nearly) all of its power


Make it high-enough fidelity, and it will be used to convince people to kill billions.


Video can convince people to kill each other now because it is assumed to show real things. Show people a Jew killing a Palestinian, and that will rile up the Muslims, or vice versa.

When a significant fraction of video is generated content spat out by a bored teenager on 4chan, then people will stop trusting it, and hence it will no longer have the power to convince people to kill.


You don't need to generate fake videos for that example. State of Isreal have been killing Palestinians en masse for a long time and intensified the effort for the last 4 months. The death toll is 29,000+ and counting. Two thirds are children and women.

Isreal media machinery parading photographs of damaged houses that could only be done by heavy artillery or tank shells blaming on rebels carrying infantry rifles.

But I agree, as if the current tools were not enough to sway people they will have more means to sway public opinion.


Hamas has similarly been shooting rockets into Israel for a long time. Eventually people get tired and stop caring about long-lasting conflicts, just like we don't care about concentration camps in North Korea and China, or various deadly civil wars in Sub-Saharan Africa, some of which have killed way more civilians than all wars in Palestinian history. One can already see support towards Ukraine fading as well, even though there Western countries would have a real geopolitical interest.


> Especially considering that the biggest killer app for AI could very well be smart weapons like we've never seen before.

A homing missile that chases you across continents and shows you disturbing deepfakes of yourself until you lose your mind and ask it to kill you. At that point it switches to encourage mode, rebuilds your ego, and becomes your lifelong friend.


I don't think it's really that hard to make a nuclear weapon, honestly. Just because you have the plans for one, doesn't mean you have the uranium/plutonium to make one. Weapons-grade uranium doesn't fall into your lap.

The ideas of critical mass, prompt fission, and uranium purification, along with the design of the simplest nuclear weapon possible has been out in the public domain for a long time.


Oof, imagine if our safeguard for nuclear weapons was that a private company kept it safe.


While it's probably too idealistic to be possible, I'd rather try and focus on getting people/society/the world to a state where it doesn't matter if everyone has access (i.e. getting to a place where it doesn't matter if everyone has access to nuclear weapons, guns, chemical weapons, etc., because no-one would have the slightest desire to use them).

As things are at the moment, while supression of a technology has benefits, it seems like a risky long-term solution. All it takes is for a single world-altering technology to slip through the cracks, and a bad actor could then forever change the world with it.


On a geopolitical level 'everyone' does have access.


Do you feel the same way about electricity?


As long as destroying things remains at least two magnitudes easier than building things and defending against attacks, this take (as a blanket statement) will continue to be indefensible and irresponsible.


Should nukes be open source?


I humbly refer you to this comment:

https://news.ycombinator.com/item?id=39389262


ML models of this complexity are just as accessible as nuclear weapons. How many nations possess a GPT-4? The only reason nuclear weapons are not more common is because their proliferation is strictly controlled by conventions and covert action.


The basic designs for workable (although inefficient) nuclear weapons have been published in open sources for decades. The hard part is obtaining enough uranium and then refining it.


If you have two pieces of plutonium and put them too close together you have accidentally created a nuclear weapon… so yeah nukes are open source, plutonium breeding isn’t.


I love it when people make this “nuke” argument because it tells you a lot more about them than it does about anything else. There are so many low information people out there, it’s a bit sad the state of education even in developed countries. There’s people trotting around the word “chemical” at things that are scary without understanding what exactly the word means, how it differs from the word mixture or anything like that. I don’t expect most people to understand the difference between a proton and a quark but at least a general understanding of physics and chemistry would save a lot of people from falling into the “world is magic and information is hidden away inside geniuses” mentality.


Should electricity?


What a load…image if everyone else guarded all their discoveries, there’d be no text to video would there?


People defending this need to meditate on the meaning of the phrase "shoulders of giants".


New technology will always be new giants to see from, but open source really is a nice ladder up to the shoulders of giants. So many benefits from sharing the tech


This reminded me of a conversation with a historian. He requested the reconstruction of a monument in France that a game studio had already made.

The studio told him the model was their property, and they wouldn't share it.

Peculiar reasoning, isn't it?


This is meaningless until you've defined "world changing". It's possible that open sourcing AIs will be world-changing in a good way and developing closed source AIs will be world-changing in a bad way.

If I engineered the tech I would be much more fearful of the possibility of malice in the future leadership of the organization I'm under if they continue to keep it closed, than I would be fearful of the whole world getting the capability if they decide to open source.

I feel that, like with Yellow Journalism of the 1920s, much of the misinformation problem with generative AI will only be mitigated during widespread proliferation, wherein people become immune to new tactics and gain a new skepticism of the media. I've always thought it strange when news outlets discuss new deepfakes but refuse to show it, even with a watermark indicating it is fake. Misinformation research shows that people become more skeptical once they learn about the technological measures (e.g. buying karma-farmed Reddit accounts, or in the 1920s, taking advantage of dramatically lower newspaper printing costs to print sensationalism) through which misinformation is manufactured.


The problem is when we start to run out of reliable sources after becoming sceptical of everything.


It will be kind of like most of history where the only trustworthy method of communication is with face to face communication or with a letter or book (perhaps cryptographically) verified from a person you personally know or trust. Sounds good to me


This is a fantastic write up and great parallel to the state of where we’re headed.


How convenient for all the OpenAI employees trying to make millions of dollars by commercializing their technology. Surely this technology won’t be well-understood and easily replicable in a few years as FOSS


It'll, even if they guard their secret sauce. Let's not be naive about this, obfuscation is and always will be a minor nuisance.


>If you have world-changing technology it's better for a megacorp to control it.

You need to watch more dystopian movies.


The wheel should have been a tightly controlled technology?


Ironic, isn't it! OpenAI started out "open," publishing research, and now "ClosedAI" would be a much better name.


TBH they should just rename to ClosedAI and run with it, I and others would appreciate the honesty plus it would be amusing.


However if you are playing for the regulatory capture route (which Sam Altman seems to be angling for) it’s much easier if your name is “OpenAI”.


If you go full regulatory capture, you might as well name it "AI", The AI Company.


You never go "full" regulatory capture.


gottem


Sick burn!


When has OpenAI - for a company named "Open" AI ever released any of their stuff into anything open?


They actually did a few years ago, but that's ancient history in AI terms.

The most recent thing they released was Whisper, which to be fair is the only model with absolutely no safety implications.


From what I remember reading, Open was never supposed to be like open source with the internals freely available, but Open as in available for the public to use, as opposed to a technology only for the company to wield and create content with.


They stopped releasing their stuff openly around the time GPT3 came to be.


Whisper was after GPT3 and that was fully open.


More like ClosedAI, amirite?


OAI requires a real mobile phone number to signup and are therefore an adtech company.


Might be one of the most absurd things said on here. Requiring a phone number for sign up does not automatically mean you are selling ads.


When the time for making money comes, if you don’t think OpenAI will sell every drop of information they have on you, then you are incredibly naive. Why would they leave money on the table when everyone else has been doing it for forever without any adverse effects?


They are currently hiring people with Adtech experience.

The most simple version would be an ad-supported ChatGPT experience. Anyone thinking that an internet consumer company with 100m weekly active users (I‘m citing from their job ad) is not going to sell ads is lacking imagination.


If Google Workspace was selling my or any customers information, at all or "forever", it would not be called Google Workspace, it would be called Google We-died-in-the-most-expensive-lawsuit-of-all-time.


There's a difference. Open AI essentially has 2 products. The chat bot $20 a month thing for Joe shmoe which they admit to training on your prompts, and the API for businesses. Workspace is like the latter. The former is closer to Google search.


Sure, but there is no ambiguity about that, is there? You know that, because they tell you (and, sure, maybe they only tell you, because they have to, by law – but they do and you know)

How do we get from there to "just assume every company in the world will sell your data in wildly and obviously illegal ways", I don't know.


Well..that does seem to be the default. If they don't explicitly say they won't, they probably will. It's a sad world.


We're face to face with AGI and you're worried about ads?? Get your risks in order!!


We're still nowhere near AGI.


The day the AI stops listening to prompts instead of following them is the day I will worry about AGI.


You'd be too late. You're just waiting for someone to imbue a model with agency. We have agency due to evolution. Robots need it programmed into them, and honestly, that is easy to do compared with instilling reasoning. Primitive animals have agency. No animal can reason on the level of GPT. That will get us to HAL2000. If you stick it in a robot, you have the Terminator.


AI doesn’t exist. Neither in practice nor theoretically. Artificial intelligence is an oxymoron. Intelligence is a complex system. Artificial systems are logic systems. You live in a complex universe that you cannot perceive, i.e. we perceive it as noise/randomness only. All you can see are the logical systems expressed at the surface (Mendelbrot Set) of the noise. Everything you see and know is strictly logical, all knowns laws of the universe are derived from those logical systems. Hence, we can only build logical systems. Not complex systems. There is a limit to what we can build here on the surface (Church-Turing). We never have and never will build a complex system.


> Motion-capture works fine because that's real motion

Except in games where they mo-cap at a frame rate less than what it will be rendered at and just interpolate between mo-cap samples, which makes snappy movements turn into smooth movements and motions end up in the uncanny valley.

It's especially noticeable when a character is talking and makes a "P" sound. In a "P", your lips basically "pop" open. But if the motion is smoothed out, it gives the lips the look of making an "mm" sound. The lips of someone saying "post" looks like "most".

At 30 fps, it's unnoticeable. At 144 fps, it's jarring once you see it and can't unsee it.


Out of all the examples, the wooly mammoths one actually feels like CGI the most to me, the other ones are much more believable than this one.


Possibly because there are no videos or even photos of live wooly mammoths, but loads and loads of CG recreations in various documentaries.


I saw the cat in the bed grows an extra limb...


Cats are weird sometimes.


Huh, strong disagree. I've seen realistic CGI motion many times and I don't consider this to feel realistic at all.


I’m a bit thrown off by the fact the mammoths are steaming, is that normal for mammoths ?


Good question :)


You might just be subject to confirmation bias here. Perhaps there were scenes and entities you didn't realize were CGI due to high quality animation, and thus didn't account for them in your assessment.


Regarding CGI, I think it has became so good that you don’t know it’s CGI. Look at the dog in Guardians of the Galaxy 3. There’s a whole series on YouTube called “no cgi is really just invisible cgi” that I recommend watching.

And as with cgi, models like SORA will get better until you can’t tell reality apart. It's not there Yet, but an immense astonishingly breakthrough.


Maybe it's my anthropocentric brain, but the animals move realistically while the people still look quite off.

It's still an unbelievable achievement though. I love the paper seahorse whose tail is made (realistically) using the paper folds.


Serious: Can one just pipe an SRT (subtitle file) and then tell it to compare its version to the mp4 and then be able to command it to zoom, enhance, edit, and basically use it to remould content. I think this sounds great!


It's possible that through sheer volume of training, the neural network essentially has a 3D engine going on, or at least picked up enough of the rules of light and shape and physics to look the same as unreal or unity


It would have to in order to produce the outputs, our brains have crazy physics engines though, F1 drivers can simulate an entire race in their heads.


I wonder if they could theoretically race multiple people at once like chess masters.


I'm not sure I feel the same way about the mammoths - and the billowing snow makes no sense as someone who grew up in a snowy area. If the snow was powder maybe but that's not what's depicted on the ground.


Pixar is computer generated motion, no?


Main Pixar characters are all computer animated by humans. Physics effects like water, hair, clothing, smoke and background crowds use computer physics simulation but there are handles allowing an animator to direct the motion as per the directors wishes.


With extreme amounts of man-hours to do so.


> I've quite simply never seen convincing computer-generated motion before

I’m fairly sure you have seen it many times, it was just so convincing that you didn’t realize it was CGI. It’s a fundamentally biased way to sample it, as you won’t see examples of well executed stuff.


Nah this still has the problem with connecting surfaces that never seems to look right in any CGI. It's actually interesting that it doesn't look right here as well considering they are completely different techniques.


It's been trained on videos exclusively. Then GPT-4 interprets your prompt for it.


Just setup a family password last week...Now it seems every member of the family will have to become their own certificate authority and carry an MFA device.

"Worried About AI Voice Clone Scams? Create a Family Password" - https://www.eff.org/deeplinks/2024/01/worried-about-ai-voice...


Don't think of them as "computer-generated" any more than your phone's heavily processed pictures are "computer-generated", or JWST's false color, IR-to-visible pictures are "computer-generated".

This article makes a convincing argument: https://studio.ribbonfarm.com/p/a-camera-not-an-engine


That is such a gem of an article that looks at AI with a new lens I haven’t encountered before:

- AI sees and doesn’t generate

- It is dual to economics that pretends to describe but actually generates


I think the implications go much further than just the image/video considerations.

This model shows a very good (albeit not perfect) understanding of the physics of objects and relationships between them. The announcement mentions this several times.

The OpenAI blog post lists "Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care." as one of the "failed" cases. But this (and "Reflections in the window of a train traveling through the Tokyo suburbs.") seem to me to be 2 of the most important examples.

- In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo. - In the chair one, OpenAI says the model failed to model the physics of the object (which hints that it did try to, which is not how the early diffusion models worked ; they just tried to generate "plausible" images). And we can see one of the archeologists basically chasing the chair down to grab it, which does correctly model the interaction with a floating object.

I think we can't underestimate how crucial that is to the building of a general model that has a strong model of the world. Not just a "theory of mind", but a litteral understanding of "what will happen next", independently of "what would a human say would happen next" (which is what the usual text-based models seem to do).

This is going to be much more important, IMO, than the video aspect.


Wouldn't having a good understanding of physics mean you know that a women doesn't slide down the road when she walks? Wouldn't it know that a woolly mammoth doesn't emit profuse amounts steam when walking on frozen snow? Wouldn't the model know that legs are solid objects in which other object cannot pass through?

Maybe I'm missing the big picture here, but the above and all the weird spatial errors, like miniaturization of people make me think you're wrong.

Clearly the model is an achievement and doing something interesting to produce these videos, and they are pretty cool, but understanding physics seems like quite a stretch?

I also don't really get the excitement about the girl on the train in Tokyo:

In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo

I don't know a lot about how this model works personally, but I'm guessing in the training data the vast majority of people riding trains in Tokyo featured asian people in them, assuming this model works on statistics like all of the other models I've seen recently from Open AI, then why is it interesting the girl in the reflection was Asian? Did you not expect that?


> Wouldn't having a good understanding of physics mean you know that a women doesn't slide down the road when she walks? Wouldn't it know that a woolly mammoth doesn't emit profuse amounts steam when walking on frozen snow? Wouldn't the model know that legs are solid objects in which other object cannot pass through?

This just hit me but humans do not have a good understanding of physics; or maybe most of humans have no understanding of physics. We just observe and recognize whether it's familiar or not.

AI will need to be, that being the case, way more powerful than a human mind. Maybe orders of magnitude more "neural networks" than a human brain has.


Well we feel the world, it's pretty wild when you think about how much data the body must be receiving and processing constantly.

I was watching my child in the bath the other day, they were having the most incredible time splashing, feeling the water, throwing balls up and down, and yes, they have absolutely no knowledge of "physics" yet navigating and interacting with it as if it was the best thing they've ever done. Not even 12 months old yet.

It was all just happening on feel and yeah, I doubt they could describe how to generate a movie.


Operating a human takes an incredible intuition of physics, just because you can't write or explain the math doesn't mean your mind doesn't understand it. Further to that, we are able to apply our patterns of physics to novel external situations on the fly sometimes within miliseconds of encountering the situation.

You only need to see a ball bounce once and your brain has done some rough approximations of it's properties and will calc both where it's going and how to get your gangly menagerie pivots, levers, meat servos and sockets to intercept them at just the right time.

Think also about how well people can come to understand the physics of cars and bikes in motorsport and the like. The internal model of a cars suspension in operation is non-trivial but people can put it in their head.


Humans have an intuitive understanding of physics, not a mathy science one.

I know I can't put my hand through solid objects. I know that if I drop my laptop from chest height it will likely break it, the display will crack or shatter, the case will get a dent. If it hits my foot it will hurt. Depending on the angle it may break a bone. It may even draw blood. All of that is from my intuitive knowledge of physics. No book smarts needed.


I agree, to me the most clear example is how the rocks in the sea vanish/transform after the wave: The generated frames are hyperreal for sure, but the represented space looks as consistent as a dream.


They could test this by trying to generate the same image but set in New York, etc. I bet it would still be asain.


Give it a year


Ok bro


The answer could be in between. Who said delusion models are limited to 2d pixel generations?


Did you mean diffusion ?


> very good... understanding of the physics of objects and relationships between them

I am always torn here. A real physics engine has a better "understanding" but I suspect that word applies to neither Sora nor a physics engine: https://www.wikipedia.org/wiki/Chinese_room

An understanding of physics would entail asking this generative network to invert gravity, change the density or energy output of something, or atypically reduce a coefficient of friction partway through a video. Perhaps Sora can handle these, but I suspect it is mimicking the usual world rather than understanding physics in any strong sense.

None of which is to say their accomplishment isn't impressive. Only that "understand" merits particularly careful use these days.


Question is - how much do you need to understand something in order to mimick it?

The Chinese Room seems to however point to some sort of prewritten if-else type of algorithm type of situation. E.g. someone following scripted algorithmic procedures might not understand the content, but obviously this simplification is not the case with LLMs or this video generation, as the algorithmic scripting requires pre-written scripts.

Chinese Room seems to more refer to cases like "if someone tells me "xyz", then respond with "abc" - of course then you don't understand what xyz or abc mean, but it's not referring to neural networks training on ton of material to build this model representation of things.


Good points.

Perhaps building the representation is building understanding. But humans did that for Sora and for all the other architectures too (if you'll allow a little meta-building).

But evaluation alone is not understanding. Evaluation is merely following a rote sequence of operations, just like the physics engine or the Chinese room.

People recognize this distinction all the time when kids memorize mathematical steps in elementary school but they do not yet know which specific steps to apply for a particular problem. This kid does not yet understand because this kid guesses. Sora just happens to guess with an incredibly complicated set of steps.

(I guess.)


I think this is a good insight. But if the kid gets sufficiently good at guessing, does it matter anymore..?

I mean, at this point the question is so vague… maybe it’s kinda silly. But I do think that there’s some point of “good-at-guessing” that makes an LLM just as valuable as humans for most things, honestly.


Agreed.

For low-stakes interpolation, give me the guesser.

For high-stakes interpolation or any extrapolation, I want someone who does not guess (any more than is inherent to extrapolating).


That matches how philosophers typically talk about the Chinese room. However the Chinese room is supposed to "behaves as if it understands Chinese" and can engage in a conversation (let us assume via text). To do this the room must "remember" previously mentioned facts, people, etc. Furthermore it must line up ambiguous references correctly (both in reading and writing).

As we now know from more than 60 years of good old fashioned AI efforts, plus recent learning based AI, this CAN be done using computers but CANNOT be done using just ordinary if - then - else type rules no matter how complicated. Searle wrote before we had any systems that could actually (behave as if they) understood language and could converse like humans, so he can be forgiven for failing to understand this.

Now that we do know how to build these systems, we can still imagine a Chinese room. The little guy in the room will still be "following pre-written scripted algorithmic procedures." He'll have archives of billions of weights for his "dictionary". He will have to translate each character he "reads" into one or more vectors of hundreds or thousands of numbers, perform billions of matrix multiplies on the results, and translate the output of the calculations -- more vectors -- into characters to reply. (We may come up with something better, but the brain can clearly do something very much like this.)

Of course this will take the guy hundreds or thousands of years from "reading" some Chinese to "writing" a reply. Realistically if we use error correcting codes to handle his inevitable mistakes that will increase the time greatly.

Implication: Once we expand our image of the Chinese room enough to actually fulfill Searle's requirements, I can no longer imagine the actual system concretely, and I'm not convinced that the ROOM ITSELF "doesn't have a mind" that somehow emerges from the interaction of all these vectors and weights.

Too bad Searle is dead, I'd love to have his reply to this.


Facebook released something in that direction today https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...


Wow this is a huge announcement too, I can't believe this hasn't made the front page yet.


This seems to be completely in line with the previous "AI is good when it's not news" type of work:

Non-news: Dog bites a man.

News: Man bites a dog.

Non-news: "People riding Tokyo train" - completely ordinary, tons of similar content.

News: "Archaeologists dust off a plastic chair" - bizarre, (virtually) no similar content exists.


I found the one about the people in Lagos pretty funny. The camera does about a 360deg spin in total, in the beginning there are markets, then suddenly there are skyscrapers in the background. So there's only very limited object permanence.

> A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.

> https://cdn.openai.com/sora/videos/lagos.mp4


Also the women in red next to the people is very tiny and the market stall is also a mini market stall, and the table is made out of a bike.

For everyone that's carrying on about this thing understanding physics and has a model of the world...it's an odd world.


The thing is -- over time I'm not sure people will care. People will adapt to these kinds of strange things and normalize them -- as long as they are compelling visually. The thing about that scene is it looks weird only if you think about it. Otherwise it seems like the sort of pan you would see in some 30 second commercial for coffee or something.

If anything it tells a story: going from market, to people talking as friends, to the giant world (of Lagos).


I'm not so sure.

My instagram feed is full of AI people, I can tell with pretty good accuracy when the image is "AI" or real, the lighting and just the framing and the scene itself, just something is off.

I think a similar thing will happen here, over the next few months we'll adapt to these videos and the problems will become very obvious.

When I first looked at the videos I was quite impressed, but I looked again and I saw a bunch of werid stuff going on. I think our brains are just wired to save energy, and accepting whatever we see on a video or an image as being good enough is pretty efficient / low risk thing.


Agreed, at first glance of the woman walking I was so focused on how well they were animating that the surreal scene went unnoticed. Once I'd stopped noticing the surreal scene, I started picking up on weird motion in the walk too.

Where I think this will get used a lot is in advertising. Short videos, lots going on, see it once and it's gone, no time to inspect. Lady laughing with salad pans to a beach scene, here's a product, buy and be as happy as salad lady.


This will be classified unconsciously as cheap and uninteresting by the brain real quick. It'll have its place in the tides of cheap content, but if overall quality was to be overlooked that easily, producers would never have increased production budget that much, ever, just for the sake of it.


In the video of the girl walking down the Tokyo city street, she's wearing a leather jacket. After the closeup on her face they pull back and the leather jacket has hilariously large lapels that weren't there before.


Object permanence (just from images/video) seems like a particularly hard problem for a super-smart prediction engine. Is it the old thing, or a new thing?


There are also perspective issues: the relative sizes of the foreground (the people sitting at the café) and the background (the market) are incoherent. Same with the "snowy Tokyo with cherry blossoms" video.


Though I'm not sure your point here -- outside of America -- in Asia and Africa -- these sorts of markets mixed in with skyscrapers are perfectly normal. There is nothing unusual about it.


Yeah, some of the continuity errors in that one feel horrifying.


> then suddenly there are skyscrapers in the background. So there's only very limited object permanence.

Ah but you see that is artistic liberty. The director wanted it shot that way.


It doesn't understand physics.

It just computes next frame based on current one and what it learned before, it's a plausible continuation.

In the same way, ChatGPT struggles with math without code interpreter, Sora won't have accurate physics without a physics engine and rendering 3d objects.

Now it's just a "what is the next frame of this 2D image" model plus some textual context.


> It just computes next frame based on current one and what it learned before, it's a plausible continuation.

...

> Now it's just a "what is the next frame of this 2D image" model plus some textual context.

This is incorrect. Sora is not an autoregressive model like GPT, but a diffusion transformer. From the technical report[1], it is clear that it predicts the entire sequence of spatiotemporal patches at once.

[1]: https://openai.com/research/video-generation-models-as-world...


Good link.

But, even there it says:

> Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object states

Regardless whether all the frames are generated at once, or one by one, you can see in their examples it's still just pixel based. See the first example with the dog with blue hat, the woman has a blue thing suddenly spawn into her hand because her hand went over another blue area of the image.


I'm not denying that there are obvious limitations. However, attributing them to being "pixel-based" seems misguided. First off, the model acts in latent space, not directly on pixels. Secondly, there is no fundamental limitation here. The model has already acquired limited-yet-impressive ability to understand movement, texture, social behavior, etc., just from watching videos.

I learned to understand reality by interpreting photons and various sensory inputs. Does that make my model of reality fundamentally flawed? In the sense that I only have a partial intuitive understanding of it, yes. But I don't need to know Maxwell's equations to get a sense of what happens when I open the blinds or turn on my phone.

I think many of the limitations we are seeing here - poor glass physics, flawed object permanence - will be overcome given enough training data and compute.

We will most likely need to incorporate exploration, but we can get really far with astute observation.


Actually your comment gives me hope that we will never have AI singularity, since how the brain works is flawed, and were trying to copy it.

Heck a super AI might not even be possible, what if we're peak intelligence with our millions of years of evolution?

Just adding compute speed will not help much -- say the goal of an intelligence is to win a war. If you're tasked with it then it doesn't matter if you have a month or a decade (assume that time is.frozen while you do your research), its a too complex problem and simply cannot be solved, and the same goes for an AI.

Or it will be like with chess solvers, machines will be more intelligent than us simply because they can load much more context to solve a problem than us in their "working memory"


> Actually your comment gives me hope that we will never have AI singularity, since how the brain works is flawed, and were trying to copy it.

As someone working in the field, the vast majority of AI research isn't concerned with copying the brain, simply with building solutions that work better than what came before. Biomimetism is actually quite limited in practice.

The idea of observing the world in motion in order to internalize some of its properties is a very general one. There are countless ways to concretize it; child development is but one of them.

> If you're tasked with it then it doesn't matter if you have a month or a decade (assume that time is.frozen while you do your research), its a too complex problem and simply cannot be solved, and the same goes for an AI.

I highly disagree.

Let's assume a superintelligent AI can break down a problem into subproblems recursively, find patterns and loopholes in absurd amounts of data, run simulations of the potential consequences of its actions while estimating the likelihood of various scenarios, and do so much faster than humans ever could.

To take your example of winning a war, the task is clearly not unsolvable. In some capacity, military commanders are tasked with it on a regular basis (with varying degrees of success).

With the capabilities described above, why couldn't the AI find and exploit weaknesses in the enemy's key infrastructure (digital and real-world) and people? Why couldn't it strategically sow dissent, confuse, corrupt, and efficiently acquire intelligence to update its model of the situation minute-by-minute?

I don't think it's reasonable to think of a would-be superintelligence as an oracle that gives you perfect solutions. It will still be bound by the constraints of reality, but it might be able to work within them with incredible efficiency.


This is an excellent comparison and I agree with you.

Unfortunately we are flawed. We do know how physics work intuitively and can somewhat predict them, but not perfectly. We can imagine how a ball will move, but the image is blurry and trajectory only partially correct. This is why we invented math and physics studies, to be able to accurately calculate, predict and reproduce those events.

We are far off from creating something as efficient as the human brain. It will take insane amounts of compute power to simply match our basic innacurate brains, imagine how much will be needed to create something that is factually accurate.


Indeed. But a point that is often omitted from comparisons with organic brains is how much "compute equivalent" we spent through evolution. The brain is not a blank slate; it has clear prior structure that is genetically encoded. You can see this as a form of pretraining through a RL process wherein reward ~= surviving and procreating. If you see things this way, data-efficiency comparisons are more appropriate in the context of learning a new task or piece of information, and foundation models tend to do this quite well.

Additionally, most of the energy cost comes from pretraining, but once we have the resulting weights, downstream fine-tuning or inference are comparatively quite cheap. So even if the energy cost is high, it may be worth it if we get powerful generalist models that we can specialize in many different ways.

> This is why we invented math and physics studies, to be able to accurately calculate, predict and reproduce those events.

We won't do away without those, but an intuitive understanding of the world can go a long way towards knowing when and how to use precise quantitative methods.


GPT-4 doesn't "struggle with math". It does fine. Most humans aren't any better.

Sora is not autoregressive anyway but there's nothing "just" and next frame/token prediction.


It absolutely struggles with math. It's not solving anything. It sometimes gets the answer right only because it's seen the question before. It's rote memorization at best.


No it doesn't. I know because I've actually used the thing and you clearly haven't.

And if Terence Tao finds some use for GPT-4 as well as Khan Academy employing it as a Math tutor then I don't think I have some wild opinion either.

Now Math isn't just Arithmetic but do you know easy it is to go out of training for say Arithmetic ?


Yesterday, it failed to give me the correct answer to 4 + 2 / 2. It said 3...


Just tried in chatGpt-4. It gives the correct output (5), along with a short explanations of the order of operations (which you probably need to know, if you're asking the question).


Correct based upon whom? If someone of authority asks the question and receives a detailed response back that is plausible but not necessarily correct, and that version of authority says the answer is actually three, how would you disagree?

In order to combat Authority you need to both appeal to a higher authority, and that has been lost. One follows AI. Another follows Old Men from long ago who's words populated the AI.


The TV show American Gods becoming reality...


We shouldn't necessarily regard 5 as the correct output. Sure, almost all of us choose to make division higher precedence than addition, but there's no reason that has to be the case. I think a truly intelligent system would reply with 5 (which follows the usual convention, and would therefore mimic the standard human response), but immediately ask if perhaps you had intended a different order of operations (or even other meanings for the symbols), and suggest other possibilities and mention the fact that your question could be considered not well-defined...which is basically what it did.


I guess you might think 'math' means arithmetic. It definitely does struggle with mathematical reasoning, and I can tell you that because I and many others have tried it.

Mind you, it's not brilliant at arithmetic either...


I'm not talking about Arithmetic


> In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo.

How is this any more accurate than saying that the model has mostly seen Asian people in footage of Tokyo, and thus it is most likely to generate Asian-features for a video labelled "Tokyo"? Similarly, how many videos looking out a train window do you think it's seen where there was not a reflection of a person in the window when it's dark?


I'm hoping to see progress towards consistent characters, objects, scenes etc. So much of what I'd want to do creatively hinges on needing persisting characters who don't change appearance/clothing/accessories from usage to usage. Or creating a "set" for a scene to take place in repeatedly.

I know with stable diffusion there's things like lora and controlnet, but they are clunky. We still seem to have a long way to go towards scene and story composition.

Once we do, it will be a game changer for redefining how we think about things like movies and television when you can effectively have them created on demand.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: