Hacker News new | past | comments | ask | show | jobs | submit | page 2 login
Sora: Creating video from text (openai.com)
3647 points by davidbarker 3 months ago | hide | past | favorite | 2231 comments



This is leaps and bounds beyond anything out there, including both public models like SVD 1.1 and Pika Labs' / Runway's models. Incredible.


Let's hold our breath. Those are specifically crafted hand-picked good videos, where there wasn't any requirement but "write a generic prompt and pick something that looks good", with no particular requirements. Which is very different from the actual process where you have a very specific idea and want the machine to make it happen.

DALL-E presentation also looked cool and everyone was stoked about it. Now that we know of its limitations and oddities? YMMV, but I'd say not so much - Stable Diffusion is still the go-to solution. I strongly suspect the same thing with Sora.


The examples are most certainly cherry-picked. But the problem is there are 50 of them. And even if you gave me 24 hour full access to SVD1.1/Pika/Runway (anything out there that I can use), I won't be able to get 5 examples that match these in quality (~temporal consistency/motions/prompt following) and more importantly in the length. Maybe I am overly optimistic, but this seems too good.


Credit to OpenAI for including some videos with failures (extra limbs, etc.). I also wonder how closely any of these videos might match one from the training set. Maybe they chose prompts that lined up pretty closely with a few videos that were already in there.


https://twitter.com/sama/status/1758200420344955288

They're literally taking requests and doing them in 15 minutes.


Cool, but see the drastic difference in quality ;)


Lack of quality in the details yes but the fact that characters and scenes depict consistent and real movement and evolution as opposed to the cinemagraph and frame morphing stuff we have had so far is still remarkable!


That particular example seems to have more a "cheap 3d" style to it but the actual synthesis seems on par with the examples. If the prompt had specified a different style it'd have that style instead. This kind of generation isn't like actual animating, "cheap 3d" style and "realistic cinematic" style take roughly the same amount of work to look right.


Drastic difference in quality of the prompts too. Ones used in the OP are quite detailed ones mostly.


There are absolutely example videos on their website which have worse quality than that.


It has a comedy like quality lol

But all to be said, it is no less impressive after this new demo


Depends on the quality of the prompts.


The output speed doesn't disprove possible cherry-picking, especially with batch generation.


Who cares? If it can be generated in 15 minutes then it's commercially useful.


Especially of you think that after you can get feedback and try again..15 minutes later have a new one...try again...etc


What is your point? That they make multiple ones and pick out the best ones? Well duh? That’s literally how the model is going to be used.


Please make your substantive points without swipes. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.


OpenAI people running these prompts have access to way more resources than any of us will through the API.


Looks ready for _Wishbone_


The year is 2030.

Sarah is a video sorter, this was her life. She graduated top of her class in film, and all she could find was the monotonous job of selecting videos that looked just real enough.

Until one day, she couldn't believe it. It was her. A video of of her in that very moment sorting. She went to pause the video, but stopped when he doppelganger did the same.



I got reminded of an even older sci-fi story: https://qntm.org/responsibility


Man i was looking for this story for a year or so... thanks for sharing


Seems like in about two years I’ll be able to stuff this saved comment into a model and generate this full episode of Black Mirror


> Stable Diffusion is still the go-to solution. I strongly suspect the same thing with Sora.

Sure, for people who want detailed control with AI-generated video, workflows built around SD + AnimateDiff, Stable Video Diffusion, MotionDiff, etc., are still going to beat Sora for the immediate future, and OpenAI's approach structurally isn't as friendly to developing a broad ecosystem adding power on top of the base models.

OTOH, the basic simple prompt-to-video capacity of Sora now is good enough for some uses, and where detailed control is not essential that space is going to keep expanding -- one question is how much their plans for safety checking (which they state will apply both to the prompt and every frame of output) will cripple this versus alternatives, and how much the regulatory environment will or won't make it possible to compete with that.


I suspect given equal effort into prompting both, Sora probably provides superior results.


> I suspect given equal effort into prompting both, Sora probably provides superior results

Strictly to prompting, probably, just as that is the case with Dall-E 3 vs, say, SDXL.

The thing is, there’s a lot more that you can do than just tweaking prompting with open models, compared to hosted models that offer limited interaction options.


Generate stock video bits I think.


It doesn't matter if they're cherrypicked when you can't match this quality with SD or Pika regardless of how much time you had.

and i still prefer Dalle-3 to SD.


In the past the examples tweeted by OpenAI have been fairly representative of the actual capabilities of the model. i.e. maybe they do two or three generations and pick the best, but they aren't spending a huge amount of effort cherry-picking.


Stable diffusion is not the go-to solution, it's still behind midjourney and DAllE


Would love to see handpicked videos from competitors that can hold their own against what SORA is capable of


Look at Sam altman’s twitter where he made videos on demand from what people prompted him


Wrong, this is the first time I've seen an astronaut with a knit cap.


they're not fantastic either if you pay close attention

there are mini-people in the 2060s market and in the cat one an extra paw comes out of nowhere


The woman’s legs move all weirdly too


While Sora might be able to generate short 60-90 second videos, how well it would scale with a larger prompt or a longer video remains yet to be seen. And the general logic of having the model do 90% of the work for you and then you edit what is required might be harder with videos.


60 seconds at a time is much better than enough.

Most fictional long-form video (whether live-action movies or cartoons, etc) is composed of many shots, most of them much shorter than 7 seconds, let alone 60.

I think the main factor that will be key to generate a whole movie is being able to pass some reference images of the characters/places/objects so they remain congruent between two generations.

You could already write a whole book in GPT-3 from running a series of one-short-chapter-at-a-time generations and passing the summary/outline of what's happened so far. (I know I did, in a time that feels like ages ago but was just early last year)

Why would this be different?


> I think the main factor that will be key to generate a whole movie is being able to pass some reference images of the characters/places/objects so they remain congruent between two generations.

I partly agree with this. The congruency however needs to extend to more than 2 generations. If a single scene is composed of multiple shots, then those multiple shots need to be part of the same world the scene is being shot in. If you check the video with the title `A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.` the surroundings do not seem to make sense as the view starts with a market, spirals around a point and then ends with a bridge which does not fit into the market. If the the different shots generated the model did fit together seamlessly, trying to make the fit together is where the difficulty comes in. However I do not have any experience in video editing, so it's just speculation.


The CGI industry is about to be turned upside down. They charge hundreds of thousands per minute, and it takes them forever to produce the finished product.


You do realize virtually all movies are made up of shots often lasting no longer than 10 seconds. Edited together. Right.


The best films have long takes. Children of men or stalker come to mind


Copacabana tracking shot in Goodfellas


I'm almost speechless. I've been keeping an eye on the text-to-video models, and if these example videos are truly indicative of the model, this is an order of magnitude better than anything currently available.

In particular, looking at the video titled "Borneo wildlife on the Kinabatangan River" (number 7 in the third group), the accurate parallax of the tree stood out to me. I'm so curious to learn how this is working.

[Direct link to the video: https://player.vimeo.com/video/913130937?h=469b1c8a45]


The video of the gold rush town just makes me think of what games like Red Dead and GTA could look like.


holy cow, is that the future of gaming? instead of 3D renders it's real-time video generation, complete with audio and music and dialog and intelligent AI conversations and it's a unique experience no one else has ever played. gameplay mechanics could even change on the fly


I think for the near future we’ll see something like this:

https://youtube.com/watch?v=P1IcaBn3ej0

From a few years ago, where the game is rendered traditionally and used as a ground truth, with a model on top of it that enhances the graphics.

After maybe 10-15 years we will be past the point where the entire game can be generated without obvious mistakes in consistency.

Realtime AI dialogue is already possible but still a bit primitive, I wrote a blog post about it here: https://jgibbs.dev/blogs/local-llm-npcs-in-unreal-engine


That's why NVIDIA's CEO said recently that in the future every pixel will be generated — not rendered.


five years ago: https://www.youtube.com/watch?v=ayPqjPekn7g I'm eager to see an updated version.


DLSS is essentially this, isn't it? It uses a low quality render from the game and then increases the fidelity with something very similar to a diffusion model.


The answer is most definitely YES. Computer games, and of course, porn, the stuff the internet is made up for.


Shove all the tech you mentioned into a VR headset and it is literally game over for humans


You'd still get a headache after 20 minutes. No matter how addictive, it wont be bad until you can wear VR headsets for hours.


Many people can. I can and have been since the DK1. I’ve done 12 hour plus stints in it.


Really? My head hurts bad after 30 minutes and I feel uneasy after like 10-15.

The DK1 I could wear for like 1 minite before feeling sick, so they are getting better ...

I am prone to sea sickness. Maybe it is related.


> Really?

Yeah, but I mean who knows why. I know some people can't, my GF is one of them.

I've often wondered if im ok with it because im used to the object on head stuff (like 25 odd years of motorcycle riding/ergo helmet wearing) and close up, high fov coverage fast past gaming? (I play on a 32" maybe 70 cms from the eyes give or take.)

> I am prone to sea sickness. Maybe it is related.

I'd think it might be given my understanding of why illness in many is triggered. It's odd because I never got sick from it, but i've seen others get INCREDIBLY ill in two different ways.

1. My GF tried to use simple locomotion in a game and almost vomited as an immediate reaction

2. A friend who was fine at first, but then randomly started getting very slowly ill over a matter of like an hour, just getting more and more nausea after the fact.

It's unfortunate, because due to lack of bad feelings/nausea/discomfort etc, I love VR. I equally from those around me can see no real path forward for it as it stands today though because of those impacts and limitations.

That being said, maybe they get smaller, lighter, we learn to induce motion sickness less, I dunno. I'm not optimistic.


lol horseshit


You’re projecting.


Even otherwise, and no matter how good the screen and speakers are, a screen and speakers can only be so immersive. People oversell the potential for VR when they describe it as being as good as or better than reality. Nothing less than the Matrix is going to work in that regard.


Yep, once your brain gets over the immediate novelty of VR, it’s very difficult to get back that “Ready Player One” feeling due to the absence of sensory feedback.

If/once they get it working though, society will shift fast.

There’s an XR app called Brink Traveler that’s full of handcrafted photogrammetry recreations of scenic landmarks. On especially gloomy PNW winter days, I’ll lug a heat lamp to my kitchen and let it warm up the tiled stone a bit, put a floor fan on random oscillation, toss on some good headphones, load up a sunny desert location in VR, and just lounge on the warm stone floor for an hour.

My conscious brain “knows” this isn’t real and just visuals alone can’t fool it anymore, but after about 15 minutes of visuals + sensory input matching, it stops caring entirely. I’ve caught myself reflexively squinting at the virtual sun even though my headset doesn’t have HDR.


Digital Westworld


I'll take one holodeck, please.


Sometimes, but for specific or unique art styles, statistical models like this may not work well.

For games like call of duty or other hyper realistic games it very likely will be.


For games like 2D/3D fighting games where you don't to generate a lot of terrain, the possibilities of randomly generating stages with unique terrain and obstacles is interesting.


That’s also true, but those stages would need to fit in a specific art style.

A large part of fighting games is the style.

The cost difference of just making bespoke art and tuning an AI system to generate it for you may not be worth it (at least right now.)


Lucid Dreaming as a Service.

See also: https://en.wikipedia.org/wiki/Vanilla_Sky


The diffusion is almost certainly taking place over some sort of compressed latent, from the visual quirks of the output I suspect that the process of turning that latent into images goes latent -> nerf / splat -> image, not latent -> convolutional decoder -> image


Agreed. It's amazing how much of a head start OpenAI appears to have over everyone else. Even Microsoft who has access to everything OpenAI is doing. Only Microsoft could be given the keys to the kingdom and still not figure out how to open any doors with them.


Microsoft doesn’t have access to OpenAI’s research, this was part of the deal. They only have access to the weights and inference code of production models and even then who has access to that inside MS is extremely gated and only a few employees have access to this based on absolute need to actually run the service.

AI researcher at MSFT barely have more insights about OpenAI than you do reading HN.


This is not true. Microsoft have a perpetual license to all of OpenAI's IP. If they really wanted to they could get their hands on it.


Yeah but what's in the license? It's not public so we have no way of knowing


No. They have early access. Example: MSFT was using Dall-e Exp (early 3 version) in PUBLIC, since February of 2023.

In the same month, they were also using GPT4 in public - before OpenAI.

And they had access to GPT4 in 2022 (which was when they decided to create Bing Chat, now called Copilot).

All the current GPT4 models at MSFT are also finetuned versions (literally Creative and Precise mode runs different finetuned versions of GPT4). It runs finetuned versions since launch even...


I didn't realize that. Thank you for the clarification.


I promise you this isn't true.


Microsoft said that they could continue OpenAI's research with no slowdown if OpenAI cut them off by hiring all OpenAI's people, so from that statement it sounds like they have access.


Many people say the same about Google/DeepMind.


Eh. MSFT owns 49% of OpenAI. Doesn't really seem like they need to do much except support them.


Except they keep trying to shove AI into everything they own. CoPilot Studio is an example of how laughably bad at it they are. I honestly don't understand why they don't contract out to OpenAI to help them do some of these integrations.


Every company is trying to shove AI into everything they own. It's what investors currently demand.

OpenAI is likely limited by how fast they are able to scale their hiring. They had 778 FTEs when all the board drama occurred, up 100% YoY. Microsoft has 221,000. It seems difficult to delegate enough headcount to all the exploratory projects of MSFT and it's hard to scale headcount quicker while preserving some semblance of culture.


They don't own 49% of OpenAI. They have capped rights to 49% of OpenAI's profits.


Apparently all the rumors weren't true then, my mistake.

I don't think what you're saying is correct though, either. All the early news outlets reported 49% ownership:

https://en.wikipedia.org/wiki/OpenAI#:~:text=Rumors%20of%20t...

https://www.theverge.com/2023/1/23/23567448/microsoft-openai...

https://www.reuters.com/world/uk/uk-antitrust-regulator-cons...

https://techcrunch.com/2023/01/23/microsoft-invests-billions...

The only official statement from Micorosft is "While details of our agreement remain confidential, it is important to note that Microsoft does not own any portion of OpenAI and is simply entitled to share of profit distributions," said company spokesman Frank Shaw.

No numbers, though.

Do you have a better source for numbers?


Yes, but I am stuck in their (American) view of what is consider appropriate. Not what is legal, but what they determine to be OK to produce.

Good luck generating anything similar to an 80s action movie. The violence and light nudity will prevent you from generating anything.


I suspect it's less about being puritanical about violence and nudity in and of themself, and more a blanket ban to make up for the inability to prevent the generation of actually controversial material (nude images of pop stars, violence against politicians, hate speech)


Put like that, it's a bit like the Chumra in Judaism [1]. The fence, or moat, around the law that extends even further than the law itself, to prevent you from accidentally commiting a sin.

1. https://en.m.wikipedia.org/wiki/Chumra_(Judaism)


Na. It's more like what he said: Cover your ass legally for the real problems this could cause.


No, it's America's fault.


I am guessing a movie studio will get different access with controls dropped. Of course, that does mean they need to be VERY careful when editing, and making sure not to release a vagina that appears for 1 or 2 frames when a woman is picking up a cat in some random scene.


We can't do narrative sequences with persistent characters and settings, even with static images.

These video clips just generic stock clips. You cut cut them together to make a sequence of random flashy whatever, but you still can't do storytelling in any conventional sense. We don't appear to be close to being able to use these tools for the hypothetical disruptive use case we worry about.

Nonetheless, The stock video and photo people are in trouble. So long as the details don't matter this stuff is presumably useful.


I wonder how much of it is really "concern for the children" type stuff vs not wanting to deal with fights on what should be allowed and how and to who right now. When film was new towns and states started to make censorship review boards. When mature content became viewable on the web battles (still ongoing) about how much you need to do to prevent minors from accessing it came up. Now useful AI generated content is the new thing and you can avoid this kind of distraction by going this route instead.

I'm not supporting it in any way, I think you should be able to generate and distribute any legal content with the tools, but just giving a possible motive for OpenAI being so conservative whenever it comes to ethics and what they are making.


I've been watching 80s movies recently, and amount of nudity and sex scenes often feels unnecessary. I'm definitely not a prude. I watch porn, I talk about sex with friends, I go to kinky parties sometimes. But it really feels that a lot of movies sacrificed stories to increase sex appeal — and now that people have free and unlimited access to porn, movies can finally be movies.


It's not a particularly American attitude to be opposed to violence in media though, American media has plenty of violence.

They're trying to be all-around uncontroversial.


Where is the training material for this coming from? The only resource I can think of that's broad enough for a general purpose video model is YouTube, but I can't imagine Google would allow a third party to scrape all of YT without putting up a fight.


It's movies the shots are way to deliberate to have random YouTube crap in the dataset.


You can still have a broad dataset and use RLHF to steer it more towards the aesthetic like midjourney and SDXL did through discord feedback. I think there was still some aesthetic selection in the dataset as well but it still included a lot of crap.


It's very good. Unclear how far ahead of Lumiere it is (https://lumiere-video.github.io/) or if its more of a difference in prompting/setttings.


The big stand out to me beyond almost any other text video solution is that the video duration is tremendously longer (minute+). Everything else that I've seen can't get beyond 15 to 20 seconds at the absolute maximum.


In terms of following the prompt and generating visually interesting results, I think they're comparable. But the resolution for Sora seems so far ahead.

Worth noting that Google also has Phenaki [0] and VideoPoet [1] and Imagen Video [2]

[0] https://sites.research.google/phenaki/

[1] https://sites.research.google/videopoet/

[2] https://imagen.research.google/video/


Must be intimidating to be on the Pika team at the moment...


you nailed it


All those startups have been squeezed in the middle. Pika, Runway, etc might as well open source their models.

Or Meta will do it for them.


It is incredible indeed, but I remember there was a humongous gap between the demoed pictures for DALL-E and what most prompts would generate.

Don't get overly excited until you can actually use the technology.


I know it's Runway (and has all manner of those dream-like AI artifacts) but I like what this person is doing with just a bunch 4 second clips and an awesome soundtrack:

https://youtu.be/JClloSKh_dk

https://youtu.be/upCyXbTWKvQ


I agree in terms of raw generation, but runway especially is creating fantastic tooling too.


Yup, it's been even several months! ;) But now we finally have another quantum leap in AI.


The Hollywood Reporter says many in the industry are very scared.[1]

“I’ve heard a lot of people say they’re leaving film,” he says. “I’ve been thinking of where I can pivot to if I can’t make a living out of this anymore.” - a concept artist responsible for the look of the Hunger Games and some other films.

"A study surveying 300 leaders across Hollywood, issued in January, reported that three-fourths of respondents indicated that AI tools supported the elimination, reduction or consolidation of jobs at their companies. Over the next three years, it estimates that nearly 204,000 positions will be adversely affected."

"Commercial production may be among the main casualties of AI video tools as quality is considered less important than in film and TV production."

[1] https://www.hollywoodreporter.com/business/business-news/ope...


Honest question: of what possible use could Sora be for Hollywood?

The results are amazing, but if the current crop of text-to-image tools is any guide, it will be easy to create things that look cool but essentially impossible to create something that meets detailed specific criteria. If you want your actor to look and behave consistently across multiple episodes of a series, if you want it to precisely follow a detailed script, if you want continuity, if you want characters and objects to exhibit consistent behavior over the long term – I don't see how Sora can do anything for you, and I wouldn't expect that to change for at least a few years.

(I am entirely open to the idea that other generative AI tools could have an impact on Hollywood. The linked Hollywood Reporter article states that "Visual effects and other postproduction work stands particularly vulnerable". I don't know much about that, I can easily believe it would be true, but I don't think they're talking about text-to-video tools like Sora.)


I suspect that one of the first applications will be pre-viz. Before a big-budget movie is made, a cheap version is often made first. This is called "pre-visualization". These text to video applications will be ideal for that. Someone will take each scene in the script, write a big prompt describing the scene, and follow it with the dialog, maybe with some commands for camerawork and cuts. Instant movie. Not a very good one, but something you can show to the people who green-light things.

There are lots of pre-viz reels on line. The ones for sequels are often quite good, because the CGI character models from the previous movies are available for re-use. Unreal Engine is often used.


Especially when you can do this with still images on a normal M-series MacBook _today_, automating it would be pretty trivial.

Just feed it a script and get a bunch of pre-vis images for every scene.

When we get something like this running on hardware with an uncensored model, there's going to be a lot of redundancies but also a ton of new art that would've never happened otherwise.


This is a fascinating idea I'd never considered before.


People are extrapolating out ten years. They will still have to eat and pay rent in ten years.


It wouldn't be too hard to do any of the things you mention. See ControlNet for Stable Diffusion, and vid2vid (if this model does txt2vid, it can also do vid2vid very easily).

So you can just record some guiding stuff, similar to motion capture but with just any regular phone camera, and morph it into anything you want. You don't even need the camera, of course, a simple 3D animation without textures or lighting would suffice.

Also, consistent look has been solved very early on, once we had free models like Stable Diffusion.


Right now you’d need a artistic/ML mixed team. You wouldn’t use an off the shelf tool. There was a video of some guys doing this (sorry can’t find it) to make an anime type animation. With consistent characters. They used videos of themselves running through their own models to make the characters. So I reckon while prompt -> blockbuster is not here yet, a movie made using mostly AI is possible but it will cost alot now but that cost will go down. Why this is sad it is also exciting. And scary. Black mirror like we will start creating AI’s we will have relationships with and bring people back to life (!) from history and maybe grieving people will do this. Not sure if that is healthy but people will do it once it is a click of a button thing.


> There was a video of some guys doing this (sorry can’t find it) to make an anime type animation. With consistent characters. They used videos of themselves running through their own models to make the characters.

That was Corridor Crew: https://www.youtube.com/watch?v=_9LX9HSQkWo


It shows that good progress is still made.

Just this week sd audio model can make good audio effects like doors etc.

If this continues (and it seems it will) it will change the industry tremendously.


It won’t be Hollywood at first . It will be small social ads for TikTok, IG and social media. The brands likely won’t even care if it’s they don’t get copyright at the end, since they have copyright of their product.

Source: I work in this.


Seconding this. There is also a huge SMB and commercial business that supports many agencies and production companies. This could replace a lot of that work.


The OpenAI announcement mentions being able to provide an image to start the video generation process from. That sounds to me like it will actually be incredibly easy to anchor the video generation to some consistent visual - unlike all the text-based stable diffusion so far. (Yes, there is img2img, but that is not crossing the boundary into a different medium like Sora).


Probably a bad time to be an actor.

Amazing time to be a wannabe director or producer or similar creative visionary.

Bad time to be high up in a hierarchical/gatekeeping/capital-constrained biz like Hollywood.

Amazing time to be an aspirant that would otherwise not have access to resources, capital, tools in order to bring their ideas to fruition.

On balance I think the ‘20s are going to be a great decade for creativity and the arts.


> Probably a bad time to be an actor.

I don't see why -- the distance between "here's something that looks almost like a photo, moving only a little bit like a mannequin" and "here's something that has the subtle facial expressions and voice to convey complex emotions" is pretty freaking huge; to the point where the vast majority of actual humans fail to be that good at it. At any rate, the number of BNNs (biological neural networks) competing with actors has only been growing, with 8 billion and counting.

> Amazing time to be a wannabe director or producer or similar creative visionary. Amazing time to be an aspirant that would otherwise not have access to resources, capital, tools in order to bring their ideas to fruition.

Perhaps if you mainly want to do things for your own edification. If you want to be able to make a living off it, you're suddenly going to be in a very, very flooded market.


It’s for sure plausible that acting remains a viable profession.

The bull case would be something like ‘Ractives in “The Diamond Age” by Neal Stephenson; instead of video games people play at something like live plays with real human actors. In this world there is orders of magnitude more demand for acting.

Personally I think it’s more likely that we see AI cross the uncanny valley in a decade or two (at least for movies/TV/TikTok style content). But this is nothing more than a hunch; 55/45 confidence say.

> Perhaps if you mainly want to do things for your own edification.

My mental model is that most aspiring creatives fall in this category. You have to be doing quite well as an actor to make a living from it, and most who try do not.


> the distance between "here's something that looks almost like a photo, moving only a little bit like a mannequin" and "here's something that has the subtle facial expressions and voice to convey complex emotions" is pretty freaking huge;

The distance between pixelated noise and a single image is freaking huge.

The distance between a single image and a video of a consistent 3D world is freaking huge (albeit with rotating legs).

The distance between a video of a consistent 3D world and a full length movie of a consistent 3D world with subtle facial expressions is freaking huge.

So... next 12 months then.

>If you want to be able to make a living off it, you're suddenly going to be in a very, very flooded market.

That is, I believe, GPs point.


Considering a year ago we had that nightmare fuel of will smith eating spaghetti and Don and Joe hair force one it seems odd to see those of you who assume we’re not going to get to the point of being indistinguishable from reality in the near future.


* Flesh out a movie about x following the Hero's Journey in the style of Notting Hill.

* Create a scene in which a character with the mannerisms of Tom Cruise from Top Gun goes into a bar and says "...."


We might enter a world where "actors" are just for mocap. They do the little micro expressions with a bunch of dots on their face.

AI models add the actual character and maybe even voice.

At that point the amount of actors we "need" will go down drastically. The same experienced group of a dozen actors can do multiple movies a month if needed.


It's always a bad time to be an actor, between long hours, low pay, and a culture of abuse, but this will definitely make it worse. My writer and artist friends are already despondent from genAI -- it was rare to be able to make art full-time, and even the full-timers were barely making enough money to live. Even people writing and drawing for marketing were not exactly getting rich.

I think this will lead to a further hollowing-out of who can afford to be an actor or artist, and we will miss their creativity and perspective in ways we won't even realize. Similarly, so much art benefits from being a group endeavor instead of someone's solo project -- imagine if George Lucas had created Star Wars entirely on his own.

Even the newly empowered creators will have to fight to be noticed amid a deluge of carelessly generated spam and sludge. It will be like those weird YouTube Kids videos, but everywhere (or at least like indie and mobile games are now). I think the effect will be that many people turn to big brands known for quality, many people don't care that much, and there will be a massive doughnut hole in between.


> Even the newly empowered creators will have to fight to be noticed amid a deluge of carelessly generated spam and sludge. It will be like those weird YouTube Kids videos, but everywhere (or at least like indie and mobile games are now).

Reminds me of Syndrome's quote in the Incredibles.

"If everyone is super, then no one will be".


I dunno. Thanks to big corpo shenanigans (and, er, racism?) a lot of people have turned away from big brands (or, at least obviously brand-y brands) towards "trusted individuals" (though you might classify them as brands themselves). Who goes to PCMag anymore? It's all LTT and Marques Brownlee and any number of small creators. Or, the people on the right who abandoned broadcast and even cable news and get everything they "know" from Twitter randos. Even on this site, asks for a Google Search alternative are not rare, and you'll get about a dozen different answers each time, each with a fraction of the market share of the big guy (but growing).


> Probably a bad time to be an actor.

I'm thinking people will probably still want to see their favorite actors, so established actors may sell the rights to their image. They're sitting on a lot of capital. Bad time to be becoming an actor though.


You are talking about movie and TV stars, not actors in general. The vast majority of working actors are not known to the audience.


Even the average SAG-AFTRA member barely makes a living wage from acting. And those are the ones that got into the union. There's a whole tier below that. If you spend time in LA, you probably know some actress/model/waitress types.

There's also the weird misery of being famous, but not rich. You can't eat fame.


> established actors may sell the rights to their image

I had a conversation with a Hollywood producer last year who said this is already happening.


Likely less and less tho given that people will be able to generate a hyper personalized set of actors/characters/personalities in their hyper personalized generated media.

Younger generations growing up with hyper personalized media will likely care even less about irl media figures.


You can’t replace actors with this for a long time. Actors are “rendering” faster than any AI. Animation is where the real issues will show up first, particularly in Advertising.


Have you seen the amount of CGI in movies and TV shows? :)

In many AAA blockbusters the "actors" on screen are just CGI recreations during action scenes.

But you're right, actors won't be out of a job soon, but unless something drastic happens they'll have the role of Vinyl records in the future. For people who appreciate the "authenticity". =)


I think you can fill-in many scenes for the actor - perhaps a dupe but would look like the real actor - of course the original actor would have to be paid, but perhaps much less as the effort is reduced.


If it requires acting, it likely can't be done with AI. You underestimate, I think, how much an actor carries a movie. You can use it for digi doubles maybe, for stunts and VFX. But if his face in on the screen... We are ages away from having an AI actor perform at the same level as Daniel Day Lewis, Williem Dafoe, or anyone else that's in that atmosphere. They make too many interesting choices per second for it to replaced by AI.


Quality aside, there's a reason producers pay millions for A-list stars instead of any of the millions of really good aspiring actors in LA that they could hire for pennies. People will pay to see the new Matt Damon flick but wouldn't give it a second glance if some no-name was playing the part.

If you can't replace Matt Damon with another equivalently skilled human, CGI won't be any different.

Granted, maybe that's less true today, given Marvell and such are more about the action than the acting. But if that's the future of the industry anyway, then acting as a worthwhile profession is already on its way out, CGI or no.


Yes, people also take actors as a sign of the quality of the film, or at least they used to, before Marvel. Hence films with big names attached get more money, etc.

Still the idea that actors are easy to replace is preposterous to anyone who's ever worked with actors. They are preposterously HARD to replace, in theatre and film. A good actor is worth their weight in gold. Very very few people are good actors. A good actor is a good comedian, a master at controlling his body, and a master at controlling his voice, towards a specifically intended goal. They can make you laugh, cry, sigh, or feel just about anything. You just look at Paul Giamatti or Willem Dafoe or Denzel Washington. Those people are not replaceable, and their work is just as good and just as culturally important as a Picasso or a Monet. A hundred years from now people will know the name of actors, because that was the dominant mode of entertainment of our age.


The idea that this destroys the industry is overblown, because the film industry has already been dying since 2000's.

Hollywood is already destroyed. It is not the powerful entity it once was.

In terms of attention and time of entertainment, Youtube has already surpassed them.

This will create a multitude more YouTube creators that do not care about getting this right or making a living out of it. It will just take our attention all the same, away from the traditional Hollywood.

Yes there will still be great films and franchises, the industry is shrinking.

This is similar with Journalism saying that AI will destroy it. Well there was nothing to destroy because the a bunch of traditional newspapers already closed shop even before AI came.


They shouldn’t be worried so soon. This will be used to pump out shitty hero movies more quickly, but there will always be demand for a masterpiece after the hype cools down.

This is like a chef worrying going out of business because of fast food.


Yeah, but how many will work on that singular masterpiece? The rest will be reduced and won’t have a job to put food on the table


Only if the entertainment market remains the same size.


Without a change in copyright law, I doubt it. The current policy of the USCO is that the products of AI based on prompts like this are not human authored and can't be copywritten. No one is going to release AI created stuff that someone else can reproduce because its public domain.


Has anyone else noticed the leg swap in Tokyo video at 0:14. I guess we are past uncanny, but I do wonder if these small artifacts will always be present in generated content.

Also begs the question, if more and more children are introduced to media from young age and they are fed more and more with generated content, will they be able to feel "uncanniness" or become completely blunt to it.

There's definitely interesting period ahead of us, not yet sure how to feel about it...


There are definitely artifacts. Go to the 9th video in the first batch, the one of the guy sitting on a cloud reading a book. Watch the book; the pages are flapping in the wind in an extremely strange way.


The third batch, the one with the cat, the guy in bed has body parts all over, his face deforms, and the blanket is partially alive.


When there is a dejavu cat, we know we are in trouble!


In the one with the cat waking up its owner, the owners shoulder turns into a blanket corner when she rolls over.


Yep, I noticed it immediately too. Yet it is subtle in reality. I'm not that good to spot imperfections on picture but on the video I immediately felt something was not quite right.


Tangent to feeling numb to it - will it hinder children developing the understanding of physics, object permanence, etc. that our brains have?


There have been children, that reacted iritated, when they cannot swipe away real life objects. The idea is, to give kids enough real world experiences, so this does not happen.


Kids have been exposed to decades of 2D and 3D animations that do not contain realistic physics etc; I’m assuming they developed fine?


Kids aren't supposed to have screen time until they're at least a few years old anyways


I noticed at the beginning that cars are driving on the right side of the road, but in Japan they drive on the left. The AI misses little details like that.

(I'm also not sure they've ever had a couple inches of snow on the ground while the cherry blossoms are in bloom in Tokyo, but I guess it's possible.)


The cat in the "cat wakes up its owner" video has two left front legs, apparently. There is nothing that is true in these videos. They can and do deviate from reality at any place and time and at any level of detail.


These artefacts go down with more compute. In four years when they attack it again with 100x compute and better algorithms I think it'll be virtually flawless.


I had to go back several times to 0:14 to see if it was really unusual. I get it of course, but probably watching 20 times I would have never noticed it.


Yep! Glad I wasn't the only one that saw that. I have a feeling THEY didn't see it or they wouldn't have showcased it.


I don't think that's the case. I think they're aware of the limitations and problems. Several of the videos have obvious problems, if you're looking - e.g. people vanishing entirely, objects looking malformed in many frames, objects changing in size incongruent with perspective, etc.

I think they just accept it as a limitation, because it's still very technically impressive. And they hope they can smooth out those limitations.


They swap multiple times lol. Not to mention it almost always looks like the feet are slightly sliding on the ground with every step.

I mean there are some impressive things there, but it looks like there's a long ways to go yet.

They shouldn't have played it into the close up of the face. The face is so dead and static looking.


certainly not perfect... but "some impressive things" is an understatement, think of how long it took to get halfway decent CGI... this AI thing is already better than clips I've seen people spend days building by hand


This is pretty impressive, it seems that OpenAI consistently delivers exceptional work, even when venturing into new domains. But looking into their technical paper, it is evident that they are benefiting from their own body of work done in the past and also the enormous resources available to them.

For instance, the generational leap in video generation capability of SORA may be possible because:

1. Instead of resizing, cropping, or trimming videos to a standard size, Sora trains on data at its native size. This preserves the original aspect ratios and improves composition and framing in the generated videos. This requires massive infrastructure. This is eerily similar to how GPT3 benefited from a blunt approach of throwing massive resources at a problem rather than extensively optimizing the architecture, dataset, or pre-training steps.

2. Sora leverages the re-captioning technique from DALL-E 3 by leveraging GPT to turn short user prompts into longer detailed captions that are sent to the video model. Although it remains unclear whether they employ GPT-4 or another internal model, it stands to reason that they have access to a superior captioning model compared to others.

This is not to say that inertia and resources are the only factors that is differentiating OpenAI, they may have access to much better talent pool but that is hard to gauge from the outside.


https://openai.com/sora?video=big-sur

In this video, there's extremely consistent geometry as the camera moves, but the texture of the trees/shrubs on the top of the cliff on the left seems to remain very flat, reminiscent of low-poly geometry in games.

I wonder if this is an artifact of the way videos are generated. Is the model separating scene geometry from camera? Maybe some sort of video-NeRF or Gaussian Splatting under the hood?


Curious about what current SotA is on physics-infusing generation. Anyone have paper links?

OpenAi has a few details:

>> The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.

>> Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.

>> We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.

>> Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully.

The implied facts that it understands physics of simple scenes and any instances of cause and effect are impressive!

Although I assume that's been SotA-possible for awhile, and I just hadn't heard?


On the announcement page, it specifically says Sora does not understand physics


I saw similar artifacts in dalle-1 a lot (as if the image was pasted onto geometry). Definitely wouldn't surprise me if they use synthetic rasterized data to in the training, which could totally create artifacts like this.


The model is essentially doing nothing but dreaming.

I suspect that anything that looks like familiar 3D-rendering limitations is probably a result of the training dataset simply containing a lot of actual 3D-rendered content.

We can't tell a model to dream everything except extra fingers, false perspective, and 3D-rendering compromises.


Technically we can, that's what negative prompting[1] is about. For whatever reason, OpenAI has never exposed this capability in its image models, so it remains an open source exclusive.

[1] https://stable-diffusion-art.com/how-to-use-negative-prompts...


It's more complicated than that. Negative prompts are just as limited as positive prompts.


It's possible it was pre-trained on 3D renderings first, because it's easy to get almost infinite synthetic data that way, and after that they continued the training on real videos.


In the car driving on the mountain road video you can see level-of-detail popping artifacts being reproduced, so I think that's a fair guess.


Maybe it was trained on a bunch of 3d Google Earth videos.


Doesn't look flat to me.

Edit: Here[0] I highlighted a groove in the bushes moving with perfect perspective

[0] https://ibb.co/Y7WFW39


Look in the top left corner, on the plane


My vote is yes - some sort of intermediate representation is involved. It just seems unbelievable that it's end-to-end with 2D frames...


The water is on par with Avatar. Looks perfect


Wow, yeah I didn't notice it at first, but looking at the rocks in the background is actually nauseating


It looks perfect to me. That's exactly how the area looks in person.


I say this with all sincerity, if you're not overwhelmingly impressed with Sora then you haven't been involved in the field of AI generated video recently. While we understand that we're on the exponential curve of AI progress, it's always hard to intuit just what that means.

Sora represents a monumental leap forward, it's comically a 3000% improvement in 'coherent' video generation seconds. Coupled with a significantly enhanced understanding of contextual prompts and overall quality, it's has achieved what many (most?) thought would take another year or two.

I think we will see studios like ILM pivoting to AI in the near future. There's no need for 200 VFX artists when you can have 15 artists working with AI tooling to generate all the frame-by-frame effects, backgrounds, and compositing for movies. It'll open the door for indie projects that can take place in settings that were previously the domain of big Hollywood. A sci-fi opera could be put together with a few talented actors, AI effects and a small team to handle post-production. This could conceivably include AI scoring.

Sure, Hollywood and various guilds will strongly resist but it'll require just a handful of streaming companies to pivot. Suddenly content creation costs for Netflix drops an order of magnitude. The economics of content creation will fundamentally change.

At the risk of being proven very wrong, I think replacing actors is still fairly distant in the future but again... humans are bad at conceptualizing exponential progress.


I strongly believe that AI will have massive impact on the film industry however it won't be because of a blackbox, text to video tool like Sora. VFX artists and studios still want a high level of control over the end product and unless it's very simple to tweak small details like the blur of an object in the background, or the particle physics of an explosion, then they wouldn't use it. What Hollywood needs are AI tools that can integrate with their existing workflows. I think Adobe is doing a pretty good job at this.


You're completely missing the point. Who cares what VFX artists and studios want if anyone with a small team can create high quality entertaining videos that millions of people would pay to watch? And if you think that's a bar too high for AI, then you haven't actually seen the quality of average videos and films generated these days.


I was specifically responding to this point which seemed to be the thesis of the parent commenter.

> I think we will see studios like ILM pivoting to AI in the near future. There's no need for 200 VFX artists when you can have 15 artists working with AI tooling

Yes this will bring the barrier to entry for small teams down significantly. However it's not going to replace the 200 people studios like ILM.


I believe this to be a failure of imagination. You're assuming Sora stays like this. Reality is we are on an exponential and it's just a matter of time. ILM will be the last to go but it'll eventually go, in the sense of having less humans needed to create the same output.


I think it's fair to be impressed with Sora as the next stage of AI video, yet not be too surprised or consider it some insurmountable leap from the public pieces we've seen of AI video up to this point. We've always been just a couple papers away, seeking a good consistency modelling step - now we've got it. Amazing and viscerally chilling - seeing the net effect - but let's not be intimidated so easily or prop these guys up as gods just for being a bit ahead of the obviously-accelerating curve. Anyone tracking this stuff had a very strong prediction of good AI video within a year - two max. This was a context size increase and overall impressive quality pass reaching a new milestone, but the bones were there.


Watching these made me think, I'm going to want to go to the theatre a lot more in the future and see fellow humans in plays, lectures and concerts.

Such achievements in technology must lead to cultural change. Look at how popular vinyl has become, why not theatre again.


Do you feel the same way about modern movies? CGI is so ubiquitous and accessible, that most movies use some form of it. It's actually news when a filmmaker _doesn't_ use CGI (e.g. Nolan).

These advancements are just the next step in that evolution. The tech used in movies will be commoditized, and you'll see Hollywood-style production in YouTube videos.

I'm not sure why you think theater will become _more_ popular because of this. It has remained popular throughout the years, as technology comes and goes. People can enjoy both video and theater, no?


I agree, seeing real human actors on stage will always be popular for some consumers. Same for local live musicians.

That said, I helped a friend who makes low budget, edgy and cool films last week. I showed him what I knew about driving Pika.art and he picked it up quickly. He is very excited about the possibility of being able to write more stories and turn them into films.

I think there is plenty of demand for all kinds of entertainment. It is sad that so many creative people in Hollywood and other content creation centers will lose jobs. I think the very best people will be employed, but often partnered with AIs. Off topic, but I have been a paid AI practitioner since 1982, and the breakthroughs of deep learning, transformers, and LLMs are stunning.


We will soon find that story generation is easily automated.


It's already easy automated.


Drop a link to your friend’s work?


I actually suspect one of the new most popular mediums will be actors on a theatre stage doing live performances to a live AI CGI video being rendered behind them - similar to musicians in a live orchestra. It would bring together the nostalgia and wonder of human acting and performance art, while still smoothing and enhancing their live performance into the quality levels and wonder we've come to expect from movie theatre experiences. This will be technologically doable soon.


Imagine movies generated in real-time just for you, with the faces you know, places you know and what not!


That's terrifying and dystopian.


No it's not. Imagine turning on the television when you get home and it's a show all about you (think Breaking Bad, but you're Walter White). You flip to another channel and it's a pornographic movie where you sleep with all the world's most famous movie stars. Flip the channel again and it's all the home movies you wish you had but were never able to make.

This is a future we could once only dream of, and OpenAI is making it possible. Has anyone noticed how anti-progress HN has become lately?


I guess it depends on your definition of progress. None of those examples you listed sound particularly appealing to me. I've never watched a show and thought I'd get more enjoyment if I was at the center of that story. Porn and dating apps have created such unrealistic expectations of sex and relationships that we're already seeing the effects in younger generations. I can only imagine what on-demand fully generative porn will have on issues like porn addiction.

Not to say I don't have some level of excitement about the tech, but I don't think it's unwarranted pessimism to look at this stuff and worry about it's darker implications.


> You flip to another channel and it's a pornographic movie where you sleep with all the world's most famous movie stars.

This is not only dystopian, it's just sad. All these look taken from the first seasons of Black Mirror. I don't know what you think progress is but AI porno and ads are not.


I don't think any well adjusted person ever has actually wanted this


This might be more revealing of you than of people in general. Even when I play tabletop RPGs, a place I could _easily_ play a version of myself, I almost never do. There's nothing wrong with doing so, but most people don't.


That seems depressingly solipsistic. I think part of the appeal of art is that it's other humans trying to communicate with you, that you feel the personality of the creators shining through.

Also I've never interacted with any piece of art or entertainment and thought to myself "this is neat and all, but it would be much improved if this were entirely about me, with me as the protagonist." One watches Breaking Bad because Walter White is an interesting character; he's a man who falls into a life of crime initially for understandable reasons, but as the series goes on it becomes increasingly clear that he is lying to himself about his motivations and that his primary motivation for his escalating criminal life is his deep-seated frustration at the mediocrity of his life. More than anything else, he craves being important. The unraveling of his motivations and where they come from is the story, and that's something you can't really do when you're literally watching yourself shoehorned into a fictional setting.

You seem to regard it as self-evident that art or entertainment would be improved if (1) it's all about you personally and (2) involvement of other real humans is reduced to zero, but I cannot fathom why you would think that (with the exception of the porn example).


Seems like we're pretty close to inserting ourselves into pornographic movies.


We can do that already, you just need a camera


Can also achieve multiple (still) angles, with multiple phones.


Not close, we're there. Look up FaceFusion.


That would suck. I want to see something I haven't seen before.


I guarantee you haven't seen the entire latent space of any large model


Your wish is Sora's(or its successor model's) prompt.


I am a much better software engineer than I am a director. I can guarantee you that I don’t want to see anything that I could prompt.


Your favorite shows, where the season never ends, and the actors never age.


The vinyl narrative is so whack.

https://www.riaa.com/u-s-sales-database/

At its peak, Inflation adjusted Vinyl Sales was $1.4billion in 1979. Then forward to the lowest sales in 2009 at $3.4million. So Vinyl has been so popular it grew to $8.5m by 2021.

That is just nostalgia, not cultural change pushed by the dystopia of AI.


Why is my 14 year old niece now collecting vinyl? I can guarantee it's not nostalgia. There's obviously more at play there even when acknowledging your point about relative market size.


Perhaps it is _anemoia_ - nostalgia for a time you've never known https://www.dictionaryofobscuresorrows.com/post/105778238455...

In this case, it's for the harmless charm of an imagined past, but the same forces are at play in some more dangerous forms of social conservatism.


It's a very narrow subgroup.

But things can coexist. It's now easier to create music than ever, and there is more music created by more artists than ever. Most music is forgettable and just streamed as background music. But there is also room for superstars like Taylor Swift.

Things don't have to be either-or.


How many 14 years old do you know who collect vinyl?


The medium is the message. I know several people born post 2000 who are embracing records and tapes.


I started when I was pretty much exactly that age, ten years ago.


> The vinyl narrative is so whack.

"Revenues for the LP/EP format were $1.2B in 2022 and accounted for 7.7% of total revenue of $15.9B for all selected formats for the year"

Adjusted to inflation.

It's my understanding that LP/EP is vinyl as well. Not Just vinyl single.


This has to be it. Vinyl costs like 20$ per, and $8m is like 400k vinyl sales (users often buy more than 1 vinyl so it's a lot less users) which seems too low globally. At 1.2b, it is more like 60m sales which seems more reasonable.


I think a lot of people collect vinyl less for nostalgia reasons and more so to have a physical collection of their music. I think vinyl wins over CDs just due to how it’s larger and the cover art often looks better as a result.


Obviously incredibly cool, but it seems that people are incredibly overstating the applications of this.

Realistically, how do you fit this into a movie, a TV show, or a game? You write a text prompt, get a scene, and then everything is gone—the characters, props, rooms, buildings, environments, etc. won’t carry over to the next prompt.


It doesn't need to replace the whole movie

You could use it for stuff like wide shots, close ups, random CG shots, rapid cut shots, stuff where you just cut to it once and don't need multiple angles

To me it seem most useful for advertising where a lot of times they only show something once, like a montage


And it would be magic for storyboarding. This would be such a useful tool for a director to iterate on a shot and then communicate that to the team


i could arrange in frameforge 3d shot by shot, even adjusting for motion in between, then export to an AI solution. that to me would be everything. of course then comes issues of consistency, adjustments & tweaks, etc


I also see advertising (especially lower-budget productions, such as dropshipping or local TV commercials) being early adopters of this technology once businesses have access to this at an affordable price.


It generates up to 1 minute videos which is like what all the kids are watching on TikTok and YouTube Shorts, right? And most ads are shorter than 1 minute.


A few months ago ai generated videos of people getting arrested for wearing big boots went viral on TikTok. I think this sort of silly "interdimensional cable" stuff will be really big on these short form video type sites once this level of quality becomes available to everyone.


Robot chicken, but full motion video


You wait a year and they'll figure it out.


It also seems hard to control exactly what you get. Like you'd want a specific pan, focus etc. to realize your vision. The examples here look good, but they aren't very specific.

But it was the same with Dall-E and others in the beginning, and there's now lots of ways to control image generators. Same will probably happen here. This was a huge leap just in how coherent the frames are.


What came to mind is what is right around the corner: you create segments and stitch them together.

"ok, continue from the context on the last scene. Great. Ok, move the bookshelf. I want that cat to be more furry. Cool. Save this as scene 34."

As clip sizes grow and context can be inferred from a previous scene, and a library of scenes can be made, boom, you can now create full feature length films, easy enough that elementary school kids will be able to craft up their imaginations.


You could use it to storyboard right now. Continuity of characters/wardrobe, etc. is not that important in storyboarding.


Family Guy is built on out of context clips.

It could also fill it for background videos in scenes, instead of getting real content they’d have to pay for, or making their own. The gangster movie Kevin was playing in Home Alone was specifically shot for that movie, from what I remember.


> You write a text prompt, get a scene, and then everything is gone—the characters, props, rooms, buildings, environments, etc. won’t carry over to the next prompt.

Sure, you can't use the text-to-video frontend for that purpose. But if you've got a t2v model as good as Sora clearly is, you've got the infrastructure for a lot more, as the ecosystem around the open-source models in the space has shown. The same techniques that allow character, object, etc., consistency in text-to-image models can be applied to text-to-video models.


It's pretty obvious they just need to add the ability to prompt it with an image saying "continue in this style and make the character..."


Nah just fine-tune the model to a specific set of characters or aesthetic. It's not hard, already done with SDXL LoRAs. You can definitely generate a whole movie from just a storyboard.. if not now, then in maybe five yrs.


Explicit video clips? 4chan is gonna have a field day with this.


Not necessarily explicit. Maybe just deliberately offensive. Maybe just weirdly specific.

It's gonna be great.


lots and lots and lots of b-roll and stock footage is about to get cheaper.

Also, using this kind of footage is the bread and butter for a lot of marketers for their content.

Imagine never having to pay stock footage companies


What? You're serious?

Script => Video baseline. Take a frame of any character/prop/room/etc you want to remain consistent, and one shitty photoshop and it's part of the new scene.

Incredibly overstating. That is an incredible lack of imagination buddy. Or even just basic craftsmanship.


tiktok


People here seem mostly impressed by the high resolution of these examples.

Based on my experience doing research on Stable Diffusion, scaling up the resolution is the conceptually easy part that only requires larger models and more high-resolution training data.

The hard part is semantic alignment with the prompt. Attempts to scale Stable Diffusion, like SDXL, have resulted only in marginally better prompt understanding (likely due to the continued reliance on CLIP prompt embeddings).

So, the key question here is how well Sora does prompt alignment.


The real advancement is the consistency of character, scene, and movement!


There needs to be an updated CLIP-like model in the open-source community. The model is almost three years old now and is still the backbone of a lot of multimodal models. It's not a sexy problem to take on since it isn't especially useful in and of itself, but so many downstream foundation models (LLaVA, etc.) would benefit immensely from it. Is there anything out there that I'm just not aware of, other than SigLIP?


I agree.

I think one part of the problem is using English (or whatever natural language) for the prompts/training. Too much inherent ambiguity. I’m interested to see what tools (like control nets with SD) are developed to overcome this.


I was super on board until I saw...the paw: https://player.vimeo.com/video/913131059?h=afe5567f31&badge=...

Exciting for the potential this creates, but scary for the social implications (e.g., this will make trial law nearly impossible).


If I understand trial law correctly, the rules of evidence already prohibit introducing a video at trial without proving where it came from (for example, testimony from a security guard that a given video came from a given security camera).

But social media has no rules of evidence. Already I see AI-generated images as illustrations on many conspiracy theory posts. People's resistance to believing images and videos from sketchy sources is going to have to increase very fast (especially for images and videos that they agree with).


All the more reason why we need to rely on the courts and not the mob justice (in the social sense) which has become popular over the last several years.


Nothing will change. Confirmation bias junkies already accept far worse fakes. People who use trusted sources will continue doing so. Bumping the quantity/quality of fabricated horseshit won't move the needle.


Wow. If I saw this clip a year ago I wouldn't think, "The image generator fucked up," I'd just think that a CG effects artist deliberately tweaked an existing real-world video.


Yeah, if that gets cleaned up (one would expect it to in time), this is going to change a lot.


How does one cope with this?

- Disruptions like this happen to every industry every now and then. Just not on the level of "Communicating with people with words, and pictures". Anduril and SpaceX disrupted defense contractors and United Launch Alliance; Someone working for a defense contractor/ULA here affected by that might attest to the feeling?

- There will be plenty of opportunity to innovate. Industries are being created right now. People probably also felt the same way when they saw HTTP on their screens the first time. So don't think your career or life's worth of work is miniscule, its just a moving target, adapt & learn.

- Devil is in the details. When a bunch of large SaaS behemoths created Enterprise software an army of contractors and consultants grew to support the glue that was ETL. A lot of work remains to be done. It will just be a more imaginative glue.


I would be willing to bet $10,000 that the average person's life will not be changed in any significant way by this technology in the next 10 years. Will there be some VFX disruption in Hollywood and games? Sure, maybe some. It's not a cure for cancer. It's not AGI. It's not earth shattering. It is fun and interesting though.


"by this technology" does a lot of heavy lifting. Look at the pace of AI development and extrapolate 10 years.


Relevant XKCD : https://xkcd.com/605/


Not really. We have way more data points than one on AI development. It has been incremental progress for more than a decade.


Totally agree with you.

Most of the responses in this thread remind me of why I don't typically go into the comment section of these announcements. It's way too easy to fall into the trap set by the doomsday-predicting armchair experts, who make it sound like we're on the brink of some apocalypse. But anyone attempting to predict the future right now is wasting time at best, or intentionally fear mongering at worst.

Sure, for all we know, OpenAI might just drop the AGI bomb on us one day. But wasting time worrying about all the "what ifs" doesn't help anyone.

Like you said, there is so much work out there to be done, _even if_ AGI has been achieved. Not to get sidetracked from your original comment, but I've seen AGI repeatedly mentioned in this thread. It's really all just noise until proven otherwise.

Build, adapt, and learn. So much opportunity is out there.


> But wasting time worrying about all the "what ifs" doesn't help anyone.

Worry about the what if is all we have as a species. If we don't worry about how stop global warming, or how we can prevent a nuclear holocaust these things become more far more likely.

If OpenAI drops an AGI bomb on us then there a good chance that's it for us. From there it will just be a matter of time before a rouge AGI or a human working with an AGI causes mass destruction. This is every bit as dangerous as nuclear weapons - if not more dangerous – yet people seem unable to take the matter as seriously as it needs to be taken.

I fear millions of people will need to die or tens of millions will need to be made unemployable before we even begin to start asking the right questions.


Isn't the alternative worse though? We could try to shut Pandora's box and continue to worsen the situation gradually and never start asking the right questions. Isn't that a recipe for even more hardship overall, just spread out a bit more evenly?

It seems like maybe it's time for the devil we don't know.


We live in a golden age. Worldwide poverty is at historic lows. Billions of people don't have to worry about where their next meal is coming from or whether they'll have a roof over their head. Billions of people have access to more knowledge and entertainment options than anyone had 100 years ago.

This is not the time to risk it all.


Staying the course is risking it all. We've built a system of incentives which is asleep at the wheel and heading towards as cliff. If we don't find a different way to coordinate our aggregate behavior--one that acknowledges and avoids existential threats--then this golden age will be a short one.


Maybe. But I'm wary of the argument "we need to lean into the existential threat of AI because of those other existential threats over there that haven't arrived yet but definitely will".

It all depends on what exactly you mean by those other threats, of course. I'm a natural pessimist and I see threats everywhere, but I've also learned I can overestimate them. I've been worried about nuclear proliferation for the last 40 years, and I'm more worried about it than ever, but we haven't had another nuclear war yet.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: