Hacker News new | past | comments | ask | show | jobs | submit login

10 to 20 years sounds wildly pessimistic

In this sora video the dragon covers half the scene, and its basically identical when it is revealed again ~5 seconds later, or about 150 frames later. The is lots of evidence (and some studies) that these models are in fact building internal world models.

https://www.youtube.com/watch?v=LXJ-yLiktDU

Buckle in, the train is moving way faster. I don't think there would be much surprise if this is solved in the next few generations of video generators. The first generation is already doing very well.




Did you watch the video, it is completely different after the dragon goes past? Its still a flag there, but everything else changed. Even the stores in the background changed, the mass of people is completely different with no hint of anyone moving there etc.

You always get this from AI enthusiast, they come and post "proof" that disproves their own point.


I'm not GP, but running over that video I'm actually having a hard time finding any detail present before the dragon obscures them not either exit frame right when the camera pans left slightly near the end or not re-appear with reasonably crisp detail after the dragon gets out of the way.

Most of the mob of people are indistinct, but there is a woman in a lime green coat who is visible, and then obstructed by the dragon twice (beard and ribbon) and reappears fine. Unfortunately when dragon fully moves past she has been lost to frame right.

There is another person in black holding a red satchel which is visible both before and after the dragon has passed.

Nothing about the storefronts appear to change. The complex sign full of Chinese text (which might be gibberish text: it's highly stylized and I don't know Chinese) appears to survive the dragon passing without even any changes to the individual ideograms.

There is also a red box shaped like a Chinese paper lantern with a single gold ideogram on it at the store entrance which spends most of the video obscured by the dragon and is still in the same location after it passes (though video artifacting makes it more challenging to verify that that ideogram is unchanged it certainly does not appear substantially different)

What detail are you seeing that is different before and after the obstruction?


> What detail are you seeing that is different before and after the obstruction?

First frame, guy in blue hat next to a flag. That flag and the guy is then gone afterwards.

The two flags near the wall are gone, there is something triangular there but there was two flags before the dragon went past.

Then not to mention that the crowd is 6 people deep after the dragon went past, while just 4 people deep before, it is way more crowded.

Instead of the flag that was there before the dragon, it put in 2 more flags afterwards far more to the left.

Third second a guy was out of frame for a few frames, and suddenly gained a blue scarf. AFter dragon went by he turned into a woman. Next to that person was a guy with a blue cap, he completely disappears.

> Most of the mob of people are indistinct

No they aren't, they are mostly distinct and basically all of them changes. If you ignore that the entire mob totally changes both in number and appearance and where it is, sure it is pretty good, except it forgot the flags, but how can you ignore the mob when we talk about the model remembering details? The wall is much less information dense than the mob, so that is much easier to remember for the model, the difficulty is in the mob.

> but there is a woman in a lime green coat who is visible,

She was just out of frame for a fraction of a second, not the big bit where the dragon moves past. The guy in blue jacket and blue cap behind her disappears though, or merges with another person and becomes a woman with a muffler after the dragon moved past.

So, in the end some big strokes were kept, and that was a very tiny part of the image that was both there before and after the dragon moved past so it was far from a whole image with full details. Almost all details are wrong.

Maybe he meant that the house looked mostly the same, I agree the upper parts does, but I looked at the windows and they were completely different, it is full of people heads after the dragon moved past while before it was just clean walls.


We are looking at first generation tech and pretty much every human would recognize the "before dragon" scene as being the same place as the "after dragon" scene. The prominent features are present. The model clearly shows the ability to go beyond "image-to-image" rendering.

If you want to be right because you can find any difference. Sure. You win. But also completely missed the point.


> pretty much every human would recognize the "before dragon" scene as being the same place as the "after dragon" scene

Not in a game and those were enemies, it completely changed what and how many they are, people would notice such a massive change instantly if they looked away and suddenly there were 50% more enemies.

> The model clearly shows the ability to go beyond "image-to-image" rendering.

I never argued against that. Adding a third dimension (time) makes generating a video the same kind of problem as generating an image, it is not harder to draw a straight pencil with something covering it than to draw the scene with something covering it for a while.

But still, even though it is that simple, these models are really bad at it, because it requires very large models and much compute. So I just extrapolated based on their current abilities that we know, as you demonstrated there, to say roughly how long until we can even have consistent short videos.

Note that videos wont have the same progression as images, as the early image models were very small and we quickly scaled up there, while now for video we start at really scaled up models and we have to wait until compute gets cheaper/faster the slow way.

> But also completely missed the point.

You completely missed my point or you changed your point afterwards. My point was that current models can only remember little bits under such circumstances, and to remember a whole scene they need to be massively larger. Almost all details in the scene you showed were missed, the large strokes are there but to keep the details around you need an exponentially larger model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: