Hacker News new | past | comments | ask | show | jobs | submit login
Stable Video Diffusion (stability.ai)
1330 points by roborovskis 9 months ago | hide | past | favorite | 302 comments



In the video towards the bottom of the page, there are two birds (blue jays), but in the background there are two identical buildings (which look a lot like the CN Tower). CN Tower is the main landmark of Toronto, whose baseball team happens to be the Blue Jays. It's located near the main sportsball stadium downtown.

I vaguely understand how text-to-image works, and so it makes sense that the vector space for "blue jays" would be near "toronto" or "cn tower". The improvements in scale and speed (image -> now video) are impressive, but given how incredibly able the image generation models are, they simultaneously feel crippled and limited by their lack of editing / iteration ability.

Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.


> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

I feel like we're close too, but for another reason.

For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

And the scene shall be sent into Blender and you'll click on a button and have an actual rendering made by Blender, with correct lighting.

Wanna move that bicycle? Move it in the 3D scene exactly where you want.

That is coming.

And for audio it's the same: why generate an audio file when soon models shall be able to generate the various tracks, with all the instruments and whatnots, allowing to create the audio file?

That is coming too.


> you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I'm always confused why I don't hear more about projects going in this direction. Controlnets are great, but there's still quite a lot of hallucination and other tiny mistakes that a skilled human would never make.


Blender files are dramatically more complex than any image format, which are basically all just 2D arrays of 3-value vectors. The blender filetype uses a weird DNA/RNA struct system that would probably require its own training run.

More on the Blender file format: https://fossies.org/linux/blender/doc/blender_file_format/my...


But surely you wouldn't try to emit that format directly, but rather some higher level scene description? Or even just a set of instructions for how to manipulate the UI to create the imagined scene?


It sure feels weird to me as well, that GenAI is always supposed to be end-to-end with everything done inside NN blackbox. No one seems to be doing image output as SVG or .ai.


Imo the thinking is that whenever humans have tried to pre-process or feature-engineer a solution or tried to find clever priors in the past, massive self-supervised-learning enabled, coarsely architected, data-crunching NNs got better results in the end. So, many researchers / industry data scientists may just be disinclined to put effort into something that is doomed to be irrelevant in a few years. (And, of course, with every abstraction you will lose some information that may bear more importance than initially thought)


The way that website builders using GenAI work is they have a LLM generate the copy, then find a template that matches that and fill it out. This basically means the "visual creativity" part is done by a human, as the templates are made and reviewed by a human.

LLMs are good at writing copy that sounds accurate and creative enough, and there are known techniques to improve that (such as generating an outline first, then generating each section separately). If you then give them a list of templates, and written examples of what they are used for, the LLM is able to pick one that's a suitable match. But this is all just probability, there's no real creativity here.

Earlier this year I played around with trying to have GPT-3 directly output an SVG given a prompt for a simple design task (a poster for a school sports day), and the results were pretty bad. It was able to generate a syntantically coreect SVG, but the design was terrible. Think using #F00 and #0F0 as colours, placing elements outside the screen boundaries, layering elements so they are overlapping.

This was before GPT-4, so it would be interesting to repeat that now. Given the success people are having with GPT-4V, I feel that it could just be a matter of needing to train a model to do this specific task.


There is a fundamental disconnect between industry and academia here.


Over the last 10 years of industry work, I'd say about 20% of my time has been format shifting, or parsing half baked undocumented formats that change when I'm not paying attention.

That pretty much matches my experience working with NN's and LLM's


I've seen this but producing Python scripts that you run in Blender, e.g. https://www.youtube.com/watch?v=x60zHw_z4NM (but I saw something marginally more impressive, not sure where though!)


My god that is an irritating video style, "AI woweee!"


Yeah I'd imagine that's the best way. Lots of LLMs can generate workable Python code too, so code that jives with Blender's Python API doesn't seem like too much of a leap.

The only trick is that there has to be enough Blender Python code to train the LLM on.


Maybe something like OpenSCAD is a good middle ground. Procedural code-like format for specifying 3D objects that can then be converted and imported in Blender.


I tried all the AI stuff that I could on OpenSCAD.

While it generates a lot of code that initially makes sense, when you use the code, you get a jumbled block.


This. I think problem is that the LLMs really struggle with 3d scene understanding, so what you would need to do is generate code that generates code.

But also I suspect there just isn't that much openscad code in the training data, and the semantics are different enough to python or any of the other languages that are well-represented that it struggles.


Scene layouts, models and their attributes are a result of user input (ok and sometimes program output). One avenue to take there would be to train on input expecting an output. Like teaching a model to draw instead of generate images.. which in a sense we already did by broadly painting out silhouettes and then rendering details.


Voxel files could be a simpler step for 3D images.


> I'm always confused why I don't hear more about projects going in this direction.

Probably because they aren't as advanced and the demos aren't as impressive to nontechnical audiences who don't understand the implications: there’s lots of work on text-to-3d-model generation, and even plugins for some stable diffusion UIs (e.g., MotionDiff for ComyUI.)


I think the bottleneck is data

For single 3D object the biggest dataset is ObjaverseXL with 10M samples

For full 3D scenes you could at best get ~1000 scenes with datasets like ScanNet I guess

Text2Image models are trained on datasets with 5 billion samples


Oh, I don't know about that. Working in feature film animation, studios have gargantuan model libraries from current and past projects, with a good number (over half) never used by a production but created as part of some production's world building. Plus, generative modeling has been very popular for quite a few years. I don't think getting more 3D models then they could use is a real issue for anyone serious.


Where can you find those? I'm in the same situation as him, I've never heard of a 3d dataset better than objaverse XL.

Got a public dataset?


These are not public datasets, but with some social engineering I bet one could get access.

I've not worked in VFX for a while, but when I did the modeling departments at multiple studios had giant libraries of completed geometries for every project they ever did, plus even larger libraries of all the pieces and parts they use as generic lego geometry whenever they need something new.

Every 3D modeler I know has their own personal libraries of things they'd made as well as their own "lego sets" of pieces and parts and generative geometry tools they use when making new things.

Now this is just a guess, but do you know anyone going through one of those video game schools? I wager the schools have big model libraries for the students as well. Hell, I bet Ringling and Sheridan (the two Harvards of Animation) have colossally sized model libraries for use by their students. Contact them.


There's a lot of issues with it, but perhaps the biggest is that there aren't just troves of easily scrapable and digestible 3D models lying around on the internet to train on top of like we have with text, images, and video.

Almost all of the generative 3D models you see are actually generative image models that essentially (very crude simplification) perform something like photogrammetry to generate a 3D model - 'does this 3D object, rendered from 25 different views, match the text prompt as evaluated by this model trained on text-image pairs'?

This is a shitty way to generate 3D models, and it's why they almost all look kind of malformed.


If reinforcement learning were farther along, you could have it learn to reproduce scenes as 3D models. Each episode's task is to mimic an image, each step is a command mutating the scene (adding a polygon, or rotating the camera, etc.), and the reward signal is image similarity. You can even start by training it with synthetic data: generate small random scenes and make them increasingly sophisticated, then later switch over to trying to mimic images.

You wouldn't need any models to learn from. But my intuition is that RL is still quite weak, and that the model would flounder after learning to mimic background color and placing a few spheres.



From my very clueless perspective, it seems very possible to train an AI to use Blender to create images in a mostly unsupervised way.

So we could have something to convert AI-generated image output into 3D scenes without having to explicitly train the "creative" AI for that.

Probably much more viable, because the quantity of 3D models out in the wild is far far lower than that of bitmap images.


I think this recent Gaussian Splatting technique could end up working really well for generative models, at least once there is a big corpus of high quality scenes to train on. Seems almost ideal for the task because it gets photorealistic results from any angle, but in a sparse, data efficient way, and it doesn’t require a separate rendering pipeline.


One was on the front page the other day, I’ll search for a link


I assume because it's still extremely early.


> However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I agree with this philosophy - Teach the AI to work with the same tools the human does. We already have a lot of human experts to refer to. Training material is everywhere.

There isn't a "text-to-video" expert we can query to help us refine the capabilities around SD. It's a one-shot, Jupiter-scale model with incomprehensible inertia. Contrast this with an expert-tuned model (i.e. natural language instructions) that can be nuanced precisely and to the the point of imperceptibility with a single sentence.

The other cool thing about the "use existing tools" path is that if the AI fails part way through, it's actually possible for a human operator to step in and attempt recovery.


Nah I disagree, this feels like a glorification of the process not the end result. Just because having the 3D model in the scene with all the lighting makes the end result feel more solid to you because you feel you can see the work that's going into it.

In the end diffusion technology can make a more realistic image faster than a rendering engine can.

I feel pretty strongly that this pipeline will be the foundation for most of the next decade of graphics and making things by hand in 3D will become extremely niche because lets face it anyone who has worked in 3D it's tedious, it's time consuming, takes large teams and it's not even well paid.

The future is just tools that give us better controls and every frame will be coming from latent space not simulated photons.

I say this as someone who had done 3D professionally in the past.


Nah, I agree with GP. Who didn't suggest making 3D scenes by hand, but the opposite: create those 3D scenes using the generative method, use ray-tracing or the like to render the image. Maybe have another pass through a model to apply any touch-ups to make it more gritty and less artificial. This way things can stay consistent and sane, avoiding all those flaws which are so easy to spot today.


I know exactly what OP suggested but why are you both glorifying the fact there is a 3D scene graph made in the middle and then slower rendering at the end when the tech can just go from the first thing to a better finished thing?


Because it just can't. And it won't. It can't even reliably produce consistent shadows in a still image, so when we talk video with a moving camera, all bets are off. To create flawless movie simulations through a dynamic and rich 3D world, requires an ability of internally represent that scene with a level of accuracy which is beyond what we can hope generative models to achieve, even with the gargantuan amount of GPU-power behind ChatGPT, for example. ChatGPT, may I remind you, can't even properly simulate large-ish multiplications. I think you may need to slightly recalibrate your expectations for generative tech here.


I find that very unlikely. LLMs seem capable of simulating human intuition, but not great at simulating real complex physics. Human intuition of how a scene “should” look isn’t always the effect you want to create, and is rarely accurate im guessing


> LLMs seem capable of simulating human intuition, but not great at simulating real complex physics.

Diffusion models aren't LLMs (they may use something similar as their text encoder layer) and they simulate their training corpus, which usually isn't selected solely for physical fidelity, because that's not actually the single criteria for visual imagery outside of what is created by diffusion models.


Huh fair enough. I mean they are large models based on language but I see your point. Even though everything you said is true, I still believe there’s a place for human-constructed logically-explicit simulations and functions. In general, and in visual arts.


>For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

The question is whether the 99% of the audience would even care...


Of course they would. The internet spent a solid month laughing at the Sonic the Hedgehog movie because Sonic had weird-looking teeth.


Since that movie did well and spawned 2 sequels, the real conclusion is that the viewers didn't really care.

As for "the internet", there will always some small part of it which will obsess and/or laught over anything, doesn't mean they represent anything significant - not even when they're vocal.


Viewers did care: the teeth got changed before the movie was released. And, I don't know if you missed it, but it wasn't just one niche of the internet commenting on his teeth. The "outrage" went mainstream; even dentists were making hit-pieces on Sonic's teeth. I'm not gonna lie, it was amazing marketing for the movie, intentional or not.


No they laughed at it because it looked awful in every single way


Whats your reasoning for feeling that we're close?


We do it for text, audio and bitmapped images. A 3D scene file format is no different, you could train a model to output a blender file format instead of a bitmap.

It can learn anything you have data for.

Heck, we do it with geospatial data already, generating segmentation vectors. Why not 3D?


>3D scene file format is no different

Not in theory, but the level of complexity is way higher and the amount of data available is much smaller.

Compare bitmaps to this: https://fossies.org/linux/blender/doc/blender_file_format/my...


Also the level of fault tolerance... if your pixels are a bit blurry, chances are no one notices at a high enough resolution. If your json is a bit blurry you have problems.


You can do "constrained decoding" on a code model which keeps it grammatically correct.

But we haven't gotten diffusion working well for text/code, so generating long files is a problem.


Recent results for code diffusion here: https://www.microsoft.com/en-us/research/publication/codefus...

I'm not experienced enough to validate their claims, but I love the choice of languages to evaluate on:

> Python, Bash and Excel conditional formatting rules.



Text, audio, and bitmapped images are data. Numbers and tokens.

A 3D scene is vastly more complex, and the way you consume it is tangential to the rendering of it we use to interpret. It is a collection of arbitrary data structures.

We’ll need a new approach for this kind of problem


> Text, audio, and bitmapped images are data. Numbers and tokens.

> A 3D scene is vastly more complex

3D scenes, in fact, are also data, numbers and tokens. (Well, numbers, but so are tokens.)


As I stated and you selectively omitted, 3D scenes are collections of many arbitrary data structures.

Not at all the same as fixed sized arrays representing images.


Text gen, one of the things you contrast 3d to, similarly isn't fixed size (capped in most models, but not fixed.)

In fact, the data structures of a 3D scene can be serialized as text, and a properly trained text gen system could generate such a representation directly, though that's probably not the best route to decent text-to-3d.


Text is a standard sized embedding vector that gets passed one at a time to an LLM. All tokens have the same shape. Each token is processed one at a time. All tokens also have a pre defined order. It is very different and vastly simpler.

Serializing 3D models as text is not going to work for negligibly non trivial circumstances.


That indeed sounds like a very plausible solution -- working with AI on the level of scene definitions, model geometries etc.

However, 3D is just one approach to rendering visuals. There are so many other styles and methods how people create images, and if I understand correctly, we can do image-to-text to analyze image content, as well as text-to-image to generate it - regardless of the orginal method (3d render or paintbrush or camera lens). There are some "fuzzy primitives" in the layers there that translate to the visual elements.

I'm hoping we see "editors" that let us manipulate / edit / iterate over generated images in terms of those.


Not that I’m against the described 3d way, but personally I don’t care about light and shadows until it’s so bad that I do. This obsession with realism is irrational in video games. In real life people don’t understand why light works like this or like that. We just accept it. And if you ask someone to paint how it should work, the result is rarely physical but acceptable. It literally doesn’t matter until it’s very bad.


This isn't coming, it's already here. https://github.com/gsgen3d/gsgen Yes, it's just 3D models for now, but it can do whole scenes generations, it's just not great yet at it. The tech is there but just need to improve.


Are you working on all that?


Probably not. But there does seem to be a clear path to it.

The main issue is going to be having the right dataset. You basically need to record user actions in something like blender (ie: moving a model of a bike to the left of a scene), match it to a text description of the action (ie; "move bike to the left") and match those to before/after snapshots of the resulting file format.

You need a whole metric fuckton of these.

After that, you train your model to produce those 3d scene files instead of image bitmaps.

You can do this for a lot of other tasks. These general purpose models can learn anything that you can usefully represent in data.

I can imagine AGI being, at least in part, a large set of these purpose trained models. Heck, maybe our brains work this way. When we learn to throw a ball, we train a model in a subset of our brain to do just this and then this model is called on by our general consciousness when needed.

Sorry, I'm just rambling here but its very exciting stuff.


The hard part of AGI is the self-training and few examples. Your parents didn't attach strings to your body and puppeteer you through a few hundred thousand games of baseball. And the humans that invented baseball had zero training data to go on.


Your body is a result of a billion year old evolutionary optimization process. GPT-4 was trained from scratch in a few months.


I have for some time planning to do a 'Wikipedia for AI' (even bought a domain), where people could contribute all sorts of these skills ( not only 3d video, but also manual skills, or anything). Given the current climate of 'AI will save/doom us', and that users would in some sense be training their own replacements, I don't know how much love such site would have, though.


Excellent point.

Perhaps a more computationally expensive but better looking method will be to pull all objects in the scene from a 3D model library, then programmatically set the scene and render it.


I am guessing it will be similar to inpainting in normal stable diffusion, which is easy when using the workflow feature InvokeAI ui.


Thanks! this is exactly what I have been thinking, only you've expressed it much more eloquently than I would be able.


Where is the training data coming from?


we're working on this if you want to give it a try - dream3d.com


You should put a demo on the landing page


just redid the ux and making a new one, but here's a quick example: https://www.loom.com/share/fa84ba92d7144179ac17ece9bf7fbd99


Emu edit should be exactly what you're looking for: https://ai.meta.com/blog/emu-text-to-video-generation-image-...


It doesn’t look like the code for that is available anywhere though?


I recently tried to generate clip art for a presentation using GPT-4/DALL-E 3. I found it could handle some updates but the output generally varied wildly as I tried to refine the image. For instance, I'd have a cartoon character checking its watch and also wearing a pocket watch. Trying to remove the pocket watch resulted in an entirely new cartoon with little stylistic continuity to the first.

Also, I originally tried to get the 3 characters in the image to be generated simultaneously, but eventually gave up as DALL-E had a hard time understanding how I wanted them positioned relative to each other. I just generated 3 separate characters and positioned them in the same image using Gimp.


Yes that's exactly what I'm referring to! It feels as if there is no context continuity between the attempts.


> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Emu can do that.

The bluejay/toronto thing may be addressable later (I suspect via more detailed annotations a la dalle3) - these current video models are highly focused on figuring out temporal coherence


I wonder what other odd connections are made due to city-name almost certainly being the most common word next to sportsball-name.

Do the parameters think that Jazz musicians are mormon? Padres often surf? Wizards like the Lincoln Memorial?


Adobe is doing some great work here in my opinion in terms of building AI tools that make sense for artist workflows. This "sneak peak" demo from the recent Adobe Max conference is pretty much exactly what you described, actually better because you can just click on an object in the image and drag it.

See video: https://www.adobe.com/max/2023/sessions/project-stardust-gs6...


Right, that's embedded directly into the existing workflow. Looks like a very powerful feature indeed.


Makes me wonder if they train their data on everything anyone has ever uploaded to Creative Cloud.


> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Nearly all of the available models have this, even the highly commercialized ones like in Adobe Firefly and Canva, it’s called inpainting in most tools.


I think that's more "inpainting" where the existing software solution uses AI to accelerate certain image editing tasks. I was looking for whole-image manipulation at the "conceptual" level.


They have this. Inpainting is just a subset of the image-to-image workflow and you don't have to provide a region if you want to do whole-image manipulation.


Nice eye!

As for your last question yes that exists. There are two models from Meta that do exactly this, instruction based iteration on photos, Emu Edit[0], and videos, Emu Video[1].

There's also LLaVa-interactive[2] for photos where you can even chat with the model about the current image.

[0]: https://emu-edit.metademolab.com/

[1]: https://emu-video.metademolab.com/

[2]: https://llava-vl.github.io/llava-interactive/


> they simultaneously feel crippled and limited by their lack of editing / iteration ability.

Yeah. They're not "videos" so much as images that move around a bit.

This doesn't really look any better than those Midjourney + RunwayML videos we had half a year ago.

>Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Google has a model called Phenaki that supposedly allows for that kind of stuff. But the public can't use it so it's hard to say how good it actually is.


Have you seen fal.ai/dynamic where you can perform image to image synthesis (basically editing an existing image with the help of diffusion process) using LCMs to provide a real time UI?


I don’t spend a lot of time keeping up with the space, but I could have sworn I’ve seen a demo that allowed you to iterate in the way you’re suggesting. Maybe someone else can link it.


My guess is you're thinking of InstructPix2Pix[1], with prompts like "make the sky green" or "replace the fruits with cake"

[1] https://github.com/timothybrooks/instruct-pix2pix


This is exactly it!



It's not exactly like GP described (e.g. move bike to the left) but there is a more advanced SD technique called inpainting [0] that allows you to manually recompose parts of the image, e.g. to fix bad eyes and hands.

[0] https://stable-diffusion-art.com/inpainting_basics/


I also wonder if the model takes capitalization into account. Capitalized "Blue Jays" seems more likely to reference the sports team; the birds would be lowercase.


I see that as a reference to the AI generated Toronto Blue Jays advertisement gone wrong that went viral earlier this year. https://www.blogto.com/sports_play/2023/06/ai-generated-toro...


I wondered similarly whether the astronaut's weird gait was because it was kind of "moonwalking" on the moon.


Assuming we can post links, you mean this video: https://youtu.be/G7mihAy691g?si=o2KCmR2Uh_97UQ0N

Also, maybe you can't edit post facto, but when you give prompts, would you not be able to say : two blue jays but no CN tower


Yes, its called a negative prompt. Idk if txt2video has it, but both llms and stable-diffusion have it so I'd assume its good to go.


Haven't implemented negative prompts yet, but from what I can tell it's as simple as substracting from the prompt in embedding space.


Not exactly what you're asking for, but AnimateDiff has introduced creating gifs to SD. Still takes quite a bit of tweaking IME.


that sounds like v0 by vercel, you can iterate just like you asked, to combine that type of iteration with video would be really awesome


> sportsball

This is not the flex you think it is. You don't have to like sports, but snarking on people who do doesn't make you intellectual, it just makes you come across as a douchebag, no different than a sports fan making fun of "D&D nerds" or something.


This has become a colloquial term for describing all sports, not the insult you're perceiving it to be.

Rather than projecting your own hangups and calling people names, try instead assuming that they're not trying to offend you personally and are just using common vernacular.


If only there was an existing way to refer to sports generally! And OP was referring to a specific sport (baseball), not sports generally.


The Rogers Centre hosts baseball, football, and basketball games - so in this case "sportsball" was just a shorthand for all these ball sports.


Would you get incensed by "petrolhead", "greenfingers" or "trekkie"? Is that what you choose to be emotional about?


You’re really not helping the “sports fans are combative thugs” stereotype by going off on an insult tirade over an innocent word.


Ah, Mr. Kettle, I see you've met my friend, Mr. Pot!


The rate of progress in ML this past year has been breath taking.

I can’t wait to see what people do with this once controlnet is properly adapted to video. Generating videos from scratch is cool, but the real utility of this will be the temporal consistency. Getting stable video out of stable diffusion typically involves lots of manual post processing to remove flicker.


What was the big “unlock” that allowed so much progress this past year?

I ask as a noob in this area.


I think these are the main drivers behind the progress:

- Unsupervised learning techniques, e.g. transformers and diffusion models. You need unsupervised techniques in order to utilize enough data. There have been other unsupervised techniques in the past, e.g. GANs, but they don't work as well.

- Massive amounts of training data.

- The belief that training these models will produce something valuable. It costs between hundreds of thousands to millions of dollars to train these models. The people doing the training need to believe they're going to get something interesting out at the end. More and more people and teams are starting to see training a large model as something worth pursuing.

- Better GPUs, which enables training larger models.

- Honestly the fall of crypto probably also contributed, because miners were eating a lot of GPU time.


I don't think transformers or diffusion models are inherently "unsupervised", especially not the way they're used in Stable Diffusion and related models (which are very much trained in a supervised fashion). I agree with the rest of your points though.


Generative methods have usually been considered unsupervised.

You're right that conditional generation start to blur the lines though.


"Generative AI" is a misnomer; it's not the same kind of "generative" as the G in GAN.

While you're right about GANs, diffusion models as transformers as transformers are most commonly trained with supervised learning.


I disagree. Diffusion models are trained to generate the probability distribution of their training dataset, like other generative models (GAN, VAE, etc). The fact that the architecture is a Transformer (or a CNN with attention like in Stable Diffusion) is orthogonal to the generative vs discriminative divide.

Unsupervised is a confusing term as there is always an underlying loss being optimized and working as a supervision signal, even for good old kmeans. But generative models are generally considered to be part of unsupervised methods.


self-supervised is a better term


> The belief that training these models will produce something valuable

Exactly. The growth in the next decade is going to be unimaginable because now governments and MNCs believe that there realistically be progress made in this field.


One factor is that Stable Diffusion and ChatGPT were released within 3 months of each other – August 22, 2022 and November 3, 2022, respectively. That brought a lot of attention and excitement to the field. More excitement, more people, more work being done, more progress.

Of course those two releases didn't fall out of the sky.


Dalle 2 also went viral around the same time


Stable diffusion open source release and llama release


But what technically allowed for so much progress?

There’s been open source AI/ML for 20+ years.

Nothing comes close to the massive milestones over the past year.


Attention, transformers, diffusion. Prior image synthesis techniques - i.e. GANs - had problems that made it difficult to scale them up, whereas the current techniques seem to have no limit other than the amount of RAM in your GPU.


> But what technically allowed for so much progress?

The availability of GPU compute time. Up until the Russian invasion into Ukraine, interest rates were low AF so everyone and their dog thought it would be a cool idea to mine one or another sort of shitcoin. Once rising interest rates killed that business model for good, miners dumped their GPUs on the open market, and an awful lot of cloud computing capacity suddenly went free.


the Transformers are all you need paper from Google, which may end up being a larger contribution to society than Google search, is foundational.

Emad Mostaque and his investment in stable diffusion, and his decision to release it to the world.

I'm sure there are others, but those are the two that stick out to me.


Public availability of large transformer-based foundation models trained at great expense, which is what OP is referring to, is definitely unprecedented.


People figuring out how to train and scale newer architectures (like transfomers) effectively, to be wildly larger than ever before.

Take AlexNet - the major "oh shit" moment in image classification.

It had an absolutely mind-blowing number of parameters at a whopping 62 million.

Holy shit, what a large network, right?

Absolutely unprecedented.

Now, for language models, anything under 1B parameters is a toy that barely works.

Stable diffusion has around 1B or so - or the early models did, I'm sure they're larger now.

A whole lot of smart people had to do a bunch of cool stuff to be able to keep networks working at all at that size.

Many, many times over the years, people have tried to make larger networks, which fail to converge (read: learn to do something useful) in all sorts of crazy ways.

At this size, it's also expensive to train these things from scratch, and takes a shit-ton of data, so research/discovery of new things is slow and difficult.

But, we kind of climbed over a cliff, and now things are absolutely taking off in all the fields around this kind of stuff.

Take a look at XTTSv2 for example, a leading open source text-to-speech model. It uses multiple models in its architecture, but one of them is GPT.

There are a few key models that are still being used in a bunch of different modalities like CLIP, U-Net, GPT, etc. or similar variants. When they were released / made available, people jumped on them and started experimenting.


> Stable diffusion has around 1B or so - or the early models did, I'm sure they're larger now.

SDXL is 6.6 billion.


There has been massive progress in ML every year since 2013, partly due to better GPUs and lots of training data. Many are only taking notice now that it is in products but it wasn't that long ago there was skepticism on HN even when software like Codex existed in 2021.


Where do you want to start? The Internet collection and structuring the world's knowledge into a few key repositories? The focus on GPUs in gaming and then the crypto market creating a suite of libraries dedicated to hard scaling math. Or then the miniaturization and focus on energy efficiency due to phones making scaled training cost-effective. Finally the papers released by Google and co which didn't seem to recognise quite how easy it would be to build and replicate upon. Nothing was unlocked apart from a lot of people suddenly noticed how doable all this already was.


I mean, you probably didn't pay much attention to battery capacity before phones, laptops, and electric cars, right? Battery capacity has probably increased though at some rate before you paid attention. It's just when something actually becomes relevant that we notice.

Not that more advances don't happen with sustained hype, just there's some sort of tipping point involving usefulness based either on improvement of the thing in question or it's utility elsewhere.


MS subsidizing it with 10 billions USD and (un)healthy contempt towards copyright.


Controlnet is adapted to video today, the issues are that it's very slow. Haven't you seen the insane quality of videos on civitai?


I have seen them, the workflows to create those videos are extremely labor intensive. Control net lets you maintain poses between frames, it doesn’t solve the temporal consistency of small details.


People use animatediff’s motion module (or other models that have cross frame attention layers). Consistency is close to being solved.


Temporal consistency is improving, but “close to being solved” is very optimistic.


No I think we’re actually close. My source is I’m working on this problem and the incredible progress of our tiny 3 person team at drip.art (http://api.drip.art) - we can generate a lot of frames that are consistent, and with interpolation between them, smoothly restyle even long videos. Cross-frame attention works for most cases, it just needs to be scaled up.

And that’s just for diffusion focused approaches like ours. There are probably other techniques from the token flow or nerf family of approaches close to breakout levels of quality, tons of talented researchers working on that too.


The demo clips on the site are cool, but when you call it a "solved problem," I'd expect to see panning, rotating, and zooming within a cohesive scene with multiple subjects.


Thanks for checking it out! We’re certainly not done yet, but much of what you ask is possible or will be soon on the modeling side and we need tools to expose that to a sane workflow in traditional video editors.


Once a video can show a person twisting round, and their belt buckle is the same at the end as it was at the start of the turn, it's solved. VFX pipelines need consistency. TC is a long, long way from being solved, except by hitching it to 3DMMs and SMPL models (and even then, the results are not fabulous yet).


Hopefully this new model will be a step beyond what you can do with animatediff


> Haven't you seen the insane quality of videos on civitai?

I have not, so I went to https://civitai.com/ which I guess is what you're talking about? But I cannot find a single video there, just images and models.



Not sure I'd call that "insane quality", more like neat prototypes. I'm excited where things will be in the future, but clearly it has a long way to go.


https://civitai.com/images

Go there, in the top right of the content area it has two drop-downs: Most Reactions | Filters

Under filters, change the media setting to video.

Civitai has a notoriously poor layout for finding/browsing things unfortunately.


A small percentage of the images are animations. This id (for obvious reasons) particularly common for images used on the catalog pages for animation-related tools and models, but also its not uncommon for (AnimateDiff-based, mostly) animations to be used to demo the output of other models.


Yeah, solving the flickering problem and achieving temporal consistency will be the key to realize the full potential of generative video models.

Right now, AnimateDiff is leading the way in consistency but I'm really excited to see what people will do with this new model.


> but the real utility of this will be the temporal consistency

The main utility will me misinformation


I understand the magnitude of innovation that's going on here. But still feel like we are generating these videos with both hands tied behind our backs. In other words, it's nearly impossible to edit the videos in this constraints. (Imagine trying to edit the blue Jays to get the perfect view).

Since videos are rarely consumed raw, what if this becomes a pipeline in Blender instead? (Blender the 3d software). Now the video becomes a complete scene with all the key elements of the text input animated. You have your textures, you have your animation, you have your camera, you have all the objects in place. We can even have the render engine in the pipeline to increase the speed of video generation.

It may sound like I'm complaining, but I'm just ask making a feature request...


What would solve all these issues is full generation of 3D models that we hopefully get a chance to see over the next decade. I’ve been advocating for a solid LiDAR camera on the iPhone so there is a lot of training data for these LLMs.


> I’ve been advocating for a solid LiDAR camera on the iPhone

What do you mean by “advocating”? The iPhone has had a LiDAR camera since 2020.


That's probably why they qualified with "solid", the iPhone's LiDAR camera is quite terrible.


Yes, exactly.


we're working on this - dream3d.com


I'm still puzzled as to how these "non-commercial" model licenses are supposed to be enforceable. Software licenses govern the redistribution of the software, not products produced with it. An image isn't GPL'd because it was produced with GIMP.


The license is a contract that allows you to use the software provided you fulfill some conditions. If you do not fulfill the conditions, you have no right to a copy of the software and can be sued. This enforcement mechanism is the same whether the conditions are that you include source code with copies you redistribute, or that you may only use it for evil, or that you must pay a monthly fee. Of course this enforcement mechanism may turn out to be ineffective if it's hard to discover that you're violating the conditions.


It also somewhat depends on open legal questions like whether models are copyrightable and, if so, whether model outputs are derivative works of the model. Suppose that models are not copyrightable, due to their not being the product of human creativity (this is debatable). Then the creator can still require people to agree to contractual terms before downloading the model from them, presumably including the usage limitations as well as an agreement not to redistribute the model to anyone else who does not also agree. Agreement can happen explicitly by pressing a button, or potentially implicitly just by downloading the model from them, if the terms are clearly disclosed beforehand. But if someone decides on their own (not induced by you in any way) to violate the contract by uploading it somewhere else, and you passively download it from there, then you may be in the clear.


> Then the creator can still require people to agree to contractual terms before downloading the model from them, presumably including the usage limitations as well as an agreement not to redistribute the model to anyone else who does not also agree.

I don't think it's possible to invent copyright-like rights.


Why not? Two willing parties can agree to bind themselves to all kinds of obligations in a contract as long as they're not explicitly illegal.

Copyleft is an example of someone successfully inventing a copyright-like right by bootstrapping off existing copyright with a specially engineered contract.


There are a few problems:

1) You and I invent our own private "copyright" for data (which is not copyrightable)

2) Everything is fine until my wife walks up to my computer and makes a copy of the data. She's not bound by our private "copyright." She doesn't even know it exists, and shares the data with her bestie.

And... our private pseudo-copyright is dead.

Also: Licenses are not the same as contracts. There are times when something can be both, one, or the other. But there are a lot of limits on how far they reach. The output of a program is rarely copyrightable by the author (as opposed to the user).


> my wife walks up to my computer and makes a copy of the data

As you agreed to in our contract, you now need to compensate me for the damage caused by your failure to prevent unauthorized third-party access. Of course you're free to attempt to recover the sum you have to pay me from your wife.

> The output of a program is rarely copyrightable by the author (as opposed to the user).

The author of the program can make it a condition of letting the user use the program that the user has to assign all copyright to the author of the program, kind of like "By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed." https://www.ycombinator.com/legal/


Okay. Now put yourself in the position of Microsoft, using this scheme for Windows. We'll pretend real copyright doesn't exist, and we've got your hair-brained scheme. This is how it plays out:

1) You have a $1T product.

2) My wife leaks it, or a burglar does. I am a typical consumer, with say, a $20k net worth.

You have two choices:

1) Sue me, recover $20k, and be down $1T (minus $20k, plus litigation fees), and get the press of ruining the life of some innocent random person

2) Not sue me. Be down $1T (including the $20k) .

And yes, the author of a program can put whatever conditions they want into the license: "By using this program, you agree to transfer $1M into my bank account in bit coin, to give me your first-born baby, to swear fealty to me, and to give me your wife it servitude." A court can then read those conditions, have a good laugh, and not enforce them. There are very clear limits on what a court will enforce in licenses (and contracts), and owning the output of a program, and barring exceptional circumstance, courts will not enforce them:

https://www.lexology.com/library/detail.aspx?g=eb52567a-2104...

This is why programmers should learn basic law, not treat is as computer code, and consult lawyers when issues come up. Read by a lawyer, a license or contract with an unenforceable clause is as good as having no such clause.


> There are very clear limits on what a court will enforce in licenses (and contracts), and owning the output of a program, and barring exceptional circumstance, courts will not enforce them:

It seems to me that the cases in the article you linked involved the author of the program arguing that their copyright automatically extended to the output without any extra contractual provisions concerning copyright assignment, so I don't think they can be used as precedent regarding the enforceability of such clauses.


> The author of the program can make it a condition of letting the user use the program that the user has to assign all copyright to the author of the program

I think it is quite likely a court would find that unconscionable.


It doesn't have to be enforceable. This licensing model works exactly the same as Microsoft Windows licensing or WinRAR licensing. Lots and lots of people have pirated Windows or just buy some cheap keys off Ebay, but no one of them in their sane mind would use anything like that at their company.

The same way you can easily violate any "non-commercial" clauses of models like this one as private person or as some tiny startup, but company that decide to use them for their business will more likely just go and pay.

So it's possible to ignore license, but legal and financial risks are not worth it for businesses.


I've heard companies also intentionally do not go after individuals pirating software e.g., Adobe Photoshop - it benefits them to have students pirate and skill up on their software and then enter companies that buy Photoshop because their employees know it, over locking down and having those students, and then the businesses, switch to open source.


I'm sure there are plenty of other examples, but in my personal experience this was Autodesk's strategy with AutoCAD. Get market saturation by being extremely light on piracy. Then, once you're the only one standing lower the boom. I remember, it was almost like flipping a switch on a single DAY in the mid-00's when they went from totally lax on unpaid users to suing the bejeezus out of anyone who they had good enough documentation on.

One smart thing they did was they'd check the online job listings and if a firm advertised for needing AutoCAD experience they'd check their licenses. I knew firms who got calls from Autodesk legal the DAY AFTER posting an opening.


Visual Studio Community (and many other products) only allows "non-commercial" usage. Sounds like it limits what you can do with what you produce with it.

At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.

As an example, see the Creative Commons license, ShareAlike clause:

> If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.


> At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.

you can put whatever you want in a contract, doesn't mean it's enforceable


Do you have link for the VS Community terms you're describing? What I've found is directly contradictory: "Any individual developer can use Visual Studio Community to create their own free or paid apps." From https://visualstudio.microsoft.com/vs/community/


Enterprise organizations are not allowed to use VS Community for commercial purposes:

> In enterprise organizations (meaning those with >250 PCs or >$1 Million US Dollars in annual revenue), no use is permitted beyond the open source, academic research, and classroom learning environment scenarios described above.


I see, thanks!


So, there's a few different things interacting here that are a little confusing.

First off, you have copyright law, which grants monopolies on the act of copying to the creators of the original. In order to legally make use of that work you need to either have permission to do so (a license), or you need to own a copy of the work that was made by someone with permission to make and sell copies (a sale). For the purposes of computer software, you will almost always get rights to the software through a license and not a sale. In fact, there is an argument that usage of computer software requires a license and that a sale wouldn't be enough because you wouldn't have permission to load it into RAM[0].

Licenses are, at least under US law, contracts. These are Turing-complete priestly rites written in a special register of English that legally bind people to do or not do certain things. A license can grant rights, or, confusingly, take them away. For example, you could write a license that takes away your fair use rights[1], and courts will actually respect that. So you can also have a license that says you're only allowed to use software for specific listed purposes but not others.

In copyright you also have the notion of a derivative work. This was invented whole-cloth by the US Supreme Court, who needed a reason to prosecute someone for making a SSSniperWolf-tier abridgement[2] of someone else's George Washington biography. Normal copyright infringement is evidenced by substantial similarity and access: i.e. you saw the original, then you made something that's nearly identical, ergo infringement. The law regarding derivative works goes a step further and counts hypothetical works that an author might make - like sequels, translations, remakes, abridgements, and so on - as requiring permission in order to make. Without that permission, you don't own anything and your work has no right to exist.

The GPL is the anticopyright "judo move", invented by a really ornery computer programmer that was angry about not being able to fix their printer drivers. It disclaims almost the entire copyright monopoly, but it leaves behind one license restriction, called a "copyleft": any derivative work must be licensed under the GPL. So if you modify the software and distribute it, you have to distribute your changes under GPL terms, thus locking the software in the commons.

Images made with software are not derivative works of the software, nor do they contain a substantially similar copy of the software in them. Ergo, the GPL copyleft does not trip. In fact, even if it did trip, your image is still not a derivative work of the software, so you don't lose ownership over the image because you didn't get permission. This also applies to model licenses on AI software, insamuch as the AI companies don't own their training data[3].

However, there's still something that licenses can take away: your right to use the software. If you use the model for "commercial" purposes - whatever those would be - you'd be in breach of the license. What happens next is also determined by the license. It could be written to take away your noncommercial rights if you breach the license, or it could preserve them. In either case, however, the primary enforcement mechanism would be a court of law, and courts usually award money damages. If particularly justified, they could demand you destroy all copies of the software.

If it went to SCOTUS (unlikely), they might even decide that images made by software are derivative works of the software after all, just to spite you. The Betamax case said that advertising a copying device with potentially infringing scenarios was fine as long as that device could be used in a non-infringing manner, but then the Grokster case said it was "inducement" and overturned it. Static, unchanging rules are ultimately a polite fiction, and the law can change behind your back if the people in power want or need it to. This is why you don't talk about the law in terms of something being legal or illegal, you talk about it in terms of risk.

[0] Yes, this is a real argument that courts have actually made. Or at least the Ninth Circuit.

The actual facts of the case are even more insane - basically a company trying to sue former employees for fixing it's customers computers. Imagine if Apple sued Louis Rossman for pirating macOS every time he turned on a customer laptop. The only reason why they can't is because Congress actually created a special exemption for computer repair and made it part of the DMCA.

[1] For example, one of the things you agree to when you buy Oracle database software is to give up your right to benchmark the software. I'm serious! The tech industry is evil and needs to burn down to the ground!

[2] They took 300 pages worth of material from 12 books and copied it into a separate, 2 volume work.

[3] Whether or not copyright on the training data images flows through to make generated images a derivative work is a separate legal question in active litigation.


> Licenses are, at least under US law, contracts

Not necessarily; gratuitous licenses are not contracts. Licenses which happen to also meet the requirements for contracts (or be embedded in agreements that do) are contracts or components of contracts, but that's not all licenses.


If a company train the model from scratch, on its own dataset, could the resulting model be used commercially?


Nobody claimed otherwise?


There are sites that make Stable Diffusion-derived models available, along with GPU resources, and they sell the service of generating images from the models. The company isn't permitting that use, and it seems that they could find violators and shut them down.


Fantasy.ai was subject to controversy for attempting to license models.


They're not enforceable.


A software licence can definitely govern who can use it and what they can do with it.

> An image isn't GPL'd because it was produced with GIMP.

That's because of how the GPL is written, not because of some limitation of software licences.


Fascinating leap forward.

It makes me think of the difference between ancestral and non-ancestral samplers, e.g. Euler vs Euler Ancestral. With Euler, the output is somewhat deterministic and doesn't vary with increasing sampling steps, but with Ancestral, noise is added to each step which creates more variety but is more random/stochastic.

I assume to create video, the sampler needs to lean heavily on the previous frame while injecting some kind of sub-prompt, like rotate <object> to the left by 5 degrees, etc. I like the phrase another commenter used, "temporal consistency".

Edit: Indeed the special sauce is "temporal layers". [0]

> Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets

[0] https://stability.ai/research/stable-video-diffusion-scaling...


The hardest problem the Stable Diffusion community has dealt with in terms of quality has been in the video space, largely in relation to the consistency between frames. It's probably the most commonly discussed problem for example on r/stablediffusion. Temporal consistency is the popular term for that.

So this example was posted an hour ago, and it's jumping all over the place frame to frame (somewhat weak temporal consistency). The author appears to have used pretty straight-forward text2img + Animatediff:

https://www.reddit.com/r/StableDiffusion/comments/180no09/on...

Fixing that frame to frame jitter related to animation is probably the most in-demand thing around Stable Diffusion right now.

Animatediff motion painting made a splash the other day:

https://www.reddit.com/r/StableDiffusion/comments/17xnqn7/ro...

It's definitely an exciting time around SD + animation. You can see how close it is to reaching the next level of generation.


This field moves so fast. Blink an eye and there is another new paper. This is really cool and the learning speed of us humans is insane! Really excited on using it for downstream tasks! I wonder how easy it is to integrate animatediff with this model?

Also, can someone benchmark it on m3 devices? It would be cool to see if it is worth getting on to run these diffusion inferences and development. If m3 pro can allow finetuning it would be amazing to use it on downstream tasks!


It makes sense that they had to take out all of the cuts and fades from the training data to improve results.

I’m the background section of the research paper they mention “temporal convolution layers”, can anyone explain what that is? What sort of training data is the input to represent temporal states between images that make up a video? Or does that mean something else?


It means that instead of (only) doing convolution in spatial dimensions, it also(/instead) happens in the temporal dimension.

A good resource for the "instead" case: https://unit8.com/resources/temporal-convolutional-networks-...

The "also" case is an example of 3D convolution, an example of a paper that uses it: https://www.cv-foundation.org/openaccess/content_iccv_2015/p...


I would assume is something similar to joining multiple frames/attentions? in channel dimension and then moving values inside so convolution will have access to some channels from other video frames.

I was working on similar idea few years ago using this paper as reference and it was working extremely well for consistency also helping with flicker. https://arxiv.org/abs/1811.08383


This is really, really cool. A few months ago I was playing with some of the "video" generation models on Replicate, and I got some really neat results[1], but it was very clear that the resulting videos were made from prompting each "frame" with the previous one. This looks like it can actually figure out how to make something that has a higher level context to it.

It's crazy to see this level of progress in just a bit over half a year.

[1]: https://epiccoleman.com/posts/2023-03-05-deforum-stable-diff...


Looks like I'm still good for my bet with some friends that before 2028 a team of 5-10 people will create a blockbuster style movie that today costs 100+ million USD on a shoestring budget and we won't be able to tell.


I wouldn't bet either way.

Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic only to be improved upon with each subsequent blockbuster game.

I think we're in a similar phase with AI[0]: every new release in $category is better, gets hailed as super fantastic world changing, is improved upon in the subsequent Two Minute Papers video on $category, and the cycle repeats.

[0] all of them: LLMs, image generators, cars, robots, voice recognition and synthesis, scientific research, …


Your comment reminded me of this: https://www.reddit.com/r/gaming/comments/ktyr1/unreal_yes_th...

Many more examples, of course.


Yup, that castle flyby, those reflections. I remember being mesmerised by the sequence as a teenager.

Big quality improvement over Marathon 2 on a mid-90s Mac, which itself was a substantial boost over the Commodore 64 and NES I'd been playing on before that.


> Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic

Whenever I saw anybody calling those graphics "photorealistic", I always had to roll my eyes and question if those people were legally blind.

Like, c'mon. Yeah, they could be large leaps ahead of the previous generation, but photorealistic? Get real.

Even today, I'm not sure there's a single game that I would say has photo-realistic graphics.


> Even today, I'm not sure there's a single game that I would say has photo-realistic graphics.

Looking just at the videos (because I don't have time to play the latest games any more and even if I did it's unreleased), I think that "Unrecord" is also something I can't distinguish from a filmed cinematic experience[0]: https://store.steampowered.com/app/2381520/Unrecord/

Though there are still caveats even there, as the pixelated faces are almost certainly necessary given the state of the art; and because cinematic experiences are themselves fake, I can't tell if the guns are "really-real" or "Hollywood".

Buuuuut… I thought much the same about Myst back in the day, and even the bits that stayed impressive for years (the fancy bedroom in the Stoneship age), don't stand out any more. Riven was better, but even that's not really realistic now. I think I did manage to fool my GCSE art teacher at the time with a printed screenshot from Riven, but that might just have been because printers were bad at everything.


Unrecord looks amazing, I forgot about that one.

IMO, though, the lighting in the indoor scenes is just not quite right. There's something uncanny valley about it to me. When the flashlight shines, it's clearly still a computer render to my eyes.

The outdoor shots, though, definitely look flawless.


I'm imagining more of an AI that takes a standard movie screenplay and a sidecar file, similar to a CSS file for the web and generates the movie. This sidecar file would contain the "director" of the movie, with camera angles, shot length and speed, color grading, etc. Don't like how the new Dune movie looks? Edit the stylesheet and make it your own. Personalized remixed blockbusters.

On a more serious note, I don't think Roger Deakins has anything to worry about right now. Or maybe ever. We've been here before. DAWs opened up an entire world of audio production to people that could afford a laptop and some basic gear. But we certainly do not have a thousand Beatles out there. It still requires talent and effort.


> thousand Beatles out there. It still requires talent and effort

As well as marketing.


It'll happen, but I think you're early. 2038 for sure, unless something drastic happens to stop it (or is forced to happen.)


I'm pumped for this future, but I'm not sure that I buy your optimistic timeline. If the history of AI has taught us anything, it is that the last 1% of of progress is the hardest half. And given the unforgiving nature of the uncanny valley, the video produced by such a system will be worthless until it is damn-near perfect. That's a tall order!


The first full-length AI generated movie will be an important milestone for sure, and will probably become a "required watch" for future AI history classes. I wonder what the Rotten Tomatoes page will look like.


As per the reviews - it will be hard to say, as both positive and negative takes will be uploaded by ChatGPT bots (or it's myriad of descendents).


"I wonder what the Rotten Tomatoes page will look like"

Surely it will be written using machine vision and llms !


Definitely a big first for benchmarks. After that hyper personalized content/media generated on-demand


What I am really looking forward is some Star Trek style holodeck, but I guess we will start with it in VR headsets first.

Geordi: "Computer, in the Holmesian style, create a mystery to confound Data with an opponent who has the ability to defeat him"


VRAM requirements are big for this launch. We're hosting this for free at https://app.decoherence.co/stablevideo. Disclaimer: Google log-in required to help us reduce spam.


How big is big?


40GB although hearing reports 3090 can do low frame counts


it's worth paying your subscription just for these free videos. would those have the watermark removed if I go "Basic"?


A seemingly off topic question, but with enough compute and optimization, could you eventually simulate “reality”?

Like, at this point, what are the technical counters to the assertion that our world is a simulation?


(disclaimer: worked in the sim industry for 25 years, still active in terms of physics-based rendering).

First off, there are zero technical proofs that we are in a sim, just a number of philosophical arguments.

In practical terms, we cannot yet simulate a single human cell at the molecular level, given the massive number of interactions that occur every microsecond. Simulating our entire universe is not technically possible within the lifetime of our universe, according to our current understanding of computation and physics. You either have to assume that ‘the sim’ is very narrowly focussed in scope and fidelity, and / or that the outer universe that hosts ‘the sim’ has laws of physics that are essentially magic from our perspective. In which case the simulation hypothesis is essentially a religious argument, where the creator typed 'let there be light' into his computer. If there isn't such a creator, the sim hypothesis 'merely' suggests that our universe, at its lowest levels, looks somewhat computational, which is an entirely different argument.


I don't think you would need to simulate the entire universe, just enough of it that the consciousness receiving sense data can't encounter any missing info or "glitches" in the metaphorical matrix. Still hard of course, but substantially less compute intensive than every molecule in the universe.


And if you’re in charge of the simulation, you get to decide how many “consciousnesses” there are, constraining them to be within your available compute. Maybe that’s ~8 billion — maybe it’s 1. Yeah, I’m feeling pretty Boltzmann-ish right now…


> but substantially less compute intensive than every molecule in the universe

Very true, but to me this view of the universe and one's existence within it as a sort of second-rate solipsist bodge isn't a satisfyingly profound answer to the question of life the universe and everything.

Although put like that it explains quite a lot.

[Edit] There is also a sense in which the sim-as-a-focussed-mini-universe view is even less falsifiable, because sim proponents address any doubt about the sim by moving the goal posts to accommodate what they claim is actually achievable by the putative creator/hacker on Planet Tharg or similar.


And you don't have to simulate it in real time, maybe 1 second here takes years or centuries to simulate outside the simulation. It's not like we'd have any way to tell.


These are all open questions in philosophy of mind. Nobody knows what causes consciousness/qualia so nobody knows if it's substrate dependent or not and therefore nobody knows if it can be simulated in a computer, or if it can nobody knows what type of computer is required for consciousness to be a property of the resulting simulation.


Maybe something like quantum mechanics are an "optimization" of the sim, i.e the sim doesn't actually compute the locations, spin etc of subatomic particles but instead just uses probabilities to simulate it. Only when a consciousness decides to look more closely does it retroactively decide what those properties really were.

Kind of like how video games won't render the full resolution textures when the character is far away or zoomed out.

I'm sure I'm not the first person to have thought this.


The brain does simulate reality in the sense that what you experience isn't direct sensory input, but more like a dream being generated to predict what it thinks is happening based on conflicting and imperfect sensory input.


To illustrate your point, an easily accessible example of this is how the second hand on clocks appears to freeze for longer than a second when you quickly glance at it. The brain is predicting/interpolating what it expects to see, creating the illusion of a delay.

https://www.popsci.com/how-time-seems-to-stop/


Example vision: comes in from the optic nerve warped and upside down and as small patches of high resolution captured by the eyes zigzagging across the visual field (saccades), all of which is assembled and integrated into a coherent field of vision by our trusty old grey blob.


Why does it matter? Not trying to dismiss, but truly, what would it mean to you if you could somehow verify the "simulation"?

If it would mean something drastic to you, I would be very curious to hear your preexisting existential beliefs/commitments.

People say this sometimes and its kind of slowly revealed to me that its just a new kind of geocentrism: its not just a simulation people have in mind, but one where earth/humans are centered, and the rest of the universe is just for the benefit of "our" part of the simulation.

Which is a fine theory I guess, but is also just essentially wanting God to exist with extra steps!


> Like, at this point, what are the technical counters to the assertion that our world is a simulation?

How about this theory is neither verifiable nor falsifiable.


The general concept is not falsifiable, but many variations might be, or their inverse might be. E.g. the theory that we are not in a simulation would in general be falsifiable by finding an "escape" from a simulation and so showing we are in one (but not finding an escape of course tells us nothing).

It's not a very useful endeavour to worry about, but it can be fun to speculate about what might give rise to testable hypotheses and what that might tell us about the world.


There can be no technical counters to the assertion that our world is a simulation. If our world is a simulation, then hardware/software that simulates it is outside of our world and it's technical constitution is inaccessible to us.

It's purely a religious question. When humanity invented the wheel, religion described the world as a giant wheel rotating in cycles. When humanity invented books, religion described the world as a book, and God as a it's writer. When humanity invented complex mechanism, religion described the world as giant mechanism and God as a watchmaker. Then computers where invented, and you can guess what happened next.


A little too freshman's first bit off a bong for me. There is, of course, substantial differences between video and reality.

Let's steel-man — you mean 3D VR. Let's stipulate there's a headset today that renders 3D visually indistinguishable from reality. We're still short the other 4 senses

Much like faith, there's always a way to sort of escape the traps here and say "can you PROVE this is base reality"

The general technical argument against "brain in a vat being stimulated" would be the computation expense of doing such, but you can also write that off with the equivalent of foveated rendering but for all senses / entities


Actually it was already done by sentdex with GAN Theft Auto:

https://youtu.be/udPY5rQVoW0

To an extent...

PS: Video is 2 years old, but still really impressive.


That theory was never meant to be so airtight such that it 'needs' to be refuted.


I've been following this space very very closely and the killer feature would be to be able to generate these full featured videos for longer than a few seconds with consistently shaped "characters" (e.g., flowers, and grass, and houses, and cars, actors, etc.). Right now, it's not clear to me that this is achieving that objective. This feels like it could be great to create short GIFs, but at what cost?

To be clear, this remains wicked, wicked, wicked exciting.


I admit I'm ignorant about these model's inner workings, but I don't understand why text is the chosen input format for these models.

It was the same for image generation, where one needed to produce text prompts to create the image, and stuff like img2img and Controlnet that allowed things like controlling poses and inpainting, or having multiple prompts with masks controlling which part of the image is influenced by which prompt.


According to the GitHub repo this is an "image-to-video model". They tease of an upcoming "text to video" interface on the linked landing page, though. My guess is that interface will use a text-to-image model and then feed that into the image-to-video model.


Imago Deo? The Word is what is spoken when we create.

The input eventually becomes meanings mapped to reality.


Can this be used for porn?


Porn will be one of the main use cases for this technology. Porn sites pioneered video streaming technologies back in the day, and drove a lot of the innovation there.


The question reminded me of this classic: https://www.youtube.com/watch?v=YRgNOyCnbqg


Depends on whether trains, cars, and/or black cowboys tickle your fancy.



If it can't, someone will massage it until it can. Porn, and probably also stock video to sell to YouTubers.


The answer to that question is always "yes", regardless what "this" is.

Diffusion models for moving images are already used to a limited extent for this. And I'm sure it will be the use case, not just an edge case.


Nope, all commercial models are severly gated.


It's already posted to Unstable Diffusion discord so soon we'll know.

After all fine-tuning wouldn't take that long.


Very unusual comment.

I do not think so as the chance of constructing a fleshy eldritch horror is quite high.


How is that not the first question to ask? Porn has proven to be a fantastic litmus test of fast market penetration when it comes to new technologies.


Market what?


This is true. I was hoping my educated guess of the outcome would minimize the possibility of anyone attempting this. And yet, here we are - the only losing strategy in the technology sector is to not try at all.


No pun intended?


That didn't stop people using PornPen for images and it wouldn't stop them using something else for video.


> I do not think so as the chance of constructing a fleshy eldritch horror is quite high.

There is a market for everything!


A surprisingly large number of people are into fleshy eldritch horrors.


Has anyone managed to run the thing? I got the streamlit demo to start after fighting with pytorch, mamba, and pip for half an hour, but the demo runs out of GPU memory after a little while. I have 24GB on GPU on the machine I used, does it need more?


Yeah, got a 24GB 4090, try to reduce the number of frames decoded to something like 4 or 8. Although, keep in mind it caps the 24Gb and goes to RAM (with the latest nvidia drivers).


Oh yes it works, thanks!


Have heard from others attempting it that it needs 40GB, so basically an A100/A6000/H100 or other large card. Or an Apple Silicon Mac with a bunch of unified memory, I guess.


Alright thanks for the information. I will try to justify using one A100 for my "very important" research activities.


Give it a week.


Is the checkpoint default fp16 or fp32?


Very excited to play with this. Some of my latest experiments - https://www.jasonfletcher.info/vjloops/


We're hosting this free (no credit card needed) at https://app.decoherence.co/stablevideo Disclaimer: Google log-in required to help us reduce spam. Let me know what you think of it! It works best on landscape images from my tests.


Model weights (two variations, each 10GB) are available without waitlist/approval: https://huggingface.co/stabilityai/stable-video-diffusion-im...

The LICENSE is a special non-commercial one: https://huggingface.co/stabilityai/stable-video-diffusion-im...

It's unclear how exactly to run it easily: diffusers has video generation support now but need to see if it plugs in seamlessly.


It looks like the huggingface page links their github that seems to have python scripts to run these: https://github.com/Stability-AI/generative-models


Those scripts aren't as easy to use or iterate upon since they are CLI apps instead of a REPL like a Colab/Jupyter Notebook (although these models probably will not run in a normal Colab without shenanigans).

They can be hacked into a Jupyter Notebook but it's really not fun.


Regular reminder that it is very likely that model weights can't be copyrighted (and thus can't be licensed).


These are basically like animated postcards, like you often see now on loading screens in videogames. A single picture has been animated. Still a long shot from actual video.


"2 more papers down the line"...


It's funny that still don't really have video wallpapers on most devices (I'm only aware of Wallpaper Engine on Windows)


Mplayer/MPV used to be able to play videos in the X root window like a wallpaper. No idea if it still works nowadays.


I had a video wallpaper on my Motorola Droid back in 2010.


and a battery life of...?

I do wonder if there have been any codec studies that measure power usage with respect to RAM


Soon the hollywood strike won't even matter, won't need any of those jobs. Entire west coast economy obliterated.


Seems relatively unimpressive tbh - it's not really a video, and we've seen this kind of thing for a few months now


It seems like the breakthrough is that the video generating method is now baked into the model and generator. I've seen several fairly impressive AI animations as well, but until now, I assumed they were tediously cobbled together by hacking on the still-image SD models.


Once text-to-video is good enough and once text generation is good enough, we could legit actually have endless TV shows produced by individuals! We're probably still far away from that, but it is exciting to think about!

I think this will really open new ways and new doors to creativity and creative expression.


Question for anyone more familiar with this space: are there any high-quality tools which take an image and make it into a short video? For example, an image of a tree becomes a video of a tree swaying in the wind.

I have googled for it but mostly just get low quality web tools.


That's what this is


Hmm, for some reason I was understanding this as a text-to-video model. I’ll have to read this again.


Very soon, we will be able to change story line of a web series dynamically, a little more thrill, a little more comedy, changing character face to matching ours and others, all in 3D with 360 degree view, how far are we from this ? 5 year ?


At least several decades, I’d say. This is a hugely complex, multifaceted problem. LLMs can’t even write half-decent screenplays yet.


Model chain:

Instance One : Act as a top tier Hollywood scenarist, use the public available data for emotional sentiment to generate a storyline, apply the well known archetypes from proven blockbusters for character development. Move to instance two.

Instance Two: Act as top tier producer. {insert generated prompt}. Move to instance three.

Instance Three: Generate Meta-humans and load personality traits. Move to instance four.

Instance Four: Act as a top tier director.{insert generated prompt}. Move to instance five.

Instance Five: Act as a top tier editor.{insert generated prompt}. Move to instance six.

Instance Six: Act as a top tier marketing and advertisement agency.{insert generated prompt}. Move to instance seven.

Instance Seven: Act as a top tier accountant, generate an interface to real-time ROI data and give me the results on an optimized timeline into my AI induced dream.

Personal GPT: Buy some stocks, diversify my portfolio, stock up on synthetic meat, bug-coke and Soma. Call my mom and tell her I made it.


Much like in static images, the subtle unintended imperfections are quite interesting to observe.

For example, the man in the cowboy hat seems he is almost gagging. In the train video the tracks seem to be too wide while the train ice skates across them.


How much longer will it be until we can play "video games" which consist of user-input streamed to an AI that generates video output and streams it to the player's screen?


If you're willing to accept text based output then Text adventure style games and even simulating bash was possible using chatgpt until openAI nerfed it.


Stability.ai, please make sure your board is sane.


A default glitch effect in the video can make the distortions a "feature not a bug"


Finally ! Now that this is out, I can finally start adding proper video widgets to CushyStudio https://github.com/rvion/CushyStudio#readme . Really hope I can get in touch with StabilityAi people soon. Maybe Hacker News will help


Needs 40GB VRAM, down to 24GB by reducing the number of frames processed in parallel.


cannot join the waiting list (nor opt in for marketing newsletter), because the sign-up form checkboxes don't toggle on android mobile Chrome or Firefox.


Is this available in the stability API any time soon?


And thanks to the porn community on Civit.ai!


How long until Replicate has this available?


We're hosting this free (no credit card needed) at https://app.decoherence.co/stablevideo Disclaimer: Google log-in required to help us reduce spam.

Let me know what you think of it! It works best on landscape images from my tests.


Looks like there is a WIP here: https://replicate.com/lucataco/svd


Can't wait for these things to not suck


It's definitely pretty impressive already. If there could be some kind of "final pass" to remove the slightly glitchy generative artifacts, these look completely passible for simple .gif/.webm header images. Especially if they could be made to loop smoothly ala Snapchat's bounce filter.


This is gonna change everything


It's really not.

Don't get me wrong, this is insanely cool, but it's still a long way from good enough to be truly disruptive.


In a few years' time, teenagers will be consuming shows and films made by their peers, not by streaming providers. They'll forgive and perhaps even appreciate the technical imperfections for the sake of uncensored, original content that fits perfectly with their cultural identity.

Actually, when processing power catches up, I'm expecting a movie engine with well-defined characters, scenes, entities, etc., so people will be able to share mostly text-based scenarios to watch on their hardware players.


Similar to how all the kids today only play itch.io games thanks to Unity and Unreal dramatically lowering the bar of entry into game development.

Oh wait... No.

All it has done is create an environment where indy games are now assumed to be trash unless proven otherwise, making getting traction as a small developer orders of magnitude harder than it has ever been because their efforts are drowning in a sea of mediocrity.

That same thing is already starting to happen on youtube with AI content, and there's no reason for me to expect this going any other way.


It took ~2 years for my 10 year old daughter to get bored and give up the shitty user made roblox games and start playing on switch, steam or ps4.


They do that now (forget the name there's a popular one my niece uses to make animated comics, others do similar things in Minecraft etc), and have been doing that since forever - nearly 30 years ago my friends and I were scribbling comic panels into our notebooks and sharing them around class.


ms comic chat for the win


One year.

All of Hollywood falls.


Every time something like this is released someone comments how it’s going to blow up legacy studios. The only way you can possibly think that is that: 1-the studios themselves will somehow be prevented from using this tech themselves, and 2-that somehow customers will suddenly become amenable to low grade garbage movies. Hollywood already produces thousands of low grade B or C movies every year that cost fractions of what it costs to make a blockbuster. Those movies make almost nothing at the box office.

If anything, a deluge of cheap AI generated movies is going to lead to a flight to quality. The big studios will be more powerful because they will reap the productivity gains and use traditional techniques to smooth out the rough edges.


> 2-that somehow customers will suddenly become amenable to low grade garbage movies

People have been amenable to low grade garbage movies for a long, long time. See Adam Sandler's back catalog.


No offense, but this is absolutely delusional.

As long as people can "clock" content generated from these models, it will be treated by consumers as low-effort drivel, no matter how much actual artistic effort goes in the exercise. Only once these systems push through the threshold of being indistinguishable from artistry will all hell break loose, and we are still very far from that.

Paint-by-numbers low-effort market-driven stuff will take a hit for sure, but that's only a portion of the market, and frankly not one I'm going to be missing.


Very far, yes, but also in a fast moving field.

CGI in films used to be obvious all the time no matter how good the artists using it, now it's everywhere and only noticeable when that's the point; the gap from Tron to Fellowship of the Ring was 19.5 years.

My guess is the analogy here puts the quality of existing genAI somewhere near the equivalent of early TV CGI, given its use in one of the Marvel title sequences etc., but it is just an analogy and there's no guarantees of anything either way.


something unrelated improved overtime so something else unrelated will also improve to whatever goal you've set in your mind

weird logic circles yall keep making to justify your beliefs, i mean the world is very easy like you just described if you completely strip all nuance and complexity

people used to believe at the start of the space race we'd have mars colonies by now because they looked at the rate of technological advancement from 1910 to 1970, from the first flight to landing on the moon; yet that didn't happen because everything doesn't follow the same repeatable patterns


First, lotta artists already upset with genAI and the impact it has.

Second, I literally wrote the same point you seem to think is a gotcha:

> it is just an analogy and there's no guarantees of anything either way


People also believed that recorded music would destroy the player piano industry and the market for piano rolls. Just because recorded music is cheaper doesn't mean that the audience will be willing to give up the actual sound of a piano being played.


Is it? How so?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: