True video capability would entail describing a scene as a prompt and getting a video in return. Not interpolating between a handful of images as is being done now (not to discredit those).
This will be a huge game changer when it occurs. Whether it be for deep fake videos, creating custom content, or making a new season of your favorite tv show that was cancelled too early. The possibilities are endless.
This is probably not in the near future (i.e. this year), but I doubt it is very far off.
I am much more interested in an intermediary step. I would love to be able to use a tool like this to create a comic book. This is after all just static artwork which the tool already creates quite beautifully.
What it would need to be able to do to get from here to there is understand some concepts. The first being "characters". On reddit there was beautiful image that recently won first place in an art contest and its quite frustrated some of the art community. When I was looking at it I thought it was awesome, but wondered at the ability to create another hundred or so images in that same 'world' that the created image was showing. I would want to do something like give it the prompt "tired old medieval knight with a mace and shield" and have it create the character then be able to name it "Tom" or something and feed it more prompts for that characters like "Tom is sitting in a forest brooding" and have it create the same exact character but in a different context.
That would be pretty game changing for opening up amature web comics to a large body of people who have ideas and tell stories but have no art skills to speak of - my stick characters are crooked :(
The result is bad though, for the same reason you can't generate video with it. Comic panels need to relate to each other; you can't simply make them out of random images. There aren't sufficient style controls to do that with current technology, even if Midjourney added in "textual inversion".
There is some work exploring that with Textual Inversion[1].
Another trick to approach this problems is specifying the random seed, this will cause the same image being generated by the same prompt without any randomness. When you now change the prompt you get an image that is very similar to the first one, but with the variation included. Somebody used that to age a woman across 100 years[2] with quite stunning results. Even works with gender or style changes.
I recently saw a Twitter thread from last year where someone made a comic book with AI generated backgrounds. The characters were added in later, but it stuck with me as a very cool future use case
Imagine how fun sitting at a terminal in vim editing a 100 line 'script' for a short movie and getting rapid feedback back. I'm so excited about the future.
The possibilities are endless. "Insert Willie Wonka, as Froto's love interest, and Willie should joint the major battles with UZI machine guns, and his dialog should be as if he is an inner-city gang member."
I thought we'd never get image generation this fast. Last year it was 30 minutes per image. The stable diffusion folks are planning for a 100mb release of the image generator in Q1 which for sure would be real time. I actually suspect you can get something like that incredibly fast (even though all intuition says otherwise).
What is referred to/defined as "interpolation" because as an outsider... isn't "Stable Diffusion interpolating text into images/frames/video" in a "literal" (maybe not technical) sense?
It's to be interpreted in the quasi-mathematical sense where you have images for frame A and frame B representing your data points. To interpolate between those frames, a flow of plausible images simulating the transition from A to B is generated.
Interpolation here meaning one smooth motion transition is all that is depicted. An entire episode of television requires things like cuts between scenes, possibly discontinuities like flashbacks, scenes that take place days, months, or even decades later, and characters should still look the same, but might be wearing different clothing, or grow a beard, or get really old but still have similar facial features and the same skin color. If one ages, they should all age about the same, unless it's a story with time travel or humanoid immortal characters that don't age.
I'm sure these types of capabilities will come at some point, but no current model can do it. It requires more than just projecting motion into a scene.
You could "hack it" by using a couple of other models as part of your pipeline. Similarly to how you have to use GAN after SD to "fix" faces sometimes.
You also could put a language model on top of your prompting system. So "gandolff kicking ass" gets translated into " Page XXX, Paragraph XX from LOTR "
> "Stable Diffusion was trained primarily on LAION-Aesthetics, a collection of subsets of the LAION 5B dataset, which is itself a subset of LAION-400M"
Err, no? As you can tell by the name itself, the 5B dataset is larger.
> "Unlike autoencoder-based deepfake content, or the human recreations that can be achieved by Neural Radiance Fields (NeRF) and Generative Adversarial Networks (GANs), diffusion-based systems learn to generate images by adding noise"
This is confused. Diffusion is orthogonal to NeRF. For example, here is a paper that uses both: https://arxiv.org/abs/2112.12390
> "Within days of release, the open sourced Stable Diffusion code and weights were packaged into a free Windows executable"
That's not how it became popular.
> "Additionally, at the time of writing, Google Research has just released a similar system called DreamBooth, which likewise ‘tokenizes’ a desired element into a distinct ‘object’."
The approaches are actually very different. DreamBooth uses fine-tuning, unlike textual inversion.
> This is confused. Diffusion is orthogonal to NeRF.
Also, diffusion models learn to remove noise (guided by a description of the undiffused image), not add it.
Though, it seems you don't have to use noise as the information destruction mechanism; blurring works, and I wonder if there isn't something better for animation.
I'm not as excited by the demo as you are. It seems that they just detect foreground, key it and put an arbitrary picture (in this case from stable diff) on a skybox behind.
It seems closer to what Mac OS X Leopard's Photo Booth was able to do 15 years ago[1], than to a "Stable Diffusion for Video".
An image is already a 2-dimensional flattening of a 3-dimensional world. A generator of 2-dimensional images would already have a disadvantage, since they have not learned anything about the three dimensions, and are simply faking what other, real, 2-dimensional images look like, including all artifacts which comes from 3-dimensionality being flattened. But these artifacts will always be weird and not consistent with each other, since the image generator does not really know anything about the third dimension, it’s just mimicking what artifacts it has learned by seeing them. Video is one additional dimension: time. I would guess that similar artifacts would also show up here, and be difficult to eliminate.
The current image generators certainly know about 3D internally. It's not put there explicitly but the wealth of knowledge they possess and the perfect lighting and shadowing and material reflections compared to the tiny amount of weights strongly suggest they have proper representations (just like humans by the way).
Even a small model in early training learns to do proper shadowing and lighting and mostly makes mistakes in the 3D high-level object space, like wrong number of wheels or legs or fingers and stuff like that.
I don't think that they consider 3d too much. Consider the "Girl with a Pearl Earring" image, that's meant to represent something that's state of the art:
The image might look consistent at first sight, but if you look closer, the dimensions are all over the place. They reflect the method by which they were created: reproducing observed patterns without deeper understanding.
You can find lots of buggy images with bugs in all possible ways. But the vast majority of the images have consistent shading, perspective, and even global illumination. Some of it can be done in "screen space" (as in some standard game 2D post-processing filters) but lots of them couldn't work this way.
At some low level, pixel patterns are rendered and I guess you could say that this is "reproducing observed patterns". Would you say that about a 3D game engine as well that does the same when it textures local regions of pixels? A network is layered for this particular reason. The lowest layer will have less "understanding" than the higher layers.
If you ask for something "reflective" or "in a hall of mirrors" it tends to work, though it often seems like a screen-space effect as it tends to draw the front of objects rather than their back in the reflection.
The perspective barely changes in the sphere. I'm not convinced that the AI implied 3d, lighting/shadows, and reflections from the dataset which is supposed to be, at most, just 2d projections of 3d environments. Maybe in the future when we feed it 3d (meta) information also, then we can expect some magic, but until then, I don't think so.
> It's not put there explicitly but the wealth of knowledge they possess and the perfect lighting and shadowing and material reflections compared to the tiny amount of weights strongly suggest they have proper representations
I don’t think they do, but I can’t prove it, either way.
> (just like humans by the way).
Oh, I don’t think humans have it either (mostly), since humans don’t have 3D vision; humans have the equivalent of wiggle stereoscopy¹, which gives some hints of depth, and humans have the additional advantage of the time dimension. People do have some intellectual capacity to reason about 3-dimensional shapes, and some people can even rotate things in their heads with ease. Blind people might have it too, since their concept of the world was not created by this pseudo-3D visual input. But mostly, people don’t think in 3D.
I think we can see this by looking at drawings by children. Child drawings are dominated by the concepts important to children: Faces, hands, etc. Concepts, not actual images. And it’s all in 2D, as was the majority of art for much of history.
I have to strongly disagree with you on this, I can walk around in my (small) city in my mind in 3D without any problems. Anybody who's ever had a dream would concur as well. It's absolutely not true that humans can't think in 3D.
The sensory input modalities and specifics do not control or limit the internal representations; a sufficiently capable neural network will extract the most efficient rep to predict the input data, which is moving 3D objects.
That it's difficult to "render" this to 2D by painting is not very surprising.
> I have to strongly disagree with you on this, I can walk around in my (small) city in my mind in 3D without any problems. Anybody who's ever had a dream would concur as well. It's absolutely not true that humans can't think in 3D.
Well, some people reportedly can, but it’s rare. You say you can imagine walking around your small city, and infer that you have the city conceptualized as a 3D object, and can rotate it at will, in any direction. But how do we know that you simply have not memorized your admittedly small city? Can you imagine what going around the city upside down (i.e. like walking on your hands) would look like, with ease? It should be just a simple rotation, right?
DCT is a linear transformation so the model ought to be able to learn it (and typically already does). If you're going to feed in some non-learned transformation it seems like it should be nonlinear to add value.
I guess the next step is to make stable diffusion better a lot of local context driven and multi- image prediction .
Today still the AI misses local context as the images are more of trained from Open images annotated in English by experts. But imagine if you have single image annotated by multiple people in different languages then what happens to AI capabilities
I'm wondering is there a service to create tilemap spritesheet? Say I want to make a retro style Ultima-ish spritesheet with 64*64 sprites, but I don't have any drawing skill, is there an AI to generate some for me?
The "initial frame -> video" problem seems way harder than generating video from a text input. Once a good dataset for that problem is assembled, it seems like Stable Diffusion would naturally accept an additional "time" dimension, and generate cohesive output, with corresponding massively increased hardware requirements.
Though I'd venture that the first "novel to feature film" or "novel to TV series" algorithm won't just be an upscaling of this tech....
Video will suffer even more than still imagery or text from the inherent lack of continuity/self-consistent memory that these autocomplete/prediction algorithms have.
So cool for abstract art but not for storytelling or following a script. Unless you are OK with the content being visually inconsistent like an acid trip.
I think that's just a scaling issue, fundamentally there's no reason why a model trained on video couldn't come to create coherent motion in the same way that image models can now product coherent lighting/themes.
Smaller image models had the same problems with logical inconsistency just because they didn't have sufficient general understanding of how visual concepts.
The same is almost certainly true of video - early smaller models will likely create janky movements/motion, however once they've seen enough video to understand how a person walks, how a scene is framed etc.. there's no reason we couldn't get to the same level of maturity as today's image models.
I think the real issue will come from labelling - most video is only going to be labelled simply with basic info/captions without detailed descriptions of the camera pan, movement of subjects. The amount of text required to accurately describe a scene is much larger than a still image and I'm not sure how once would go about collecting this.
It does seem to be the case that this type of generative image AI makes somewhat surreal imagery unless given very specific literal directions (aka, Tom Cruise hugging Ben Affleck). If using this AI right now it is probably best to work within these constraints.
Curious, what other notable features or use cases are people aware of that are already existing in other systems, but not present in Stable Diffusion?
Anyone aware of any other open source projects that have a known list of requested features and way for community to express support for them, for example bounties?
While on the software side we are close, the amount of compute required to train a model to produce longer, high quality, sensible, and non-surrealistic videos - especially with a plot - is simply too much for the time being.
Video is already possible as is shown by e.g. SALT_VERSE [1] (which may or may not use SD/DALL-E/... but that is not that interesting), what is not there yet is a --txt2vid script option. Implementing this would not be that hard but the processing time needed is quite substantial.
Something else which would be possible is the use of a model like SD in combination with a frame interpolation model like [2] as a video generator. Use SD to generate key frames, feed these to FILM and let it generate the intermediate frames and you should get video.
Amongst the many video experiments, I find Karen Cheng is making some interesting concepts by piping AI output to a range of other interpolation tools.
In summary:
Dall-e generated the outfits,
Ebsynth mapped those outfits to a range of frames (instead of having a new random artwork on each frame) &
Dain smoothed the transitions between each outfit change.
The idea is that it takes two txt2img prompts and animates between them. The result isn't 100% there yet, but I think this idea has some legs on the 2D side of things because of the way it's able to keep a consistent composition. (One of the core issues with txt2img is that while it can produce nice images, it can product just one - animation, or even a series of story boards requires a lot of manual intervention from the creative.
SALT_VERSE just uses still images generated by DALL•E-2 and Midjourney. They're animated using techniques that have already been around for a long time: panning around and zooming in and out of a larger frame, fake raster-to-3d effects (auto projection mapping), typical After Effects transitions. There are some overlaid face animations done with something along the lines of NVIDIA's Audio2Face. None of it is video generated by SD, DALL•E-2 and Midjourney.
The article is thorough and covers why your [2], under tweening, is insufficient. Video beyond stereotyped repetitive movements is likely AI-complete.
Presumably you'd want to be director and be able to control camera angles. You might want to have a cartoon or 3D render visual style. Unreal engine prompt tag even more literal. Most important, you will want actors. Anything alive will have to move convincingly and intentionally. Some might even bring up the ethics of allowing the movie to end.
Before proper video, I think we'll first have to see tools for music, 3D rendering and animation that bring down difficulty by orders of magnitude.
There are many open source models like GPT-3, the problem is that you need a GPU cluster if you want to run something yourself that has similar performance to GPT-3.
None of them really work though; the current GPT-3 (InstructGPT) is fine tuned to answer questions, which isn't the same as the general text completion problem they're all trained on.
You can try TextSynth and see the result is nowhere near as good.
it is still a significant endeavor for the same reason audio is still difficult. will be a huge breakthrough because it means having made quantifying uncertainty in the near continuous limit tractable.
This will be a huge game changer when it occurs. Whether it be for deep fake videos, creating custom content, or making a new season of your favorite tv show that was cancelled too early. The possibilities are endless.
This is probably not in the near future (i.e. this year), but I doubt it is very far off.