These videos look too much like the things and their movement that I see in dreams. They are blurryish but makes sense but actually don't. e.g. the running rabbit, its legs are moving but its not. This is almost exactly how I remember dreams, when I see people moving, I can rarely notice their limbs moving accordingly. When I look at my own hands they might have more than 5 five fingers and very vague and blurry hand lines. When i try to run or walk, or fly its just as weird as these videos.
This reminds of how the first generation of these kind of image generators were said to be 'dreaming'. This also makes me think that do our brains really work like these algorithms (or these algos are mimicking brains very correctly).
> trained only on Text-Image pairs and unlabeled videos
This is fascinating. It's able to pick up sufficiently on the fundamentals of 3D motion from 2D videos, while only needing static images with descriptions to infer semantics.
Getting something that generates multiple angles of the same subject in different typical poses would go a long way.
I can get midjourney to kind of do this by asking for "multiple angles", but it's hit or mis.
Performing these optimization processes during inference time has never been very practical for generative tasks, as it requires a lot of time, memory (to store the gradient) and the quality is usually mediocre. I still remember VQGAN+CLIP, the optimization process was to find a latent embedding that would maximize the cosine similarity between the CLIP encoded image and the CLIP encoded prompt, It worked but not very practical.
I really wish these datasets were more openly accessiable. I always want to try replicating these models but it seems that the data is the blocker. Renting the compute needed to create an inferiror model does not seem to be an issue, it's always the data.
It's perfectly reasonable to release a publicly accessible paper while keeping the code to yourself, especially if you're Meta or OpenAI and wish to commercialize it at some point.
You can recreate things from papers fine. I've done it for several projects, it's often nicer than just copy-pasting in code and it fixes issues where one side is uisng Montreal's AI toolkit and another is using pytorch and one other is using keras.
Although for a tool like this, they clearly used pre-trained models as a large component, ones with publicly accessible weights as well. So replicating it will probably happen in the coming months if Meta doesn't (understandably) release the code they very clearly plan to use for their own Metaverse product.
Sure, it's perfectly reasonable to release such a paper as PR. I don't think it's perfectly reasonable for any academic journal to accept it. Leaving the code out of a paper about claims regarding the code is like leaving the experiment design out of a material science paper.
In addition it's worth noting that Meta is generally good at releasing source code.
Often there's a paper deadline and the code still needs tidying up, or the same codebase supports additional models that are published in additional papers.
Keep an eye on the facebookreaseach GitHub for this in the next few months.
Code is nice, but a paper should be written sufficiently well that it gets the ideas across such that the solution can be replicated. The ideas are the point, not the implementation.
This reminds of how the first generation of these kind of image generators were said to be 'dreaming'. This also makes me think that do our brains really work like these algorithms (or these algos are mimicking brains very correctly).