Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Experiment: Can 3D improve AI video consistency? (backdroptech.github.io)
65 points by TobiasEnholmX 10 months ago | hide | past | favorite | 36 comments


AI-generated video struggles with consistency. Flickering, weird proportions, and characters changing. I tried using 3D as a way to get more consistency.

Overall, it worked. No sudden changes in proportions, clothing, or style. Still, there are some limitations, especially with fine details.

We’re looking into whether this could be useful as a tool and would love to hear what you think: Has anyone experimented with 3D + AI generation for images or video? or sees a better way to approach this?

Demo and details in the blog: https://backdroptech.github.io/3d-to-video/


We and several other startups tried this, and we even filed a few patents.

1. The users with experience and patience for this are slim. Blocking out a scene and the animation are tough, and the users with this skill and inclination are using Blender and ComfyUI already or are submitting renders to RunwayML V2V. We're still too early for AI auto rigging and animation to work, though those technologies will make this approach easier.

2. AI video users want I2V quality and predictability, not unpredictable V2V style transfer. You need control over the exact look and feel of the starting frame as well as the animation. If you can't get this, the renders are useless.

3. One of the advantages of AI video is that it can animate things a human animator cannot easily do. Non-humanoids, crowd movements, explosions, etc.

Basically, this requires deep integration with a new class of video model.

You'll find that even with this technology perfected, it fits into a comprehensive suite of tools that AI video creators will use. They will still lean on I2V for most shots and V2V compositing for other shots.

I've done a lot of hands-on interviews and demos. Steve May, various studios, schools, etc. Steve kind of negged me and told me there are bigger players working on this. My guess was Odyssey Systems at the time, but they turned out to be working on something else.

I do think this is a valuable technology, but there's a tremendous amount of work to do to make it work.


First off, I’d love to hear more about your experience. If you’re up for a chat, shoot me an email at tobias@backdrop.tech, or let me know where I can learn more about what you’ve worked on!

1. I assume you mean patience when setting up a 3D scene? That’s definitely a factor, but it’s getting easier with image-to-3D tools, and AI can even assist with object placement to speed things up.

2.Yeah, predictability is key. Our approach is about making it easier to generate high-quality, consistent images, which are then fed into video models—rather than relying on direct video-to-video style transfer, which can be more chaotic.

3. Agreed! AI can animate things that traditional methods struggle with, but consistency is still a challenge. This workflow helps strike a balance between AI flexibility and user control.


Yes, it's better to use AI to automate individual steps in the actual animation process itself, with AI workers.

The quality is identical to human animators (because: same tools, same process).

Just the cost is lower (although training the AI workers is a new cost).

The only companies that can do this are those that have a very strong workflow, because AI workers operate on individual steps, NOT the entire workflow.

2D animation already has this, so it's easier to adapt to AI workers than 3D (for that reason).


Nice demo examples. I'm a casual observer and curious how this differs from Gaussian splatting which also (implicitly?) uses 3D representations.

I could see applying changes at the 3D model level which wouldn't be directly accessible if it was only an internal representation.


Yeah, exactly. Gaussian Splatting works great when you have an image (or set of images) and want to reconstruct a whole scene in 3D, but it treats everything as a unified point-based representation. In a structured 3D scene, though, objects are clearly separated, so you can manipulate them individually.

For example, you can attach a LoRA specifically to one object and run a separate workflow just for that, giving you way more control. That’s a big difference—Gaussian Splatting doesn’t naturally lend itself to object-level edits since everything is blended into the same representation.


Isn't modeling and animating the hardest part anyway? You're saving some time on texturing and lighting (which is still good, don't get me wrong).

But the whole premise of AI video is that you're basically directing, and the model does all the hard parts for you.


Yeah, modeling and animating used to be a huge bottleneck, but with tools like image-to-3D and improving AI animation models, it's getting easier and faster. The goal here isn’t to replace animation but to use 3D as a guide, so AI doesn’t have to guess everything from scratch. In this workflow, you don’t even need animations—just static 3D models. The output is a single image that gets passed to a video model, which handles the motion.


Oh, I didn't realize you were feeding static 3D scenes to it, basically only the first frame.

That makes it a lot more useful.


Article doesn't clearly tell what its doing.

Is it training a model based on 3d model? Is it doing img2img? is doing video2video using 3d video/image as source?

Article is showing off stuff but is light on explanation of what is going on here.

Also, the person is bearded. Hard to notice face changes there. I want to see more demos but instead of a 3d like render, do a realistic render without beard etc.

edit: Plus these are very short clips. 3D should theoretically help, but this article could have been better.

edit2: in the anime example, the girls face is definitely changing (+lips are not correct). It feels like you are doing img2video here?


Appreciate the feedback! Sounds like you're looking for more of a deep dive into the tech and how it works. This post was more about showing results, but we’ll cover the process and technical details in another one


I'm aware this is only a very tangential comparison, but my LLM-based coding workflows around Cursor regularly involve me creating "empty" skeletons of files which I add to the context, as in: because I'm aware that it makes sense for an implementation to have a FooBarService and two entities Fizz and Buzz, I create these as empty files (with the "class FoobarService" line as it's only content), and this way, Cursor (or rather, Claude) doesn't get too creative and wild when deciding how and where to implement code.

This really increased the quality of results for me.


That makes sense! Giving AI a structured starting point so it doesn’t go off the rails is similar to what we're doing with 3D. Instead of letting it hallucinate frame by frame, we define the space, characters, and style upfront—so the AI focuses more on refinement rather than making up new details every time. Interesting parallel!


What's up with the redirect? You created an otherwise empty GitHub repo just so you can post a GitHub link instead of a direct link to the announcement of a private, closed source project. Why?


Maybe it's a growth hacking trick?


I can see how it helps to feed a 60% final product, but at that point isn't it style transfer, or, DLSS?


Yes, the 3D data is extracted and then fed back to 2D image generation models like Flux and a relight step, at least in the examples above.


It would be surprising if it didn't, but it's a question of diminishing returns.

The complexity and cost likely increase approximately with the third power. Does the result justify this effort?"


The rationale is when you do not use 3D models to keep subjects consistent over multiple scenes it becomes too unpredictable and you would need way too many trial and errors. 3D helps with it. Then, if you even want to have that very same character starting from a specific pose it will more likely you will give up to the likes of the AI model rather than the other way around.


We’re actually finding that leveraging 3D reduces complexity rather than increasing it. Setting up a 3D scene is getting easier with tools like image-to-3D, and once it's set up, you can reuse it flexibly across different shots without having to recreate assets from scratch.

By passing a structured 3D-rendered image to the AI, you increase the chances of the video model generating exactly what you want, rather than relying on unpredictable, frame-by-frame AI improvisation.


This is great tech, but I was expecting more from “I’ll show you how I used 3D to make AI generations more predictable” - I wanted to know how, instead we essentially got an ad.


Fair point! We debated whether to post this here since it's more about examples rather than the full how-to and technical breakdown. Next post, we’ll aim to go deeper into the process and techniques behind it. Appreciate the feedback!


i am very confused... what is 3D? to me 3D means three dimensions - but it looks like 3D refers to a product? this page doesn't help with explaining anything


Good question! When we say '3D' here, we mean using 3D models and scenes as a base to guide AI video generation. The AI isn’t just making things up from scratch—it follows the structure of a 3D-rendered scene to keep things consistent across frames. It’s not a product, just an approach to solving the problem of AI-generated video drifting too much between frames.


I’d suggest adjusting your demo, I also couldn’t figure out if “3D” meant a technique or a product here.


> 3D models and scenes

Maybe add "digital" too...


This is one of the most impressive demos I've ever seen. Truly incredible. I have been dreaming about working on something like this for a while. Open to help?


Glad you like it! In the blog post you can find my email, or join the discord https://discord.gg/backdrop


Not an expert here. I am not sure how 3D the videos in the article are. IMO they are 3D in a pixar/animated sort of way.

But I have very recent first hand experience of creating a video for our startup's Facebook post with Minimax image-to-video inference, from an image of our animated avatar character.

...And yes, first the videos were bad quality with lots of inconsistencies, but after adding "animated" to the prompt, in front of the "man" word, the result was pretty great already on the first try! Which I then ended up even using. (you can even check it here if interested https://fb.watch/xRC-fptexM/)

Perhaps it should be self-evident, but still, it was not to me. :)

Edit. I guess my point was also that the animated character in the video ended up being somewhat 3D as well.


Yeah, I see what you mean! When we say 3D here, we mean working with actual 3D scenes—models, depth, and lighting—rather than the '3D movie' style like Pixar. The goal is to use 3D to control AI generations more precisely, so things stay consistent across frames instead of AI hallucinating every frame from scratch."

Your experience with Minimax sounds cool! Adding 'animated' to the prompt helping consistency makes sense—AI models often struggle with structure, so any guidance helps.


That "anime styled" one looks... subpar. The eyes kept changing shape and it looks like it dropped a load of frames?


That's some way to market a new product.


Got an error (literally just said "Error") trying to sign up for the waitlist, FF mobile on Android.


I just checked and it works for me, but I'm on IOS. There is a fallback that sends each subscriber to my email as well


Is the idea here to basically transform the almost unlimited degrees of freedom an AI has in manipulating rastered image properties from one frame to another in a limited set of known things it can change?

I mean, unless its a wild magic sci-fi movie, it makes sense that once a character is known in separation to its background, that the background doesn't change with a hands movement of the character. And the character cannot move beyond earthy physics?

Is that what this is?


Yeah, that’s pretty much the idea. AI has a lot of freedom when generating each frame independently, which often leads to inconsistencies. By using a 3D base, we’re constraining that freedom. Locking in composition, lighting, and perspective so AI doesn’t hallucinate unintended changes. The goal is to have a stable structure where AI can still add detail and style but without breaking spatial or physical consistency.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: