Reducing geometric detail while keeping outlines intact is one of the major showstoppers that prevent current game engines from having realistic foliage. And that exact same problem is also why a Nerf with its near-infinite geometric detail is impractical to use for games. And this paper is yet another way to produce a Nerf.
SpeedTree already used billboard textures 10 years ago and that's still the way to go if you need a forest in UE5. Fortnite did slightly improve upon that by having multiple billboard textures that get swapped based on viewing angle, and they call that impostors. But the core issue of how to reduce overdraw and poly count when starting with a high detail object is still unsolved.
That's also the reason, BTW, why UE5's Nanite is used only for mostly solid objects like rocks and statues, but not for trees.
But until this is solved, you always need a technical artist to make a low poly mesh onto whose textures you can bake your high resolution mesh.
Nanite can actually do trees now, and Fortnite is using it in production, with fully modelled leaves rather than cutout textures because that turned out to be more efficient under Nanite. They talk about it here: https://www.unrealengine.com/en-US/tech-blog/bringing-nanite...
That's still ultimately triangle meshes though, not some other weird representation like NERF, or distance fields, or voxels, or any of the other supposed triangle-killers that didn't stick. Triangles are proving very difficult to kill.
My understanding is that while they allow masked textures, the geometry is still fully emitted by Nanite. That means you still need to start with a mesh that has multiple leaves baked into a single polygon plane, as opposed to starting with individual leaf geometry and then that is somehow baked automatically.
This illustration from the page you linked to shows that as well:
The alpha masked holes move around, but the polygons remain static. That means if you draw a tree with this, you still have the full overdraw of the highest-poly mesh.
Yeah alpha masking is inefficient under Nanite, but as they explain further down its handling of dense geometry is good enough that they were able to get away with not using masked materials for the foliage in Fortnite. The individual leaves are modelled as actual geometry and rendered with an opaque, non-masked material.
Please note that these results were obtained using a small amount of compute (compared to say a large language model training run) on a limited training set. Nothing in the paper makes me think that this won't scale. I wouldn't be surprised to see a AAA quality version of this within a few months.
There's been some attempts at translating these techniques over to 3D models but the results so far aren't very useful, they have a tendency to produce extremely triangle-dense meshes which are ironically not actually very detailed or well defined, with most of the detailing just painted over the surface as textures, which also have terrible UV mapping. Not to mention the topology is a mess so it's impossible to rig for animation. I'm not sure how good they are at producing proper PBR material channels, but I'm guessing "not good" if they're derived from text-to-image models trained on fully rendered images/photos rather than individual PBR components, which is much more difficult data to source on a massive scale.
I suspect this will continue to be an uphill battle because there aren't billions of high quality 3D models and PBR textures just lying around on the internet to slurp up and train a model on, so they're having to build it as a second-order system using images as the training data, and muddle though the rest of the steps to get to a usable 3D model.
I'm looking at such techniques that appear here on HN every now and then, and indeed, they are very impressive, but also something an intermediate 3D artist will do better with ease.
I think however, that the existence of such tools could perhaps motivate more people to create low quality 3D assets, that for many purposes are good enough - but people might be shy to do that due to many high quality assets shaming them... Once AI 3D stuff will flood the space, amateurs might start competing with it - or so I hope.
Also it's just a matter of time until we reach the 80% (from the 80/20 rule), 90% and 98% quality of these tools. The remaining 2% will still be of value in AAA titles though...
MVDream can be combined with a NeRF (or equivalent spatial interpolator such as InstantNGP/TorchNGP) followed by marching cubes to extract a mesh and produce a 3D model. SDS loss and guidance is done from multiple equally-spaced views simultaneously. MVDream has a repo that implements this:
I expect people will implement rough geometry with polygons and feed a depth+material map to a diffuser for rasterization. This way you get photorealism from the models but maintain precise control over what the scene has in it and where.
I’m confused why there is so much focus on text to images and models. If you spent five minutes talking to anyone with artistic ability, they would tell you that this is not how they generate their work. Making images involves entirely different parts of reasoning than that for speech and language. We seem to be building an entirely faulty model of image generation (outside of things like ControlNet) on the premise that text and images are equivalent, solely because that’s the training data we have.
Can you share some of what you have found about the creative process by talking to people with artistic ability ?
What are your ideas about the differences between a human and AI's creative process ?
Are there any similarities, or analagous processes ?
Do you think creators have an kind of latent space where different concepts are inspired by multi-modal inputs ( what sparks inspiration ? e.g. sometimes music or a mood inspires a picture ) and then the creators make different versions of their idea by combining different amounts of different concepts ?
I am not being snarky, I am genuinely interested in views comparing human an AI's creative processes.
I used to work as an illustrator. Most images appeared to me as somewhere between fuzzy or clear image concepts, unaccompanied by any words. I then have to take these concepts and translate them using principles of design, color, composition, abstraction etc., such that they’re coherent and understandable to others.
Most illustration briefs are also not wrote descriptions of images because people are remarkably bad at describing what they want in an image, beyond in the most general sense of its subject. This is why you see DALLE doing all kinds of prompt elaboration on user inputs to generate “good” images. Typically, the illustrator is given the work to be illustrated (e.g. an editorial), distills key concepts from the work and translates these into various visual analogues, such as archetypes, metaphors and themes. Depending on the subject, one may have to include reference images or other work in a particular style, if the client has something specific in mind.
Project briefs to an artist typically contain both text and reference images. Image diffusion models and the like likewise typically use a text prompt together with optional reference images.
Project briefs are generally not descriptions of images and reference tends to be more style than content focused. Source: I used to be an illustrator for major media outlets like the NYTimes, etc.
The artist is given the content to be illustrated, extrapolates themes and overarching rhetorical or narrative aspects of the work, creates visual representations or metaphors corresponding to these aspects, generates 3-5 interpretations, shows them to the AD, who provides feedback on what has been extrapolated, as well as various design considerations.
Not even wrong, in the Pauli sense: to engage requires ceding the incorrect premises that image models only accept text as input and that the generation process relies on this text
Text prompts aren't an essential part of this technology. They're being used as the interface to generation APIs because it's easy to build, easy to moderate, and for the discord models like Midjourney it's easy for people
to copy your work.
With a local model you can find latent space coordinates any way you want and patch the pixel generation model any way you want too. (the above are usually called textual inversion and LoRAs.)
I would personally like to see a system that can input and output layers instead of a single combined image.
And for in-painting I think you’ll find text-to-image is still useful to artists. It’s extra metadata to guide the generation of a small portion of the final image.
Not sure what these cars are all about. Everyone travels by horse and buggy…
We’re building a model optimized for the machine, not people.
Artists can go collect clay to sculpt and flowers to convert to paint. Computers are their own context and should not be romantically anthropomorphized
In the same way fewer and fewer people go to church, fewer and fewer will see the nostalgia in being a data entry worker all day. Society didn’t stop when we all got our first beige box.
This is an incredibly dull, unthinking regurgitation of the nonsense people say online to feel better about their own lack of creative ability. My point wasn’t that computers can’t do the same thing as artists (they already can), it’s that computers won’t achieve the same result by having people describe the images they want to see because that’s fundamentally not how images are made or even perceived.
I play three instruments, draw, sculpt, and used to build houses.
No one ever set a goal for AI to achieve the same result; just replace labor.
Your post is the same dull strawman non-engineers (I also have a BSc in engineering and MSc in math) repeat about AI.
Find me a formal proof of how “images are made” and I’ll show you one possible model of an infinite number of possible models to explain it with a few axiomatic correct twists to the math since all of our symbolic logic is a leaky abstraction that fails to capture how anything is “fundamentally made”.
nah, that has absolutely nothing to do with what they're saying. I've used it for over a year and this is a weird way for it to appear in conversation, I hope you're not astroturfing
Fortsense FL6031 - Automotive ready. For anyone not familiar with SPAD (Single Photon Avalanche Diode) YouTube it. Very impressive computational imagery through walls, around corners and such.
SpeedTree already used billboard textures 10 years ago and that's still the way to go if you need a forest in UE5. Fortnite did slightly improve upon that by having multiple billboard textures that get swapped based on viewing angle, and they call that impostors. But the core issue of how to reduce overdraw and poly count when starting with a high detail object is still unsolved.
That's also the reason, BTW, why UE5's Nanite is used only for mostly solid objects like rocks and statues, but not for trees.
But until this is solved, you always need a technical artist to make a low poly mesh onto whose textures you can bake your high resolution mesh.