Hacker News new | past | comments | ask | show | jobs | submit login
GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images (nv-tlabs.github.io)
177 points by lnyan on Sept 24, 2022 | hide | past | favorite | 62 comments



In somewhat related topics, I think we can just use stable diffusion to help convert single photos to 3D NERFs.

1. find the prompt that best generates the image

2. generate a (crude) NERF from your starting image and render views from other angles

3. use stable diffusion with the views from other angles as seed images, refine them using the prompt from 1 combined with(add descriptions to generate "view from back", "view from top", etc

4. feed the refined views back to the NERF generator, keeping the initial photo view constant

5. Generate new views from the NERF, which should now be much more realistic.

Run the above steps 2-5 in a loop indefinitely. Eventually you should end up with a highly accurate, realistic NERF which is full 3d from any angle, all from a single photo.

Similar techniques could be used to extend the scene in all directions.


The problem with such an approach would be that NERFs require a set of input images with their exact poses and exact poses are only available if the underlying geometry is static. However if you use SD to generate new views it's only an approximation and you wouldn't be able to get the exact poses.

Not all hope is lost though. I'm pretty sure in a few years (perhaps sooner) we'll be able to generate entire 3D scenes directly without going through 2D images as an intermediate step.


I don't see it as a blocker... Especially if you alternate NeRF and SD iterations - ie. don't generate a whole image each time, instead just choose a random angle, do NeRF reconstruction, then run a single SD iteration on that NeRF, and do another training step of the NeRF.

That way, you know the exact pose for each image, because you chose it when rendering the NeRF.


This 100%. Most images in SD are paintings and have a certain compostion (for instance people usually looking at the camera, posed a certain way, etc). Trying to generate the same thing from a different angle will yield wildly different results which won't work well for Nerf. I do think diffusion models may work for generating 3d stuff, but right now a) We need a big labelled dataset of various 3d objects in different styles (yes, I know some exist but they are nowhere near what we have for 2d images since 2d images can be scraped from the web) 2) Memory/computation is a big bottleneck as 3d is orders of magnitude bigger in terms of memory and processing. That said, hopefully some smart person will come up with some shortcuts that make it possible on today's hardware.


You can optimize the poses as part of the model.


That doesn't work because of the contradiction of NERFs needing multiple images of the same geometry and stable diffusion not producing images with stable underlying geometry because it's purely image-based.


I have my doubts that this will converge to anything meaningful.


In the short term you may be right, but in the long run it's a certainty you won't.


Ok, so on the generative model modality landscape I'm now aware of:

- speech

- images

- audio samples

- text

- code

- 3d models

I've seen basic attempts at music and video, and based on everything else we've seen getting good results there seems to be mostly a matter of scaling.

What content generation modalities are left? Will all corporate generation of these fall to progressively larger models, leaving a (relatively) niche "Made by humans!" industry in it's wake?


It's not just a matter of different modalities. It's still a matter of sophistication.

The end game is endless generative music or video streaming customized to your preferences. Being able to describe a story, or having the AI model take a guess at what you mind find interesting/entertaining and generating a whole TV show or movie for you to watch. Or generating background music while you work and automatically adjusting to your tastes as well as adjusting if you're finding it hard to concentrate or you need to take a call.


Except it wont be, will it. Such things were promised for the internet and we had maybe a good 10 years or so before corps caught up and told us what to watch through their channels.

I imagine this being much of the same: AI trained on corp-approved training sets to give suggestions to your preferences that they want.

Sure, you could spin up your own and buy a machine that trains on it's own training data, but watch how no one will do that because of the cost, or the diminishing access to untainted resources.


This seems like a weird take to me. We just saw Stable Diffusion land, an open-source community-trained SOTA model. There is an open reimplemented version of GPT.

How is the correct extrapolation that corps will control all content generation?

Sure, Google will always be an OOM or two (more?) ahead in terms of compute dedicated to the problem. And so the best-quality stuff will likely come from big corps; Netflix (or their successor) will have the best quality video-generation AI. That is how it always has been though; movies are heavily capital intensive.

But this tech raises the quality of hobbyist-generated content vs. highly-capitalized studio content. So I think it’s reasonable to extrapolate to even more content at the long tail, instead of consolidation.


It's going to be akin to content creation on youtube, perhaps even just people using youtube as their distribution medium. Anyone can make a youtube video but we don't see everyone creating content.

We should see a proliferation of the tech such that lots of small (even one-man) studios pop up pumping out high quality content, but the content is released on a schedule similar to how youtube videos are now. Your preferences come into play through your suggested watch list, it'll be populated from this pre-created media based on whatever preference machine learning algorithm the distribution platform (youtube?) decides. The feedback through watch-metrics will then be used by these micro studios to decide what to create next. It's basically what already happens now with youtube content creation, but the quality of what people will produce will be better than hollywood content movies / tv shows, and the pace of release will be much quicker.

Not everyone needs to be training and generating their own content in order for your content preferences to be absolutely saturated with things you'd enjoy watching.


Management positions replaced by a supervising AI


The problem is finding a good training set.


I wonder what your input data would need to be for a competant AI in this space.

Finance, goals, delivery timelines, capabilities of the team, employee availability, which employees work well with each other, office politics, regulatory constraints...


It doesn’t need to work as well as current management, just well enough to be cheaper.

Perhaps not even that, factoring in loyalty.


Don't see why not, especially for low impact, high frequency decisions. Some AI guided assistance with option to automate. The next level for auto complete I guess.


The future is already here -- it's just not evenly distributed.

https://www.retailbiz.com.au/latest-news/inside-amazons-ai-p...


Music will continue to be made by humans because of strong copyright law. Its illegal to sample (and distribute) > 0 seconds of recorded music. If that makes it into any ai generated music, its game over if you distribute it.

Source on sampling: the head audio engineer at Juilliard School of Music.


Why would being a "head audio engineer" at Julliard give you any credibility on AI generated music and sampling/copyright law? Lol.


Because they work with/teach electronic music/sampling in addition to recording classical acoustic music.


It's not sampling. Sampling is copying & pasting, but that doesn't happen, technically. Just like stable diffusion doesn't copy any artworks. AI learns from previous works, but doesn't copy them. It's quite similar to how humans learn and make adaptations based on other work.


While technically true, the unspoken premise of my argument is that it can and will output distinctive samples derived from the source. E.g. you cant change the pitch, tempo, and add other effects and call the output your own, legally.

Its a landmine


Do you mean it's a minefield* :P

Not sure it's as much of a problem as you're making it seem though.. likely there can be enough iteration that it would not be distinguishable.

Also lots of mainstream artists ignore copyright anyway (Kanye West etc). Plus you could just use training data that is from public domain stuff if it was really such an issue, there is decades worth of music that are outside of copyright law.


That seems like good advice for professionals, but I'm wondering if it's going to hold up with new ways of distribution.

Would distributing a generative model that can sometimes generate such music would also be considered illegal? Will it actually stop people from doing it in practice?

Would it be illegal to share seeds and prompts?

Though these alternative methods, you could have a lot of people listening to music that's never distributed as audio or video files. And if there's an API for it, games could use such generated music via a plugin.

And then I suppose people start sharing on YouTube, and we see how good their copyright violation detection actually is.


People sample music all the time. Sometimes credit and royalties are given, but there are lots of popular music being made that samples without credit.

Against the law? Maybe, probably. I don't think the laws are as strong as they appear though.


Real-time interactive 3D scenes. Imagine Myst, but the 60fps visuals can be anything you can imagine, and anything you can think of can be done within it. Imagine GTA but it doesn't take a AAAA team 5-10 years to flesh out the content to make everything interactable.

At some point we'll have a universal Holodeck more powerful than our imaginations, able to simulate slices of our experience better than our dreams. It'll be fun as heck


At some point we'll have fusion and/or fully self-driving cars. Still the AI only can replicate learned ideas, so you can't do anything that hasn't been done before. Generated content will look 99% there but be just a bit off. Besides that, even if you could do 'anything' in a game, what's the point if nothing you do has meaning?


To me these models look worthless. They'd be useless for anything other than BG props with high DOF and complementary lighting, you can see on the rear windows in particular there are artifacts from the topology. If you hit most of that shit with a light from the side it would look horrendous.

You can get away with a lot, but I think this is too much. I think future iterations could be promising, but this definitely isn't challenging any pipeline I'm aware of.


To be fair, these models are no worse than the ones the iPhone makes with LIDAR. It's pretty impressive for being generated from a single static image.


Well, you're gonna need folks to sift through all the generated images and curate the results into something coherent. Taste is still a thing after-all.


Visions of the future where the consumer has to pick apart "human curated", "human assembled" and "human made" much the same way we do now for cage free and free range eggs at the grocer.


That would be the same as curating social media feeds.


HN comments.


I'm waiting for the "literal video" generator, which writes and sings new lyrics describing what is happening in the video.


Animation, at least the ‘background one’.


Still nowhere near good enough to be able to generate a VFX or video game asset from some pictures, which is what we'd really want for a practical application of such a tool.


This would make the coolest Katamari Damacy game ever.


Generating good video game assets from pictures is solved, but this does more than that. It generates modified versions from words.


>Generating good video game assets from pictures is solved

lol no, not at all. It still needs tons of manual work to get it up to quality in terms of topology, material, etc.


In terms of practical engineering it's not solved, I mean that that SOTA in photogrammetry is good enough to create high quality textures and meshes directly from pictures.


Those meshes and textures are far from usable for realtime rendering in a 3D game.


Some of the videos aren't working in Firefox. Here's the error:

> Can't decode H.264 stream because its resolution is out of the maximum limitation


They all work for me on Firefox. Btw. it's using the systems decoder.


An AI that does good UV unwrapping would be much more interesting and useful.


And I'd love to see an AI-generated rigging tool for auto-generating bone structures so you don't have to do it by hand.

Baby steps, though. The data required to train an unwrapping/rigging tool is a lot more domain-specific than correlating an OBJ file with it's completed render.


Can you help me understand how you are imagining that? As in, you have texture images already and you want to apply them to the 3D object intelligently? Is that the case you’re talking about? Or texture generation and UV unwrapping in one?


Generating an UV map from an untextured mesh. The UV map is stored in vertices as texture coordinates, and together with the topology of the mesh it defines how the texture (images) gets mapped to the mesh. A good UV maps preserve surface area (e.g. every region of the mesh map to regions of the texture proportionally, otherwise you get stretching), have few seams and have little empty space around the texture to reduce size.

There are ways to do this automatically but they're far from perfect. Artists usually take the mesh and literally unwrap it until it's planar, and convert this transformed mesh to the UV mapping. The advantage of this method is that it gives you very good control of seams and texture islands, but it's tricky to preserve surface area.

Those neural rendering methods are very cool because they use light/color fields, but they still have a lot of catching up to do compared to modern 3D graphics.


I think what they mean by UV unwrapping is generating a UV map from. Textured model (here the texture is generated by a triplanes network).

It’s interesting for compression purpose, but kinda orthogonal to this method (a good UV unwrapping can be used once the model generated)


I think this is a bit of a useless comment. Someone made a cool thing, and the reply is "I can think of a cooler thing"? I mean, yeah, I can always think of cooler things, but now we have this cool thing so saying that it's not as cool as some other thing that can exist doesn't add anything to the conversation.


Spoiler: The results are not high quality, at all.


Higher quality than what I can whip up in Blender!

But yeah, calling this high quality is quite disingenuous. I don't think this kind of mis-labelling is helpful or productive. The results are what they are, and a massive step forward from what was available/possible some years ago.


Still better than any previous AI models that are out right now. Probably only a few years away from something truly impressive.


Great, now we can get the unreleased code for this paper and use it with the unreleased code for generating animations (really impressive stuff by Sebastian Starke, presented at various SIGGRAPH) and build a videogame generator.

I wouldn't even mad if it were a paid product and not free code, just release something to the world so we can start using it.


One step closer to a dream I have: To describe scifi objects to Stable Diffusion, use the image to create a 3D object, print that object on my 3D printer. All on my laptop. (Well, I have SD running at home now, will have to see how the code for this runs when it is finally released)


I'm working exactly on this


Hecking man


https://github.com/nv-tlabs/GET3D

> News

> 2022-09-22: Code will be uploaded next week!

Not really that interesting at this point; the 5 page paper has a lot of hand waving, and without the code to see how they actually implemented it…

…I’m left totally underwhelmed.

No weights.

No model.

No code.

The pictures were very pretty.

/shrug


Disclaimer: This work is done by some of my colleagues.

As someone pointed out, there are 25 pages of text (not including bibliography of course), not 5.

Most publications are coming with multiple months delay before code release (if any), here you literally have a written soft deadline of 1 week. So maybe you can wait few days before posting such bad comment?


Thats very dismissive. The paper is 39 pages. Most of the details is in Appendix which I think is fairly standard. I think they describe the network quite ok (page 16).


you are barking on the wrong tree, nvidia labs have a history of releasing their codes and models quickly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: