Hacker News new | past | comments | ask | show | jobs | submit login
DreamFusion: Text-to-3D using 2D Diffusion (dreamfusion3d.github.io)
833 points by nullptr_deref on Sept 29, 2022 | hide | past | favorite | 201 comments



The most incredible thing here is that this demonstrates a level of 3D understanding that I didn't believe existed in 2D image models yet. All of the 3D information in the output was inferred from the training set, which is exclusively uncurated and unsorted 2D still images. No 3D models, no camera parameters, no depth maps. No information about picture content other than a text label (scraped from the web and often incorrect!).

From a pile of random undifferentiated images the model has learned the detailed 3D structure and plausible poses and variants of thousands (millions?) of everyday objects. And all we needed to get that 3D information out of the model was the right sampling procedure.


As far as I understand from a quick read of the paper, the 2D diffusion doesn't have a 3D understanding. It probably have some sort of local neighborhood understanding, aka small geometric transformation of objects map close to each other in the diffusion space (That's why like with latent spaces you can "interpolate" (https://replicate.com/andreasjansson/stable-diffusion-animat...) in the diffusion space).

But that's not really surprising because when you have enough data, even simple clustering methods group objects like faces by the direction they are looking to. With enough views even a simple L2 distance in pixel space allow t-SNE to do that.

They are injecting the 3D constraints via the NERF and an optimization process to add the consistency between the frames.

It's a deep dream process that optimize by alternating updates for 3D consistency, and updates for text-to-2Dimage correspondence. It's searching for a solution that satisfy these two constraints at the same time.

Even though they only need to run a single diffusion step to get the update direction, this optimization process is quite long : 1h30 (but they are not using things like instant Nerf (or even simple voxel grids) ).

But this will allow for creation of a dataset of 3D objects with corresponding text, which will then allow to train a diffusion model that will have a 3D understanding and will be able to generate 3D objects directly with a single diffusion process.


The model clearly has an understanding of the 3D structure of objects. If it didn't, using it to generate 3D models wouldn't work. The knowledge that the leg bone is connected to the knee bone, etc, isn't coming from NeRF, it's all in the "2D" model. Sure, maybe you could distill that knowledge into a different model architecture that is somehow natively 3D in order to improve the efficiency of sampling. But that's more or less just an optimization. What's interesting is the fact that the knowledge is in there already, learned without 3D input data.

Humans don't get natively 3D training data either. We don't have depth sensors or any other natively 3D senses. Our eyes only ever see 2D images and we learn 3D structure from that, so I guess these 2D image models are doing something analagous. Except they don't even have the benefit of stereo images!


I think you're both right! It is incredible that the 2D model knows enough about the visual world to produce many objects from all angles, but the 3D model is essential for gluing these views together, and in some ways can fill in the gaps the 2D model doesn't know about. Imagine just taking a huge collection of photographs of an object. While there is enough information in those photos to reconstruct the 3D object, I wouldn't personally call that collection of images "an understanding of 3D." In our case, the diffusion model is the collection of photos and the NeRF model + optimization procedure is what figures out how all those photos can be related to a shared underlying 3D representation. - ben p (author)


>The model clearly has an understanding of the 3D structure of objects.

And the submarine clearly can swim! :D

The model encompasses 3D information, but a model is not a thinking entity able to understand anything. Well, it might be argued that saying that the "model understands" is a metaphor, and used as such OK, fine. But that’s it. Analogies rarely scale well.

As for humans, have you ever heard of proprioception? If that is not "native 3D sensors", I have no idea of what a 3D sense might be.


https://www.reddit.com/r/Damnthatsinteresting/comments/xrxez...

A machine with 3d understanding from 2d samples predates computers


> Our eyes only ever see 2D images and we learn 3D structure from that

I think that would be an oversimplification. We do have some 3D information from focus and eye convergence.


its not an over simplification. The extra information from convergence is negligible... our eyes derive virtually identical information when looking at flat 2D pictures of 3D scenes. Evidence of this is everywhere in pictures.


I'm sure people born with one eye are perfectly able to understand the concept of 3D


Dunno if you fully read my comment. But we're in agreement.


You're right, we are in agreement.


I think this is a misunderstanding of how these models work. The model does not understand anything at all. It's computing correlations. I could spend all day computing correlation without ever understanding what the correlations correspond to in the physical world. The correlations could still amount to a useful description of some physical phenomenon, but their interpretation requires much more than just the ability to compute correlations in some limited space.

If we elevate "correlations" to mean "understanding", we're quickly going to run out of words.


Your brain merely computes correlations. Does that make it any less intelligent?

At some point we have to accept that when you layer simple ops on top of simple ops enough times you get complex behavior.


The model is trained on discretized photons (pixels). Doing so may have encouraged the model to learn features related to depth.


Co-author here - we were also surprised :) The breadth of knowledge of the visual world embedded in these 2D models and what they unlock is astounding.


Any word about how Nerf -> marching cubes works? I thought that was still an open problem. Is that another discovery in this research paper?


just seems to work when the 3D model is simple and smooth


the breadth of knowledge of the visual word embedded in these lines and what they unlock is astounding

-the point


So I wonder if unusual angles that normally do not get photographed will be distorted? For example, underneath a table looking up.


They reapply noise to the potentially distorted image and then predict the de-noised version like the originally rendered first frame. So the image is at least internally consistent for the frame (to the extend the the system generates consistency whatsoever).

The example with a squirrel wearing a hoodie demonstrates an interesting edge case, the "front" of the squirrel (with hoodie over the head) show a normal hooded face as expected, but when you rotate to the "back" you get another face where the hoodie is low over the eyes. Each looks fine in isolation, but in aggregate it seems like we have a two-faced squirrel.


Yes, this is often a problem. We use view-dependent prompts (e.g. "cat wearing sunglasses, back view") but the pretrained 2D model often does not do a good job of interpreting non-canonical views and will put sunglasses on the back of the cats head (as well as the front).


>cat wearing sunglasses, back view")

Bad prompt, missing implied antecedent/ambiguous subject...

You may want:

Back view of a cat which is wearing sunglasses, back view of a cat, but the view is wearing sunglasses, etc... I actually tried using projective terms from drafting books, and didn't get great results. Nor anatomicals either.


>Back view of a cat which is wearing sunglasses, back view of a cat, but the view is wearing sunglasses, etc... I actually tried using projective terms from drafting books, and didn't get great results. Nor anatomicals either.

In short: natural language is not good enough and you need a DSL. If only the last 60 years of language research had warned us of this.

Up next: English sentences are ambiguous and need context information to parse correctly. Machine learning community in shambles.


I mean... It kinda did. Sarcasm aside.

The trouble is all those darn uninitiated and trying to create a generalized oracle to map their inspecific ramblings to what they mean to free them of having to actually communicate properly...

Actually, funnily enough, this has cross section with philosophy in a way most programmers scoff at; but communication is frigging hard, and worse, detecting when someone is trying to get something across, but just needs a nudge in the right direction to be able to find the language to explain it is really damn hard.

I run into it every time I get a haircut. I have no idea how to speak their language, so it's always "Uh... A little off the top and rounded at the back, I guess?"


yep our fixed strategy for view-dependent prompting is silly and there is tons of room for improvement!


Some of the "mesh exports" used as examples on the page actually show this, to some extent. Look specifically at the plush corgi's belly and the weird geometry on the underside of the lemur's book, and to a lesser extent the underside of the bedsheet-ghost.


It'll be delusions and guesses, rather than distortions.

It'll just make up some colours and geometries that don't contradict anything it already knows from the defined perspectives.

Or leave it empty.


You can see that in some of these examples e.g. "plush toy of a corgi nurse"


Yes, this is amazing. The reason I am a bit less surprised is that I've been playing with textual inversion in the last few days. Just from six photos of me, it can create portraits showing me from any direction. That means it has a very good 3D model of my face, and that information is fully contained in the model plus the CLIP embedding obtained by textual inversion.


Did we hit some sort of technical inflection point in the last couple of weeks or is this just coincidence that all of these ML papers around high quality procedural generation are just dropping every other day?


From the abstract: “We introduce a loss based on probability density distillation that enables the use of a 2D diffusion model as a prior for optimization of a parametric image generator. Using this loss in a DeepDream-like procedure, we optimize a randomly-initialized 3D model (a Neural Radiance Field, or NeRF) via gradient descent such that its 2D renderings from random angles achieve a low loss.”

This seems like basically plugging a couple of techniques together that already existed, allowing to turn 2D text-to-image into 3D text-to-image.


> This seems like basically plugging a couple of techniques together that already existed [...]

In his Lex Fridman interview, John Carmack makes similar assertions about this prospect for AGI: That it will likely be the clever combination of existing primitives (plus maybe a couple novel new ones) that make the first AGI feasible in just a couple thousand lines of code.


That was a great interview. I really liked his perspective on how close we are to having AGI. His point is that there's only a few more things we need to figure out and then it will basically happen.

I also liked the analogy he made with his earlier work on 2D and 3D graphics engines where taking a few short cuts basically got him on a path to success. For a while we had this "almost" 3D capability long before the hardware was ready to do 3D properly. It's the same with AGIs. A few short cuts will get us AI that is pretty decent and can do some impressive things already - as witnessed by the recent improvements in image generation. It's not a general AI but it has enough intelligence that it can still do photo realistic images that make sense to us. There's a lot of that happening right now and just scaling that up is going to be interesting by itself.


That's a great example that reminds me of another one: there was nothing new about Bitcoin conceptually, it was all concepts we already had just in a new combination. IRC, Hashing, Proof of Work, Distributed Consensus, Difficulty algorithms, you name it. Aside from Base58 there wasn't much original other than the combination of those elements.


Base58 really should have been base57.


Hello Stavros, I agree. When I look at the goals that base58 sought to achieve, (eliminating visually similar characters) I couldn't help but wonder why more characters were not eliminated. There is quite a bit of typeface androgyny when you consider case and face.


Yeah, I don't know why 1 was left in there, seems like a lost opportunity. Discarding l, I, 0, O, but then leaving 1? I wonder why.


I can only assume it was for a superstitious reason so that the original address prefixes could be a 1. This is the only sense I can make from it.


Billions of creatures with stronger neutral networks, more parameters, better input have lived on earth for millions of years, but only now something like humans showed up. I fully expect AI to do everything animals can do pretty soon, but since whatever it is that differentiates humans didn't happen for million of years, there's good chance AGI research will get stuck at a similar point.


Nature has the advantage of self organisation and (partially because of that) parallelism, that's proved hard to mimic in man made devices. But on the other hand, nature also has obstacles such as energy consumption, procreation & development, and survival, that AI doesn't have to worry about.

I think finding a niche for humans has proved difficult especially because of those reasons, and AI can take those hurdles much easier.


Change arrives gradually, and then suddenly.

It takes nature thousands of years to create a rock that looks like a face, just by using geology. A human can do that in a couple hours. And then this AI can generate 50 3d human faces per second (assuming enough CPU).

It could be that an AGI is around the corner, as they say. We might not be machines, but are way faster than nature at reaching places. We don't have the option of waiting for thousands of years.


> This seems like basically plugging a couple of techniques together that already existed

as with a majority of ML research


True (I made such a proposal myself a few hours ago, albeit in vaguer terms). The thing is deployment infrastructure is good enough now that we can just treat it as modular signal flows and experiment a lot without having to engineer a whole pile of custom infrastructure for each impulsive experiment.


Isn't that what the Singularity was described as a few decades ago? Progress so fast it's unpredictable even in the short term.


Same as it ever was, scientific revolutions arrive all at once, punctuating otherwise uneventful periods. As I understand, the present one is the product of the paper "Attention is all you need": https://arxiv.org/pdf/1706.03762.pdf.


... that one has 52K citations, and the 2D to 3D paper "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" with 1488 citations.

https://arxiv.org/abs/2003.08934


>as with a majority of ML research

Plus "we did the same thing, but with 10x the compute resources".

But yeah.


> This seems like basically plugging a couple of techniques together that already existed

Do this enough times and eventually the thing you have looks indistinguishable from something completely novel.


Time and time again these ML techniques are proving to be wildly modular and pluggable. Maybe sooner or later someone will build a framework for end to end text-to-effective-ML-architecture that will just plug different things together and optimize them.


I think this is what huggingface (github for machine learning) is trying with diffusers lib: https://huggingface.co/docs/diffusers/index

They have others as well.


Fascinating stuff! But who is working on the text-to-ML-architecture thing?


Cool stuff. But who is working on the text-to-ML-architecture thing?


They're AI generated, the singularity already happened but the machines are trying to ease us into it.


Scary fn thought!

and I agree with you!

And the OP comment its by the magnanimous/infamous AnigBrowl

You need to start doing AI legal admin ( I dont have the terms, but you may - we need legal language to control how we deal with AI)

and @dang - kill the gosh darn "posting too fast" thing

Jiminey Crickets I have talked to you abt this so many times...


"You're posting too fast" is a limit that's manually applied to accounts that have a history of "posting too many low-quality comments too quickly and/or getting into flame wars". You can email dang (hn@ycombinator.com) if you think it was applied in error, but if it's been applied to you more than once... you probably have a continuing problem with proliferating flame wars or posting otherwise combative comments.


I think you also can get it by just having unpopular opinions. Hackernews used to be much more nuanced than it is today imo.


Uhm... did you even check my account age ((and my old one is two years older))


It’s really easy to get throttled for a single comment out of thousands.


> Scary fn thought!

I'm a kotlin programmer so it's a scary fun thought for me.


It has become clear since alphaGo that intelligence is an emergent property of neural networks. Since then the time and cost requirements to create a useful intelligence have been coming down. The big change was in August when Stable Diffusion was able to run on consumer hardware. Things were already accelerating before August, but that has really kicked up the speed because millions of people can play around with it and discover intelligence applications, especially in the latent space.


SD is open source (for real open source) and the community has been having a field day with it.


We've hit a couple of inflection point.

The numbers of researchers and research labs scaled up that there is now many well funded teams with experience.

Public tooling and collaboration has reached a point where research happen across the open internet between researchers at a pace that wasn't before possible. Common Crawl, stable diffusion, hugging-face, etc...)

All the techniques that took years in small labs to prove as viable are now getting scaled up across data and people in front of our eyes.


I think DALLE really kicked things into high gear.


My hot take is that we're merely catching up on until recently unutilized hardware improvements. There's nothing 'self-improving', it's largely "just" scaled up methods or new, clever applications of scaled up methods.

The pace at which methods scale up is currently a lot faster than hardware improvements, so unless these scaled up methods become incredibly lucrative (not impossible), I think it's quite likely we'll soon-ish (a couple years from now) see a slowdown.


Maybe deadline for neurips which is coming up?


This was submitted to ICLR


Whose full paper submission deadline was also 2 days ago.

This should be further up than all the speculation about AI accelerationism. There's a very simple explanation why a lot of awesome papers come out right now, it's prestigious conference paper deadlines.


Partially coincidence, but also ICLR submission deadline was yesterday, so now papers can be public.


This has been going on for years. The applications are just crossing thresholds now that are more salient for people, e.g. doing art.


Conference season?


It’s called the technological singularity. Pretty fun so far!


This isn't what is usually meant by "technological singularity". It is an inflection point where technology growth becomes incontrollable and unpredictable, usually theorized to be cause by a self improving agent (/AI) that becomes smarter with each of its iterations. This is still standard technological progress, human control, even if very fast


It’s basically when AI starts self-improving. I think this started with large language models. They are central to these developments. Complete autonomy is not required for AGI-nor therefore for the singularity.

Whatever it is, this is a massive phase shift.


It's not really "human controlled". It's an evolutionary process, researchers are scanning the space of possibilities, each with a limited view, but in aggregate it has an emergent positive trend.


THIS

WTF - the singularity is closer than we thought!!!


yay


In the make-a-video I said that things are getting more and more impressive by the day. I was wrong, because that was a couple hours ago. They're getting more and more impressive by the HOUR.

I'm curious where this will end up in a year. Will it plateau? If so, when?


Huh, it's a pretty similar technique to what I outlined a couple days ago: https://news.ycombinator.com/item?id=32965139

Although they start with random initialization and a text prompt. It seems to work well. I now see no reason we can't start with image initialization!


> Sussman attains enlightenment

> In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

> “What are you doing?”, asked Minsky.

> “I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.

> “Why is the net wired randomly?”, asked Minsky.

> “I do not want it to have any preconceptions of how to play”, Sussman said.

> Minsky then shut his eyes.

> “Why do you close your eyes?”, Sussman asked his teacher.

> “So that the room will be empty.”

> At that moment, Sussman was enlightened.

http://www.catb.org/jargon/html/koans.html


directly training a NeRF on a single image is a terribly unconstrained problem that would lead to a volume that looks bad when the viewpoint changes. the gist of render + use diffusion model to refine is a great idea though and core to our method! the details of how to use the diffusion model for this refinement was the challenge, but once we figured that out it... just worked :)


Very cool! Can we tweak your algorithm a little to be seeded with a real photo from a single viewpoint?


it's a good idea :)


"those who say it cannot be done should not interrupt the people doing it"


They said it could be done, and even said how...


The version that you proposed wouldn't have worked


Amazing! How long then until we get photorealistic AI generated 3D VR games and experiences in the metaverse?


Why the downvote? I wasn't being sarcastic, it was a honest question, I'm really impressed how far this technology has come since GPT-3 2 years ago to DALl-E and Stable Diffusion ro Meta's text to video to this...


I was wondering the same thing in the other thread about Text to Video. Someone asked about 3D Blender models, which made me think about animating blender models. Bang, now on this thread we see animated images… it does feel like we can get to asking for a 3D environment, put on a VR glass and experience it. And with outpainting, that we can even change it in real time.

It’s totally sci-fi, and at the same time seems to be possible? I am amazed how even image generation evolved over the last year, but that’s just me daydreaming.


Or in-painting with AR glasses. Change things in the real world just by looking at it (with eye tracking) and say what you want it changed into.


Maybe because you said "Metaverse" (and to some extent "VR") making it sound like sci-fi nonsense. You could have just said:

How long then until we get photorealistic AI generated 3D games and experiences?


3D game and VR game (i.e. stereoscopic 3D game) is not the same experience.

But I agree that particular company is creating a bad marketing around the term Metaverse. (But pretty decent HW.) NVIDIA has much better footing with their Omniverse. But in the end, we all know the thing will be build as web browsers are today. USD + JS + WebRTC + WebXR can go a long way.


But it is the same tech


You probably want to use ML instead of AI then.


I'm guessing the post could be interpreted as "normie" and not HN curiosity ;)


It's funny that the authors are 'anonymous' but they have access to Imagen so obviously it's by Google.


A large portion of the ML community (rightly) discredits Google papers because:

- they rarely provide the data or code used so it's basically "i swear it works bro" research

- what they achieve is usually through having the most pristine dataset on the planet and is often unusable by other researchers

- other times they publish papers that are basically "we slightly modified this excellent open source paper, slapped an internal name on it and trained it on our proprietary dataset"

- sometimes they achieve remarkably little but their papers still get a shiny spot because they're a big name and sponsor all the conferences

- they've also been caught trying to patent/copyright ML techniques; disregarding that this is the same as privatizing math, these are often techniques they plainly didn't come up with

Also ever since OpenAI did their "we have to go closed-source for-profit to save humanity" PR campaign, every company that releases models that can achieve a large amount in NLP/CV gets dragged by the media and equated to Skynet.


A ton of big advances in AI that community has benefitted from have been from published google research


Notably, the transformer architecture, the basis of the last few years of ML breakthroughs, came out of Google.


IMO the biggest algorithmic advances made by Google such as the transformer have greatly pushed the field forward. The giant model's that will have similar variations released in the next 3 months actually aren't that important on a conceptual level except as a PoC.


While the transformer model is important that was 1 paper in an ocean they put out every year.

Also it’s one of the only papers they put out that falls completely outside what I put above. They released everything about it including the model code, pretrained weights, the techniques; and it took quite a while for the model to “catch on” while it was peer reviewed and reproduced by others.

Something something broken clock


The full author list is on the updated link at: https://dreamfusion3d.github.io/


This is par for the course - there have been other instances where an 'anonymous' paper mentioned training on a cluster of TPUs that weren't publicly available yet - dead giveaway it was Google.


Dead giveaway... Dead giveaway...


> Paper under double-blind review

Once the paper is accepted (or rejected) the names may be revealed.

Though, in reality, the reviewers can often easily tell who wrote the paper.


Lots of reasons to stay anonymous besides for hiding what org is behind the paper. Maybe they don't want to be kidnapped by the North Koreans and forced to produce new paens with "lost footage" to Kim Il-Sung.


Can someone explain what's going on in this example from the gallery? The prompt is "a humanoid robot using a rolling pin to roll out dough":

https://dreamfusion-cdn.ajayj.com/gallery_sept28/crf20/a_DSL...

But if you look closely, the pin looks like it's actually rolling across the dough as the camera orbits.


The rolling pin is above the table but the shading is wrong because they don't render shadows.


Hi! Ajay here. Correct, our shading model doesn't compute intersections, since that's a bit challenging with a NeRF scene representation.


Super interesting work. Do you think that's a solvable problem and something you'll work on?


I think what's happening here is that the flat-looking table is actually raised up in the center, in the shape of something like a smooth pyramid. There's dough painted on both sides of the rolling pin, but because of the curvature of the "table" you only see each side's dough when the camera is on that side of the pyramid.


Correct link with full demo: https://dreamfusion3d.github.io/


That link also has a link to the authors and the paper preprint.


Changed now. Thanks!


As someone who went to college for 3D animation in +*1997*+ AND DESIGNED the datacenter for luca' presidio complex..

where-by learning that Pixar was developed by steve jobs when lucas didnt think there was a future for computer animation... and so steve bought the death star from lucas...

That became pixar...

AI is going to fucking kill it - what will happen in the next decade will be ANYONE uploading a script to an AI to make a full length movie...

AND their will be editing tools as well that are AI driven...

Like mentioned by William Gibson

*The future is here, its just not evenly distributed yet*


I think this will come, but it won't be competitive with filmmaking in a decade, not without real superhuman AGI. What's much more likely is efforts to reproduce performances will result in deep uncanny valley stuff - and correcting this last few percent for accuracy / weirdness will take a long time. Photorealistic video rendering is inevitable, voice duplication is already here (Tortoise should be much more well known but our culture is fixated on the visual side of this tech - https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUN...). But generatively creating a performance which accurately interprets the emotional context of a scene, in interrelationship with other characters doing the same? I don't see even a path towards that without AGI. At best you could crib specific performance elements and map them onto new models. My intuition is that to get all the way to generative movie you need AGI (or Zimbos so convincing we can't tell the difference).

So what we will get in coming years / decades - the timeline is anyones guess - is movies acted out to camera in a rehearsal space, and that combined with a(n AI augmented) script to generatively create a film / character driven interactive game).


It's going to be worse then that. I'll just write a summary and an AI will generate the full script. Then the movie will be generated from the script. The full source code and assets for a video game too.

All the primitive components for this future seem to be at an early stage of inception. We can't say for sure whether they will mature to the point where they can replace us, but the trajectory is certainly pointing in that direction.


> AI is going to fucking kill it - what will happen in the next decade will be ANYONE uploading a script to an AI to make a full length movie...

Nah. These techniques will definitelly lower the barrier for making stuff (not just movies), but that has been case will all transformative technologies.

Before computers, if you wanted to shoot and edit a movie, it was a challenge. Now, you can shoot a movie with your pocket computer and edit it on the same device while shitting. Upload it to video sharing web and billions of people can watch it.

This class of technologies will enable creative people to make a lot of stuff, to iterate quickly. But don’t be naïve that everybody will do that. 99% of that will be trash, and that is fine. I like that it will enable individuals to bring their visions into this world without any need for collaboration. And when highly artistic individuals will begin to collaborate using these tools, that will be an awesome inflection point for art as we know it.


We're quickly approaching HNFusion: Text-to-HN-Article-That-Implements-That-Idea ...


This sounds like something that could be made to work with stable diffusion if someone just implements the code based on the paper.


Give it a week or two…


This is crazy good - most prior text-to-3d models produced weird amorphous blobs that would kind of look like the prompt from some angles, but had no actual spatial consistency.

Blown away by how quickly this stuff is advancing, even as someone who's relatively cynical about AI art.


Awesome! I wonder how long it will be until there is an open source implementation compatible with Stable Diffusion


Coincidentally came out the same day as Meta's text-to-video. I wonder if Google deliberately held out the release to make a bigger impact somehow?


I think it’s because of ICLR deadlines.


nvidia also released GET3D[1] a few days ago. research seems to be heading towards similar goals.

[1]: https://github.com/nv-tlabs/GET3D


Would they publish it anonymously? I'd bet they'd want to take credit somehow.


Someone posted the "correct" URL that has names: https://dreamfusion3d.github.io/


hi folks, ben p from the dreamfusion paper here. happy to answer qs for the next ~hour!


Curious about how long it took: brain storming, research, hypothesis, work, iteration, bug fixing, writing etc

I'm curious about the process you and the team uses to do this. Additionally there's the meme that these things are appearing every hour now, so it could be good for some perspective like "well actually it took n weeks"


Great question! Our team has been working on text-to-3d for ~1.5 years starting with https://ajayj.com/dreamfields. We had hoped that we could swap the contrastive CLIP model in Dream Fields for the generative Imagen model and crank out an easy paper in a few weeks. But what was supposed to be an easy win turned into months of frustration. Nothing we tried worked any better than Dream Fields. After a long detour trying MCMC, we stumbled across the score distillation loss that powers DreamFusion. Going from an initial sign of life to the results you see today still took months of hard work.

Research progress is unpredictable and these advances are not inevitable. We have the privilege to work in an environment full of amazing colleagues and powerful models, but at the end of the day it took a persistent team and a bit of luck.


Could the NeRF be replaced with a voxel grid, backpropagating to the voxel color values directly? Or is there a reason that wouldn't work?


should work, and there are tons of new differentiable mesh and volumetric representations to try!


Is there code or notebook for this paper?


not yet, but see the appendix of the paper for pseudocode. the core update step from the diffusion model that powers dreamfusion is surprisingly simple and easy to implement.


What does this mean for our understanding of intelligence?

It trivializes it, in my opinion.

When asked the question of is lambda/GPT-3 and/or DreamFusion and it's derivatives an aspect of sentience? there's always a bunch of people who are repeating the same cliche negative line, of "no, it's only attempting to statistically mimic sentience." I agree with the reasoning.

But have we considered the other side of the story? That yes, the mimicry is All sentience actually is. Nothing more.


AI is getting quite good at a lot of things humans consider fairly difficult (like this example) but has made less progress at things humans consider fairly easy (e.g. maintaining consistency across paragraphs of text for GPT3, navigating 3D space, learning from small numbers of examples). That suggests that there is still a gap between what current AI approaches are doing and what the human brain is doing that doesn't just come down to throwing ever more data at the problem.


You just picked an arbitrary gap though. And it seems like a small gap that's crossable.

For example just 2 weeks ago there were two other gaps that you could've used in your example. You could've said 3D interpretation of images wasn't possible and the creation of animated movies wasn't possible and you could've said these few things suggest that there's a gap in what the human brain is doing and what the AI is doing and just throwing more data at the problem doesn't fix it.

Those two examples would be irrelevant today as both of those gaps have Effectively been crossed.

See what I'm saying here. There's two ways of looking at it even from your perspective... Either that gap is so large that the human brain is completely different. Or the gap is small, trivial and will be crossed very very soon.


I'm saying something different, that the most impressive examples of AI breakthroughs are doing things that humans find hard / are bad at. Meanwhile there are many things that people find easy / do without thinking / can be done by dogs or very young children that AI struggles with.

It suggests to me that what most current approaches are doing is something fairly different from what human / animal intelligence is doing in important ways. That means we will likely continue to see AI do increasingly amazing things while at the same time struggling to perform a lot of tasks that are quite basic for humans.

It is the fact that AI is proving to be a better artist than most humans while not being able to do many things that are simple for a 4 year old that suggests strongly to me that some of the fundamental mechanisms are fairly different still, or current AI approaches are missing some key insights.

I could be wrong. I'd bet money that I'm right if there was an easy way to do it though.


you will find that GPT-3 writing is consistently on all fronts better than what a 4 year old can write. Even the strange inconsistencies and lack of awareness in the writing is better than what a 4 year old can do.


Every one of these AI improvements from DreamFusion to Metas introduction moves the needle closer and closer.

If someone asked this question when GPT-3 came out there'd be thousands of negative retorts throwing out the same tired lines. Now this statement is getting harder and harder to refute.


The thing that frightens me is that we are rapidly reaching broad humanity disrupting ML technologies without any of the social or societal frameworks to cope with it.


I'm usually not a fan of this general hand wringing / fear mongering around ML that a lot of people with too much time and not enough STEM background constantly bring up.

Stable diffusion has been made available to the public for quite a while now and if anything has disproved a lot of the ungrounded nonsense that made companies like OpenAI censor their generative models.


I don't think you understand.

I watched this humor group when I was little (think 30 years ago). Usually they are inoffensive, but they have this particular sketch were a (man crossdressed as a) woman complains about "his husband hitting her" ... and you can see that one of her eyes is black. With canned laughs during the whole thing.

This used to make people laugh. 30 years ago. Then we had social changes and gradually our perspective as a society shifted, and this humor piece looks ... grotesque.

That's how I interpret the "societal changes" the OP is referring to. We have shiny new things every week and there's no time to adapt to them.


SD has been out for about a month and it is first gen technology. If "quite a while" was 5 years, then sure, I'll agree.

But we are one month into the experiment, and the technology still struggles to get wholly authentic looking media. I'd say it would be wise to give it a few years before claiming victory.


>I'm usually not a fan of this general hand wringing / fear mongering around ML that a lot of people with too much time and not enough STEM background constantly bring up.

What's with the STEM reference? Are you implying that STEM is related to intelligence and that people without a STEM background are not intelligent?

It is well known among academics (AKA STEM MAJORS) that human society is a chaotic system and that ML can change society for the better or for the worse... the outcome is basically unknown... It is therefore the intelligent choice to consider the negative consequences of this technology.

To not consider the other side indicates a lack of something.


That's exactly what he's saying. How many citations does the top cited paper from the LessWrong community have?


There were bigger disruptions in the past. The telegraph, railroads, explosives. "The Devils" by Dostoevsky is a great fictional account of what all these technological disruptions do to the fragile social order in the late 19th century Russia countryside. All of a sudden all these foreign people, ideas , technology and commerce start streaming in to these once isolated communities.


@dang The link should be updated to https://dreamfusion3d.github.io


Is there code for any of these models? Or a collab? Ajay Jain's colab doesn't work, but I would love to see a colab for this.


Silly replying to myself I know, but I had more thoughts. I'm an architect for 3D worlds and I am desperate, lol, for this kind of tool. I use both blender and grasshopper, but I use midjourney to think and prototype all the time. Obvious but it would be astonishing to have something like this for game worlds. I used another version of this to create "a forest emerging from an aircraft carrier" https://www.instagram.com/p/CiRfXKzpnLC/ but the technique didn't have good resolution yet (high fidelity).


You should try the AUTOMATIC1111 version of stable diffusion. It's crazy fast and has great results - https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki...


I'm mostly only looking for 3DML these days. I want it to hallucinate architecture for games.


Hi @sirianth, this is Ajay. Are you talking about the https://ajayj.com/dreamfields colab? Feel free to dm me on Twitter (@ajayj_) or email me if you're facing bugs. Dependencies keep shifting, really should have pinned versions originally...


Yes, exactly, I was trying to fix the dependency problems in the colab and I couldn't... I'll hit you up on twitter.


My guess is we have to wait a good 3 months time before someone makes a opensource version


:)


As someone who dabbles in 3d modeling, this is going to be an incredible resource for creating static 3d objects. Someone ought to come up with a way to convert to mesh better than the Marching Cubes algorithm I've seen applied to most NERFs. The models still lack coherent topology and would probably be janky if fully rigged.


With smooth enough geometry converting NeRFs to meshes with marching cubes works pretty well. Would you say the topology of meshes on our website are still too incoherent for rigging?


In case of a communication gap; The word 'topology' has a more domain specific meaning in animation and rigging compared to the mathematical one.

It's used to mean that the placement of the lower level components - vertices, edges and faces - are well aligned to the higher level structure of the object, and make up a well defined 2D grid that flows along the models surface. In particular you'd want edges going along/perpendicular to mesh structures such as limbs, and around facial features and other details in a logical manner.

Otherwise when applying deformations as part of an animation the model will not have the detail in the right places to still look good, e.g. if there is no edges perpendicular to a joint in a limb, the bent version of the limb cannot have a clear smooth line along the joint, and the edges and faces become janky.

Under this definition, marching cubes cannot produce good 2D topology, as the mesh features are all aligned to the cardinal grid instead of the features of the object represented.


Aha thank you, this is helpful. Agreed there is much research needed to get this working but hopefully not too far off: https://twitter.com/kkpatain/status/1575758085821706240


I swear I recently saw something related to generating clean topology procedurally. Wish I could remember where.


So does this mean I can use DreamBooth to create plausible NERFs of myself in any scenario? The future is looking weird.


Nah. This is made by Google, so it'll never become useful.

You'll have to wait a few months for someone else to replicate it.


Someone did implement DreamBooth for stable diffusion so I was imagining, like you say, this (implemented by someone else) in a couple months with DreamBooth + stable diffusion


Fun that they had an octopus playing a piano.

I made the same thing the old fashioned way. Mine can actually play though. https://twitter.com/LeapJosh/status/1423052486760411136 :P


Text to 3D animation is the obvious next step though.


Cool.

The samples are lacking definition, but they're otherwise spatially stable across perspectives.

That's something that's been struggled with for years.


This is getting asymptotic.


Progress often happens in waves. There will be a trough again.


Seems a bit like a tsunami currently. But I wonder how we'll think about it 10 years from now.


AI might be different - as has been predicted for many years now - due to the compounding effects on intelligence.


I'm not sure this is quite that, if you will.

I'm also not sure someone couldn't cleverly figure out a way to use stable diffusion to write code based on a text prompt.


I'm not even going to pretend that I have a clue on how this is done. But I'm wondering if the output can be turned into 3d objects that can be used in any of the 3D modeling software? It would be a game changer in terms of real world product development in both of speed and ease.


Well they even have a “downlod model” so yep you definitely can. I wouldn’t think of this as an amazing panacea though, since once everyone has access to it that suddenly means whatever reason making assets like this was valuable before, will now be dirt cheap for all and thus actually net negative for people in that industry. Just saying and warning, not to be a bummer


Thx, I went back and saw that I missed this the first time:

"Mesh exports Our generated NeRF models can be exported to meshes using the marching cubes algorithm for easy integration into 3D renderers or modeling software."

Like they say, "This is the start of something big."


FWIW, there's still a pretty big gap between a single static mesh and something that is a usable asset, say in a game. Maybe this could provide a shortcut for a modeler to get started, but it still is going to take a lot of skill from that point.


These are getting too good, too fast.

I'm excited and scared. The world is going to look very different in 10 years!


Futurists have been predicting when we'll have stable fusion for decades, but now we suddenly got stable diffusion working. That's good too, not what we wanted, but good. We're gonna need stable fusion or other renewables to run stable diffusion though. /s


Pretty neat, wish I could try it out ( maybe I missed a link). Obviously has interesting / novel uses, but kind of reminds me of the previous discussion of upscaling audio recordings to the “soundstage” format. I doubt most 2d images want to be 3d ;)


Unclear to me what is going on, but there’s another URL that lists the authors names. Given it’s possible this change was done for reason, not linking to it, but strikes me as odd it’s still up. Anyone know what’s going on without causing problems for the authors?


This link was from OpenReview which must be anonymous (double blind). The full author list is on the updated link at: https://dreamfusion3d.github.io


Aware of the link, though you have not provided any clarification for why there are two links; strikes me as odd if authors are trying to post it anonymously that simple Google finds authors names.


I believe ICLR guidelines require the authors to submit papers and any supplementary materials (including links to webpages, videos, etc) without identifying information, but authors are not barred from public announcements on other forums. IIUC, the idea behind this policy was originally to accommodate author freedom to engage in common practices such as simultaneous submission to arxiv (which identifies the authors). To respect the double blind spirit of review, reviewers are asked not to actively search the web in attempt identify the authors. In the past, when social media promotion was less common, it was reasonably likely that reviewers would follow this guidance and would not have seen the arxiv submission, preserving the double blind nature of review in most cases. However, the use of social media in academia has radically changed in recent years, as more researchers use social media to keep up with the latest advancements, so promotion of papers in submission on platforms like Twitter can offer significant advantages to authors.

So, authors today often submit anonymously following the conference guidelines, but simultaneously post publicly elsewhere, walking a fine line as not to overstep the conference policies. This appears to be the case for this submission. Note, recently, some conferences, such as CVPR, have started to institute new policies forbidding social media promotion until acceptance, as they adapt to the changing landscape of social media promotion. If this were a CVPR submission, the authors would not be allowed to tweet publicly about their work yet, nor have the version of the webpage with their names visible.


Thanks, appreciate explanation - maybe it’s me, but found it odd, didn’t know what to make of it. Maybe it’s me, but if there not against authors posting with their identity public elsewhere, they should just have a button you’re able to click, a call warning, and then get to see the authors.


This is like magic to me. The pace at which we are getting these tool amazes me.


Oh my god, we are done for


Gives a new perspective on a classic verse:

"For he spoke, and it came to be;

he commanded, and it stood firm."

Psalm 33:9, NIV

:)


How so?


You give a prompt and it makes the thing.


Not really though.


btw guys stable diffusion img2img consistently applied frame-by-frame will get us some insane CGI for movies yo

"transform this into this realistically"

ILM's holy grail


Is it a light version of script when the AGI comes fast


Is code available?


but seems i can only generate models from predetermined inputs, when can i submit my inputs to create a video?


Is there an API for using it myself ?


I don't see a person in the gallery. It's capable of generate a 3D model of me with only a photo?


that is so amazing! Next up it puts a skeleton in them and animate :o


Anonymously authored research is very ominous.


The full author list is on the updated link at: https://dreamfusion3d.github.io/


Source?


Url changed from https://dreamfusionpaper.github.io/ to the page that names the authors.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: