Hacker News new | past | comments | ask | show | jobs | submit login
NeRF in the Wild: reconstructing 3D scenes from internet photography (nerf-w.github.io)
218 points by tambourine_man on Aug 6, 2020 | hide | past | favorite | 124 comments

I recall many years back a website, I think a Microsoft project, that linked together photos in a 3D space of tourist destinations. It created something of a point cloud, but nothing this advanced. You could click through the points/photos to jump into each photos perspective of the space.

Anyone remember what that project was called, or if it is even still around?

Edit: Found it. http://phototour.cs.washington.edu/ Later the discontinued. https://en.wikipedia.org/wiki/Photosynth

Watching this demo makes it very stark how much our single page webapps are regressions in fluid performance.

Is this not a Silverlight app or something?

I highly doubt that demo was JS / could run in multiple browsers without some proprietary runtime.

It was, in point of fact, a Java applet. [0] Older, less secure, but more powerful.

[0] https://web.archive.org/web/20191231213153/http://phototour....

After watching that video: Why did no one build the application that tied your personal photos with a global database of all the other public photos taken in the same place? Seems like it would have been an amazing application.

yes ... impressive TED talk back then and even now

Original author here. AMA!

I had this vision that one day we'll be able to reconstruct memories from our past by taking old photos and having a ML model collate everything together to form a 3D rendering of that point in time. It seems like you have gotten most of the way there.

The next step would be to have the user grab a VR headset and immerse themselves in their favorite childhood moment. One could even add avatars for loved ones using again ML-generated audio based on recordings of their voices.

Your project made me think that it wouldn't be that interesting for me to view your memories, so perhaps the best initial step for a proof-of-concept that would allow the technology to mature would be to recreate historical moments so everyone people could relive them – and they could do so entirely virtually, from the comfort of their own couch. Side note: it feels like this technology can disrupt traditional museums with the added bonus of being pandemic-proof.

Anyway, I don't really have a question... Just wanted to compliment you on this amazing work and throw this idea out there in case others want to think about it, as I'm in an entirely different field and don't have the skills and resources to make it real, and I do strongly feel this will inevitably come to life.

That's a really cool idea! This technology does a fantastic job at reconstructing static scenes. The moving objects -- people, cars, even flora -- are out of scope here. Why? It's really hard to build a 3D model of something you only see from one direction.

Anyone remember this scene from Enemy of The State (1998)? https://youtu.be/3EwZQddc3kY?t=45

Jon Voight: "Can the computer take us around to the other side?"

Jack Black: "It can HYPOTHESIZE"


I was amazed at the scene at the time, and thought it was unbelievable. But then i read they had NSA advisors and maybe the US govt might have had access to some sort of primitive photogrammetry at the time?

But if we know what a car ought to look like in 3D, can't we take the one photo we have from one direction and just fill in the blanks with that a priori 3D knowledge?

Similar to how GPT-3 can be applied not only to create Text, but also fill in missing pieces of Images (ie. complete the missing half of a face).

Would the logical next step, use GPT-3 to create a 3D world? :)

GPT-3D rolls off the tongue nicely

Getting there. This is one part of the puzzle: https://arxiv.org/pdf/2007.11965

I pitched a similar idea in an interview years ago in an interview (http://www.wearegamedevs.com/2016/01/20/scott-anderson-rende...) with the added complexity of forward simulating past events with different choices. I was asked what I would do with infinite time and money though. To this day my elderly parents tell me to quit my job and work on this idea :-D.

I do believe that past time travel to reconstructed and recorded events will be one of the stickiest use cases for VR.

It's funny you say that... https://news.ycombinator.com/item?id=19529921

We're hiring too. Looking for an engineer with some game dev experience.

I wish I had a more substantive comment beyond “wow” but this is really impressive. I’ve wanted something like this for a long time.

Hah! Loved the interview, thanks for sharing.

"I Built a REAL-LIFE Time Machine! " by Lucas Builds The Future https://www.youtube.com/watch?v=aHyNYfFfXlg

Can this technique reconstruct good geometry for the visible parts if only part of the structure is ever imaged?

That's a precursor to: could this technique be used to enhance Street View? There are times when I would really like to be able to "walk around" outdoor scenes in finer steps. Current Street View smears between photos taken some distance apart. (I don't know if that is a limitation of the public interface or if the original data capture is really that coarse.) It would be nice to have a real 3D space to explore, but I certainly don't expect the un-imaged parts to be defined correctly.

Finally, does this also work for reconstructing interior spaces seen from the inside? Like the geometry of a cave, from pictures of the cave interior?

At this point, this method is only good at reconstructing parts of the scene that are well-photographed. You'll notice that our video for Sacre Coeur has some blurry bits, particularly the staircase in front of the Basilica. That's because we learn to reconstruct what was seen, but aren't yet able to imagine what wasn't!

Does this work in reconstructing indoor spaces? Give it a shot and find out!

Does this work in reconstructing indoor spaces? Give it a shot and find out!

Have you released the code? (Or did you mean I could try re-implementing the work you published from the paper? That's a reasonable response too.) I didn't see a link to source in the github.io page or in the arXiv paper. The only source code link I saw was to https://github.com/bmild/nerf which I thought was earlier work than this paper.

Our code is not released yet. If your data is captured without occluders and in RAW format, you don't need the enhancements we propose :). NeRF can do amazing things with clean data!

I just saw the NeRF demo video, it's amazing. If its results are this good, why is photogrammetry software not basically perfect yet? It looks like they can generate models with vast amounts of detail.

Not the author (but I read the previous papers and the code), the simple answer is that it's still very costly in processing power. Think a few hours on good hardware for a set of pictures.

Note that the results are still very impressive imo, this is still early research phase.

NeRF was first published in March! Give us time :)

Thanks so much for taking the time to do this impromptu AMA. The excitement over the tech is clearly palpable :)

Love how you explained this! And I thought I was the only one using exclamation marks for everything

Great work! I’m an ex googler currently working on a farming robot we hope to make open source. I’m particularly interested in neural reconstruction of plants in a field. I want to capture the 3D structure of the plants as well as semantics like plant species. I’ve found that normal photogrammetry produces poor reconstructions due to movement of the plants in the wind.

We’re potentially about to start a non profit and formally kick off our whole robot as open source. I’m interested in finding research partners who would like to help produce a research paper on 3D reconstruction of plants. I can produce a high quality geo located dataset with 2cm accurate GPS tags, but I have no experience with neural rendering. This is work I want to do over the course of the next year.

Do you know anyone interested in helping with thank kind of work? Thanks!

I'm a bit new to the field myself, so I'm afraid I can't provide any contacts. Ask me again in a couple of years.

The method presented here wouldn't do well with your problem either. 3D reconstruction of moving objects is an unsolved problem!

Indeed. I am seeing some generative approaches that know what the object should look like in 3D and use that knowledge to imagine a model that matches a photo. I think such a technique would be useful for good approximation of plant models. Such a project would require some new datasets I would think, but seems like a good approach.

Just came across this which is neat: https://github.com/AljazBozic/DeepDeform

Also this might be useful: https://github.com/paschalidoud/hierarchical_primitives

For now I’m just beginning to collect data but I hope to contribute more to the field in time!

Looks awsome! I'm not a ML guy and haven't read the paper, just watched the video - one thing isn't clear to me from it: is this fully automatic/unattended, you just throw images into it and out come magic rainbows of 3d structures? or do you need to somehow help it, e.g. to disentangle the structure from the "transient" elements? In other words, I don't really understand what does the "Appearance Embedding" even mean... Or is the "input" that you mention in the video fed into a model that is already trained on a set of photos of a particular scene? I.e. the "input" + "appearance embedding" basically encodes just a choice of a framing & "atmosphere/lighting"?

It's a little hard to describe from scratch, but let me do my best.

The method is unattended, in the sense that it's photos + camera parameters in and scene representation out. The photos should all be of the same scene (e.g. the Trevi Fountain). Once you have a scene representation, you can ask what the scene would look like from new camera angles with your choice of lighting.

Choosing camera angles is straightforward. You tell me where and what direction the camera is facing. The question then becomes, how do you specify your choice of lighting? The answer is, you can't do so directly. Instead, you provide a picture with the lighting you want, and with a little magic, we can find a way to imitate that lighting. The way we do is by finding a corresponding "appearance embedding" via numerical optimization.

What is the precision required (or used in your datasets) for camera position and angles? Is the geotagging in the images from common cellphones and smart cameras enough? Were they back-calculated using some other method from non- or poorly-georeferenced images?

It's hard for me to say how precise camera position and direction needs to be. We use COLMAP to estimate both via multi-view stereo.

Why did you use neural networks? There are faster techniques in analytical geometry that can extract surface contours from color gradients from images, and they do this faster and directly.

1) My bread and butter for the last 10 years has been machine learning. When all you have is a hammer...

2) We don't extract surface contours, we learn a volumetric radiance field! To oversimplify, we learn a (smooth) function that, given a position in space, produces the differential opacity and color at that space. To render an image from a camera viewpoint, we approximately integrate along rays emitted from each pixel of the camera.

Check out NeRF and our paper to learn more about this representation!

Neural networks are better compared to classical methods.

One of the best non-classical methods is this one (https://grail.cs.washington.edu/projects/sq_rome_g1/), and our method is significantly improves upon it. We do not compare directly with it, but Neural Rerendering in the Wild does, and we improve upon it.

these nerf models are like 5MB large are have a ton of directional lighting support. speculars, caustics, refraction, mirrors, you name it!

also they're way higher quality than traditional techniques

Random thought:

Someone made a “camera” which tracks location & direction, and “takes” a picture by selecting the closest picture found taken from that spot & angle.

Use this new tech for a next generation of that “camera”, generating the Platonic frame which should occur from that location.

Then instead of a camera, display it in VR goggles. Allows you to walk around and see a landmark without all those pesky people ruining your view.

You can use the PhotoSynth tech described above.

this looks really cool. I'm not am ML chap, but always wondered: Can these kinds of algorithms also give you dimensional data? For example, can I 3D-print one of these models with any accuracy?

For that, you'll need to convert the representation we have (volumetric radiance field) to on your 3D printer can understand (a mesh?). The NeRF authors use the marching cubes algorithm to do just that. Check out their website: https://www.matthewtancik.com/nerf.

how many pictures or angles are needed to produce good results? I get that landmarks have an abundance of source material, but whats a reasonable amount of data to reconstruct scenes?

On the order of hundreds to low-digit thousands worked well for us. These photos contain a lot of occluders like tourists, and we needed to have enough views of the subject in question to build a good 3D scene representation.

can you elaborate on the key variables for the data? for instance, is it safe to assume 360 photos from the same angle would yield a worse model than 1 photo from 360 different angles?

what does the ideal minimal data set look like (eg, 5 photos from each 15-degree offset)?

thanks for being so active on this thread.

NeRF's (and all of photogrammetry's) bread and butter is 3D consistency -- that is, seeing the same object from multiple angles. A 360 degree photo from a fixed position just won't do. As to how to select the best camera angles...I'm not sure. I believe there is research in this area for classical photogrammetry techniques, but I'm not familiar enough to point you to a body of work.

How do you remove tourists? Is the network trained to segment and ignore humans?

The model does not explicitly learn to segment images. The answer is unfortunately more difficult to explain than a HN comment bears. I encourage you to read the paper for more details.


Just gotta say: amazing!

My follow up question would be: are you able to compare your results to actual photogrammetry data to see how good your reconstruction performs?

I'm actually quite new to the field, and I'm not even sure what to compare against nor how to compare it. What's typically measured and how?

Is the model able to capture the underlining geometry? E.g. If I have a pillar part of which was not visible at any training point is it able to reconstruct that part?

The model is trained to reconstruct what is observed, but not what is obscured. If you look closely at our videos, you'll notice some parts of the scene are blurry -- those parts weren't seen often enough to learn well. If you look at parts of the scene not observed at all, I'm not sure what you'd find.

would a sufficiently long video in motion, say from a drone, car or even a walking person, work instead?

Pictures are pictures, even as video frames :)

Did you consider using movies as a source too?

Consider? Yes. Try? Nope!

awww! Figured dolly shots and steady cam shots would fit perfect into something like that. Esp 24 frames per second and usually known locations. Course it would probably drag a lot of the net into being biased towards that time spot I guess?

There are problems associated with using video: motion blur, rolling shutter.

Oh I agree. In my head it seems like it should work. I could be wildly wrong though. I am every day :)

It definitely can work, and even has some additional benefits (1), but requires special considerations. You can deblur using global motion vectors (2), or additional hardware like accelerometer reading embedded in the video feed (3).

1) cant find the paper now, but by exploiting predictable rolling shutter you get additional temporal resolution

2) http://users.ece.northwestern.edu/~sda690/MfB/Motion_CVPR08....

3) http://neelj.com/projects/imudeblurring/imu_deblurring.pdf

This looks amazing! Congratulations. What was the most challenging aspect of this? I'm curious to see how you performed edge detection on the transient objects and were able to isolate them so cleanly. For some reason, the paper isn't loading, so feel free to say it's explained in detail there.

Do you plan on releasing code?

Congrats again. This is very cool research.

> This looks amazing! Congratulations.

Thank you!

> What was the most challenging aspect of this?

Wow, that's hard to say! Our work truly stands on the shoulders of giants (Mildenhall et al, 2020). I can list off a few challenges: figuring out if an idea "kinda works" or "definitely works" or has a bug, figuring out how to measure progress, coordinating a group of 6 researchers living 9 hours apart, and assembling everything together for a simultaneous paper-website-video release!

> I'm curious to see how you performed edge detection on the transient objects and were able to isolate them so cleanly.

We don't! All of this comes by the magic of machine learning :). We train our model to attentuate the importance of "difficult" pixels that aren't 3D-consistent in the training set. We also partition the scene into "static" and "transient" volumetric radiance fields without explicit supervision. We do so by regularizing the latter to be empty unless necessary, and providing it with access to a learned, image-specific latent embedding. We discard the transient radiance field when rendering these videos, thus removing tourists and other moving objects.

> For some reason, the paper isn't loading, so feel free to say it's explained in detail there.

Well that won't do. Download it here: https://arxiv.org/abs/2008.02268. It's 40 MB, so your download bar may indicate it's almost done, but it actually has a good bit left to go.

> Do you plan on releasing code?

I hope so! As with most code, what ran on our machines may not run on yours. Migrating the code to open source will be a big effort. I hope what we describe in the paper is sufficient to build something like you see here.

> Migrating the code to open source will be a big effort.

No, it is not. It's 5 shell command at the most.

$ git init

$ git add .

$ git commit -m 'Initial import'

$ git remote add origin git://...

$ git push origin master

You are assuming there that all their code is non-proprietary, doesn't belong to someone else, already has a open-licence, and is of good enough quality to be shared with the public.

> their code is non-proprietary, doesn't belong to someone else, already has a open-licence

Then say "we can't share for legal reason", not "we're planning to". This is just a bs corporate answer.

> is of good enough quality to be shared with the public.

This is a petty excuse. There is plenty of open-source code utterly crappy/barely functional out there.

> There is plenty of open-source code utterly crappy/barely functional out there.

Mine included ;)

Are you planning to share the code ?

GPT-3 has some pretty interesting demo which unfortunately are unfortunately rather disappointing once outside a carefully crafted environment. Said otherwise, a paper is nothing it it cannot be reproduced.

Not too familiar with this area. IIUC, training a model would require photos with corresponding information about where and at what angle the photo is taken.

Is this information just available from the dataset?

It is provided by the dataset we use, but given a new dataset, you can use off-the-shelf tools to obtain it yourself! Check out COLMAP, it's super duper cool: https://colmap.github.io/.

so why does it use multi-layer perceptron? is it the same as ANN? why not calling it ANN? Does it have activation?

There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input.

Why does the network need a direction? Why can't we get a density (opacity) and a color given a position?

and what are z(t) r(t) in equation 5,6?

Answered most your other questions below in another comment.

> and what are z(t) r(t) in equation 5,6?

r(t) is a position in 3D space along a camera ray of the form, r(t) = origin + t * direction.

z(t) is the output of our first MLP. Think of it as a 256-dimensional vector of uninterpretable numbers that represent the input position r(t) in a useful way.

Would it possible to integrate video into the model?

I'm not sure I understand your question.

If the question is, "can you reconstruct a (static) scene from the frames in a video?", the answer is yes!

If the question is, "can you reconstruct a scene with people and other moving objects, and model them moving around too?", the answer is not yet.

The former. Once the scene is synthesized, I figure that is where any dynamic output would occur. Although that raises an interesting thought of using the NeRF modeling to paint out certain things in potentially live video.

Very cool! Congratulations!

Thank you :)

Do you need posed images?

Yes, the images need to be posed. We use COLMAP to obtain camera pose.

cool project ... now spin up a public server so we can feed up our own set of images and get back the 3D synth object scene

Does it take reflections into consideration?

The work we build off of, NeRF, does. While there's nothing preventing NeRF-W from also representing reflections, we find it captures a more matte picture of the object.

Is there a code release somewhere?

Those are some very cool 3D visualizations generated, but it's a bit difficult to understand what the form of the dataset they generated it from is. They say "in-the-wild" photography, but of course don't really give you a great sense.

The light->dark transitions having consistent geometry is clean though.

We use images from the Image Matching Challenge 2020 dataset. If you look at the Appendix, we list how many images we use and the process by which they were chosen.

Download and have a look! https://vision.uvic.ca/image-matching-challenge/data/

Thanks, that's a clean reference.

> They say "in-the-wild" photography, but of course don't really give you a great sense.

Flickr user photos. Citation shows up in the lower right hand corner during the video.

This appears to be a substantial improvement on current open photogrammetry/structure from motion work [1]. I hope Google supports this making its way into cultural preservation efforts [2].

[1] https://github.com/mapillary/OpenSfM (developed by Mapillary, now part of Facebook)

[2] https://www.nytimes.com/2015/12/28/arts/design/using-laser-s... (Using Lasers to Preserve Antiquities Threatened by ISIS)

Yes, I mostly meant that I don't get a great sense of "how many photos there are" in these datasets.

I saw in the paper their citation [13] pointed to https://arxiv.org/pdf/2003.01587.pdf, which in section 3 says the following:

We thus build on 25 collections of popular landmarks originally selected in [48,101], each with hundreds to thousands of images.

So hundreds to thousands of photos are used, which is a decent quantity, but definitely makes the quality of the result very impressive.

I'm still looking for a program that takes a video and turns it into an animated 3D scene. All the stuff I've seen is on static scenery, besides some neural nets that can tweak camera angles.

Do you happen to know how intellectual property works when someone wants to use the algorithm/code?

I think we're going to use the MIT license. So, you'll be able to use it in almost any way you like...

MIT actually does not give an explicit patent grant. So if "any way you like" is your goal, you should choose something different like Apache License 2.0

There is currently no way i'm aware of to accurately reconstruct a moving 3D scene. Sorry! Ask us again in a few years :)

A while back I stitched together a "hyperlapse" of Stanford's Hoover Tower using lots of Flickr-scraped images. Everything was aligned using "classical" CV tricks and I was really happy with the results. I wonder how NeRF-w would fare on this data?


After going to one of the early Maker Faires, and seeing so many interesting exhibits and projects, I had this same idea, of course with absolutely no clue about how to implement it. If enough people take pictures of the exhibits from a variety of angles, and make them available online, a virtual Maker Faire could be created. Thanks for sharing this!

Great work! Having tried the code from the original NeRF paper I found the inference time (generation of new views) to be rather slow because the network had to be queried multiple times per ray (pixel). The paper said that there is still potential to speed this up. Did you improve inference speed and do you think that it will be possible to get it to real-time (>30 fps) in the foreseeable future?

We did not aim to speed this part of NeRF up. Check out Neural Sparse Voxel Fields (https://arxiv.org/abs/2007.11571) for some effort in that direction. It's 10x faster, but there's still another 10x to go till you get video frame rates :)

This sort of work will both allow for digital forensics (imagine reconstructing a scene from multiple socially shared images or video), as well as to create even better "deep fakes" (putting people in scenes they never actually went to; or at different times of day/night, or with different weather effects).

Is there a reason why the skies do not appear to be picked up by their "transient" filter of the scene? You end up with the skies constantly changing when moving in 3D point of view, which looks strange.

A good question! And a problem yet to be solved!

This is really cool and IMHO an area where ML truly shines: being able to disentangle the base geometric signal from lighting / crowds / occlusion via learning is truly amazing.

Amazing work! Reminds me of something I saw at SIGGRAPH back in '95 called 'Tour into Picture' I think the work came out of Japan.

When will you be sharing some code?!

Wow that is fantastic work! And so quick since NeRF debuted. This is exactly the kind of work I have been waiting for to reconstruct some old photos I have.

> reconstruct some old photos I have.

Note that it generates a light field, which is note exactly like a polygonal mesh ... YMMV

Is the geometry from each of the examples available in some format? It would be fun to look more closely. Apologies if I missed a link somewhere!

The magic of this method is that we don't construct a "geometry" the same way one might think. There are no triangles or textures here. Instead, we train a machine learning model to predict the derivative of the color and opacity at every point in 3D space. We then integrate along rays emitted from the camera to render an image. It's similar to what's used in CT scans!

That's very cool, but also makes it sound more challenging to integrate into the existing 3D modeling ecosystem vs, say, photogrammetry approaches. Is it possible to generate approximate textured meshes from the color and opacity information?

[Edit] After a little Googling I do see this has been done, using marching cubes (https://www.matthewtancik.com/nerf).

Thanks. So are the trained models for the examples available with code to generate 2D images from them?

This means you don't have an occlusion mesh or any other depth information, correct?

There is depth information, just not in the form of a mesh.

The model learns to compute a function that takes an XYZ position within a volume as input, and returns color and opacity. You can then render images by tracing rays through this volume. You can pretty easily compute the distance to the first sufficiently-opaque region, or the "average" depth (weighted by each sample's contribution to the final pixel color), at the same time.

Another recent Google project figured out a way to approximate these radiance fields with layered, partially transparent images for efficient rendering: https://augmentedperception.github.io/deepviewvideo/

Another related project by our friends in NYC: https://twitter.com/Jimantha/status/1289184432553734144

We have depth! Check out our depth video: https://youtu.be/yPKIxoN2Vf0?t=146

so why does it use multi-layer perceptron? is it the same as ANN? why not calling it ANN? Does it have activation?

There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input.

Why does the network need a direction? Why can't we get a density and a color given a position?

> so why does it use multi-layer perceptron? is it the same as ANN? why not calling it ANN? Does it have activation?

According to Wikipedia, "A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN)." We're being more specific about what we use :)

> There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input. Why does the network need a direction? Why can't we get a density and a color given a position?

Volume data of this form is unable to express the idea of view-dependent reflections. I admit, we don't make much use of that here, but it does help! See NeRF for where it makes a big, big difference: https://www.matthewtancik.com/nerf

The videos are very impressive. I wish you let us move the 3D scene with the mouse in the browser to get a better idea of the result.

Also, can you share any details of the compute requirement for this?

How does this compare to photogrammetry?

According to Wikipedia, "Photogrammetry is the science and technology of obtaining reliable information about physical objects and the environment through the process of recording, measuring and interpreting photographic images and patterns of electromagnetic radiant imagery and other phenomena."

I'd say that this research is in the field of photogrammetry.

I guess I meant how it compares to the commercial photogrammetry software out there.

As mentioned in another comment, traditional photogrammetry software typically generates a 3D polygonal mesh.

It looks like this generates a light field, which is not something that traditional 3D software handles directly.

I don't know! I'm not familiar enough with the field to say.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact