
NeRF in the Wild: reconstructing 3D scenes from internet photography - tambourine_man
https://nerf-w.github.io/
======
mey
I recall many years back a website, I think a Microsoft project, that linked
together photos in a 3D space of tourist destinations. It created something of
a point cloud, but nothing this advanced. You could click through the
points/photos to jump into each photos perspective of the space.

Anyone remember what that project was called, or if it is even still around?

Edit: Found it.
[http://phototour.cs.washington.edu/](http://phototour.cs.washington.edu/)
Later the discontinued.
[https://en.wikipedia.org/wiki/Photosynth](https://en.wikipedia.org/wiki/Photosynth)

~~~
ur-whale
Was it PhotoSynth ?

[https://www.ted.com/talks/blaise_aguera_y_arcas_how_photosyn...](https://www.ted.com/talks/blaise_aguera_y_arcas_how_photosynth_can_connect_the_world_s_images/transcript?language=en)

~~~
mey
Watching this demo makes it very stark how much our single page webapps are
regressions in fluid performance.

~~~
fastball
Is this not a Silverlight app or something?

I highly doubt that demo was JS / could run in multiple browsers without some
proprietary runtime.

~~~
shakna
It was, in point of fact, a Java applet. [0] Older, less secure, but more
powerful.

[0]
[https://web.archive.org/web/20191231213153/http://phototour....](https://web.archive.org/web/20191231213153/http://phototour.cs.washington.edu/applet/index.html)

------
duckworthd
Original author here. AMA!

~~~
airstrike
I had this vision that one day we'll be able to reconstruct memories from our
past by taking old photos and having a ML model collate everything together to
form a 3D rendering of that point in time. It seems like you have gotten most
of the way there.

The next step would be to have the user grab a VR headset and immerse
themselves in their favorite childhood moment. One could even add avatars for
loved ones using again ML-generated audio based on recordings of their voices.

Your project made me think that it wouldn't be that interesting for _me_ to
view _your_ memories, so perhaps the best initial step for a proof-of-concept
that would allow the technology to mature would be to recreate historical
moments so everyone people could relive them – and they could do so entirely
virtually, from the comfort of their own couch. Side note: it feels like this
technology can disrupt traditional museums with the added bonus of being
pandemic-proof.

Anyway, I don't really have a question... Just wanted to compliment you on
this amazing work and throw this idea out there in case others want to think
about it, as I'm in an entirely different field and don't have the skills and
resources to make it real, and I do strongly feel this will inevitably come to
life.

~~~
duckworthd
That's a really cool idea! This technology does a fantastic job at
reconstructing _static_ scenes. The moving objects -- people, cars, even flora
-- are out of scope here. Why? It's really hard to build a 3D model of
something you only see from one direction.

~~~
airstrike
But if we know what a car ought to look like in 3D, can't we take the one
photo we have from one direction and just fill in the blanks with that a
priori 3D knowledge?

~~~
withjive
Similar to how GPT-3 can be applied not only to create Text, but also fill in
missing pieces of Images (ie. complete the missing half of a face).

Would the logical next step, use GPT-3 to create a 3D world? :)

~~~
airstrike
GPT-3D rolls off the tongue nicely

------
nawgz
Those are some very cool 3D visualizations generated, but it's a bit difficult
to understand what the form of the dataset they generated it from is. They say
"in-the-wild" photography, but of course don't really give you a great sense.

The light->dark transitions having consistent geometry is clean though.

~~~
duckworthd
We use images from the Image Matching Challenge 2020 dataset. If you look at
the Appendix, we list how many images we use and the process by which they
were chosen.

Download and have a look! [https://vision.uvic.ca/image-matching-
challenge/data/](https://vision.uvic.ca/image-matching-challenge/data/)

~~~
nawgz
Thanks, that's a clean reference.

------
Mathnerd314
I'm still looking for a program that takes a video and turns it into an
animated 3D scene. All the stuff I've seen is on static scenery, besides some
neural nets that can tweak camera angles.

~~~
johanneskopf
Check this. Code coming (very) soon :)
[https://roxanneluo.github.io/Consistent-Video-Depth-
Estimati...](https://roxanneluo.github.io/Consistent-Video-Depth-Estimation/)

~~~
ThisIsMyPasswrd
Do you happen to know how intellectual property works when someone wants to
use the algorithm/code?

~~~
johanneskopf
I think we're going to use the MIT license. So, you'll be able to use it in
almost any way you like...

~~~
lostmsu
MIT actually does not give an explicit patent grant. So if "any way you like"
is your goal, you should choose something different like Apache License 2.0

------
hardmath123
A while back I stitched together a "hyperlapse" of Stanford's Hoover Tower
using lots of Flickr-scraped images. Everything was aligned using "classical"
CV tricks and I was really happy with the results. I wonder how NeRF-w would
fare on this data?

[https://github.com/kach/hootow-hyperlapse](https://github.com/kach/hootow-
hyperlapse)

------
flyingcircus3
After going to one of the early Maker Faires, and seeing so many interesting
exhibits and projects, I had this same idea, of course with absolutely no clue
about how to implement it. If enough people take pictures of the exhibits from
a variety of angles, and make them available online, a virtual Maker Faire
could be created. Thanks for sharing this!

------
brookman64k
Great work! Having tried the code from the original NeRF paper I found the
inference time (generation of new views) to be rather slow because the network
had to be queried multiple times per ray (pixel). The paper said that there is
still potential to speed this up. Did you improve inference speed and do you
think that it will be possible to get it to real-time (>30 fps) in the
foreseeable future?

~~~
duckworthd
We did not aim to speed this part of NeRF up. Check out Neural Sparse Voxel
Fields ([https://arxiv.org/abs/2007.11571](https://arxiv.org/abs/2007.11571))
for some effort in that direction. It's 10x faster, but there's still another
10x to go till you get video frame rates :)

------
PeterCorless
This sort of work will both allow for digital forensics (imagine
reconstructing a scene from multiple socially shared images or video), as well
as to create even better "deep fakes" (putting people in scenes they never
actually went to; or at different times of day/night, or with different
weather effects).

------
ekianjo
Is there a reason why the skies do not appear to be picked up by their
"transient" filter of the scene? You end up with the skies constantly changing
when moving in 3D point of view, which looks strange.

~~~
duckworthd
A good question! And a problem yet to be solved!

------
ur-whale
This is _really_ cool and IMHO an area where ML truly shines: being able to
disentangle the base geometric signal from lighting / crowds / occlusion via
learning is truly amazing.

------
nla
Amazing work! Reminds me of something I saw at SIGGRAPH back in '95 called
'Tour into Picture' I think the work came out of Japan.

When will you be sharing some code?!

------
randyrand
Wow that is fantastic work! And so quick since NeRF debuted. This is exactly
the kind of work I have been waiting for to reconstruct some old photos I
have.

~~~
ur-whale
> reconstruct some old photos I have.

Note that it generates a light field, which is note exactly like a polygonal
mesh ... YMMV

------
schemescape
Is the geometry from each of the examples available in some format? It would
be fun to look more closely. Apologies if I missed a link somewhere!

~~~
duckworthd
The magic of this method is that we don't construct a "geometry" the same way
one might think. There are no triangles or textures here. Instead, we train a
machine learning model to predict the derivative of the color and opacity at
every point in 3D space. We then integrate along rays emitted from the camera
to render an image. It's similar to what's used in CT scans!

~~~
jayd16
This means you don't have an occlusion mesh or any other depth information,
correct?

~~~
teraflop
There is depth information, just not in the form of a mesh.

The model learns to compute a function that takes an XYZ position within a
volume as input, and returns color and opacity. You can then render images by
tracing rays through this volume. You can pretty easily compute the distance
to the first sufficiently-opaque region, or the "average" depth (weighted by
each sample's contribution to the final pixel color), at the same time.

Another recent Google project figured out a way to approximate these radiance
fields with layered, partially transparent images for efficient rendering:
[https://augmentedperception.github.io/deepviewvideo/](https://augmentedperception.github.io/deepviewvideo/)

~~~
duckworthd
Another related project by our friends in NYC:
[https://twitter.com/Jimantha/status/1289184432553734144](https://twitter.com/Jimantha/status/1289184432553734144)

------
billconan
so why does it use multi-layer perceptron? is it the same as ANN? why not
calling it ANN? Does it have activation?

There is another thing I don't understand. Traditional volume data is a map of
position to density (or color). There doesn't seem to be the need for a
direction as an input.

Why does the network need a direction? Why can't we get a density and a color
given a position?

~~~
duckworthd
> so why does it use multi-layer perceptron? is it the same as ANN? why not
> calling it ANN? Does it have activation?

According to Wikipedia, "A multilayer perceptron (MLP) is a class of
feedforward artificial neural network (ANN)." We're being more specific about
what we use :)

> There is another thing I don't understand. Traditional volume data is a map
> of position to density (or color). There doesn't seem to be the need for a
> direction as an input. Why does the network need a direction? Why can't we
> get a density and a color given a position?

Volume data of this form is unable to express the idea of view-dependent
reflections. I admit, we don't make much use of that here, but it does help!
See NeRF for where it makes a big, big difference:
[https://www.matthewtancik.com/nerf](https://www.matthewtancik.com/nerf)

------
woko
The videos are very impressive. I wish you let us move the 3D scene with the
mouse in the browser to get a better idea of the result.

------
nla
Also, can you share any details of the compute requirement for this?

------
mrfusion
How does this compare to photogrammetry?

~~~
duckworthd
According to Wikipedia, "Photogrammetry is the science and technology of
obtaining reliable information about physical objects and the environment
through the process of recording, measuring and interpreting photographic
images and patterns of electromagnetic radiant imagery and other phenomena."

I'd say that this research is in the field of photogrammetry.

~~~
mrfusion
I guess I meant how it compares to the commercial photogrammetry software out
there.

~~~
ur-whale
As mentioned in another comment, traditional photogrammetry software typically
generates a 3D polygonal mesh.

It looks like this generates a light field, which is not something that
traditional 3D software handles directly.

