Anyone remember what that project was called, or if it is even still around?
Found it. http://phototour.cs.washington.edu/
Later the discontinued. https://en.wikipedia.org/wiki/Photosynth
I highly doubt that demo was JS / could run in multiple browsers without some proprietary runtime.
The next step would be to have the user grab a VR headset and immerse themselves in their favorite childhood moment. One could even add avatars for loved ones using again ML-generated audio based on recordings of their voices.
Your project made me think that it wouldn't be that interesting for me to view your memories, so perhaps the best initial step for a proof-of-concept that would allow the technology to mature would be to recreate historical moments so everyone people could relive them – and they could do so entirely virtually, from the comfort of their own couch. Side note: it feels like this technology can disrupt traditional museums with the added bonus of being pandemic-proof.
Anyway, I don't really have a question... Just wanted to compliment you on this amazing work and throw this idea out there in case others want to think about it, as I'm in an entirely different field and don't have the skills and resources to make it real, and I do strongly feel this will inevitably come to life.
Jon Voight: "Can the computer take us around to the other side?"
Jack Black: "It can HYPOTHESIZE"
Would the logical next step, use GPT-3 to create a 3D world? :)
I do believe that past time travel to reconstructed and recorded events will be one of the stickiest use cases for VR.
We're hiring too. Looking for an engineer with some game dev experience.
That's a precursor to: could this technique be used to enhance Street View? There are times when I would really like to be able to "walk around" outdoor scenes in finer steps. Current Street View smears between photos taken some distance apart. (I don't know if that is a limitation of the public interface or if the original data capture is really that coarse.) It would be nice to have a real 3D space to explore, but I certainly don't expect the un-imaged parts to be defined correctly.
Finally, does this also work for reconstructing interior spaces seen from the inside? Like the geometry of a cave, from pictures of the cave interior?
Does this work in reconstructing indoor spaces? Give it a shot and find out!
Have you released the code? (Or did you mean I could try re-implementing the work you published from the paper? That's a reasonable response too.) I didn't see a link to source in the github.io page or in the arXiv paper. The only source code link I saw was to https://github.com/bmild/nerf which I thought was earlier work than this paper.
Note that the results are still very impressive imo, this is still early research phase.
We’re potentially about to start a non profit and formally kick off our whole robot as open source. I’m interested in finding research partners who would like to help produce a research paper on 3D reconstruction of plants. I can produce a high quality geo located dataset with 2cm accurate GPS tags, but I have no experience with neural rendering. This is work I want to do over the course of the next year.
Do you know anyone interested in helping with thank kind of work? Thanks!
The method presented here wouldn't do well with your problem either. 3D reconstruction of moving objects is an unsolved problem!
Just came across this which is neat:
Also this might be useful:
For now I’m just beginning to collect data but I hope to contribute more to the field in time!
The method is unattended, in the sense that it's photos + camera parameters in and scene representation out. The photos should all be of the same scene (e.g. the Trevi Fountain). Once you have a scene representation, you can ask what the scene would look like from new camera angles with your choice of lighting.
Choosing camera angles is straightforward. You tell me where and what direction the camera is facing. The question then becomes, how do you specify your choice of lighting? The answer is, you can't do so directly. Instead, you provide a picture with the lighting you want, and with a little magic, we can find a way to imitate that lighting. The way we do is by finding a corresponding "appearance embedding" via numerical optimization.
2) We don't extract surface contours, we learn a volumetric radiance field! To oversimplify, we learn a (smooth) function that, given a position in space, produces the differential opacity and color at that space. To render an image from a camera viewpoint, we approximately integrate along rays emitted from each pixel of the camera.
Check out NeRF and our paper to learn more about this representation!
One of the best non-classical methods is this one (https://grail.cs.washington.edu/projects/sq_rome_g1/), and our method is significantly improves upon it. We do not compare directly with it, but Neural Rerendering in the Wild does, and we improve upon it.
also they're way higher quality than traditional techniques
Someone made a “camera” which tracks location & direction, and “takes” a picture by selecting the closest picture found taken from that spot & angle.
Use this new tech for a next generation of that “camera”, generating the Platonic frame which should occur from that location.
You can use the PhotoSynth tech described above.
what does the ideal minimal data set look like (eg, 5 photos from each 15-degree offset)?
thanks for being so active on this thread.
My follow up question would be: are you able to compare your results to actual photogrammetry data to see how good your reconstruction performs?
1) cant find the paper now, but by exploiting predictable rolling shutter you get additional temporal resolution
Do you plan on releasing code?
Congrats again. This is very cool research.
> What was the most challenging aspect of this?
Wow, that's hard to say! Our work truly stands on the shoulders of giants (Mildenhall et al, 2020). I can list off a few challenges:
figuring out if an idea "kinda works" or "definitely works" or has a bug,
figuring out how to measure progress,
coordinating a group of 6 researchers living 9 hours apart,
and assembling everything together for a simultaneous paper-website-video release!
> I'm curious to see how you performed edge detection on the transient objects and were able to isolate them so cleanly.
We don't! All of this comes by the magic of machine learning :). We train our model to attentuate the importance of "difficult" pixels that aren't 3D-consistent in the training set. We also partition the scene into "static" and "transient" volumetric radiance fields without explicit supervision. We do so by regularizing the latter to be empty unless necessary, and providing it with access to a learned, image-specific latent embedding. We discard the transient radiance field when rendering these videos, thus removing tourists and other moving objects.
> For some reason, the paper isn't loading, so feel free to say it's explained in detail there.
Well that won't do. Download it here: https://arxiv.org/abs/2008.02268. It's 40 MB, so your download bar may indicate it's almost done, but it actually has a good bit left to go.
> Do you plan on releasing code?
I hope so! As with most code, what ran on our machines may not run on yours. Migrating the code to open source will be a big effort. I hope what we describe in the paper is sufficient to build something like you see here.
No, it is not. It's 5 shell command at the most.
$ git init
$ git add .
$ git commit -m 'Initial import'
$ git remote add origin git://...
$ git push origin master
Then say "we can't share for legal reason", not "we're planning to". This is just a bs corporate answer.
> is of good enough quality to be shared with the public.
This is a petty excuse. There is plenty of open-source code utterly crappy/barely functional out there.
Mine included ;)
GPT-3 has some pretty interesting demo which unfortunately are unfortunately rather disappointing once outside a carefully crafted environment. Said otherwise, a paper is nothing it it cannot be reproduced.
Is this information just available from the dataset?
There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input.
Why does the network need a direction? Why can't we get a density (opacity) and a color given a position?
and what are z(t) r(t) in equation 5,6?
> and what are z(t) r(t) in equation 5,6?
r(t) is a position in 3D space along a camera ray of the form, r(t) = origin + t * direction.
z(t) is the output of our first MLP. Think of it as a 256-dimensional vector of uninterpretable numbers that represent the input position r(t) in a useful way.
If the question is, "can you reconstruct a (static) scene from the frames in a video?", the answer is yes!
If the question is, "can you reconstruct a scene with people and other moving objects, and model them moving around too?", the answer is not yet.
The light->dark transitions having consistent geometry is clean though.
Download and have a look! https://vision.uvic.ca/image-matching-challenge/data/
Flickr user photos. Citation shows up in the lower right hand corner during the video.
This appears to be a substantial improvement on current open photogrammetry/structure from motion work . I hope Google supports this making its way into cultural preservation efforts .
 https://github.com/mapillary/OpenSfM (developed by Mapillary, now part of Facebook)
 https://www.nytimes.com/2015/12/28/arts/design/using-laser-s... (Using Lasers to Preserve Antiquities Threatened by ISIS)
I saw in the paper their citation  pointed to https://arxiv.org/pdf/2003.01587.pdf, which in section 3 says the following:
We thus build on 25 collections of popular landmarks
originally selected in [48,101], each with hundreds to thousands of images.
So hundreds to thousands of photos are used, which is a decent quantity, but definitely makes the quality of the result very impressive.
When will you be sharing some code?!
Note that it generates a light field, which is note exactly like a polygonal mesh ... YMMV
[Edit] After a little Googling I do see this has been done, using marching cubes (https://www.matthewtancik.com/nerf).
The model learns to compute a function that takes an XYZ position within a volume as input, and returns color and opacity. You can then render images by tracing rays through this volume. You can pretty easily compute the distance to the first sufficiently-opaque region, or the "average" depth (weighted by each sample's contribution to the final pixel color), at the same time.
Another recent Google project figured out a way to approximate these radiance fields with layered, partially transparent images for efficient rendering: https://augmentedperception.github.io/deepviewvideo/
Why does the network need a direction? Why can't we get a density and a color given a position?
According to Wikipedia, "A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN)." We're being more specific about what we use :)
> There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input. Why does the network need a direction? Why can't we get a density and a color given a position?
Volume data of this form is unable to express the idea of view-dependent reflections. I admit, we don't make much use of that here, but it does help! See NeRF for where it makes a big, big difference: https://www.matthewtancik.com/nerf
I'd say that this research is in the field of photogrammetry.
It looks like this generates a light field, which is not something that traditional 3D software handles directly.