
Neural scene representation and rendering - johnmoberg
https://deepmind.com/blog/neural-scene-representation-and-rendering/
======
cs702
This work is a natural progression from a lot of other prior work in the
literature... but that doesn't make the results any less impressive. The
examples shown are amazingly, unbelievably good! Really GREAT WORK.

Based on a quick skim of the paper, here is my oversimplified description of
how this works:

During training, an agent navigates an artificial 3D scene, observing multiple
2D snapshots of the scene, each snapshot from a different vantage point. The
agent passes these snapshots to a deep net composed of two main parts: a
representation-learning net and a scene-generation net. The representation-
learning net takes as input the agent's observations and produces a scene
representation (i.e., a lower-dimensional embedding which encodes information
about the underlying scene). The scene-generation network then predicts the
scene from three inputs: (1) _an arbitrary query viewpoint_ , (2) the scene
representation, and (3) stochastic latent variables. The two networks are
trained jointly, end-to-end, to maximize the likelihood of generating the
ground-truth image that would be observed from the query viewpoint. See Figure
1 on Page 15 of the Open Access version of the paper. Obviously I'm playing
loose with language and leaving out numerous important details, but this is
essentially how training works, as I understand it based on a first skim.

EDIT: I replaced "somewhat obvious" with "natural," which better conveys what
I actually meant to write the first time around.

~~~
Rainymood
I, literally just 15 minutes ago, had a chat with a friend of mine exactly
about how what we are doing right now with computer vision is all based on a
flawed premise (supervised 2D training set). The human brain works in 3D space
(or 3D+time) and then projects all this knowledge in a 2D image.

Here I was, thinking I finally had thought of a nice PhD project and then
Deepmind comes along and gets the scoop! Haha.

~~~
simonster
I don’t think this is a novel idea, but it is still a great topic for a PhD.
While the results in this paper look impressive, my suspicion is that the
system doesn’t generalize particularly well. (I suspect this from experience
with similar, albeit simpler, ideas, as well as from looking at the datasets.)
If you can make a system that generalizes to new environments and objects, or
a system that works with real-world natural image/video data, that would be a
tremendous accomplishment.

~~~
amelius
Generalization is a more fundamental problem, and (imho) should be tackled
first at a more fundamental level.

For example, if you have a classifier that can recognize cats, it doesn't mean
it will work for cartoon cats. You'd have to train the system all over again
with cartoon cats. Instead, you want the system to learn more like humans,
where only a small number of examples is necessary to make the connection
between real and cartoon cats.

~~~
simonster
It is possible that the problems are related—-it may be that, to achieve
human-like generalization, neural nets need to learn in a human-like
environment, instead of from a folder full of images. But time will tell.

~~~
igravious
This has been said many times in different ways over the years. To achieve
human-like intelligence one needs a human-like body operating in a human-like
environment. It's the first of the E's in: embodied emergent extended
enactive. [https://plato.stanford.edu/entries/embodied-
cognition/](https://plato.stanford.edu/entries/embodied-cognition/)

------
TTPrograms
I'm surprised people are so blown away by this. It's a cool demonstration, but
for this problem you have basically infinite training data. If you can find a
latent space of faces this is hardly a stretch, since you already have a
fantastic notion of locality in your data (by perturbing the camera). The
interesting thing is generalization, which they show in figure 3B and is...
ok, I guess. It's not that surprising compared to any of the other VAE stuff
people have done (see the morphing scenes, 3D face illumination / rendering
and furniture stuff from 2 years ago, for instance). It's also not that
surprising compared to ex. the generative scene model RL paper that came out a
few months ago (with Doom and the driving game). IMO deep learning research
has moved beyond "here's another set of points I can fit a curve to". It
really feels like this publication was heavily driven by prestige when most of
the innovative stuff was achieved by other groups 2 years ago or more.

Ex. how's this different from
[https://arxiv.org/pdf/1503.03167.pdf](https://arxiv.org/pdf/1503.03167.pdf)
from 2015?

~~~
2bitencryption
I don't think the breakthrough here is generating a 3D space from 2D
snapshots.

I think it's the idea that a network capable of doing that is a far, far
better input to training an agent than flat images, or even the ground-truth
3D space.

------
sgillen
Very cool work, deep mind wows me once again.

One thing I wish they would make more explicit (and in all their papers that
I've read for that matter) is how much computational power it takes to train
these networks and achieve these results.

I'm not sure if this is something they usually leave out because it's not
interesting, or because it's something that people that work with deep
networks all the time (I.E. not me) already have a feel for.

As someone in a related field (sometimes using deep networks but not
researching them for their own sake). I certainly would like to know which of
deep minds results would be feasible to replicate using my research groups
resources, and it can be hard to do that without spending a lot of time
actually trying to replicate the results and benchmarking them on your
hardware.

~~~
dzdt
Retracted! See below; compute is disclosed and is not that crazy. For this
project their training hardware used 4 Nvidia K80's.

Original comment: _I think they leave it out because otherwise the standard
response to their work would be "no surprise they get better results than
anyone else, they are using two orders of magnitude more compute time than
anyone else!" Not highlighting the computational expense makes their results
look more impressive._

~~~
zerostar07
Also, because they are focusing heavily on the RL part of the modeling. They
obviously have obscene amounts of available compute, but that is not their
competitive advantage.

~~~
sgillen
what exactly do you mean? Are you saying that RL requires less compute?

I would say having an obscene amount of compute is definitely a big
competitive advantage, especially over a lot of small academic research labs.

~~~
zerostar07
> We train each GQN model simultaneously on 4 NVidia K80 GPUs for 2 million
> gradient steps. The values of the hyper-parameters used for optimisation are
> detailed in Table S1, and we show the effect of model size on final
> performance in Fig. S4.

> The values of all hyper-parameters were selected by performing informal
> search. We did not perform a systematic grid search owing to the high
> computational cost.

~~~
jacquesm
That's nowhere near an obscene amount of computing power for any serious ML
project.

------
GistNoesis
This seems impressive, but it shows that there is still some way to go when
comparing to old school techniques. I don't know how many TPU they used but
probably a lot.

You can build a 3d key-points map using Slam algorithms real time on a
raspberry pi. From there, you render those key-points and descriptors to a
virtual screen given the desired camera pose, then you learn a deconvolution
from these sparse rendered key-points to image mapping.

Alternatively, using more memory, once you have a 3d map, you can save some
key-frame with camera poses, when ask the view from a given pose, you pick the
k closest poses and interpolate (eventually with a neural net).

If now you have some more compute, you can do the previous slam algorithms
with dense maps, and interpolate the dense 2.5d point clouds.

Their network is probably doing inefficiently a mixture of those different
things, trading compute and memory power for flexibility.

~~~
zxcvvcxz
> You can build a 3d key-points map using Slam algorithms real time on a
> raspberry pi.

Does anyone have links to any open source projects doing this? Preferably with
an example video(s) showing results?

~~~
GistNoesis
I used this today for my robot. [https://github.com/Alkaid-
Benetnash/ORB_SLAM2/](https://github.com/Alkaid-Benetnash/ORB_SLAM2/) (This
particular fork can save the map although it needs to be generated on the pi).
It works almost out of the box, just need half a day of slow compilation.

For use on a raspberry pi model 3B+, 2000 key-points, it runs 1-2 fps at
640x480, 5-6 fps at 320x240. Use 500M for a few rooms and 75% CPU when map
building 50% CPU once built. It's not optimized for the pi so you can probably
get it to run at least 3 times faster if you are willing to get your hands
dirty.

For it to work well a 180 degree camera really makes a difference, and run
opencv cpp-tutorial-cameracalibration on a chessboard to get the needed
extrinsics.

There are probably other slam algos in ROS, but I'm not sure how raspberry-pi
compatible they are.

If you want to try and experiment with neural networks, once you have build
your 3d map on powerful computer you can build a neural network to learn the
pose from the image. This will allow you to have some constant time, constant
memory algorithm for later use on the pi, it will probably be less precise.

------
state_less
Bravo. This sort of imagining of a scene could perhaps allow for an agent to
recognize it doesn't know what's behind the ball if asked. That would be a
nice feature if you wanted to reward the agent for finding unexplored areas.
It also could help an agent plan to get to some goal. No need to guess if I
take this fork in the road, will I need to retrace my steps? Instead imagine
it and then avoid the imagined pitfalls before taking the action.

------
rasz
I can totally see Google incorporating this in Self driving cars down the
road, after training on millions of hours or dashcam footage, to augment or
maybe even replace Lidar. Paper suggests it is capable of segmenting 2D
picture input into logical objects and their discrete
configurations/positions.

Its not without its pitfalls tho, instead of Monocular SLAM generating
factual, albeit fuzzy point cloud map we get overfitted (5 pictures x 2M
similar scenes) magic black box generating very training set specific
hallucination. This is how we get Scanners replacing numbers in scanned
documents [http://www.dkriesel.com/en/blog/2013/0802_xerox-
workcentres_...](http://www.dkriesel.com/en/blog/2013/0802_xerox-
workcentres_are_switching_written_numbers_when_scanning)

Similar example was posted 2 months ago
[https://data.vision.ee.ethz.ch/aeirikur/extremecompression/](https://data.vision.ee.ethz.ch/aeirikur/extremecompression/)
example picture (no doubt best case authors could manage) gained additional
data absent from original, some of it dangerous like fake license plate
numbers.

------
zerostar07
> We also found that the GQN is able to carry out “scene algebra” [akin to
> word embedding algebra (20)]. By adding and subtracting representations of
> related scenes, we found that object and scene properties can be controlled,
> even across object positions.

This is incredible because it provides a way to link with linguisic
understanding and manipulation of the rendering.

------
formalsystem
If any of the authors are on this thread, am wondering if there are any to
plans to release source code this? This can potentially make generating
environments for games or VR trivial by just taking photos IRL and then
importing them into some game engine to generate a scene

~~~
modeless
The important part is the dataset, which in this case is generated by DeepMind
Lab, which is already open source:
[https://github.com/deepmind/lab](https://github.com/deepmind/lab)

Reimplementing the rest of the paper shouldn't be tremendously difficult.
These techniques tend to be fairly simple at their core. But training it could
be expensive. In any case, it is a _long, long way_ from here to "make
generating environments for games or VR trivial by just taking photos IRL and
then importing them into some game engine". Many years of research remain.

~~~
formalsystem
Hey James, since you seem to know a lot about graphics and ML. Am wondering
which specific problems remain open before what I mentioned becomes more
feasible product wise?

~~~
modeless
The domain of images used in this research is extremely limited. These are
very low resolution artificially generated images of small scenes with simple
lighting, simple textures, simple geometry, and very restricted camera
positions and parameters that are known exactly (which is not the case for
most natural photos). Each of those restrictions needs to be lifted before
this will work on realistic natural scenes, and that will require many orders
of magnitude more data. It's not clear that this approach will easily scale up
to that amount of dataset variation. It's likely that a much fancier neural
net architecture and training scheme will be required, and probably faster
hardware too.

This is not intended as a criticism of this research, which I think is really
great.

------
tomxor
> The generation network is therefore an approximate renderer that is learned
> from data.

>
> [https://www.youtube.com/watch?v=G-kWNQJ4idw&feature=youtu.be](https://www.youtube.com/watch?v=G-kWNQJ4idw&feature=youtu.be)

I wonder if this is efficient... I know this isn't the researchers intended
application, but the path tracer in me wants to see how far this can be pushed
for real time rendering. I welcome the more interesting artefacts that a NN
might produce (i'm talking about pigsnails [1] of course :D)

full circle: GPU GLSL for graphics -> GPU cuda/opencl for NN -> GPU
cuda/opencl for NN graphics

[1] [https://www.newscientist.com/article/dn27755-artificial-
brai...](https://www.newscientist.com/article/dn27755-artificial-brain-turns-
clouds-into-psychedelic-pig-snails/)

~~~
halflings
Disney actually does a lot of research combining the world of graphics with
deep learning.

Some examples that you might appreciate (from the excellent channel "Two
Minute Papers"):

. "Disney's AI Learns To Render Clouds" [0]

. "AI Learns Noise Filtering For Photorealistic Videos" [1]

[0]
[https://www.youtube.com/watch?v=7wt-9fjPDjQ](https://www.youtube.com/watch?v=7wt-9fjPDjQ)
[1]
[https://www.youtube.com/watch?v=YjjTPV2pXY0](https://www.youtube.com/watch?v=YjjTPV2pXY0)

~~~
tomxor
Thanks! that is really some awesome stuff. It's even simpler in concept than
this.

------
VikingCoder
I'm disappointed they didn't demonstrate what happens when they take this
system and expose it to a few real-world photos. Can it handle that, or has it
been very much fitted to these shapes?

~~~
yagyu
It has very much been fitted. The supplementary material describes the
training set as 5 pictures each of 2M different scenes of the type you can see
in the paper (square room with random objects).

So to extrapolate wildly, it seems reasonable that getting similar results
for, say, real-world bedrooms, you'd need to take around 5 pics each of 2M
bedrooms, and record the location and angle of the camera for each picture.

Edit: and I didn't mean to sound negative, using artificially generated rooms
to develop the method is a great idea, and the next step will be narrow domain
specific applications (they mention eg robotic arms) where it's feasible to
automatically collect enough data for a task, and somewhere in the future we
may have the data and compute to sample the distribution of real world
environments in decent resolution..

~~~
chrisfosterelli
I'd say much more than that, since real-world bedrooms likely have a much more
complicated representation than more simple generated rooms.

~~~
yagyu
I agree with you that the estimate is conservative, and would depend strongly
on image resolution and how broad your distribution of bedrooms is - only
modern US style, or also 40 year old Japanese houses?

------
hacker_9
Well this is nothing short of incredible. I wonder if they'll get it to a
point where it can look at 3d drawing, and immediately be able to produce a 3d
model which includes all the occluded parts.

~~~
pwaai
Probably not far off. We might get to a point where we have an AI software
that can run on any computer which will entertain us to eternity.

Combined with VXGI and other photo realism efforts, AI could produce any
permutation of your favorite TV show that ended too soon. Ex. Breaking Bad
Season 15: Walter Jr's Revenge or something like that.

There's also an AI that produced a clone of the game by watching videos so
with this new neural scene representtion, you wouldn't have to train it with
thousands of hours of gameplay footage, it could see a video once and figure
out the game mechanics ex. stepping on a group of sprites which it recognizes
as Enemy1, it should increment score count based on some generic platformer
template model.

Once again, Deepmind delivers.

------
aqsheehy
I wonder if we'll get to the stage where game engines become a series of
neural networks hallucinating the output.

~~~
hypothetical
I was thinking about exactly this kind of experiment. Given an input of
gameplay recordings, train a model to predict the next framebuffer from the
previous frame and keypress input. Would the model have to be excessively
complex to avoid rapid divergence into feedback patterns resembling a Winamp
visualizer? Probably, but it should be entertaining enough to watch and
interact with anyway.

------
cabaalis
Could this be a pivot point in rendering technology? All of the thought and
effort over decades put into the math behind 3D rendering, meant to produce a
rasterized scene following perfect 3D calculations and rules -- replaced with
a system that just "imagines" the picture?

~~~
ClassyJacket
If you're referring to increasing the speed of 3D rendering, it's very
unlikely that this is faster than a traditional rendering method. It's
probably an order of magnitude or three slower. If anything it would assist in
the art stage, not the rendering stage.

Also, what would you render? You still need input.

------
juanuys
Demis alluded to this at the Cheltenham Science festival last Saturday when
someone asked about rats in mazes and how spatial neural connections are
formed.

[https://www.cheltenhamfestivals.com/science/whats-
on/2018/de...](https://www.cheltenhamfestivals.com/science/whats-
on/2018/demis-hassabis-the-future-of-ai-and-science/)

------
closetCS
Hey, just skimmed the news article. Seems really interesting, but the lack of
information on compute requirements is concerning. Also I wonder what the
latent factors and the specific layers in each model are? I tried to dig
deeper in the paper but the description was pretty ambiguous?

------
allthenews
This undoubtedly brings us a leap closer to AI. Imagine a number of these self
learning nets arranged in some structure, learning and unlearning bits of
information on demand, perhaps with different levels of volatility.

Sounds almost like learning new skills and forgetting old ones.

------
guskel
This + the recent grid cell work would allow for the view training points to
be generated unsupervised as well. Just drop an agent in an environment, it
will explore it and come up with the scene representation entirely on it's
own.

------
ankeshanand
One thing to note is that the camera viewpoint (it's position, roll, pitch,
and yaw) is fed along with the images during training. Requiring access to
this ground truth makes this method very constraining to use in practice.

~~~
ehsankia
What kind of use cases are you thinking of where this wold be constraining?
Don't many computer vision algorithms also require something specifying the
parameters of the camera, such as the fundamental matrix for stereo imaging?

As humans, when we look at a scene, then move a few feet and look at it again,
we have a pretty good idea what the delta between the two views were, so why
is providing the same info here any different?

~~~
boxy310
I would add that humans also integrate gyroscopic & acceleration information
from the inner ear to understand relative balance. Multiple sources of sensor
data is a net benefit, not a drawback.

------
ChuckMcM
Nice work! I am guessing that this moves us that much closer to a camera only
based SLAM system.

I am also curious if they are going to use this architecture to defend against
adversary GANs that are attempting to defeat image recognition.

------
martythemaniak
It is rather funny that in nearly every Tesla-related thread, they are slammed
as irresponsible fraudsters for their decision to not use LIDAR and rely on a
radar/camera-based system. Cameras cannot detect obstacles, we are told, and
they'll never be able to make an autonomous vehicle without it.

This, despite the fact that humans do well enough and that Structure From
Motion has been a well-established part of Computer Vision research for a
while.

More on topic, this is pretty great work and it'll have wide applications, for
example Google's own efforts to make robot arms more perceptive using regular
cameras:
[https://ai.google/research/teams/brain/robotics/](https://ai.google/research/teams/brain/robotics/)

~~~
TeMPOraL
> _It is rather funny that in nearly every Tesla-related thread, they are
> slammed as irresponsible fraudsters for their decision to not use LIDAR and
> rely on a radar /camera-based system. Cameras cannot detect obstacles, we
> are told, and they'll never be able to make an autonomous vehicle without
> it._

I don't think anyone reasonable says it's _impossible_ \- after all, humans
are living, walking proof that it's entirely doable. The core of criticism is
that it's _insanely more difficult_ than just using LIDAR data. One could even
say that LIDAR, as a specialized tool for depth detection, is a intrinsically
better tool for the job.

~~~
haberman
In particular, I find this claim questionable: "All Tesla Cars Being Produced
Now Have Full Self-Driving Hardware." [https://www.tesla.com/blog/all-tesla-
cars-being-produced-now...](https://www.tesla.com/blog/all-tesla-cars-being-
produced-now-have-full-self-driving-hardware?redirect=no)

As you mention, this is trivially true on the level of the _sensors_. We know
cameras are enough because human brains can do it with eyes (biological
cameras).

But there is no evidence that we know how to write software for the
_processing_ part of it (to an acceptable degree of safety) with cameras only,
nor that the computing power on-board is up to the task.

How can you say that hardware package X is sufficient to implement something
that has literally never been done before?

~~~
joshuamorton
>this is trivially true on the level of the sensors.

Maybe. The human eye is fundamentally different than cameras in terms of how
it focuses on things. For similar performance, we may need much higher
resolution cameras. We don't know.

~~~
haberman
Interesting, I wasn't aware. How are they different?

~~~
joshuamorton
Eyes don't have uniform resolution. The centers of our eyes (or what we're
focusing on) have very, very high resolution, while the outer parts have much
lower. Cameras normally have something like a middling resolution in
comparison.

~~~
sorenjan
Eyes doesn't send the entire picture to the brain at once at a constant
sampling rate either, they work more like event cameras [0]. Combine this with
micro movements of the eye, and specialized brain structures and it's not as
easy as saying that because we only need two eyes, robots only need two
cameras. Sure, stereo vision might be enough, but what kind of cameras, and
what kind of computers do we need to reach feature parity with our own sight?

[0]
[http://www.rit.edu/kgcoe/iros15workshop/papers/IROS2015-WASR...](http://www.rit.edu/kgcoe/iros15workshop/papers/IROS2015-WASRoP-
Invited-04-slides.pdf)

------
mark_l_watson
Very nice. It will be interesting to see future results that work with real
(non synthetic) scenes - I would not be surprised if that happens in just a
few months.

~~~
extralego
Why will you not be surprised if that happens in just a few months?

As a mere CG artist, I will still be experiencing the surprise of seeing these
examples a few months ago.

Are any particular recent achievements, announcements or similar influencing
your expectations? If so, please share.

~~~
Maybestring
I also wouldn't be surprised. Results in transfer learning from synthetic to
real world vision tasks suggest to me that if you could train this system with
(for example) GTA-V as the environment, it may work reasonably well in the
real world.

------
polskibus
Is this patented? I heard that deep mind is patenting a lot of that, is that
applicable to this particular technique? Where would such patent be
enforceable?

~~~
ooyy
From the paper: _DeepMind has filed a U.K. patent application
(GP-201495-00-PCT) related to this work._

------
auggierose
The displayed scenes remind me a lot of Wolfenstein 3D.

------
bitL
Fantastic! Is there some secret at DeepMind how to boost ones capabilities in
this space to be _that_ good at bleeding edge?

------
cryoshon
i'm envisioning a new kind of black-swan style mistake which i'm going to call
the allegory of the neural net in the cave.

people will feed neural nets data, and ask it to describe the specific data
set that the data is coming from -- without having the majority of that data
set in hand.

in this instance, it would be showing the neural net a picture of a 3D area,
and then waiting for it to extrapolate the details of the rest.

on average, the neural net's prediction may line up with reality. that is to
say, the simulated data set is identical to the real data. that is what we are
seeing in the OP link. but as soon as this method can apprehend and predict
things of more complexity, that's where the differences will start to show.

sure, it isn't the neural net's fault -- any one worth its salt will place a
confidence estimate on its extrapolated data points. but people don't
understand how to interpret those confidence estimates. they'll round up to
100%, or round down to 0% accuracy. once people start using these techniques
to guide serious decisions in business or elsewhere, that's where those
dastardly percentages between 0 and 100 come into play.

imagine using this neural net as a way to generate returns in the context of
trading stocks on wall st. it's a misuse of the tool, of course. but that
won't stop people from making a decision based on a 95% probability of being
correct; of course, 5% of the time, it will result in disaster. nor will it
stop people from getting screwed by unknown unknowns.

this is the stuff which the consulting businesses of the future are built on
-- scolding people about abusing models while trying to preserve the power of
the model as a tool. needless to say, i'm interested in where this goes.

------
kayoone
finally, CSI level hollywood tech will be real in a few years

~~~
yoz-y
Would hallucinated images be accepted in court though? I hope not.

------
vokep
I. Am. Terrified.

This is too close

slow the hell down

~~~
mabbo
> slow the hell down

I'm afraid this isn't a car that you can stop, it's a freefall without a
parachute. You're welcome to try flapping your arms, for all the good it will
do.

And yes, the ground may or may not be approaching at an alarming rate.

~~~
sgtmas2006
What if we're flying up and not down?

~~~
fvdessen
Into cold and empty space ?

~~~
sgtmas2006
Towards endless stars, bound to be pulled in by another

