

Deep Convolutional Inverse Graphics Network - tejask
http://willwhitney.github.io/dc-ign/www/

======
radarsat1
This is really clever. So basically iiuc, they set up a network to encode down
to a representation that consists of parameters for a rendering engine. In
order to ensure that this is the representation that is learned, the decoding
stage is used to re-render the image subject to transformations and perform
the decoding based on a an initial reduction phase after rendering. I.e. it is
like an autoencoder, but the inner-most reduced representation is forced to be
related to a graphics rendering engine by manipulating related transformation
parameters.

Not only is this interesting from the point of view of using it for learning
how to generate images, but it is a novel way to force a semantic internal
representation instead of leaving it up to a regularisation strategy and
interpreting the sparse encoding post-hoc. It forces the internal
representation to be inherently "tweakable."

~~~
cs702
This can also be used for object recognition against invariant 3D
representations, potentially with more accuracy than traditional convolutional
neural net architectures.

Consider: their proof-of-concept face-recognition model achieves performance
comparable to traditional convnets on faces with varying degree of pose,
lighting, shape and texture, even though it was trained _completely
unsupervised_. I would expect this type of model to beat the state of the art
in face recognition and other similar tasks when fined-tuned with supervised
training in the not-too-distant future.

------
rirarobo
Very cool work, I'm happy to see more people thinking about deep networks
along these lines. It seems that this is very similar to a recent work put on
arxiv back in November,

"Learning to Generate Chairs with Convolutional Neural Networks".
[http://arxiv.org/abs/1411.5928](http://arxiv.org/abs/1411.5928)

They also have a very cool video of the generation process:
[https://youtu.be/QCSW4isBDL0](https://youtu.be/QCSW4isBDL0)

It's very interesting to see two groups independently developing almost
identical networks for inverse graphics tasks, both using pose, shape, and
view parameters to guide learning. I think that continuing in this direction
could provide a lot of insight into how these deep networks work, and lead to
new improvements for recognition tasks too.

@tejask - You should probably cite the above paper, and thanks for providing
code! awesome!

~~~
tejask
thanks for the references! I like that many people are doing such things.
After looking at the chairs paper, it seems like they render images given
pose,shape,view etc (supervised setting). However, in our model, there is a
twist as it is trained either completely unsupervised or biased to separate
those variables (but it is never given the true values of those parameters ...
just raw data).

------
svantana
This is very nice, however I wish they would have used a traditional rendering
technique (e.g. raytracing) for the decoder stage. It would have been more
difficult to compute the gradient, but maybe not too bad if employing some
type of automatic differentiation. If they had done it that way, the
renderings could scale to any resolution (post-learning) and employ all types
of niceities such as depth of field, sub-surface scattering, etc. Instead
we're left with these very blocky, quantized convolution-style images.

~~~
tejask
One of the authors here. You are absolutely right! In fact, I am currently
doing something similar but it is not working as well yet. As far as this work
is concerned, we wanted to see how model-free can we go.

~~~
rndn
I don’t understand much of the paper but it looks awesome! I have two
questions: Am I understanding it correctly that one would need to convert the
internal representation to a textured triangle mesh in order to use ray
tracing in the decoder stage? Is the encoder effectively similar to scene
reconstruction via structure from motion?

~~~
tejask
there are many ways to parametrize the decoder. One of the ways is to
constrain it to output an explicit mesh or volumetric representation and
express the rendering pipeline so that it's differentiable. The encoder will
then effectively learn an "inference algorithm" to get the best output. A
feedforward neural network is not enough and recurrent computations will
eventually be necessary.

~~~
jwp729
Can you explain a bit more why the recurrent network structure becomes
necessary at some point? Is that because reversing a CNN naturally means
rendering by (de)convolution?

~~~
tejask
In order to approximately learn a "real" graphics engine with support for
basic physics, just feed-forward computation might not be sufficient. A more
natural way to learn graphics/physics might be to learn the temporal structure
more explicitly. On the other hand, it might also be interesting to just add
temporal convolution-deconvolution structure in the existing model. This is
work in progress though.

------
poslathian
Reminds me of being blown away in 2007 by Vetter and Blanz chasing a similar
aim:
[https://m.youtube.com/watch?v=jrutZaYoQJo](https://m.youtube.com/watch?v=jrutZaYoQJo)

------
ericjang
Whoa. Basically like
[http://www.di.ens.fr/willow/pdfscurrent/pami09a.pdf](http://www.di.ens.fr/willow/pdfscurrent/pami09a.pdf)
except it skips the (explicit) 3D mesh reconstruction altogether and goes
straight to the rendered output.

------
FallDead
In laymen's terms this does what ?

~~~
kveykva
Network learns a system of lighting and geometry, so you can manipulate a set
of codes that represent some variables of that geometry and the positions of
those lights.

~~~
tejask
In summary, the most interesting part for the general audience might be the
following question -- can we learn a 3D rendering engine just from images or
videos without any hand-engineering?

Apart from the interesting applications for computer graphics (like rendering
novel viewpoints of an object from various viewpoints), this can also be
directly used for vision applications. This is because computer vision can be
thought of as the inverse of computer graphics.

Goal of computer graphics: scene description -> images

and

Goal of vision: images -> scene description.

Therefore, training a neural network to behave like a graphics engine is
interesting from both these perspectives. We are a LONG way from even
scratching the surface.

~~~
perdunov
How long has this idea of making a 3D engine from conv nets been researched?

~~~
tejask
To the best of my knowledge, not much at all. It is an open question. Besides,
a feedforward net is not going to be enough.

------
_0ffh
Haven't read the paper yet, but sounds similar in concept to what Geoff Hinton
aims at for image recognition networks.

~~~
tejask
Yes this is very much inspired by Geoff's work.

------
amelius
So, in essence, this network can learn to "unproject" images.

Since projection is a lossy operation, a projected image has potentially
multiple inverses. And this makes me wonder how this system deals with the
situation where two or more inverses exist and are equally likely.

~~~
tejask
This is an interesting question. Technically, we capture a probability
distribution in the code layer (between encoder and decoder). So you can
sample from it multiple times and assess uncertainty. However, we have not
really studied this.

