Based on a quick skim of the paper, here is my oversimplified description of how this works:
During training, an agent navigates an artificial 3D scene, observing multiple 2D snapshots of the scene, each snapshot from a different vantage point. The agent passes these snapshots to a deep net composed of two main parts: a representation-learning net and a scene-generation net. The representation-learning net takes as input the agent's observations and produces a scene representation (i.e., a lower-dimensional embedding which encodes information about the underlying scene). The scene-generation network then predicts the scene from three inputs: (1) an arbitrary query viewpoint, (2) the scene representation, and (3) stochastic latent variables. The two networks are trained jointly, end-to-end, to maximize the likelihood of generating the ground-truth image that would be observed from the query viewpoint. See Figure 1 on Page 15 of the Open Access version of the paper. Obviously I'm playing loose with language and leaving out numerous important details, but this is essentially how training works, as I understand it based on a first skim.
EDIT: I replaced "somewhat obvious" with "natural," which better conveys what I actually meant to write the first time around.
Here I was, thinking I finally had thought of a nice PhD project and then Deepmind comes along and gets the scoop! Haha.
For example, if you have a classifier that can recognize cats, it doesn't mean it will work for cartoon cats. You'd have to train the system all over again with cartoon cats. Instead, you want the system to learn more like humans, where only a small number of examples is necessary to make the connection between real and cartoon cats.
Novelty is usually a requirement for a PhD project.
It may be splitting hairs, but I think the mammalian brain, at least, can simulate/remember/imagine additional 'dimensions' like X/Y/Z spin, derivatives of velocity like acceleration/jerk/jounce.
Is space 11 dimensional (M string theory) or 2 dimensional (holographic principle)? What 'dimensions' does the human brain process? Is this capacity innate or learned; should we expect pilots and astronauts to have learned to more intuitively cognitively simulate gravity with their minds?
Everything takes figuratively forever to train, and the field is moving incredibly fast, so everything you see is both directly adjacent to previous work and also impressive.
To me, this represents the best of science--that we can collectively make rapid progress without having to invoke an Einstein figure to make some magical leap!
This is a natural progression from prior work. It's a compliment :-)
Ex. how's this different from https://arxiv.org/pdf/1503.03167.pdf from 2015?
Yes, this work is incremental, not a breakthrough; it's a natural progression from a lot of other prior work.
BUT they're doing this with environments (not just with objects), with an agent that explores those environments, with color (unlike some of the older work), and with evident applicability in a range of "Open AI Gym"-style deep RL tasks.
I find the examples the authors show amazing. They feel qualitatively different. The fact that the work is incremental doesn't make it any less impressive.
BTW, I remember reading that paper from MIT when it came out; I find it also amazing -- even if in the end it's all only "fitting a manifold to a set of points."
I think it's the idea that a network capable of doing that is a far, far better input to training an agent than flat images, or even the ground-truth 3D space.
"GQN uses analysis-by-synthesis to perform “inverse graphics,” but unlike existing methods (42), which require problem-specific engineering in the design of their generators, GQN learns this behavior by itself and in a generally applicable manner. However, the resulting representations are no longer directly interpretable by humans."
One thing I wish they would make more explicit (and in all their papers that I've read for that matter) is how much computational power it takes to train these networks and achieve these results.
I'm not sure if this is something they usually leave out because it's not interesting, or because it's something that people that work with deep networks all the time (I.E. not me) already have a feel for.
As someone in a related field (sometimes using deep networks but not researching them for their own sake). I certainly would like to know which of deep minds results would be feasible to replicate using my research groups resources, and it can be hard to do that without spending a lot of time actually trying to replicate the results and benchmarking them on your hardware.
Original comment: I think they leave it out because otherwise the standard response to their work would be "no surprise they get better results than anyone else, they are using two orders of magnitude more compute time than anyone else!" Not highlighting the computational expense makes their results look more impressive.
I would say having an obscene amount of compute is definitely a big competitive advantage, especially over a lot of small academic research labs.
> The values of all hyper-parameters were selected by
performing informal search. We did not perform a systematic grid search owing to the high
I think they leave it out because it's a trade secret. Like the way they waited a long time before announcing that they used specialized hardware (TPU) for AlphaGo.
You can build a 3d key-points map using Slam algorithms real time on a raspberry pi. From there, you render those key-points and descriptors to a virtual screen given the desired camera pose, then you learn a deconvolution from these sparse rendered key-points to image mapping.
Alternatively, using more memory, once you have a 3d map, you can save some key-frame with camera poses, when ask the view from a given pose, you pick the k closest poses and interpolate (eventually with a neural net).
If now you have some more compute, you can do the previous slam algorithms with dense maps, and interpolate the dense 2.5d point clouds.
Their network is probably doing inefficiently a mixture of those different things, trading compute and memory power for flexibility.
Does anyone have links to any open source projects doing this? Preferably with an example video(s) showing results?
For use on a raspberry pi model 3B+, 2000 key-points, it runs 1-2 fps at 640x480, 5-6 fps at 320x240. Use 500M for a few rooms and 75% CPU when map building 50% CPU once built. It's not optimized for the pi so you can probably get it to run at least 3 times faster if you are willing to get your hands dirty.
For it to work well a 180 degree camera really makes a difference, and run opencv cpp-tutorial-cameracalibration on a chessboard to get the needed extrinsics.
There are probably other slam algos in ROS, but I'm not sure how raspberry-pi compatible they are.
If you want to try and experiment with neural networks, once you have build your 3d map on powerful computer you can build a neural network to learn the pose from the image. This will allow you to have some constant time, constant memory algorithm for later use on the pi, it will probably be less precise.
Its not without its pitfalls tho, instead of Monocular SLAM generating factual, albeit fuzzy point cloud map we get overfitted (5 pictures x 2M similar scenes) magic black box generating very training set specific hallucination. This is how we get Scanners replacing numbers in scanned documents http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...
Similar example was posted 2 months ago https://data.vision.ee.ethz.ch/aeirikur/extremecompression/ example picture (no doubt best case authors could manage) gained additional data absent from original, some of it dangerous like fake license plate numbers.
This is incredible because it provides a way to link with linguisic understanding and manipulation of the rendering.
Reimplementing the rest of the paper shouldn't be tremendously difficult. These techniques tend to be fairly simple at their core. But training it could be expensive. In any case, it is a long, long way from here to "make generating environments for games or VR trivial by just taking photos IRL and then importing them into some game engine". Many years of research remain.
This is not intended as a criticism of this research, which I think is really great.
I wonder if this is efficient... I know this isn't the researchers intended application, but the path tracer in me wants to see how far this can be pushed for real time rendering. I welcome the more interesting artefacts that a NN might produce (i'm talking about pigsnails  of course :D)
full circle: GPU GLSL for graphics -> GPU cuda/opencl for NN -> GPU cuda/opencl for NN graphics
Some examples that you might appreciate (from the excellent channel "Two Minute Papers"):
. "Disney's AI Learns To Render Clouds" 
. "AI Learns Noise Filtering For Photorealistic Videos" 
So to extrapolate wildly, it seems reasonable that getting similar results for, say, real-world bedrooms, you'd need to take around 5 pics each of 2M bedrooms, and record the location and angle of the camera for each picture.
Edit: and I didn't mean to sound negative, using artificially generated rooms to develop the method is a great idea, and the next step will be narrow domain specific applications (they mention eg robotic arms) where it's feasible to automatically collect enough data for a task, and somewhere in the future we may have the data and compute to sample the distribution of real world environments in decent resolution..
I think there are only 15k different rooms, rather than 2M you suggested, though!
Combined with VXGI and other photo realism efforts, AI could produce any permutation of your favorite TV show that ended too soon. Ex. Breaking Bad Season 15: Walter Jr's Revenge or something like that.
There's also an AI that produced a clone of the game by watching videos so with this new neural scene representtion, you wouldn't have to train it with thousands of hours of gameplay footage, it could see a video once and figure out the game mechanics ex. stepping on a group of sprites which it recognizes as Enemy1, it should increment score count based on some generic platformer template model.
Once again, Deepmind delivers.
Also, what would you render? You still need input.
Sounds almost like learning new skills and forgetting old ones.
As humans, when we look at a scene, then move a few feet and look at it again, we have a pretty good idea what the delta between the two views were, so why is providing the same info here any different?
I am also curious if they are going to use this architecture to defend against adversary GANs that are attempting to defeat image recognition.
This, despite the fact that humans do well enough and that Structure From Motion has been a well-established part of Computer Vision research for a while.
More on topic, this is pretty great work and it'll have wide applications, for example Google's own efforts to make robot arms more perceptive using regular cameras: https://ai.google/research/teams/brain/robotics/
I don't think anyone reasonable says it's impossible - after all, humans are living, walking proof that it's entirely doable. The core of criticism is that it's insanely more difficult than just using LIDAR data. One could even say that LIDAR, as a specialized tool for depth detection, is a intrinsically better tool for the job.
As you mention, this is trivially true on the level of the sensors. We know cameras are enough because human brains can do it with eyes (biological cameras).
But there is no evidence that we know how to write software for the processing part of it (to an acceptable degree of safety) with cameras only, nor that the computing power on-board is up to the task.
How can you say that hardware package X is sufficient to implement something that has literally never been done before?
Maybe. The human eye is fundamentally different than cameras in terms of how it focuses on things. For similar performance, we may need much higher resolution cameras. We don't know.
I mean, we could be focusing 100% on the road ahead, and can still respond to peripheral events lightning fast..
I think this is why a lack of sleep can quite adversely affect driving. It seems that if you haven't slept well. the "behind the scene" processing takes a hit, which greatly impacts driving capability..
LIDAR gives you data points which may be more useful than optical data, but in order for it to be safely incorporated into a guidance system, there need to be times when it is out-voted by optical sensors.
Back when the company I worked for was building the Heathrow Pod, they had the option of putting LIDAR on the vehicles for collision-detection. They eventually decided against it, determining that the probability of injuries due to false-positive e-braking manoeuvres was higher than the probability of injuries if LIDAR was omitted altogether.
This was more than 10 years ago, and the state of the art has advanced considerably. I'm sure the balance of this calculus has changed. But figured it was worth pointing out that LIDAR is by no means a trivial thing to incorporate into a guidance system.
Versus lidar, which can't be fooled about how far away something is.
Robot arms in a factory environment are very different from driving in a snowstorm, or blowing rain, or through a construction site, or with tree branches waving around.
It's precisely this experience that drives people to ask Wtf are they thinking?
In the current state of the art there are many things that a self-driving car can do better than any human (reaction speed, never gets distracted, etc) Then there are other driving tasks that humans are still far better at... I'm not a self-driving expert, but I suspect that they're still worse at picking up complicated environmental cues (say, a pedestrian acting unusual or trying to alert you to something)
The safety premise of self-driving is that the former can outweigh the latter. The idea is that there will be some accidents that might have been avoidable by a proficient human driver, but many more that will be prevented by a machine's super-human capabilities in other areas.
To make a self-driving car as safe as possible, you'd want to give it every super-human capability that's practical. LIDAR provides one such capability -- a super-human understanding of where objects are even when the visual environment is challenging.
The question isn't whether it's possible for a LIDARless system to operate safely. Obviously humans don't have LIDAR and they're considered "safe enough" to put on the roads. The concern is whether it would be as safe as possible... and whether it would be so safe that people would feel comfortable sharing the highway with such a car.
As a mere CG artist, I will still be experiencing the surprise of seeing these examples a few months ago.
Are any particular recent achievements, announcements or similar influencing your expectations? If so, please share.
people will feed neural nets data, and ask it to describe the specific data set that the data is coming from -- without having the majority of that data set in hand.
in this instance, it would be showing the neural net a picture of a 3D area, and then waiting for it to extrapolate the details of the rest.
on average, the neural net's prediction may line up with reality. that is to say, the simulated data set is identical to the real data. that is what we are seeing in the OP link. but as soon as this method can apprehend and predict things of more complexity, that's where the differences will start to show.
sure, it isn't the neural net's fault -- any one worth its salt will place a confidence estimate on its extrapolated data points. but people don't understand how to interpret those confidence estimates. they'll round up to 100%, or round down to 0% accuracy. once people start using these techniques to guide serious decisions in business or elsewhere, that's where those dastardly percentages between 0 and 100 come into play.
imagine using this neural net as a way to generate returns in the context of trading stocks on wall st. it's a misuse of the tool, of course. but that won't stop people from making a decision based on a 95% probability of being correct; of course, 5% of the time, it will result in disaster. nor will it stop people from getting screwed by unknown unknowns.
this is the stuff which the consulting businesses of the future are built on -- scolding people about abusing models while trying to preserve the power of the model as a tool. needless to say, i'm interested in where this goes.
This is too close
slow the hell down
I'm afraid this isn't a car that you can stop, it's a freefall without a parachute. You're welcome to try flapping your arms, for all the good it will do.
And yes, the ground may or may not be approaching at an alarming rate.
What is it about this technology that isn't already done bigger / better / faster / more by TV and video games, or by alternate technology like LiDAR or even a Kinect?
There are extremely good arguments for caution in the development of artificial intelligence. It's reasonable to suppose it will be society altering at a minimum, but in not so improbably extremes it threatens extinction and immortality.
As for what makes this advance notable to someone cautioning speed? It's explicitly an example of a removed limiter, data collection and labeling. As such, it represents a potential acceleration of progress.
Given speed being seen as in conflict with caution, increase in speed can be thought of, potentially, especially when fearful, as a reduction in caution.
And, as an example of a technology with less inherent danger, we exercise an abundance of caution in the field of nuclear weapons/chemical weapons, to the point of military attacks to prevent their development.
Also differential in rate of roll-out. "The future's already here, it's just not evenly distributed." ~W. Gibson
Old people are already being attacked by automated systems they cannot understand. It will happen to you sooner or later. It may have already happened and you didn't notice.