Hacker News new | past | comments | ask | show | jobs | submit login
Neural scene representation and rendering (deepmind.com)
545 points by johnmoberg 9 months ago | hide | past | web | favorite | 114 comments

This work is a natural progression from a lot of other prior work in the literature... but that doesn't make the results any less impressive. The examples shown are amazingly, unbelievably good! Really GREAT WORK.

Based on a quick skim of the paper, here is my oversimplified description of how this works:

During training, an agent navigates an artificial 3D scene, observing multiple 2D snapshots of the scene, each snapshot from a different vantage point. The agent passes these snapshots to a deep net composed of two main parts: a representation-learning net and a scene-generation net. The representation-learning net takes as input the agent's observations and produces a scene representation (i.e., a lower-dimensional embedding which encodes information about the underlying scene). The scene-generation network then predicts the scene from three inputs: (1) an arbitrary query viewpoint, (2) the scene representation, and (3) stochastic latent variables. The two networks are trained jointly, end-to-end, to maximize the likelihood of generating the ground-truth image that would be observed from the query viewpoint. See Figure 1 on Page 15 of the Open Access version of the paper. Obviously I'm playing loose with language and leaving out numerous important details, but this is essentially how training works, as I understand it based on a first skim.

EDIT: I replaced "somewhat obvious" with "natural," which better conveys what I actually meant to write the first time around.

I, literally just 15 minutes ago, had a chat with a friend of mine exactly about how what we are doing right now with computer vision is all based on a flawed premise (supervised 2D training set). The human brain works in 3D space (or 3D+time) and then projects all this knowledge in a 2D image.

Here I was, thinking I finally had thought of a nice PhD project and then Deepmind comes along and gets the scoop! Haha.

I don’t think this is a novel idea, but it is still a great topic for a PhD. While the results in this paper look impressive, my suspicion is that the system doesn’t generalize particularly well. (I suspect this from experience with similar, albeit simpler, ideas, as well as from looking at the datasets.) If you can make a system that generalizes to new environments and objects, or a system that works with real-world natural image/video data, that would be a tremendous accomplishment.

Generalization is a more fundamental problem, and (imho) should be tackled first at a more fundamental level.

For example, if you have a classifier that can recognize cats, it doesn't mean it will work for cartoon cats. You'd have to train the system all over again with cartoon cats. Instead, you want the system to learn more like humans, where only a small number of examples is necessary to make the connection between real and cartoon cats.

It is possible that the problems are related—-it may be that, to achieve human-like generalization, neural nets need to learn in a human-like environment, instead of from a folder full of images. But time will tell.

This has been said many times in different ways over the years. To achieve human-like intelligence one needs a human-like body operating in a human-like environment. It's the first of the E's in: embodied emergent extended enactive. https://plato.stanford.edu/entries/embodied-cognition/

> I don’t think this is a novel idea, but it is still a great topic for a PhD.

Novelty is usually a requirement for a PhD project.

For a result not a project, and certainly not a topic!

Maybe in trivial examples, but bleeding edge has for quite some time adopted 3D convolutions (either 3D space or 2D + time), and combined them with RNNs for more power.

Here is an excellent talk by Geoffrey Hinton on exactly that topic https://www.youtube.com/watch?v=rTawFwUvnLE

Indeed. Here is a recent realization of those ideas by him and his team: https://openreview.net/forum?id=HJWLfGWRb

The trick is to then to have one agent try to take on the perspective of another agent, then change its behavior accordingly (depending on task goal). I.e. theory of mind.

Hide and seek

There you go. A good task.

Would probably make for more interesting game AI and pathfinding behavior for sure.

"Spatial memory" https://en.wikipedia.org/wiki/Spatial_memory

It may be splitting hairs, but I think the mammalian brain, at least, can simulate/remember/imagine additional 'dimensions' like X/Y/Z spin, derivatives of velocity like acceleration/jerk/jounce.

Is space 11 dimensional (M string theory) or 2 dimensional (holographic principle)? What 'dimensions' does the human brain process? Is this capacity innate or learned; should we expect pilots and astronauts to have learned to more intuitively cognitively simulate gravity with their minds?

> This work is a somewhat obvious progression from a lot of other prior work in the literature... but that doesn't make the results any less impressive.

Everything takes figuratively forever to train, and the field is moving incredibly fast, so everything you see is both directly adjacent to previous work and also impressive.

To me, this represents the best of science--that we can collectively make rapid progress without having to invoke an Einstein figure to make some magical leap!

I replaced "somewhat obvious" with "natural," which better conveys what I actually meant to write the first time around

This is a natural progression from prior work. It's a compliment :-)

What is a stochastic latent variable? Is it just random noise inputted to the network?


I'm surprised people are so blown away by this. It's a cool demonstration, but for this problem you have basically infinite training data. If you can find a latent space of faces this is hardly a stretch, since you already have a fantastic notion of locality in your data (by perturbing the camera). The interesting thing is generalization, which they show in figure 3B and is... ok, I guess. It's not that surprising compared to any of the other VAE stuff people have done (see the morphing scenes, 3D face illumination / rendering and furniture stuff from 2 years ago, for instance). It's also not that surprising compared to ex. the generative scene model RL paper that came out a few months ago (with Doom and the driving game). IMO deep learning research has moved beyond "here's another set of points I can fit a curve to". It really feels like this publication was heavily driven by prestige when most of the innovative stuff was achieved by other groups 2 years ago or more.

Ex. how's this different from https://arxiv.org/pdf/1503.03167.pdf from 2015?

I can't disagree with the contents of your comment; I upvoted it.

Yes, this work is incremental, not a breakthrough; it's a natural progression from a lot of other prior work.

BUT they're doing this with environments (not just with objects), with an agent that explores those environments, with color (unlike some of the older work), and with evident applicability in a range of "Open AI Gym"-style deep RL tasks.

I find the examples the authors show amazing. They feel qualitatively different. The fact that the work is incremental doesn't make it any less impressive.

BTW, I remember reading that paper from MIT when it came out; I find it also amazing -- even if in the end it's all only "fitting a manifold to a set of points."

I don't think the breakthrough here is generating a 3D space from 2D snapshots.

I think it's the idea that a network capable of doing that is a far, far better input to training an agent than flat images, or even the ground-truth 3D space.

They address that paper specifically here:

"GQN uses analysis-by-synthesis to perform “inverse graphics,” but unlike existing methods (42), which require problem-specific engineering in the design of their generators, GQN learns this behavior by itself and in a generally applicable manner. However, the resulting representations are no longer directly interpretable by humans."

This is not completely true - they provide labelled poses as part of the training set. The differences are really minor.

Maybe it's just me, but I had no idea any of this was currently possible. Any cool videos about this we should see for further mind-blowing?

To put this into a larger context: I wonder, do we currently have the technology to see another major breakthrough in AI, because frankly all I’m seeing are impressive, but nonetheless incremental advancement.

Don't worry op I'm sure deepmind patented this approach.

They're trying to, actually, the Science paper gives the patent application number.

Very cool work, deep mind wows me once again.

One thing I wish they would make more explicit (and in all their papers that I've read for that matter) is how much computational power it takes to train these networks and achieve these results.

I'm not sure if this is something they usually leave out because it's not interesting, or because it's something that people that work with deep networks all the time (I.E. not me) already have a feel for.

As someone in a related field (sometimes using deep networks but not researching them for their own sake). I certainly would like to know which of deep minds results would be feasible to replicate using my research groups resources, and it can be hard to do that without spending a lot of time actually trying to replicate the results and benchmarking them on your hardware.

Retracted! See below; compute is disclosed and is not that crazy. For this project their training hardware used 4 Nvidia K80's.

Original comment: I think they leave it out because otherwise the standard response to their work would be "no surprise they get better results than anyone else, they are using two orders of magnitude more compute time than anyone else!" Not highlighting the computational expense makes their results look more impressive.

Also, because they are focusing heavily on the RL part of the modeling. They obviously have obscene amounts of available compute, but that is not their competitive advantage.

what exactly do you mean? Are you saying that RL requires less compute?

I would say having an obscene amount of compute is definitely a big competitive advantage, especially over a lot of small academic research labs.

> We train each GQN model simultaneously on 4 NVidia K80 GPUs for 2 million gradient steps. The values of the hyper-parameters used for optimisation are detailed in Table S1, and we show the effect of model size on final performance in Fig. S4.

> The values of all hyper-parameters were selected by performing informal search. We did not perform a systematic grid search owing to the high computational cost.

That's nowhere near an obscene amount of computing power for any serious ML project.

Wow. Okay I retract my snide comment. Thanks for finding that.

> I'm not sure if this is something they usually leave out because it's not interesting, or because it's something that people that work with deep networks all the time (I.E. not me) already have a feel for.

I think they leave it out because it's a trade secret. Like the way they waited a long time before announcing that they used specialized hardware (TPU) for AlphaGo.

This seems impressive, but it shows that there is still some way to go when comparing to old school techniques. I don't know how many TPU they used but probably a lot.

You can build a 3d key-points map using Slam algorithms real time on a raspberry pi. From there, you render those key-points and descriptors to a virtual screen given the desired camera pose, then you learn a deconvolution from these sparse rendered key-points to image mapping.

Alternatively, using more memory, once you have a 3d map, you can save some key-frame with camera poses, when ask the view from a given pose, you pick the k closest poses and interpolate (eventually with a neural net).

If now you have some more compute, you can do the previous slam algorithms with dense maps, and interpolate the dense 2.5d point clouds.

Their network is probably doing inefficiently a mixture of those different things, trading compute and memory power for flexibility.

> You can build a 3d key-points map using Slam algorithms real time on a raspberry pi.

Does anyone have links to any open source projects doing this? Preferably with an example video(s) showing results?

I used this today for my robot. https://github.com/Alkaid-Benetnash/ORB_SLAM2/ (This particular fork can save the map although it needs to be generated on the pi). It works almost out of the box, just need half a day of slow compilation.

For use on a raspberry pi model 3B+, 2000 key-points, it runs 1-2 fps at 640x480, 5-6 fps at 320x240. Use 500M for a few rooms and 75% CPU when map building 50% CPU once built. It's not optimized for the pi so you can probably get it to run at least 3 times faster if you are willing to get your hands dirty.

For it to work well a 180 degree camera really makes a difference, and run opencv cpp-tutorial-cameracalibration on a chessboard to get the needed extrinsics.

There are probably other slam algos in ROS, but I'm not sure how raspberry-pi compatible they are.

If you want to try and experiment with neural networks, once you have build your 3d map on powerful computer you can build a neural network to learn the pose from the image. This will allow you to have some constant time, constant memory algorithm for later use on the pi, it will probably be less precise.

Bravo. This sort of imagining of a scene could perhaps allow for an agent to recognize it doesn't know what's behind the ball if asked. That would be a nice feature if you wanted to reward the agent for finding unexplored areas. It also could help an agent plan to get to some goal. No need to guess if I take this fork in the road, will I need to retrace my steps? Instead imagine it and then avoid the imagined pitfalls before taking the action.

I can totally see Google incorporating this in Self driving cars down the road, after training on millions of hours or dashcam footage, to augment or maybe even replace Lidar. Paper suggests it is capable of segmenting 2D picture input into logical objects and their discrete configurations/positions.

Its not without its pitfalls tho, instead of Monocular SLAM generating factual, albeit fuzzy point cloud map we get overfitted (5 pictures x 2M similar scenes) magic black box generating very training set specific hallucination. This is how we get Scanners replacing numbers in scanned documents http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...

Similar example was posted 2 months ago https://data.vision.ee.ethz.ch/aeirikur/extremecompression/ example picture (no doubt best case authors could manage) gained additional data absent from original, some of it dangerous like fake license plate numbers.

> We also found that the GQN is able to carry out “scene algebra” [akin to word embedding algebra (20)]. By adding and subtracting representations of related scenes, we found that object and scene properties can be controlled, even across object positions.

This is incredible because it provides a way to link with linguisic understanding and manipulation of the rendering.

If any of the authors are on this thread, am wondering if there are any to plans to release source code this? This can potentially make generating environments for games or VR trivial by just taking photos IRL and then importing them into some game engine to generate a scene

The important part is the dataset, which in this case is generated by DeepMind Lab, which is already open source: https://github.com/deepmind/lab

Reimplementing the rest of the paper shouldn't be tremendously difficult. These techniques tend to be fairly simple at their core. But training it could be expensive. In any case, it is a long, long way from here to "make generating environments for games or VR trivial by just taking photos IRL and then importing them into some game engine". Many years of research remain.

Hi, you can find the actual datasets used for the paper here: https://github.com/deepmind/gqn-datasets

Hey James, since you seem to know a lot about graphics and ML. Am wondering which specific problems remain open before what I mentioned becomes more feasible product wise?

The domain of images used in this research is extremely limited. These are very low resolution artificially generated images of small scenes with simple lighting, simple textures, simple geometry, and very restricted camera positions and parameters that are known exactly (which is not the case for most natural photos). Each of those restrictions needs to be lifted before this will work on realistic natural scenes, and that will require many orders of magnitude more data. It's not clear that this approach will easily scale up to that amount of dataset variation. It's likely that a much fancier neural net architecture and training scheme will be required, and probably faster hardware too.

This is not intended as a criticism of this research, which I think is really great.

> The generation network is therefore an approximate renderer that is learned from data.

> https://www.youtube.com/watch?v=G-kWNQJ4idw&feature=youtu.be

I wonder if this is efficient... I know this isn't the researchers intended application, but the path tracer in me wants to see how far this can be pushed for real time rendering. I welcome the more interesting artefacts that a NN might produce (i'm talking about pigsnails [1] of course :D)

full circle: GPU GLSL for graphics -> GPU cuda/opencl for NN -> GPU cuda/opencl for NN graphics

[1] https://www.newscientist.com/article/dn27755-artificial-brai...

Disney actually does a lot of research combining the world of graphics with deep learning.

Some examples that you might appreciate (from the excellent channel "Two Minute Papers"):

. "Disney's AI Learns To Render Clouds" [0]

. "AI Learns Noise Filtering For Photorealistic Videos" [1]

[0] https://www.youtube.com/watch?v=7wt-9fjPDjQ [1] https://www.youtube.com/watch?v=YjjTPV2pXY0

Thanks! that is really some awesome stuff. It's even simpler in concept than this.

This is also very likely to be useful for video compression.

I'm disappointed they didn't demonstrate what happens when they take this system and expose it to a few real-world photos. Can it handle that, or has it been very much fitted to these shapes?

"Our method still has many limitations when compared to more traditional computer vision techniques, and has currently only been trained to work on synthetic scenes. However, as new sources of data become available and advances are made in our hardware capabilities, we expect to be able to investigate the application of the GQN framework to higher resolution images of real scenes."

"As new sources of data"...

Like, StreetView?

That could be really interesting, StreetView but with this system imagining interpolated points between the original photos, giving the user more free and natural movement.

More important, more hardware, aka TPUs.

It has very much been fitted. The supplementary material describes the training set as 5 pictures each of 2M different scenes of the type you can see in the paper (square room with random objects).

So to extrapolate wildly, it seems reasonable that getting similar results for, say, real-world bedrooms, you'd need to take around 5 pics each of 2M bedrooms, and record the location and angle of the camera for each picture.

Edit: and I didn't mean to sound negative, using artificially generated rooms to develop the method is a great idea, and the next step will be narrow domain specific applications (they mention eg robotic arms) where it's feasible to automatically collect enough data for a task, and somewhere in the future we may have the data and compute to sample the distribution of real world environments in decent resolution..

FYI, for artificially generated rooms, there is already the SceneNet RGB-D dataset - "5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth"


I think there are only 15k different rooms, rather than 2M you suggested, though!

This could be achieved with a drone camera and positioning sensors.

I'd say much more than that, since real-world bedrooms likely have a much more complicated representation than more simple generated rooms.

I agree with you that the estimate is conservative, and would depend strongly on image resolution and how broad your distribution of bedrooms is - only modern US style, or also 40 year old Japanese houses?

Well this is nothing short of incredible. I wonder if they'll get it to a point where it can look at 3d drawing, and immediately be able to produce a 3d model which includes all the occluded parts.

Probably not far off. We might get to a point where we have an AI software that can run on any computer which will entertain us to eternity.

Combined with VXGI and other photo realism efforts, AI could produce any permutation of your favorite TV show that ended too soon. Ex. Breaking Bad Season 15: Walter Jr's Revenge or something like that.

There's also an AI that produced a clone of the game by watching videos so with this new neural scene representtion, you wouldn't have to train it with thousands of hours of gameplay footage, it could see a video once and figure out the game mechanics ex. stepping on a group of sprites which it recognizes as Enemy1, it should increment score count based on some generic platformer template model.

Once again, Deepmind delivers.

I wonder if we'll get to the stage where game engines become a series of neural networks hallucinating the output.

I was thinking about exactly this kind of experiment. Given an input of gameplay recordings, train a model to predict the next framebuffer from the previous frame and keypress input. Would the model have to be excessively complex to avoid rapid divergence into feedback patterns resembling a Winamp visualizer? Probably, but it should be entertaining enough to watch and interact with anyway.

Could this be a pivot point in rendering technology? All of the thought and effort over decades put into the math behind 3D rendering, meant to produce a rasterized scene following perfect 3D calculations and rules -- replaced with a system that just "imagines" the picture?

If you're referring to increasing the speed of 3D rendering, it's very unlikely that this is faster than a traditional rendering method. It's probably an order of magnitude or three slower. If anything it would assist in the art stage, not the rendering stage.

Also, what would you render? You still need input.

Doesn't mean the neural renderer is any fast at all :)

Demis alluded to this at the Cheltenham Science festival last Saturday when someone asked about rats in mazes and how spatial neural connections are formed.


Hey, just skimmed the news article. Seems really interesting, but the lack of information on compute requirements is concerning. Also I wonder what the latent factors and the specific layers in each model are? I tried to dig deeper in the paper but the description was pretty ambiguous?

This undoubtedly brings us a leap closer to AI. Imagine a number of these self learning nets arranged in some structure, learning and unlearning bits of information on demand, perhaps with different levels of volatility.

Sounds almost like learning new skills and forgetting old ones.

This + the recent grid cell work would allow for the view training points to be generated unsupervised as well. Just drop an agent in an environment, it will explore it and come up with the scene representation entirely on it's own.

One thing to note is that the camera viewpoint (it's position, roll, pitch, and yaw) is fed along with the images during training. Requiring access to this ground truth makes this method very constraining to use in practice.

What kind of use cases are you thinking of where this wold be constraining? Don't many computer vision algorithms also require something specifying the parameters of the camera, such as the fundamental matrix for stereo imaging?

As humans, when we look at a scene, then move a few feet and look at it again, we have a pretty good idea what the delta between the two views were, so why is providing the same info here any different?

I would add that humans also integrate gyroscopic & acceleration information from the inner ear to understand relative balance. Multiple sources of sensor data is a net benefit, not a drawback.

Nice work! I am guessing that this moves us that much closer to a camera only based SLAM system.

I am also curious if they are going to use this architecture to defend against adversary GANs that are attempting to defeat image recognition.

It is rather funny that in nearly every Tesla-related thread, they are slammed as irresponsible fraudsters for their decision to not use LIDAR and rely on a radar/camera-based system. Cameras cannot detect obstacles, we are told, and they'll never be able to make an autonomous vehicle without it.

This, despite the fact that humans do well enough and that Structure From Motion has been a well-established part of Computer Vision research for a while.

More on topic, this is pretty great work and it'll have wide applications, for example Google's own efforts to make robot arms more perceptive using regular cameras: https://ai.google/research/teams/brain/robotics/

> It is rather funny that in nearly every Tesla-related thread, they are slammed as irresponsible fraudsters for their decision to not use LIDAR and rely on a radar/camera-based system. Cameras cannot detect obstacles, we are told, and they'll never be able to make an autonomous vehicle without it.

I don't think anyone reasonable says it's impossible - after all, humans are living, walking proof that it's entirely doable. The core of criticism is that it's insanely more difficult than just using LIDAR data. One could even say that LIDAR, as a specialized tool for depth detection, is a intrinsically better tool for the job.

In particular, I find this claim questionable: "All Tesla Cars Being Produced Now Have Full Self-Driving Hardware." https://www.tesla.com/blog/all-tesla-cars-being-produced-now...

As you mention, this is trivially true on the level of the sensors. We know cameras are enough because human brains can do it with eyes (biological cameras).

But there is no evidence that we know how to write software for the processing part of it (to an acceptable degree of safety) with cameras only, nor that the computing power on-board is up to the task.

How can you say that hardware package X is sufficient to implement something that has literally never been done before?

>this is trivially true on the level of the sensors.

Maybe. The human eye is fundamentally different than cameras in terms of how it focuses on things. For similar performance, we may need much higher resolution cameras. We don't know.

Interesting, I wasn't aware. How are they different?

Eyes don't have uniform resolution. The centers of our eyes (or what we're focusing on) have very, very high resolution, while the outer parts have much lower. Cameras normally have something like a middling resolution in comparison.

Eyes doesn't send the entire picture to the brain at once at a constant sampling rate either, they work more like event cameras [0]. Combine this with micro movements of the eye, and specialized brain structures and it's not as easy as saying that because we only need two eyes, robots only need two cameras. Sure, stereo vision might be enough, but what kind of cameras, and what kind of computers do we need to reach feature parity with our own sight?

[0] http://www.rit.edu/kgcoe/iros15workshop/papers/IROS2015-WASR...

Apart from this, our brains (actually brains of all things that have to hunt/be hunted) are extremely good at detecting motion, especially at the peripherals. This makes us extremely good at detecting stuff like a pedestrian who is about to cross the road, or an incoming vehicle from a junction...

I mean, we could be focusing 100% on the road ahead, and can still respond to peripheral events lightning fast..

I think this is why a lack of sleep can quite adversely affect driving. It seems that if you haven't slept well. the "behind the scene" processing takes a hit, which greatly impacts driving capability..

"All Tesla Cars Being Produced Now Have Full Self-Driving Hardware at 5mph."

Ish. If you automatically do an emergency braking / avoidance manoeuvre every time a LIDAR-opaque object is in front of your vehicle, then you'll end up causing a lot of whiplash for the sake of sparing some paper bags.

LIDAR gives you data points which may be more useful than optical data, but in order for it to be safely incorporated into a guidance system, there need to be times when it is out-voted by optical sensors.

Back when the company I worked for was building the Heathrow Pod, they had the option of putting LIDAR on the vehicles for collision-detection. They eventually decided against it, determining that the probability of injuries due to false-positive e-braking manoeuvres was higher than the probability of injuries if LIDAR was omitted altogether.

This was more than 10 years ago, and the state of the art has advanced considerably. I'm sure the balance of this calculus has changed. But figured it was worth pointing out that LIDAR is by no means a trivial thing to incorporate into a guidance system.

Humans do awful at it. All sorts of visual confusion can fool humans about how far away something is.

Versus lidar, which can't be fooled about how far away something is.

Robot arms in a factory environment are very different from driving in a snowstorm, or blowing rain, or through a construction site, or with tree branches waving around.

It's precisely this experience that drives people to ask Wtf are they thinking?

LiDAR can be easily fooled by litter, bad weather, snow etc. It basically works reliably well only in good weather.

I think that's a poor description of the LIDARless controversy.

In the current state of the art there are many things that a self-driving car can do better than any human (reaction speed, never gets distracted, etc) Then there are other driving tasks that humans are still far better at... I'm not a self-driving expert, but I suspect that they're still worse at picking up complicated environmental cues (say, a pedestrian acting unusual or trying to alert you to something)

The safety premise of self-driving is that the former can outweigh the latter. The idea is that there will be some accidents that might have been avoidable by a proficient human driver, but many more that will be prevented by a machine's super-human capabilities in other areas.

To make a self-driving car as safe as possible, you'd want to give it every super-human capability that's practical. LIDAR provides one such capability -- a super-human understanding of where objects are even when the visual environment is challenging.

The question isn't whether it's possible for a LIDARless system to operate safely. Obviously humans don't have LIDAR and they're considered "safe enough" to put on the roads. The concern is whether it would be as safe as possible... and whether it would be so safe that people would feel comfortable sharing the highway with such a car.

If you have hardware that can operate at 1000fps at full HD, then I'd bet it could drive very safely even without LiDAR. But if at best it manages 5fps, uses imprecise extended Kalman filters to interpolate movement of obstacles, low-res SfM and normal-based 3D detection, it could get really bad results at times. All these systems together at 1000fps would probably obliterate humans comfortably. It was said that self-driving is a solved problem, the faster HW the better results. However, detailed path planning (in inches/centimeters) is way way worse than average human and due to NP-hard fun there is not much hope there will be any great solution anytime soon.

As an occasional pedestrian & occasional road cyclist, I'd personally much rather have any automated cars around me using both LIDAR and SFM than SFM without LIDAR. Wouldn't you?

Very nice. It will be interesting to see future results that work with real (non synthetic) scenes - I would not be surprised if that happens in just a few months.

Why will you not be surprised if that happens in just a few months?

As a mere CG artist, I will still be experiencing the surprise of seeing these examples a few months ago.

Are any particular recent achievements, announcements or similar influencing your expectations? If so, please share.

I also wouldn't be surprised. Results in transfer learning from synthetic to real world vision tasks suggest to me that if you could train this system with (for example) GTA-V as the environment, it may work reasonably well in the real world.

In the last keynote Apple has shown that ARKit hallucinates scene elements that it can not see but tries to infer (such as lights above the scene). I wonder if they use a similar technique to this.

Is this patented? I heard that deep mind is patenting a lot of that, is that applicable to this particular technique? Where would such patent be enforceable?

From the paper: DeepMind has filed a U.K. patent application (GP-201495-00-PCT) related to this work.

The displayed scenes remind me a lot of Wolfenstein 3D.

Fantastic! Is there some secret at DeepMind how to boost ones capabilities in this space to be that good at bleeding edge?

i'm envisioning a new kind of black-swan style mistake which i'm going to call the allegory of the neural net in the cave.

people will feed neural nets data, and ask it to describe the specific data set that the data is coming from -- without having the majority of that data set in hand.

in this instance, it would be showing the neural net a picture of a 3D area, and then waiting for it to extrapolate the details of the rest.

on average, the neural net's prediction may line up with reality. that is to say, the simulated data set is identical to the real data. that is what we are seeing in the OP link. but as soon as this method can apprehend and predict things of more complexity, that's where the differences will start to show.

sure, it isn't the neural net's fault -- any one worth its salt will place a confidence estimate on its extrapolated data points. but people don't understand how to interpret those confidence estimates. they'll round up to 100%, or round down to 0% accuracy. once people start using these techniques to guide serious decisions in business or elsewhere, that's where those dastardly percentages between 0 and 100 come into play.

imagine using this neural net as a way to generate returns in the context of trading stocks on wall st. it's a misuse of the tool, of course. but that won't stop people from making a decision based on a 95% probability of being correct; of course, 5% of the time, it will result in disaster. nor will it stop people from getting screwed by unknown unknowns.

this is the stuff which the consulting businesses of the future are built on -- scolding people about abusing models while trying to preserve the power of the model as a tool. needless to say, i'm interested in where this goes.

finally, CSI level hollywood tech will be real in a few years

Would hallucinated images be accepted in court though? I hope not.

I. Am. Terrified.

This is too close

slow the hell down

> slow the hell down

I'm afraid this isn't a car that you can stop, it's a freefall without a parachute. You're welcome to try flapping your arms, for all the good it will do.

And yes, the ground may or may not be approaching at an alarming rate.

What if we're flying up and not down?

Into cold and empty space ?

Towards endless stars, bound to be pulled in by another

Yes it’s not really up to us anymore. Viewing corporations as a non-human actors under capitalism, they will do it out of profit motive. Individuals may resign or shape the style of results but the progression is inevitable barring collapse or radical alteration in the structure of modern society.

Honestly: What are you scared of?

What is it about this technology that isn't already done bigger / better / faster / more by TV and video games, or by alternate technology like LiDAR or even a Kinect?

You're question about this technology mistakes the fear for the thing prompting the fear. Picture someone saying that you need to slow down while driving toward an area. The request to slow down isn't mitigated just because there are examples of things which go faster or which go the same speed. It's not the speed that causes the fear, but the potential things ahead.

There are extremely good arguments for caution in the development of artificial intelligence. It's reasonable to suppose it will be society altering at a minimum, but in not so improbably extremes it threatens extinction and immortality.

As for what makes this advance notable to someone cautioning speed? It's explicitly an example of a removed limiter, data collection and labeling. As such, it represents a potential acceleration of progress.

Given speed being seen as in conflict with caution, increase in speed can be thought of, potentially, especially when fearful, as a reduction in caution.

Why is it different now, though? People have been lying, manipulating, watching, and killing each other since forever. What is it about abuses of AI that makes it so unmanageable compared to all the other evils that humans manage?

I'm not saying it is different than other evils we manage. Caution isn't a new thing for managing evils. Prudence was recommended even farther back than Proverbs, I'm sure.

And, as an example of a technology with less inherent danger, we exercise an abundance of caution in the field of nuclear weapons/chemical weapons, to the point of military attacks to prevent their development.

Speed and scale.

Also differential in rate of roll-out. "The future's already here, it's just not evenly distributed." ~W. Gibson

Old people are already being attacked by automated systems they cannot understand. It will happen to you sooner or later. It may have already happened and you didn't notice.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact