
Moving Camera, Moving People: A Deep Learning Approach to Depth Prediction - skybrian
https://ai.googleblog.com/2019/05/moving-camera-moving-people-deep.html
======
cedricd
The best part of this paper is that they used mannequin challenge videos as
their training dataset. That's super clever.

~~~
antome
I have seen people suggest that the "10 year challenge" was created to build
an age-related training dataset. While the mannequin challenge was probably
just spontaneous, I wonder if we will see an increasing number of viral
challenges in the future that center around the creation of structured
information.

------
johndough
You can tell that the authors have a very fast internet connection by the fact
that this website weights in at 91.6 Mbyte and takes over a minute to fully
load on a 25 Mbit connection.

~~~
JeremyBanks
Jesus. Why couldn't they use embedded video files instead of 30 megabyte gifs?

~~~
aeternus
A state-of-the-art deep learning neural net designed by digital video experts
within one of the most technology savvy companies in the world...

What do they use to reveal it to the world? GIFs!

------
joshvm
Always worth looking at a point cloud versus a disparity map.

Grayscale disparity/depth maps are somewhat misleading - the large regions of
constant intensity suggest that the algorithm is good at segmenting areas of
constant depth. However, the flickering in the map suggests that if you
actually tried to plot this in 3D, it'd be pretty noisy. Not to disparage the
result, but 2D depth/disparity maps tend to look better than what they
represent.

You can see this in the synthetic camera wiggle video, focus on the actor's
hands, for example.

You can also see this effect in the Stereolabs Zed promo video.

~~~
enriquto
to visualize depth maps it is best to look at derivatives of them (e.g., a
directional derivative or the laplacian). Mapping the depths to indensities
directly loses a lot of information.

------
TaylorAlexander
I can’t wait until techniques like this find their way in to open source
photogrammetry pipelines. I came up with a way of training neural nets for
robotics using a monocular camera, photogrammetry, and a simulation
environment with the captured 3D scene, but the photogrammetry was error prone
and computationally intensive even on a beefy cloud server.

I’d love for OpenSFM or OpenMVS (check github) to get this kind of software.

Also would love to see an implementation of this on github, but hopefully that
will follow in time.

~~~
Fission
I personally do not believe that depth generated purely from deep learning can
be used as input to photogrammetry anytime soon.

Photogrammetry works exceedingly well because the depth maps that they
generate are quite precise and accurate, and mesh reconstruction usually
assumes that these points are quite close to ground truth.

Deep learning approaches usually have medium accuracy but low precision, which
causes the flickering and smooth surfaces that you see on the person. Even the
background has flickering despite being computed through stereo, likely
because the camera motion is primarily forward-backward (vs. more accurate
side-to-side motion), the baseline is likely small, and the depth isn't
globally optimized.

This type of research is super great for applications requiring lower
accuracy, typically visual-only applications (e.g. selective blurring, faking
stereo on a frame, etc.). But as an input to photogrammetry — probably not
anytime soon, until the problems above get resolved.

~~~
TaylorAlexander
Interesting. Perhaps my idea of this being inserted in to existing algorithms
would not work.

However I do ultimately seek a low accuracy “visually approximate” 3D scene
that I could use for simulation purposes. I guess I could rephrase my desire
as: I’d love to see this kind of approach used to train an end to end deep
learning photogrammetry system. I feel like the parallel nature of neural nets
as well as their ability to approximate results could result in a much less
computationally intensive solution to my photogrammetry desires.

(I want to train my four wheel drive robot to follow forest trails using the
training method described in the “world models” research paper, which requires
a simulation to work.)

~~~
Fission
Some of my friends recently put out
[http://gibsonenv.stanford.edu/](http://gibsonenv.stanford.edu/)

Full simulation with realistic 3D spaces, enables embodied agents to interact
and learn from real-world spaces. Not forest trails, but a real world
environment.

If you really want to create a 3D model of forest trails, photogrammetry
should be sufficient, because forest scenes are richly-textured.

~~~
TaylorAlexander
Yes I did come across Gibsonenv and it looks great for indoor scenes.

As far as photogrammetry of forest trails, I found it to be very
computationally intensive (taking a GCE 32 core instance 30+ hours using 90+GB
of ram to compute a scene, only with errors that made it unusable). It felt
very heavy handed and given all the great work I've seen in scene
understanding using neural nets, it seems like deep learning would be a
promising approach here. Maybe there is commercial photogrammetry software
that has better pipelines, but I want to be able to compute my scenes on linux
and use hundreds of images.

I did my computation with OpenSFM and OpenMVS. Both wonderful projects for
being free and open source. I did get a lot of great results. But I am
convinced a simpler way is possible with deep learning.

~~~
Fission
OpenSFM is quite out of date, so it's quite inefficient and rather inaccurate
(e.g. exhaustive matching is O(n^2), and there are a lot of smarter ways that
are closer to O(n))

Also, one of the main steps of mesh reconstruction is depth map generation. It
typically takes anywhere from 30-75% of compute time for dense reconstruction,
_IF_ it's parallelized thru GPU. If you're using the CPU only to calculate
depth maps, you're probably slowing yourself down by an order of magnitude.

If you have a GPU, and use a better SFM-MVS solution, then you can quite
easily reconstruct datasets of 1k-10k images within 24 hours.

~~~
jasonjs
What would you recommend as a better SFM-MVS solution?

------
ralusek
Why wouldn't they use 3d renderings as a large part of their training data
set? You could have perfectly generated depth outputs generated alongside the
image input, and you could adjust things like focal length to all kinds of
values that would make this able to understand how shifting items correlate to
depth across a variety of focal lengths. To be honest I'm not even sure how
they're training them with live footage, how are they even getting the depth
maps from the training footage to begin with?

~~~
pjc50
> we make use of an existing source of data for supervision: YouTube videos in
> which people imitate mannequins by freezing in a wide variety of natural
> poses, while a hand-held camera tours the scene. Because the entire scene is
> stationary (only the camera is moving), triangulation-based methods--like
> multi-view-stereo (MVS)--work, and we can get accurate depth maps for the
> entire scene including the people in it

I suspect the reason for not using 3D rendering is the desire to cope with the
noise and variability of real video.

------
suyash
Here is the actual paper
[https://arxiv.org/pdf/1904.11111.pdf](https://arxiv.org/pdf/1904.11111.pdf)

------
jawns
Could this technology one day become so good as to eliminate the need for
lidar for self-driving cars? Or will lidar be so inexpensive by that point
that there will be no need to eliminate it?

------
AndrewKemendo
This is a great hack, but I'd love to see more detail on how they did pose
initialization to approximate ground truth on depth/pose from the Mannequin
set. The paper says they are using ORB-SLAM2, but AFAIK ORB still needs a
height label.

Maybe it's the case that this system doesn't actually return an X,Y,Z camera
pose, but rather just a pixel specific depth, and not a recovered pose for new
inputs.

------
IanCal
The predictions seem to have rapid flickering, which means the model is saying
lots of items are moving back and forth extremely quickly. Since this seems
common in video analysis (rapid changes per frame) is it that smoothing or
taking into consideration multiple frames is slow? Or does it cause more
issues than it solves?

------
andr
DeepMind seems to be conducting research in a similar direction. Any insight
on how are the two projects related or different?

[https://deepmind.com/blog/neural-scene-representation-and-
re...](https://deepmind.com/blog/neural-scene-representation-and-rendering/)

------
p1necone
At this point I feel like I'm psychic. Every single time I see an image
processing project posted on here I think to myself "I bet the only examples
are tiny low resolution thumbnails" and _every_. _single_. _time_. I'm proven
right. Whyyyyyyy?

To be fair, this particular application doesn't _really_ need more to show
it's improvement over other approaches, but still.

~~~
okusername
Because it's a lot less data to crunch for the network.

------
viraptor
I'm not sure if this is due to different mapping into greyscale or is their
method completely killing far distance details?

Compared to "Chen et al" which is a bit flickery in the foreground, but full
of stable background details, their result is almost completely black 3m in.

------
novaRom
Can other companies use YouTube database for free say for research in Computer
vision?

~~~
bonoboTP
I think it's a gray area, but researchers often just do it. Better to ask for
forgiveness than permission I guess. You could never collect datasets like
ImageNet if you had to obtain individual permissions.

~~~
PeterisP
At least some jurisdictions have research exemptions in their copyright laws,
so at least I don't need the copyright owner's permission to use any data for
research purposes.

I'd still prefer to use explicitly open datasets because it allows for simpler
data sharing and easier reproducibility, however in cases where that's not
possible whatever is available will do even if I'm restricted in how I can
redistribute that data.

------
BadassFractal
Seems related to the Tesla video-based depth perception work?

~~~
TheArcane
Tesla approach works chiefly on video scenes with static objects like parked
cars.

They train a DepthCNN to infer depth from monocular images (lidar or stereo
for supervision) and make sure it's temporally consistent by adjusting with
pixel transformations from the previous and next frame using a PoseCNN.
[https://arxiv.org/abs/1704.07813](https://arxiv.org/abs/1704.07813)

The guys at Google use Optical flow (only previous frame) to make sure their
model trained on static object video sequences works when the scene is dynamic
using a mask for a specific class an object (humans here). They do have to
make sure nothing but humans are dynamic in the scene.

------
polyterative
This can improve fake bokeh on smartphones to a pro-camera level quality

------
DeonPenny
Is it as good as LIDAR

~~~
mesutpiskin
I do not think so

