
3D Ken Burns Effect from a Single Image - sniklaus
http://sniklaus.com/papers/kenburns
======
sniklaus
Paper: [https://arxiv.org/abs/1909.05483](https://arxiv.org/abs/1909.05483)

Website: [http://sniklaus.com/kenburns](http://sniklaus.com/kenburns)

~~~
sniklaus
I just got the daily email from my server with some metrics. It has
experienced over 500 gigabytes of traffic in the last few hours and the video
is only 3 megabytes in size. I definitely did not expect such a HN-effect.

~~~
FreeHugs
The video is 180 MB.

So 500 GB are about 2800 full video downloads.

~~~
sniklaus
Only very few people in comparison watched the full video. Here are the logs
for the content server from yesterday 5 PM PST, covering the previous 24 hours
(search for /kenburns/promo.mp4):
[https://pastebin.com/4zTxL2aw](https://pastebin.com/4zTxL2aw)

~~~
FreeHugs
What does this line mean:

/kenburns/promo.mp4 | GET | 175574 | 33.4183748571803

That the video was watched 175574 times? That would indeed be a lot!

~~~
sniklaus
That means that there were 175574 HTTP GET requests, but my guess is that if
someone fast-forwards that that might trigger another request. So the actual
number of views may be a little lower.

------
dperfect
The most interesting thing about this (compared to results from similar
research/projects) is that in all of the examples, camera movement is forward
and down relative to the original perspective. The results are really good,
and some of that may be due to a superior algorithm, but it's also aided in
large part by the choice of movement. Since objects lower in the frame tend to
be closer (foreground elements), the downward camera movement causes those
objects to occlude parts of the background above (and behind) them, meaning
that a relatively small portion of the background needs to be inpainted by the
algorithm. If too much is interpolated, visual artifacts often ruin the
illusion.

The forward movement also helps in this case as those foreground objects grow
in size (relative to the background) with time, so there's even less need for
interpolation. If the movement were primarily lateral (or reversed relative to
the original image), I imagine the algorithm would have a much harder time
producing good results [1].

EDIT: After skimming the paper, it appears that the algorithm is
_automatically_ choosing these best-case virtual camera movements:

> This system provides a fully automatic solution where the start- and end-
> view of the virtual camera path are automatically determined so as to
> minimize the amount of disocclusion.

That is pretty impressive. I had originally assumed the paths were cherry-
picked by humans, so it's cool that the paths themselves are automatically
chosen (and that the algorithm matches the intuitive best-case scenario in
most cases). It's still slightly misleading in terms of results because they
mention that user-defined movements can be used instead, but of course, the
results are likely to suffer significantly if the movement doesn't match the
optimal path chosen by the algorithm.

[1] The last example shown in the full results video illustrates the issue
with too much background interpolation in lateral movement:
[http://sniklaus.com/papers/kenburns-
results](http://sniklaus.com/papers/kenburns-results)

~~~
sniklaus
Thank you for sharing your thoughts! We designed the automatic camera path
estimation to minimize the amount of disocclusion which indeed simplifies the
problem. As you correctly pointed out, inpainting the background is an
additional challenge and while we address it, the inpainted results sometimes
lack texture.

------
amayne
Cool. But that’s not what we call the Ken Burns Effect in the industry. This
is 2.5 D parallax shift as seen in the documentary The Kid Stays in the
Picture.

Ken Burns Effect is panning and zooming to highlight features in a photo then
fading into another.

~~~
sgt
Perhaps it would also be worth mentioning Ken Burns and his documentaries.

They're absolutely worth watching and gives BBC a run for its money.

Perhaps start with The West and then move on to Civil War. My personal
favorite was perhaps the one on Jazz, made in 2001.

~~~
afterburner
His Vietnam series was great.

Civil War... I guess I already knew a lot about it, so what I noticed mostly
was more sympathy than I'd like for a sort of Lost Cause nostalgia, and not
much talk of slavery. But I guess he had to sell it for its time...

~~~
pier25
Loved the Vietnam series too. Have watched it a couple of times on Netflix.

I'm 40 yo European so I knew close to nothing about the Vietnam war other than
what I had seen in some movies. Very in depth and emotionally powerful. It
also seemed pretty objective to me as a complete outsider.

------
JonathanFly
Not bad but monocular depth estimation has gotten pretty good all around. I
made these similar images with basically no expertise, no manual mapping, just
trying random single image depth projects from github. I kind of just went
with whatever I could get running quickly, didn't even evaulate the quality
other than cases where it obviously didn't work.

(Sorry for linking a million tweets but I didn't put these on a blog or
anything, that's the only place they exist.)

Your version does look better though! But I was impressed I could just run
random images through depth estimates and get anything cool like this.

[https://twitter.com/jonathanfly/status/1156799136987013120](https://twitter.com/jonathanfly/status/1156799136987013120)

[https://twitter.com/jonathanfly/status/1153383325974896646](https://twitter.com/jonathanfly/status/1153383325974896646)

[https://twitter.com/jonathanfly/status/1154472832249860100](https://twitter.com/jonathanfly/status/1154472832249860100)

[https://twitter.com/jonathanfly/status/1153120643040337925](https://twitter.com/jonathanfly/status/1153120643040337925)

I would love to your method take on some Escher paintings, can you try it for
me?

~~~
sniklaus
They look nice, thank you for sharing! It seems like the input images that you
used are paintings. I am not sure how well our depth estimation would work on
those, definitely something to try out.

~~~
JonathanFly
> It seems like the input images that you used are paintings.

Yeah, I love trying things that aren't supposed to work! They often do work
and surprise you, or fail in interesting ways.

------
whalesalad
I've always wondered if Apple was using technology like this for the movie
covers in the Apple TV store, since they have a parallax effect. For those who
don't know what I am talking about, the remote control has a touchpad on it,
so as you wiggle your finger around (while a title is selected) the cover will
move on a 3D axis w/ your finger until you use enough force to move to the
next title. Figured there would be a relatively straightforward way to
separate the layers w/ software.

But only certain titles have the effect ... so I would imagine the studios
provide a layered asset that can get composed together.

~~~
bonestamp2
Yes, I've submitted artwork to iTunes and they want (but don't require) a
layered file for the artwork to create that parallax effect. You'll notice
most just have the Title and (other text) as moveable layers, which is easy
since those were probably created as separate layers when the artwork was
created to begin with.

The background artwork is typically static, although I suspect this tool will
be included in photoshop at some point at which time it will make it much
easier to give this effect to backgrounds too.

Here are some specs if you're interested (there are various different docs
depending on which assets you're submitting, this is just one example):

[https://help.apple.com/itc/videoaudioassetguide/#/itc0c10422...](https://help.apple.com/itc/videoaudioassetguide/#/itc0c10422b9)

~~~
whalesalad
Oh rad, thanks for the link!

~~~
sniklaus
Thank you both for the pointers! Our research on the 3D Ken Burns effect could
be used to automate this, but the quality is still a little spotty. Some
results look great, some have obvious artifacts. An automated deployment would
thus either need to improve the overall quality or be able to detect artifacts
in order to fallback to the 2D result for those.

------
ChuckMcM
Can we replace that with
[http://sniklaus.com/papers/kenburns](http://sniklaus.com/papers/kenburns) the
actual paper?

~~~
sniklaus
I was thinking about what I should post and decided to make this little teaser
since it demonstrates the gist of our work within a few seconds. I understand
that the typically Hacker News audience may find the paper more worthwhile
though.

~~~
ChuckMcM
It was a good thought, however the mechanism doesn't easily lend itself to
following through to the paper. Rather than using a video consider a web page
with your examples embedded that can be played, (even animated gifs would
work) and a link to the paper. That gets you a bit of both worlds.

~~~
sniklaus
Good suggestion, thank you!

------
lostgame
It’s been said, but this is mistitled - this is not a ‘Ken Burns’ effect, but
rather a parallax panning effect with depth. To be honest it’s far more
impressive than I expected.

~~~
JansjoFromIkea
Yep same here, it misrepresents and understates the actual thing to the point
that I wouldn't have even clicked on it if the 3D part of the title didn't
confuse me.

------
anon1m0us
You'll notice the effect isn't as good at about 7 seconds into the movie where
the green grass between the bride and her maids moves with the bride.

Why does the algorithm think that is part of her, rather than the background?

~~~
pedrocx486
The two examples involving humans look terrible to me, the others look really
nice. Probably some bias on my perception.

~~~
sniklaus
Yeah, we just are good at seeing artifacts in human renderings, it can easily
look wrong or uncanny. In contrast, if a boulder is rendered geometrically
wrong we usually do not notice it.

------
nunodonato
Cool. I have a Fiverr service to make these for whoever wants to transform a
normal photo into a 2.5D one. But it's hand made, so I guess this can make the
whole thing more accessible to anyone

~~~
sniklaus
Our work focuses on physically correct depth. We have noticed that such
effects created by professional artists emphasize the parallax to an extend
that is not physically correct. As such, artists still seem to know best how
to animate a catchy parallax effect.

------
sprash
Found something similar on shadertoy [1] from 2013. Of course the geometry is
handcoded in this case but it would be a good starting point to implement
something like this via shaders in WebGL and use it for all your images on
your Webpage.

[1]:
[https://www.shadertoy.com/view/XdlGzH](https://www.shadertoy.com/view/XdlGzH)

~~~
sniklaus
That looks awesome, thank you for sharing!

------
covercash
I’ve manually done that in the past using Photoshop and multiple layers in
Final Cut Pro, it was very time consuming. I wonder if this method requires
you to isolate the layers manually.

~~~
sniklaus
I was watching a lot of tutorials like that when I started working on the
project. The shown results were created fully automatically, one can
optionally refine the camera path though. You can find more information on
that in the video on my website.

------
Animats
OK, so this is depth estimation from a single image, right? Like this classic
paper.[1] Then that's used to turn the image into a set of layers. This is
often done for video; that's how 2D movies are turned into 3D movies.

[1]
[http://www.cs.cornell.edu/~asaxena/learningdepth/ijcv_monocu...](http://www.cs.cornell.edu/~asaxena/learningdepth/ijcv_monocular3dreconstruction.pdf)

~~~
sniklaus
More or less. Estimating the depth from a single image is highly challenging
and far from solved. We thus had to made sure that the depth estimate is
suitable for synthesizing new views. And we are not explicitly modelling
layers, we actually model the scene geometry as a point cloud. But you
definitely got the gist of it. By the way, one advantage when estimating the
scene geometry in a video is that it is possible to employ structure from
motion.

------
TazeTSchnitzel
I've seen this effect done to a single still image before, but presumably it
was done by hand somehow. Specifically, in the “Kony 2012” video (remember
that?)

~~~
sniklaus
The effect itself is actually not too uncommon. It is just very tedious and
time-consuming to do manually since it requires segmenting the elements in the
scene, arranging them in a 3D space, filling in any holes and specifying a
virtual camera trajectory.

~~~
seanalltogether
Thanks for clarifying this. My roommate used to do this in after effects 10
years ago, so I couldn't understand what the demo video was attempting to
showcase.

------
logicallee
For easy clicking the link at the end of the video is:

[http://sniklaus.com/kenburns](http://sniklaus.com/kenburns)

very interesting!

------
simonholland
I am a broadcast film editor PBS, Nat Veo BBC etc..many shows need this
parallax effect to engage viewers with stills. At present the After effects
production of this effect is time consuming and expensive. Please keep in
touch with me if you bring an app to market or offer a service to process our
3D stills. Simonhollandfilms@gmail.com

------
davidmurdoch
Reminds me of this gimbal-shot portrait-video parallax effect:
[https://youtu.be/Ryu4hp-HbwU](https://youtu.be/Ryu4hp-HbwU) (which was
probably somewhat inspired by the Ken's Burns Effect)

~~~
sniklaus
Those look nice! The premise of the project was to only use a single image as
an input in order to be applicable to existing footage. Without this
constraint, more stunning effects like the portrait-video parallax become
possible. I am sure we will see more exciting work like this in the future!

------
nayuki
This visual animation effect is almost the signature of the YouTuber "Business
Casual": [https://www.youtube.com/channel/UC_E4px0RST-
qFwXLJWBav8Q/vid...](https://www.youtube.com/channel/UC_E4px0RST-
qFwXLJWBav8Q/videos)

~~~
sniklaus
It seems like he is using them extensively but not too extreme. Looks great,
thank you for sharing!

------
astannard
Absolutely love it, fantastic work!

------
k__
Does this work real-time?

~~~
sniklaus
It takes two to three seconds to process the input image and can subsequently
synthesize each frame in the output video in real time. This makes it possible
for the user to adjust the camera path and see the result in real time. Feel
free to have a look at the video on the website for an example of this.

------
adammenges
Great job! Love this.

------
rolltiide
Great! I was thinking about doing this for years

Every time I see an awesome landscape in person I think about whats missing
from capturing that for other people to experience and I concluded it came
down to depth perception and how our eyes dart around to create a composite
experience resulting in ever slight shifts of depth

A 2D image cant capture that, but this seems a lot closer and its great it can
use a 2D image as a base

~~~
sniklaus
I wholeheartedly agree with you and believe that depth perception is a key
aspect that is is missing in the status quo of viewing still images!

------
zekarlos
looks good!

------
kd3
Incredible. Amazing. Holy shit.

~~~
sniklaus
This comment made me smile, thank you. :)

~~~
kd3
I imagine it must feel great to be a modern wizard. This is magic.

~~~
sniklaus
It is the result of a lot of work (a lot) and I truly believe everyone in my
situation with the right mindset would have been able to achieve it. Being a
Ph.D. student and having a chance to do internships at companies like Adobe or
Google does sometimes feel like a privilege though, so I understand that not
many are in the same situation.

