I just got the daily email from my server with some metrics. It has experienced over 500 gigabytes of traffic in the last few hours and the video is only 3 megabytes in size. I definitely did not expect such a HN-effect.
Only very few people in comparison watched the full video. Here are the logs for the content server from yesterday 5 PM PST, covering the previous 24 hours (search for /kenburns/promo.mp4): https://pastebin.com/4zTxL2aw
I see your point now. This post originally pointed to the teaser video and not the project website (I do not know who changed that, must have been an admin): http://content.sniklaus.com/kenburns/promo.mp4
That means that there were 175574 HTTP GET requests, but my guess is that if someone fast-forwards that that might trigger another request. So the actual number of views may be a little lower.
Half of that must be me :-) This is so cool I watched it over and over again.
Hope you'll be able to release the source - or otherwise a binary to play with. Would love to try this out on 3D rendered human anatomy from CT scans. Cinematic rendering there already looks amazing, with this it would really come to life (eg: https://www.ajronline.org/doi/10.2214/AJR.17.17850)
I may be wrong in thinking that given two researchers are working at Adobe, the plan probably is to keep it proprietary and put it in one of their software.
I am planning to release the code as well as the dataset but have yet to get approval to do so. If the work is done by an intern, then Adobe is usually fairly open in terms of sharing the work.
I'm doing a video slideshow for my niece's wedding in a couple weeks and would love to add this effect to a bunch of their photos (without spending a usual hours-and-hours in photoshop).
Thank you for you kind words! I moved away from having separate websites for each project and am currently trying to bring all my projects to my personal website. I just added the Mini Chess game last night, which used to be an Android app.
Really impressive results. Is your "Context-aware Synthesis for Video Frame Interpolation" available as an implementation somewhere? It looks really good.
I skimmed the paper about the Ken Burns effect, and if you don't mind I have some questions. I hope I didn't miss the answers in the paper itself, I'll be sure to read it more carefully when time permits.
1. The loss function for the depth estimation is Ldepth=0.0001·Lord+Lgrad. Is Lgrad much bigger than Lord by design or will this basically make Lord tiny and almost unecessary? How did you arrive at this number and not a magnitude bigger or smaller?
2. How are you rendering the point cloud, like tiny discs in free space? When I think of point clouds I think of the typical LIDAR output renderings, but your results are continuous images, or video frames. Are the points rendered to an image plane and the result then interpolated?
And a couple more general ones:
1. Are all other depth estimation techniques other than NN obsolete now? Is there no point in estimating intrinsic camera parameters and epipolar lines when the state of the art seems to be to input one or more images into a NN and let it produce the depth for you?
2. How do you decide how the NN should look like, with various downsampling layers, convolutions, etc. I've seen people start with pretrained networks and retrain them, but how do you build a novel network based on the input and desired output?
I unfortunately did not get the approval to release the context-aware frame interpolation, but one of my older works is open source for research purposes: https://github.com/sniklaus/pytorch-sepconv
1. I am not sure about the scale of the individual loss functions anymore, my apologies. I determined the combination of the two losses via a simple grid search and seeing what works best (plus / minus a magnitude did not make that much of a difference).
2. The points are just splatted to an image plane, more advanced point cloud rendering techniques would be better though. There is no video frame interpolation, each individual frame in the output video is a rendering of the point cloud from a different camera perspective.
3. I am not sure about multi-view stereo, COLMAP still seems like the state of the art for that. But neural networks definitely outperform classic techniques for single image depth estimation.
4. Common architectures just did not do as well as I was hoping for so I tried about 1500 model architectures. I started with an architecture that intuitively seemed right and then gradually explored / refined alterations of it. It ultimately was a lot of trial and error.
Was it the university that didn't want to release it? Are they looking at commercializing it, or how does that work? Is it available in any commercial software? It kind of looks like magic and would probably be very useful for a lot of purposes.
2. So basically each point is projected to the image plane without perspective mapping? So in 3D, the further away from the camera they are the bigger they are so they all have the same size on the image? And that prevents any seams to occur in the pixel grid as things move around?
4. Experience, intuition, and elbow grease. Kind of what I thought, but I guess it's reassuring to see an expert in the field having to try 1500 variants.
It's complicated but the gist of it is that they are trying to commercialize it, yes. For what it is worth, you might be able to find it in commercial software once they were successful with their business endeavors.
2. Yes and there are two mechanism for handling seams. First, the inpainting which extends the point cloud and can provide a higher sample rate. Second, a postprocessing step that heuristically fills in any seams that may still be present despite inpainting.
4. The downside of it is that one needs a lot of resources in order to try all of these variants, which not everyone is lucky enough to have access to.
The most interesting thing about this (compared to results from similar research/projects) is that in all of the examples, camera movement is forward and down relative to the original perspective. The results are really good, and some of that may be due to a superior algorithm, but it's also aided in large part by the choice of movement. Since objects lower in the frame tend to be closer (foreground elements), the downward camera movement causes those objects to occlude parts of the background above (and behind) them, meaning that a relatively small portion of the background needs to be inpainted by the algorithm. If too much is interpolated, visual artifacts often ruin the illusion.
The forward movement also helps in this case as those foreground objects grow in size (relative to the background) with time, so there's even less need for interpolation. If the movement were primarily lateral (or reversed relative to the original image), I imagine the algorithm would have a much harder time producing good results [1].
EDIT: After skimming the paper, it appears that the algorithm is automatically choosing these best-case virtual camera movements:
> This system provides a fully automatic solution where the start- and end-view of the virtual camera path are automatically determined so as to minimize the amount of disocclusion.
That is pretty impressive. I had originally assumed the paths were cherry-picked by humans, so it's cool that the paths themselves are automatically chosen (and that the algorithm matches the intuitive best-case scenario in most cases). It's still slightly misleading in terms of results because they mention that user-defined movements can be used instead, but of course, the results are likely to suffer significantly if the movement doesn't match the optimal path chosen by the algorithm.
[1] The last example shown in the full results video illustrates the issue with too much background interpolation in lateral movement: http://sniklaus.com/papers/kenburns-results
Thank you for sharing your thoughts! We designed the automatic camera path estimation to minimize the amount of disocclusion which indeed simplifies the problem. As you correctly pointed out, inpainting the background is an additional challenge and while we address it, the inpainted results sometimes lack texture.
Cool. But that’s not what we call the Ken Burns Effect in the industry. This is 2.5 D parallax shift as seen in the documentary The Kid Stays in the Picture.
Ken Burns Effect is panning and zooming to highlight features in a photo then fading into another.
The examples in the video were automatically generated without user feedback. It is actually also possible to manually specify the camera path in order to achieve the effect that you are expecting, for more information feel free to have a look at the video on the project website.
Understood. This is an area of image research I have an interest and some experience in (image segmentation, background inpainting, etc., through machine learning.)
I think your results are wonderful. I was just pointing out that from my background (a television producer - most recently Shark Week, etc.) that the Ken Burns effect and parallax effects are considered two different concepts (that can be combined, but mean different things.)
I did not mean to shoot you down or anything, I in fact greatly appreciate your feedback! I unfortunately do not have a television / movie producer background but wish to learn more to better understand and support the workflow of professionals in this area. On a different note, I have recently not had time to watch television and hence just looked Shark Week up. It looks popular, congrats!
- Ken Burns effect -> zoom and pan around the image.
The Ken Burns effect is a type of panning and zooming effect used in video production from still imagery. The name derives from extensive use of the technique by American documentarian Ken Burns. (Wikipedia)
I would like to add that our method estimates the scene in full 3D space. Have a look at the additional results, there are a few results that exemplify this when the geometry is a little more complex: http://sniklaus.com/papers/kenburns-results
there are Fiverr gigs doing it, and tons of filmmakers (me included when I dabble with that), do it ourselves too (which takes time in Photoshop, Premiere, etc) from time to time (for news, documentaries, and such).
In fact I am about to pay a good 100-200 to have 4-5 photos done this way for a project...
Since I have done the work while interning at Adobe, I am afraid that I will not be able to commercialize it myself due to patenting. I agree with you that it has great potential, there are still quite a few failure cases though and it is a little hit and miss.
My special was “Andrew Mayne: Ghost Diver” where I built a suit to fool great white senses. Not shown in the special was the work we did trying to understand how sharks process visual data differently than humans. One interesting note is that they don’t seem to do image segmentation like we do (people in a shark cage are just a noisy box-thing and not discreet objects.)
Again, your project is great. I’ve been trying to find a simple workflow to do something like this (currently it’s remove.bg and OpenCV - with a lot of tinkering.)
I just watched the trailer to that special, really interesting insights into the vision of sharks. And kudos on the nifty special effects to illustrate it!
I think OP meant for the 3D to be a qualifier of the Ken Burns effect. Meaning he added parallax to the Ken Burns effect to produce a "3D Ken Burns Effect".
I don't have any experience in any of it but I'm just offering another explanation of the discussion.
It might be a nice side result to use your tech to identify foreground elements automatically to generate the camera motion, without using the 2.5D layering. Then you’d have a more traditional Ken Burns effect. It might not work at picking a specific face out of a crowd, but it would work for a lot of scenarios.
I find Apple’s automatic Ken Burns effect to be way too noisy and almost useless because it zooms in/out from random places in the pictures. When I manually edit a Ken Burns effect myself, there’s a lot less movement, and the movement is so much more meaningful and relevant to the video.
Civil War... I guess I already knew a lot about it, so what I noticed mostly was more sympathy than I'd like for a sort of Lost Cause nostalgia, and not much talk of slavery. But I guess he had to sell it for its time...
Loved the Vietnam series too. Have watched it a couple of times on Netflix.
I'm 40 yo European so I knew close to nothing about the Vietnam war other than what I had seen in some movies. Very in depth and emotionally powerful. It also seemed pretty objective to me as a complete outsider.
The Roosevelts was my intro to Ken Burns and it was the only TV show ever that I could not stop watching. Since then, I've also been very impressed by the quality of all the other Ken Burns documentaries.
I'm not sure that they give the BBC a run for its money - you're talking about quite a lot of Beeb output ;-) I agree that they are of an unusually high standard.
BBC documentaries, many of them atleast are having a reputation beyond what they deserve I think, often times the real information in them is sparse, and they present overtly simplistic narratives, with too much padding.
In comparison to what, though? It's easy to knock the Beeb, but compare the output (TV, Radio, and Web) to anything vaguely similar and you will probably appreciate the quality. They're not just making docus, it's so much more. Funnily enough, the Beeb aired the Burns Vietnam, too.
We were doing a lot of reading to better understand video producers an their usage of the Ken Burns effect. We are ultimately still research scientists with no hands-on video-production experience but we made sure to do our homework. :)
Some thoughts based on looking at the video a few times more (disclaimer - I'm not a video professional)
* There needs to be some variation in the zooming if you use this in a slideshow. I'd even say bias it towards zooming out more often, since that gradually reveals more of the image.
* You might have enough information to work with here to develop a similar "focus pull effect from single image" algo. That would be really cool.
* Maybe also you could try to develop a "sunset fade" for images where you detect blue sky. Graduallu adding a gradient yellow-orange fade for the sky, gradually warmer and darker foreground and non-sky background.
* There definitely should be more variety in terms of camera paths. Our framework either supports fully automatic results (as shown) or ones with a manual camera path. For an automatically generated slideshow, one should probably add more variety.
* There is some great work in this area, commonly focusing on portrait images. But yes, once the scene geometry has been estimated sufficiently well, one can definitely add some nice out-of-focus effects.
* That would be an interesting research direction. I am not an expert in relighting, but I can imagine that this requires a lot of work to make sure that the scene composition looks believable.
As for 2, Focos app has recently added the fake depth blur, and it is both absolutely impressive (oh my! It gets most of the image so right) and discernible (we are so sensitive about the visual that one wrongly estimated region - often a 'hole' like the ground between the legs - is enough to ruin the magic).
The authors might want to change the description on their amazing effects if they want to avoid lots of threads and comments about it not being the Ken Burns effect (or maybe that will generate more comments)
I am confused, a quote from Wikipedia: "The Ken Burns effect is a type of panning and zooming effect used in video production from still imagery."
Our framework takes a single still image and animates it via panning and zooming while adding 3D parallax. Does adding the 3D parallax not make it the Ken Burns effect anymore? Please let me know in case I am misunderstanding anything.
Yes, the examples that we have shown do not zoom as much as common Ken Burns examples. This stems from our framework processing the input at low resolution due to limitations in deep learning and the overall complexity of the problem. As such, there is not enough detail that could be zoomed into. This will be improved in later generations.
> Does adding the 3D parallax not make it the Ken Burns effect anymore?
Yes.
Specifically, the Ken Burns effect describes taking a still, 2D image and panning and zooming on the 2D image in a motion picture.
Adding 3D parallax makes it something quite different in the eyes of many of us on this forum. It is cool! We would just not use the words "Ken Burns effect" to describe what you've made.
On the contrary I think it basically serves the same purpose as the Ken Burns effect, except it tastefully incorporates the natural depth of the image.
One might even argue Ken Burns would have paired the panning and zooming with this parallax effect if he could.
I look forward to this enhancing basically all photo slideshows for a long time to come.
Thank you for clarifying, I see where you are coming from! My background is in research and it is fairly common to extend / reuse an existing terminology, but I totally understand and accept your point of view.
There is probably a name for this kind of "taking a 2D image, separating it into 2d cutouts, and panning with parallax". It's an effect that's been around for a while (hand done, not automated as you have it) but it's never been referred to as a "Ken Burns Effect"
Googling "How to make a parallax video from 2d pictures" brings up lots of tutorials, none of which mentions "Ken Burns Effect"
I think adding the 3D parallax gives the footage a very unique effect, and quite different from the traditional Ken Burns 2D scale and pan. I would say enough to make it deserve its own name, such as the sniklaus effect. Ken Burns is Ken Burns. This is something quite different, even though it may be inspired by KB.
I would be happy if it gets its own name in the future. I am voting against "sniklaus" effect though since this was a team effort and I could not have done it without my mentors.
Would it not be possible to do the image processing on the original image to create several pixel masks which then draw in the high resolution image data?
For instance, you take the source image at resolution n*n and do all processing at n/2 or n/4. Then at the moment you are going to composite the finished image you instead draw "pixels" from a source image which contains X,Y not RGB. Then you upscale the output image x2 (or x4) and replace each X,Y index with the relevant pixel from the source image (adding X%2,Y%2 or X%4 to the indexes to return to source resolution).
That is called guided upsampling and we are in fact doing just that to get a better estimate of the scene geometry. Specifically, we estimate the scene geometry at a resolution of 256x256 pixels (for a 1:1 aspect ratio, it varies for other resolutions) and then use the original image as a guide to upsample this estimate to 1024x1024 pixels. So yes, it could be used to a greater extent but it needs more work to be able to do this.
Not bad but monocular depth estimation has gotten pretty good all around. I made these similar images with basically no expertise, no manual mapping, just trying random single image depth projects from github. I kind of just went with whatever I could get running quickly, didn't even evaulate the quality other than cases where it obviously didn't work.
(Sorry for linking a million tweets but I didn't put these on a blog or anything, that's the only place they exist.)
Your version does look better though! But I was impressed I could just run random images through depth estimates and get anything cool like this.
They look nice, thank you for sharing! It seems like the input images that you used are paintings. I am not sure how well our depth estimation would work on those, definitely something to try out.
I've always wondered if Apple was using technology like this for the movie covers in the Apple TV store, since they have a parallax effect. For those who don't know what I am talking about, the remote control has a touchpad on it, so as you wiggle your finger around (while a title is selected) the cover will move on a 3D axis w/ your finger until you use enough force to move to the next title. Figured there would be a relatively straightforward way to separate the layers w/ software.
But only certain titles have the effect ... so I would imagine the studios provide a layered asset that can get composed together.
Yes, I've submitted artwork to iTunes and they want (but don't require) a layered file for the artwork to create that parallax effect. You'll notice most just have the Title and (other text) as moveable layers, which is easy since those were probably created as separate layers when the artwork was created to begin with.
The background artwork is typically static, although I suspect this tool will be included in photoshop at some point at which time it will make it much easier to give this effect to backgrounds too.
Here are some specs if you're interested (there are various different docs depending on which assets you're submitting, this is just one example):
Thank you both for the pointers! Our research on the 3D Ken Burns effect could be used to automate this, but the quality is still a little spotty. Some results look great, some have obvious artifacts. An automated deployment would thus either need to improve the overall quality or be able to detect artifacts in order to fallback to the 2D result for those.
I was thinking about what I should post and decided to make this little teaser since it demonstrates the gist of our work within a few seconds. I understand that the typically Hacker News audience may find the paper more worthwhile though.
It was a good thought, however the mechanism doesn't easily lend itself to following through to the paper. Rather than using a video consider a web page with your examples embedded that can be played, (even animated gifs would work) and a link to the paper. That gets you a bit of both worlds.
It’s been said, but this is mistitled - this is not a ‘Ken Burns’ effect, but rather a parallax panning effect with depth. To be honest it’s far more impressive than I expected.
Yep same here, it misrepresents and understates the actual thing to the point that I wouldn't have even clicked on it if the 3D part of the title didn't confuse me.
Estimating the scene geometry from a single image is highly challenging and remains an unsolved problem. As such, you can find subtle artifacts like this in most of the results. The artifact that you are referring to stems from a geometry adjustment step in which we make sure that salient objects like humans or animals are on a plane in the 3D space. This requires image segmentation which is also highly challenging and may lead to parts of the background being assigned to the segmented foreground object and vice versa.
Could you not use something like the "magic wand" selection tool in image editing programs? Doesn't the dramatic color difference indicate it's not part of the same object?
I thought maybe it grabbed it because it was the same color as the bouquet she is holding, but that is entirely within the white of her dress.
There definitely is room for improvement. Our proposed framework uses Mask R-CNN for object segmentation. However, Mask R-CNN only as a resolution of 28x28 pixels for the object masks if I remember correctly (which is then resized to the actual resolution of the object in the image). So there are a lot of limitations from that alone.
Yeah, we just are good at seeing artifacts in human renderings, it can easily look wrong or uncanny. In contrast, if a boulder is rendered geometrically wrong we usually do not notice it.
It's people moving up or down on the screen relative to one another in a way which their relative distances wouldn't dictate. Agreed that it gives me an uncanny-valley feeling (still super cool though).
Cool. I have a Fiverr service to make these for whoever wants to transform a normal photo into a 2.5D one. But it's hand made, so I guess this can make the whole thing more accessible to anyone
Our work focuses on physically correct depth. We have noticed that such effects created by professional artists emphasize the parallax to an extend that is not physically correct. As such, artists still seem to know best how to animate a catchy parallax effect.
Found something similar on shadertoy [1] from 2013.
Of course the geometry is handcoded in this case but it would be a good starting point to implement something like this via shaders in WebGL and use it for all your images on your Webpage.
I’ve manually done that in the past using Photoshop and multiple layers in Final Cut Pro, it was very time consuming. I wonder if this method requires you to isolate the layers manually.
I was watching a lot of tutorials like that when I started working on the project. The shown results were created fully automatically, one can optionally refine the camera path though. You can find more information on that in the video on my website.
OK, so this is depth estimation from a single image, right? Like this classic paper.[1] Then that's used to turn the image into a set of layers. This is often done for video; that's how 2D movies are turned into 3D movies.
More or less. Estimating the depth from a single image is highly challenging and far from solved. We thus had to made sure that the depth estimate is suitable for synthesizing new views. And we are not explicitly modelling layers, we actually model the scene geometry as a point cloud. But you definitely got the gist of it. By the way, one advantage when estimating the scene geometry in a video is that it is possible to employ structure from motion.
I've seen this effect done to a single still image before, but presumably it was done by hand somehow. Specifically, in the “Kony 2012” video (remember that?)
The effect itself is actually not too uncommon. It is just very tedious and time-consuming to do manually since it requires segmenting the elements in the scene, arranging them in a 3D space, filling in any holes and specifying a virtual camera trajectory.
Thanks for clarifying this. My roommate used to do this in after effects 10 years ago, so I couldn't understand what the demo video was attempting to showcase.
I am a broadcast film editor PBS, Nat Veo BBC etc..many shows need this parallax effect to engage viewers with stills. At present the After effects production of this effect is time consuming and expensive. Please keep in touch with me if you bring an app to market or offer a service to process our 3D stills. Simonhollandfilms@gmail.com
Reminds me of this gimbal-shot portrait-video parallax effect: https://youtu.be/Ryu4hp-HbwU (which was probably somewhat inspired by the Ken's Burns Effect)
Those look nice! The premise of the project was to only use a single image as an input in order to be applicable to existing footage. Without this constraint, more stunning effects like the portrait-video parallax become possible. I am sure we will see more exciting work like this in the future!
That's awesome, great tutorial. Another fun one is dolly zoom, which is pretty easy. Some of the DJI drones can even do it automatically: https://www.youtube.com/watch?v=MC7hkMR0hBs
It takes two to three seconds to process the input image and can subsequently synthesize each frame in the output video in real time. This makes it possible for the user to adjust the camera path and see the result in real time. Feel free to have a look at the video on the website for an example of this.
Every time I see an awesome landscape in person I think about whats missing from capturing that for other people to experience and I concluded it came down to depth perception and how our eyes dart around to create a composite experience resulting in ever slight shifts of depth
A 2D image cant capture that, but this seems a lot closer and its great it can use a 2D image as a base
It is the result of a lot of work (a lot) and I truly believe everyone in my situation with the right mindset would have been able to achieve it. Being a Ph.D. student and having a chance to do internships at companies like Adobe or Google does sometimes feel like a privilege though, so I understand that not many are in the same situation.
Website: http://sniklaus.com/kenburns