Animation seems like an especially poor fit to me, since the actual framerate is often much lower than the video's framerate. Framerate can vary between scenes and even within different parts of one scene! Typically the background is very low framerate (sometimes as low as 4 FPS), the foreground is higher framerate (typically 8-12 FPS), while pans, zooms, and 3D elements are at a full 24 FPS. Most of the additional frames from interpolation will therefore be exact duplicates of other frames.
This does little to improve the smoothness of the video. It just adds in artifacts. And, since the frames between two drawings will be interpolated while frames within one drawing will be unchanged, the framerate will be inconsistent and appear as judder.
Interpolation will never work for 2D animation. No way, no how. Any worthwhile system will need to modify existing frames rather than simply adding more in between the original frames. I can understand interpolation for live action (though I still dislike it), but it is absolutely god-awful for animation.
The reason I'm somewhat skeptical is that just because something looks realistic doesn't mean that it what is intended. It's a version of the 'zoom in, enhance, enhance' problem. It's like the _Hobbit_ problem: a GAN could perfectly well fake a 60FPS version of a 30FPS version of the _Hobbit_ such that you couldn't tell that it wasn't the actual 60FPS version that Peter Jackson shot... but the problem is that it's 60FPS and that just feels wrong for cinema. Animators, anime included, use the limitations of framerate and deliberate switches between animating 'on twos' etc, with reductions in framerates being deliberately done for action segments and sakuga and other reasons. An anime isn't simply a film which was unavoidably shot with a too-low framerate.
(This is less true of superresolution: in most cases, if an anime studio could have afforded to animate at a higher resolution originally, they would have; and you're not compromising any 'artistic vision' if you use a GAN to do a good upscaling job instead of a lousy bilinear upscale built into your video player.)
That's the problem: no matter how smart your algorithm is, you cannot make animation look smooth by only adding frames. Not even human animators could do that.
The framerate of animation is irrelevant. What matters is the number of drawings per second, not the number of frames. An intelligent system would interpolate between drawings, which would often require modifying or deleting frames from the source.
I'm not some purist claiming that this is an evil technology. It just plain doesn't apply to animation, except for pans or the rare scene animated at a full 24 FPS.
(IMO this is just something we have to push through. I hate the low frame rate of movies.)
One of the author also worked on https://news.ycombinator.com/item?id=20188316
AI Learns Video Frame Interpolation | Two Minute Papers #197
The reason for this is that people felt that unsupervised learning was a misleading name for many of the so-called unsupervised learning methods, such as language-modeling. They argue that there is a supervised training signal in these methods, the only difference is that the training signal comes from the model "input" itself, rather than an external label.
Ultimately, I'm not entirely sure if there is really a distinction between the two if you argue it all the way down to the details (is PCA unsupervised? or self-supervised since it's constructing a model with respect to its own inputs), but I think it's generally intuitive what self-supervised methods refer to and I'm on board for this renaming.
> - A form of unsupervised learning where the data provides the supervision
> – In general, withhold some information about the data, and task the network with predicting it
> – The task defines a proxy loss, and the network is forced to learn what we really care about,
e.g. a semantic representation, in order to solve it
In self-supervised training you use some kind of measurable structure to build a loss function against.
But in common usage people say "unsupervised" to mean "self-supervised". For example Word2Vec is usually referred to as unsupervised when it is technically self-supervised.
I think this is really because the self-supervised name was invented well after the techniques became common-place.
I think google did something like this some years ago?
This isn't really a unsupervised or self-supervised technique at all. It's a combination of supervised learning with reinforcement learning (which is a whole other thing too).
Speaking of, though: you'd think, by now, that security cameras that capture footage at very low framerates for the sake of storage space, would have ASICS in them using those models to convolve together a bunch of grainy input frames into a stream of fewer, but very good and clean frames.
Any hardware on the market with this capability yet?
If it’s good enough for generating accurate fMRI images from sequentially-overlaid magnetic flux readings, it’s definitely good enough for generating visuals from slightly suckier visuals.