

Next generation video: Introducing Daala, part 2 - walrus
https://people.xiph.org/~xiphmont/demo/daala/demo2.shtml

======
ZeroGravitas
Does Xiph have an official position on objective metrics for video codecs?

I'd imagine that what you choose to measure might make a big difference at
this early design stage, and from a purely practical point of view, an
"outsider" codec needs to prove itself against the incumbents via benchmarks
unless it is night and day better (and even then the incumbents can muddy the
waters with carefully chosen benchmarks).

The online peanut gallery debate on this seems to have degenerated into a
sports fan like clash between SSIM afficiandos and people who think PNSR is
good enough. But since the two measures are highly correlated with each other
and both are based on treating video as a series of still images it sometimes
it seems like two very short twins arguing about who is taller. The Xiph
write-ups have a good track record of cutting through the BS and illuminating
good engineering practice so I'd be interested in hearing their take.

A recent paper suggested that all such frame-based metrics heavily
underestimated the improvements between H264 and HEVC compared with subjective
MOS ratings. As an "odd" codec design that intentionally strays from the MPEG
orthodoxy it seems like Daala could have a correspondence between objective
and subjective quality measures that differs radically from existing codecs.

------
aidenn0
Quick comment: don't expect Daala to be a mature codec for a while yet. To
quote Monty "Writing a complete new encoder from scratch is a small task
compared to the time required to then tune that new encoder into efficient
operation. Incremental changes allow demonstrable, steady progress."

The downside is that incremental changes leave you stuck in a sufficiently
large local maximum, which is where Daala comes in.

~~~
mtgx
I'm not, but I am expecting it to be 2x as efficient (with the same quality)
as VP9 and HEVC, if it's going to arrive that late in the game. I don't think
anything less than that will work for them (won't be adopted), because then
the switch won't be considered worth the trouble. Hopefully they pull it off.

~~~
zanny
The next generation of video codecs won't be as expensive a switch as the
mpeg2 or h264 era. If a codec shows enough space efficiency, you could just
implement an opencl or compute shader variant of it to run GPU side, which
will outperform cpu rendering in power efficiency and time by an order of
magnitude, and properly tuned would be in the ballpark for dedicated hardware
decode time.

In the same way dedicated audio decoding hardware went out the window when
cpus got fast enough dedicating die area to it just because excessive even if
it was more efficient and faster, I think the same thing will happen (soon) to
video decoding, where the extra dedicated decode hardware just isn't worth the
hassle when well optimized gl 4.3 compute shaders or opencl enable efficiency
and performance if not close to the dedicated hardware, close enough to not
justify having it.

It almost already wins the efficiency gain, because just by making dies larger
to include that extra circuitry increases the power ceiling of the device.
Being able to genericize die area doesn't give at-runtime power gains but
overall you can save juice not wasting die (though power management has gotten
so sophisticated it can outright shutoff parts of a chip, so it might be able
to eliminate that downside).

But simultaneously the gpu hardware in phones (tegra 4 / snapdragon 600 era)
is reaching the same threshold cpu audio decoding reached - it becomes silly
to waste the die when the performance is close enough.

And even if this isn't the generation where dedicated hardware goes out the
window, it will be the next one, and this one will be close enough it will be
like the late 90s with audio where the experimentation begins.

~~~
brigade
Codecs aren't parallel enough to work well on GPUs; parallelism in general
hampers compression efficiency.

Anyway, audio resolution hasn't increased anywhere near as much as video
resolution. 48 kHz sample rate, 16 bit sample depth per channel is still the
highest reasonable, and we've had basically that since CDs.

Whereas internet video has gone from CIF to 1080p, a resolution increase of
over 20x. And 4k is being pushed now for another 4x increase.

~~~
zanny
The point is that audio resolution peaked, and additional returns were
negligible at best. Video has the same effect occur somewhere around 300 PPI
at 6" view distance, 200 PPI at 12", etc - and between 90 and 150hz refresh
rate. Color fidelity is also near its limits on some high end IPS panels.

Past those points, _most_ people don't notice the difference, just like how
most people don't notice the difference between 16 and 24 bit audio at 44.1 or
48khz sample rate. Once the vast majority of people no longer see a
difference, the technology peaks. I think video is (finally) approaching that
territory in the next 5 years, at least in 2 dimensions. I feel holographic 3d
video will see a boon after that, and not the eye trick 3d crap we have now.

------
anarchy8
Can someone give a summary or TL;DR for us less technical types? This looks
very promising.

~~~
noahl
First, let me say that I am not an expert on video codecs - I've just worked
with VP9 a little. However, here is what I got from that.

Video codecs overall use two tricks to reduce video sizes: first, they drop
unnecessary information; second, they predict the information that's left.
(I'm not going to write about why prediction reduces size here, but look up
arithmetic coding in Wikipedia if you want to know.)

If you look at a raw video, dropping information might seem hard - it's a
bunch of pixels, and you don't really want to drop a pixel. Instead, you
transform the video into a representation where some of the information seems
less important. Daala, like VP9, uses the Fourier transform. That means it
writes a group of pixels (in this case, either a 4x4, 8x8, 16x16, or 32x32
block) in a different basis than normal (look up "change of basis" for more
information here).

The ordinary basis has a vector that corresponds to each pixel (or really,
each color in each pixel, but I'm not getting into that). In the basis they're
using (the Fourier basis), the first basis vector is the average of all of the
pixels, and the later vectors drill down into more and more specific
information until the final vector gives you information about the difference
between adjacent pixels. Since the difference between two adjacent pixels is
probably fuzz, you feel fine dropping it - this is the "dropping information"
part of the codec.

But what do you do with the information you can't drop? You try to predict it,
and that is what this blog post is about. When we're decoding a frame, we go
from left to right and from top to bottom. So when we're looking at a normal
32x32 block somewhere in the middle of the frame, we already know the blocks
directly above, left above, right above, and directly left. The idea is to use
what we know about those blocks to predict the one that we're decoding. This
is going to work because nearby pixels in videos are probably pretty similar -
for instance, in the background. (Note that you can also predict a block based
on a nearby region in _previous_ frames. The blog post doesn't talk about
that, but Daala will have a way to do it.)

The author's idea for prediction is both simple and clever. Each value is
going to be a weighted linear combination of old values - weight_1 * old_val_1
+ weight_2 * old_val_2 + ... . Now, having decided the outline of the
predictors, they're _not actually choosing predictors_. Instead, they've
narrowed the search space enough that they can have their computers search for
optimal predictors within that space. It's a cool idea, and I hope it will
produce good results. One notable difference between them and other codecs is
that after doing a Fourier transform, they're predicting the _transformed_
data, rather than the plain data. We don't know yet what effect that will
have.

This is still early-stage work, but it does look cool. One big thing they only
briefly mentioned is error. Any video codec will produce some sort of error -
places where the compressed video doesn't look like the original video. The
trick is to make sure that the errors aren't things that people mind, and the
only real way to test that is to show videos to people and see what they
think. They're not at the point where they can do that, but it will be cool to
see the results when they can. And I really really hope that performance is as
good as they want - a 40% reduction over VP9 would be great for Internet
video.

~~~
sillysaurus
The human eye is more sensitive to intensity ("average RGB") than to color.
You can drop most of the color info from a picture without significantly
degrading its viewing quality.

One way to do that is to transform from (r,g,b) to (intensity, chroma1,
chroma2) and then downsize the chroma1 and chroma2 channels by half. When you
then transform back into (r,g,b), humans can't hardly tell the difference.
Whereas if you tried to do that with the intensity channel, the picture would
look awful.

~~~
atondwal
Also note: Daala is
[lossless]([https://git.xiph.org/?p=daala.git;a=blob;f=doc/design.tex;h=...](https://git.xiph.org/?p=daala.git;a=blob;f=doc/design.tex;h=7a2bd470fe9d4ec58544a4ed103b28fe0f6f39d0;hb=HEAD#l50))

------
aidenn0
It worries me that On2 has said in a private conversation that this is a dead-
end. They aren't stupid there, and they've almost certainly partially explored
this space before.

~~~
brigade
It certainly depends on what they tried - Loren Merritt of x264 fame looked at
lapped transforms years ago as well, and concluded that they were inferior due
to the unfeasibility of spatial prediction. I remember he also concluded that
frequency-domain prediction was inferior to spatial, probably based somewhat
on how useless the frequency prediction was in mpeg4.

Personally, the amount of blur in those images worries me more than whatever
On2 might have concluded, simply because we all know what happened with
wavelets. But it's just the prediction we're seeing, not the result of the
transform, so time will tell.

~~~
StavrosK
What happened with wavelets?

~~~
brigade
15 years ago they were the next big revolution in image coding. Then people
made codecs based on them, which never beat DCT codecs perceptually. They did
well in PSNR though, since they blurred instead of blocked.

Of course, their main problem was that fine detail wound up in every
decomposition band, so you had to code it multiple times and no one came up
with good enough prediction to offset that. Lapped DCTs shouldn't have that
issue any more than traditional DCTs.

~~~
StavrosK
Ah, good explanation, thank you. I did wonder why they never took off.

------
ZeroGravitas
I'm kind of surprised that we've not heard more from Google about machine
learning applied to video codecs. After all some people claim compression is
basically AI, and other people claim that Google is basically an AI company so
you'd expect some fruitful collaboration is possible. Or maybe they're just
keeping quiet about it?

------
vanderZwan
So this is obviously way over my head, but is it a coincedence the Daala
predictors produce results that look very similar to the high pass filters in
Photoshop?

~~~
aidenn0
No, because the predictors are best at predicting low-frequency data, which
leaves just the HF data remaining.

~~~
vanderZwan
And what about the difference in blockyness with the other filters? Is that
because of the lapped transform thingy?

~~~
aidenn0
Partly; as they say, you can't do exactly what the spacial predictors from
AVC/H.264 do.

Look at the figure labeled AVC/H.264 Prediction modes (the text is part of the
image, so you cant C-F for it unfortunately). Notice that these filters don't
work in frequency space, but are spacial, and just extend the neighboring
pixels into the block as stripes; that will tend to be smooth on the upper
and/or left edges, and blocky on the lower and right edges.

The Daala predictors happen in frequency space, after the lapped transform so
you end up with something that is not necessarily a block of stripes, and you
also get more smoothness at the lower-right boundaries.

