

A comprehensive guide to parallel video decoding - ZeroGravitas
http://emericdev.wordpress.com/2011/08/26/a-comprehensive-guide-to-parallel-video-decoding/

======
pmjordan
Good to see someone taking on VP8 decoding in hardware in earnest. I started
working on an OpenCL+OpenGL-based implementation back in April, but it ended
up taking too much of my time. I got as far as doing the loop filter on the
GPU, though if I remember correctly, I spent way too much time chasing some
precision bug.

In any case, here's a brain dump on the topic in case anyone is interested:

The way to hardware-accelerate video decoding is to start at the back of the
pipeline and work forwards, otherwise you spend too much time copying data
between CPU and GPU. The last stage is colour space conversion (easy to do on
GPU), before that is the loop filter/motion compensation/intra prediction
feedback loop. This one is a lot harder to parallelise, unfortunately by
design.

Intra prediction uses the luma/colour information from up to 4 adjacent blocks
to approximate the contents of a block:

    
    
      A B C ...
      X Y ...
      ...
    

Blocks are processed in scanning order, so by the time we get to Y, we will
have reconstructed A, B, C and X, so we can base Y on them. This is great for
getting good compression ratios, but awful for parallelisation, because
everything is explicitly serial and the correct result depends on order of
execution.

VP8 has 2 different types of loop filters, the 'normal' one and the 'simple'
one. The 'normal' loop filter reads up to 4 pixels either side of a block
boundary (and modifies up to 2), and since blocks are 4 pixels wide, you again
end up with an ordering dependency which is difficult to parallelise. The
simple block filter doesn't suffer this problem, if I remember correctly, and
is embarrassingly parallel. Unfortunately, it doesn't seem to be widely used.
(presumably as its visual results are worse)

The motion compensation is easier to parallelise, so it's probably a good idea
to get that and the loop filter working first, as that will at least offload
the majority of frames to the GPU.

So, we need to parallelise the loop filter.

    
    
      A B C D E F ...
      L M N O ...
      X Y ...
      

The above are macro blocks. To get correct results, you're supposed to
calculate them all in alphabetical order. But we observe that you need CDEN to
calculate O and LMNX to calculate Y. The two calculations are independent, so
we can perform them simultaneously. Notice that we _CAN'T_ calculate N and Y
in parallel as Y depends on N. But we can always process all next macroblocks
on the rows of such a 2:1 gradient edge.

So we start in the top left corner of the picture, where there's little
opportunity for parallelism, and work our way down, with increasing batch size
for a while, and then decreasing again as we reach the bottom right corner. So
not ideal, but it could be good enough:

    
    
      [A]
      
       A[B]
      
       A B[C]
      [L]
    
       A B C[D]
       L[M]
      
       A B C D[E]
       L M[N]
      [X]
    

etc.

As it turns out, you can actually decompose the filter into a vertical and
horizontal stage for each 16x16 macro block. I forget which comes first (I
think horizontal edges, i.e. vertical kernel), but this fact might be
exploitable for hiding memory latency.

Intra prediction should be possible via a similar scheme, except it operates
on blocks within macroblocks which makes it a bit more complex. You might be
able to achieve more parallelism by exploiting the fact that different intra
modes don't require access to all 4 neighbouring blocks. To benefit from that
you'll need a scheduling pass across all macroblocks on the CPU first.

The pipeline stage before the intra/MC/loop filter feedback loop is annoyingly
paralleliseable again: the inverse transform is a simple matrix operation and
exactly the kind of problem GPUs excel at.

It's of course entirely possible that I've made a mistake or that I've missed
a parallelisation opportunity, but this is certainly the kind of stuff you
have to think about when tackling this type of problem.

