Hacker News new | comments | show | ask | jobs | submit login

Good to see someone taking on VP8 decoding in hardware in earnest. I started working on an OpenCL+OpenGL-based implementation back in April, but it ended up taking too much of my time. I got as far as doing the loop filter on the GPU, though if I remember correctly, I spent way too much time chasing some precision bug.

In any case, here's a brain dump on the topic in case anyone is interested:

The way to hardware-accelerate video decoding is to start at the back of the pipeline and work forwards, otherwise you spend too much time copying data between CPU and GPU. The last stage is colour space conversion (easy to do on GPU), before that is the loop filter/motion compensation/intra prediction feedback loop. This one is a lot harder to parallelise, unfortunately by design.

Intra prediction uses the luma/colour information from up to 4 adjacent blocks to approximate the contents of a block:

  A B C ...
  X Y ...
Blocks are processed in scanning order, so by the time we get to Y, we will have reconstructed A, B, C and X, so we can base Y on them. This is great for getting good compression ratios, but awful for parallelisation, because everything is explicitly serial and the correct result depends on order of execution.

VP8 has 2 different types of loop filters, the 'normal' one and the 'simple' one. The 'normal' loop filter reads up to 4 pixels either side of a block boundary (and modifies up to 2), and since blocks are 4 pixels wide, you again end up with an ordering dependency which is difficult to parallelise. The simple block filter doesn't suffer this problem, if I remember correctly, and is embarrassingly parallel. Unfortunately, it doesn't seem to be widely used. (presumably as its visual results are worse)

The motion compensation is easier to parallelise, so it's probably a good idea to get that and the loop filter working first, as that will at least offload the majority of frames to the GPU.

So, we need to parallelise the loop filter.

  A B C D E F ...
  L M N O ...
  X Y ...
The above are macro blocks. To get correct results, you're supposed to calculate them all in alphabetical order. But we observe that you need CDEN to calculate O and LMNX to calculate Y. The two calculations are independent, so we can perform them simultaneously. Notice that we CAN'T calculate N and Y in parallel as Y depends on N. But we can always process all next macroblocks on the rows of such a 2:1 gradient edge.

So we start in the top left corner of the picture, where there's little opportunity for parallelism, and work our way down, with increasing batch size for a while, and then decreasing again as we reach the bottom right corner. So not ideal, but it could be good enough:

   A B[C]

   A B C[D]
   A B C D[E]
   L M[N]

As it turns out, you can actually decompose the filter into a vertical and horizontal stage for each 16x16 macro block. I forget which comes first (I think horizontal edges, i.e. vertical kernel), but this fact might be exploitable for hiding memory latency.

Intra prediction should be possible via a similar scheme, except it operates on blocks within macroblocks which makes it a bit more complex. You might be able to achieve more parallelism by exploiting the fact that different intra modes don't require access to all 4 neighbouring blocks. To benefit from that you'll need a scheduling pass across all macroblocks on the CPU first.

The pipeline stage before the intra/MC/loop filter feedback loop is annoyingly paralleliseable again: the inverse transform is a simple matrix operation and exactly the kind of problem GPUs excel at.

It's of course entirely possible that I've made a mistake or that I've missed a parallelisation opportunity, but this is certainly the kind of stuff you have to think about when tackling this type of problem.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact