In any case, here's a brain dump on the topic in case anyone is interested:
The way to hardware-accelerate video decoding is to start at the back of the pipeline and work forwards, otherwise you spend too much time copying data between CPU and GPU. The last stage is colour space conversion (easy to do on GPU), before that is the loop filter/motion compensation/intra prediction feedback loop. This one is a lot harder to parallelise, unfortunately by design.
Intra prediction uses the luma/colour information from up to 4 adjacent blocks to approximate the contents of a block:
A B C ...
X Y ...
VP8 has 2 different types of loop filters, the 'normal' one and the 'simple' one. The 'normal' loop filter reads up to 4 pixels either side of a block boundary (and modifies up to 2), and since blocks are 4 pixels wide, you again end up with an ordering dependency which is difficult to parallelise. The simple block filter doesn't suffer this problem, if I remember correctly, and is embarrassingly parallel. Unfortunately, it doesn't seem to be widely used. (presumably as its visual results are worse)
The motion compensation is easier to parallelise, so it's probably a good idea to get that and the loop filter working first, as that will at least offload the majority of frames to the GPU.
So, we need to parallelise the loop filter.
A B C D E F ...
L M N O ...
X Y ...
So we start in the top left corner of the picture, where there's little opportunity for parallelism, and work our way down, with increasing batch size for a while, and then decreasing again as we reach the bottom right corner. So not ideal, but it could be good enough:
A B C[D]
A B C D[E]
As it turns out, you can actually decompose the filter into a vertical and horizontal stage for each 16x16 macro block. I forget which comes first (I think horizontal edges, i.e. vertical kernel), but this fact might be exploitable for hiding memory latency.
Intra prediction should be possible via a similar scheme, except it operates on blocks within macroblocks which makes it a bit more complex. You might be able to achieve more parallelism by exploiting the fact that different intra modes don't require access to all 4 neighbouring blocks. To benefit from that you'll need a scheduling pass across all macroblocks on the CPU first.
The pipeline stage before the intra/MC/loop filter feedback loop is annoyingly paralleliseable again: the inverse transform is a simple matrix operation and exactly the kind of problem GPUs excel at.
It's of course entirely possible that I've made a mistake or that I've missed a parallelisation opportunity, but this is certainly the kind of stuff you have to think about when tackling this type of problem.