The forward movement also helps in this case as those foreground objects grow in size (relative to the background) with time, so there's even less need for interpolation. If the movement were primarily lateral (or reversed relative to the original image), I imagine the algorithm would have a much harder time producing good results .
EDIT: After skimming the paper, it appears that the algorithm is automatically choosing these best-case virtual camera movements:
> This system provides a fully automatic solution where the start- and end-view of the virtual camera path are automatically determined so as to minimize the amount of disocclusion.
That is pretty impressive. I had originally assumed the paths were cherry-picked by humans, so it's cool that the paths themselves are automatically chosen (and that the algorithm matches the intuitive best-case scenario in most cases). It's still slightly misleading in terms of results because they mention that user-defined movements can be used instead, but of course, the results are likely to suffer significantly if the movement doesn't match the optimal path chosen by the algorithm.
 The last example shown in the full results video illustrates the issue with too much background interpolation in lateral movement: http://sniklaus.com/papers/kenburns-results