I initially thought about synchronisation with video too (aka lip sync). However, I don't think they are talking about video but merely whether a difference is detectable (rather than acceptable). I suspect the threshold for lip sync acceptability is a lot higher than what they measured here. I would have thought the threshold was higher than 5ms, but I haven't done any rigorous testing.
Thanks, that's really interesting, especially how video lag is more detectable than audio lag. I recently set up a display with a 130ms lag and this explains why it was so bad before I corrected the audio!