People are extremely sensitive to subtleties in mouth articulation which facial landmark tracking tends to have trouble capturing. I question whether just a keyframe and facial landmarks are enough to generate convincing lip sync or gaze. I suspect that this is why the majority of the samples in the video are muted, which is a trick commonly used by facial performance capture researchers to hide bad lip sync results.