There's something very uncanny-valley about that video. I can't decide if it's the smoothness of the shading on the textures or if it's the way the parallax perspective on the buildings sometimes is just a tiny bit off. I don't generally get motion sickness from VR but I feel like this would cause it.
You’ll find this is true of all NeRFs if you spend time playing around with them. If a NeRF is trying to render part of an object that wasn’t observered in the input images, it’s going to look strange, since it’s ultimately just guessing at the appearance. The NVidia example in the link has the benefit of focusing on a single entity that’s centered in all of the input photographs - the effect is much more pronounced in large scale scenes with tons of objects, like the Waymo one. You can still see some of this distortion in the NVidia one - pay close attention to the backside of the woman’s left shoulder. You’ll see a faint haze or blur near her shoulder - the input images didn’t contain a clear shot of it from multiple angles, so the model has to guess when rendering it.
I know when doing typical 2D video based rotoscoping it is possible to use frames from before/after the current frame to see data that is being blocked in the current frame. It's also common in restoration when removing scrathes/hair in the gate/etc.
To that end, I wonder if exporting a similar bit of video from that same path exported as stills would be enough to generate the 3D version.