L_'s comments

L_ · on May 12, 2021

Style transfer commonly hallucinates objects or adds artifacts. Making something look like a painting is actually much easier than making it look realistic. In paintings, artifacts are often tolerated as 'artistic'. In a photo, it just pops out. Also, style transfer approaches are not temporally stable most of the time. Have a look at the comparisons with state-of-the-art image-to-image translation (CUT, TSIT) and photo style transfer (WCT2). In the case of GTA/Cityscapes shown here, most of these methods put trees in the sky and/or flicker. Also the use of G-buffers allows a much deeper and more robust translation than just using images as style-transfer approaches.

playpause · on May 12, 2021

Thank you, this is really interesting.

L_ · on May 12, 2021

There's also a version trained on Mapillary Vistas, which has images from all around the world. Results from that are at the bottom of the page and at the end of the video ( https://youtu.be/P1IcaBn3ej0?t=462 )

janekm · on May 12, 2021

To me that version is much more reflective of the actual potential of this method than "GTA Düsseldorf dash cam". When watching the images produced with that one I get the sense that the sense of realism is enhanced by the dashcam-style image degradations and could be largely emulated in-engine (green LUT, blur + glow filters to emulate lens issues).

L_ · on Aug 5, 2016

The first two images are from real-world datasets, where someone drove around a city, took pictures, and then labeled all pictures manually. That usually takes 60-90 minutes per image because you have no other information than the picture itself (depth data from lidar or stereo is much sparser and does not help much in fine-grained outlining of objects). If you had an algorithm that could do this perfectly, you would not need this kind of datasets. So the purpose of these datasets is being the training data for object detectors and the like. The problem here is that modern algorithms (e.g. CNNs) need tons of data to train (the more the better), but that training data is extremely costly if you need an hour per image.

Now they also create a dataset, but instead of recording and labeling the real world, they take images from GTA and use extracted mesh/texture/shader ids to automatically label all objects in an image.

However, the game does not provide any of these 'rendering resource to object class' associations by default (at least not at the level they are intercepting the game/gpu communication). So someone has to make this annotation in the first place. That is the 'magic wand' tool, where someone is still annotating, but the human effort is reduced by nearly 3 orders of magnitude (7 seconds per image) compared to the conventional way of creating those datasets.