
Playing for Data: Ground Truth from Computer Games - L_
http://download.visinf.tu-darmstadt.de/data/from_games/
======
lucb1e
Watching the video, I'm not sure what I'm looking at. On the left are a number
of buttons with objects, on the right a cursor colors corresponding objects in
the image. It looks very human in behavior but tells me nothing of what is
happening here.

It also claimed they use communication to the GPU, but none of that is visible
in the demo. It looks like a magic wand (like Gimp's) that selects pixels of
similar color for videos, except much slower.

And finally the times mentioned: the first two images took an hour or more to
label, the third seven minutes. I'm guessing that's their innovation but I'm
wondering what object recognition program takes more than a few seconds to
process a frame in the first place. They mention being 'pixel perfect' but any
object recognition would be, given it can recognize each object in the image
and thereby classify each part of the image.

~~~
L_
The first two images are from real-world datasets, where someone drove around
a city, took pictures, and then labeled all pictures manually. That usually
takes 60-90 minutes per image because you have no other information than the
picture itself (depth data from lidar or stereo is much sparser and does not
help much in fine-grained outlining of objects). If you had an algorithm that
could do this perfectly, you would not need this kind of datasets. So the
purpose of these datasets is being the training data for object detectors and
the like. The problem here is that modern algorithms (e.g. CNNs) need tons of
data to train (the more the better), but that training data is extremely
costly if you need an hour per image.

Now they also create a dataset, but instead of recording and labeling the real
world, they take images from GTA and use extracted mesh/texture/shader ids to
automatically label all objects in an image.

However, the game does not provide any of these 'rendering resource to object
class' associations by default (at least not at the level they are
intercepting the game/gpu communication). So someone has to make this
annotation in the first place. That is the 'magic wand' tool, where someone is
still annotating, but the human effort is reduced by nearly 3 orders of
magnitude (7 seconds per image) compared to the conventional way of creating
those datasets.

------
stevebmark
The language used on this page is surprisingly poorly written. Can someone
explain what this paper is actually demonstrating? I assumed that phd holders
knew the rules for writing paper abstracts, but this abstract doesn't follow
any of those rules?

~~~
Macuyiko
It's actually pretty simple/clever. Constructing labeled imagery costs a lot
of time and effort. I assume the current approach is to have a bunch of humans
(undergrads, most likely) sit through every image and label (color) them:
these pixels are trees, these are cars. Probably it's relatively error prone
as well.

The authors propose to just use <some open world game> to take a huge bunch of
images. Since we're talking about a game, the computer has a perfect internal
representation of entities and hence things that can be considered cars,
trees, streets, etc. We can thus not only obtain an image per frame that looks
close to the real-world, but immediately also one that is labelled.

Why is this helpful? To train computer vision models such as the ones used in
self-driving cars. Of course, the assumption here is that the imagery obtained
from a game is close enough to the real world, so that a trained model would
continue to work in the real world. I haven't read the paper in full, but the
authors experiments show that this is the case. They still use some original
imagery though, so perhaps it's not possible to use game-imagery alone. I also
don't think an experiment was performed to see if this method would still hold
up when using games having older, worse looking engines (it would be
interesting to see whether deep models could still generalize towards the real
world from this).

Finally, the authors spend a lot of hacky efforts in forcing the game to
outputting labelled images. As others have suggested here, they probably would
have been better off contacting some mod authors (who could whip this up in a
day, probably) or even the game developer itself (though I don't think
Rockstar would be particularly interested to collaborate on this).

------
socialist_coder
Seems like they should be writing some shaders / rendering mod that did all
this in realtime... I thought that was what their solution was going to be,
but they're still doing it semi-manually per image with that annotation tool.

~~~
microcolonel
Yeah, I found that very strange, too. And you'd think they would at least
propagate object types from texture image bindings (assuming GTA V doesn't use
virtual texturing, though that could be worked around as well).

------
KidComputer
Seems like building a GTA style simulator in UE4 or Unity would be a better
solution in the long run rather than hacking GPU resources like an aimbot
developer.

