
A survey of comics research in computer science - blopeur
https://arxiv.org/abs/1804.05490
======
egypturnash
The main thing I took away from this survey is that the state of the datasets
researchers pass around is kind of terrible. None of them are very large. The
biggest one looks to be a pile of manga of dubious copyright status.

I find myself wondering if it might be worth spending some resources reaching
out to independent comics creators to get copies of their source files, which
will often have consistent methods of organization and labeling across
hundreds of pages, as well as machine-readable text, and building a new
dataset or expanding existing ones. Some time parsing the propriety formats
these are mostly going to be in (largely Photoshop, Illustrator, and Manga
Studio) and/or pulling data out via the program’s scripting/plug-in interfaces
might prove worthwhile.

Reach out to “collectives” like Hiveworks or Spiderforest, you can find a lot
of artists working at a pro level that way.

~~~
gwern
> The main thing I took away from this survey is that the state of the
> datasets researchers pass around is kind of terrible. None of them are very
> large. The biggest one looks to be a pile of manga of dubious copyright
> status.

That's true of most of the big datasets. You don't really think ImageNet
negotiated licenses with all 10 million+ owners of the images, do you? They
were all ripped from Google Images based on search queries. Yes, there's an
'implied license' for anyone posting images online but 1. a large fraction,
probably the majority, of the posters don't own the copyright in the first
place; and 2. that implied license almost certainly doesn't extend to
unlimited redistribution forever to everyone as part of a ML dataset. Likewise
MS COCO, Visual Genome, WebVision... (I remember Google released a giant photo
dataset a while ago - not the photos, just the URLs! You were expected to
download the images yourself.)

To a first approximation, IP law is so maximalist that the answer to 'can I do
X' is always 'no'. I used to occasionally do copyright clearance for photos
for WP, and it was a nightmare. A single photo would take months for all the
back and forth, and that didn't even involve money.

> I find myself wondering if it might be worth spending some resources
> reaching out to independent comics creators to get copies of their source
> files, which will often have consistent methods of organization and labeling
> across hundreds of pages, as well as machine-readable text, and building a
> new dataset or expanding existing ones.

They will charge too much, refuse to license appropriately, take months to
decide, ignore you, be confused about what they are agreeing to, may not have
copies, it'll be very messy data if you can convert the formats at all to a
common format (necessary if you hope anyone will ever use it) and after
spending a ton of money and time, you'll have a small dataset a few orders of
magnitude smaller (and much less proportionately diverse) than Jin's Getchu,
COMICS, nico-opendata, Danbooru2017, 'Quick, Draw!', etc. NN OCR is also
pretty good these days so the benefit of the raw source files might not be so
great either, and it won't have the stroke-level data of 'Quick, Draw!'.

