
Open Questions about Generative Adversarial Networks - ml-engineer
https://distill.pub/2019/gan-open-problems/
======
Reelin
Slightly tangential question, but this is yet another piece of work published
on Distill. I keep a library of the papers I've read in Zotero. However, when
it comes to Distill I've run into a problem - I can't figure out a robust
method of archival.

Regular journal articles are published as PDF files, which make a decent
electronic analog of physical paper. In the case of Distill, there's the usual
DOI, volume, and issue, but the page itself isn't static. I did find a
seemingly undocumented feature referenced in a GitHub issue - you can append
"index.archive.html" to the url to get a supposedly standalone html archive of
the page. However, when I tried this with a number of recent publications
there were missing images and broken formatting all over the place, seemingly
due to the authors having linked in external resources (primarily for a number
of the figures).

Does anyone know how to do this or have any ideas?

~~~
ludwigschubert
Hi! I help maintain Distill's infrastructure. We'd love to find a robust
solution for archival, too, but we haven't been able to think of a satisfying
automatic way of providing static fallbacks to interactive diagrams. (That's
why index.archive.html isn't a public feature ^_^)

If you have an idea of what we could do on our side with a fixed amount of
work, please let me know! Unfortunately we can't require our third-party
authors to be experts in graceful degradation. I'm sure better tooling could
help, but currently have no specific inspiration.

As a fallback for your specific use case—we do actually provide a tiny bit of
print styles in our CSS. Try your browser's PDF conversion/print feature; I
find it often produces acceptable results. (Minus proper alignment to pages,
of course. :/)

~~~
Reelin
Really glad to hear you guys are aware of this! I do love the expressiveness
and ease of understanding that interactive diagrams offer, but find the
prospect of a mutable scientific literature to be a somewhat horrifying one. I
understand and completely agree that academic authors can not reasonably be
required to become experts in web design.

\---

From my perspective, the main issue seems to be broken formatting and media
(alignment, hoverable citations and notes, missing or partial images, etc).
The recent activation atlases paper is a particularly good example of the
problem ([https://distill.pub/2019/activation-
atlas](https://distill.pub/2019/activation-atlas)). I don't have time to tally
up all of the differences right now, but briefly just a few big ones so we
have something concrete to refer to here:

* The index.archive.html version shows "Loading..." in place of all the images in the first figure (and most of the other figures as well).

* The Zotero web page snapshot version actually does have the images for the first figure, but that figure still has other issues. Many of the other figures still just show the "Loading..." text though.

* In both cases, text is flowing around the figures in strange and unreadable ways.

* The bibliography for the index.archive.html version is completely broken.

* In both archive cases, uMatrix is showing third party scripts running on the page. I've allowed them, but obviously loading things from remote locations means it isn't really standalone anymore.

\---

> we haven't been able to think of a satisfying automatic way of providing
> static fallbacks to interactive diagrams.

Why does it have to be static? Why worry about graceful degradation at all? I
guess I just find myself wondering why there would be two versions of a
scientific paper (ie "real" and "archive") in the first place? I don't see any
reason why scientific work needs to be published in a static medium, but it
does seem like it should always be _standalone and immutable_.

To that end, why not require submissions themselves to be fully standalone
entities with absolutely no external dependencies (ie review them without
internet access) and then just publish those? Then instead of pdf papers, you
end up with interactive (but standalone) html papers that are viewable in any
standards compliant web browser, scripts and all.

Barring such a drastic solution, I don't have any clever ideas from a
technical standpoint; I'm not a web design expert nor am I familiar with what
tooling you allow authors to make use of in their papers. A fairly blunt
approach might just be to require authors to provide static images as standins
for each figure. It could literally just be a static image of what the
interactive figure initially looks like before you do anything with it. Then
you'd just have the existing formatting issues to sort out, but presumably
those are fairly straightforward.

An aside: Figures don't seem to be numbered in the activation atlases paper I
referenced above. It might be nice to add a requirement to clearly number and
label figures, tables, and other such insets - as far as I know, all the major
academic publications do this. If you're wondering why, at minimum it makes it
_much_ easier to discuss the paper when you have a clear label such as "fig
3a" or "fig 4 panel 2" or whatever to refer to.

------
tasdfqwer0897
Someone on the machine learning reddit asked me this:

> Question: How does copyright work for GAN output? If I input 300,000
> copyright protected photos of celebrities and generate images of new
> celebrities that do not exist, are the generated images public domain or
> would there be copyright issues?

AFAIK, this is not a settled issue, but I'd be really interested hear what an
actual lawyer thinks about this?

~~~
6gvONxR4sf7o
The main question is that of whether the produced image is a derivative work.
The interesting requirement is that the derivative work has originality. I
can't tell whether there's a legal definition of originality, or if it's an "I
know it when I see it" kind of thing.

What's so interesting with ML is that there's a dense spectrum between
memorization and originality. I hope in the future people start checking how
much their models are memorizing.

My favorite case is Google's facts, like when you google "golden retriever
weight." From other people going out, measuring it, writing it up, and
publishing that info to the web, Google can extract the info and never direct
traffic to those sites. I still don't know whether I think it's okay.

~~~
genai
Facts are not copyrightable

Link: [http://www.dmlp.org/legal-guide/works-not-covered-
copyright#...](http://www.dmlp.org/legal-guide/works-not-covered-
copyright#facts)

~~~
6gvONxR4sf7o
Legal question aside, it's still can't decide if I feel like it's dickish.

Even that has a spectrum, from "What is harry potter book is first?" to "What
is the first line of harry potter?" to "What is the text of the first harry
potter book?"

------
tasdfqwer0897
Hey, I wrote this! Happy to answer questions.

~~~
formalsystem
So let's say I'm in a regime where I'd like to train something like a language
model RNN on a private text dataset. If I train the RNN directly on the data
then I'm essentially leaking private data to the output. Do you know of any
approaches to generate fake data using GANs, train a language model on that
fake data and still get a good classifier that does well on test data while
still quantifying how much private data is being leaked?

Related: do GANs work well as a data augmentation technique and can their
exact contribution to how much a model could potentially be improved be
quantified.

EDIT: Added some clarifications

~~~
p1esk
You can train rnn on encrypted data

~~~
formalsystem
Sure but the output would still leak data. E.g: You give the RNN the phrase
"Chase Bank" and it outputs "Sell the stock".

Encrypting the data secures the data loading part but it's not like your model
didn't learn anything.

~~~
p1esk
The output would be encrypted of course. You’d decrypt it on your end. Whoever
hosts the model can’t know what it learned, without breaking your encryption.

~~~
yorwba
Unless you use fully homomorphic encryption (way too expensive for machine
learning), the model can't learn anything without breaking your encryption. So
you fixed the leak only at the cost of making the model completely useless.

~~~
p1esk
Source? Unless the encryption actually destroys information I don’t see why it
would necessarily make the job harder for the rnn.

~~~
yorwba
If you want to learn the function y = f(x), but present x and y as encrypted
data x' and y', the problem becomes finding g such that g(x') = y' =
encrypt(y) = encrypt(f(x)) = encrypt(f(decrypt(x'))), so the model doesn't
just have to learn f, it also has to solve for the decryption and encryption
routines.

~~~
p1esk
You look at this as a human, but for an rnn this would be the same as learning
to map x to y where both x and y are in a different language. It wouldn't need
to also learn how to translate that language into ours.

There might be encryption schemes which make learning such a "language" harder
for an rnn, but it does not necessarily mean these methods are harder to
break. So it might be possible to find/design an encryption method which is
easy to learn (map x to y), but hard to break. I don't know much about
encryption though, so I might be wrong.

~~~
yorwba
If an encryption scheme is easy to learn in the sense that if f is learnable,
then so is g, then it is also easy to break.

Proof: You can easily train an RNN to return the n-th letter of its input and
ignore everything else. Take that function to be f_n. Then g_n is the function
that takes the encrypted input and returns the n-th letter, encrypted. If the
encryption scheme is learnable in the above sense, then you can also train an
RNN to perform that task. You can use multiple of those RNNs to extract an
encryption of each individual letter in the input, which is equivalent to
having the input encrypted with a substitution cipher. Substitution ciphers
are so easy to break with frequency analysis that people have been doing it by
hand for more than a thousand years.

Conversely, if you have a secure encryption scheme, then all statistical
regularities in the input should be obfuscated so that they cannot be
exploited by an RNN.

~~~
p1esk
Good point. I didn't realize the goal of a good cipher is to remove
statistical regularities (i.e. make it indistinguishable from random noise).

------
dmix
Off-topic but I love the design of this blog article. Everyone is copying
Medium these days so it's refreshing to see a smaller-typed and denser design
for once.

