
StyleGAN2 Distillation for Feed-Forward Image Manipulation - lelf
https://github.com/EvgenyKashin/stylegan2-distillation
======
Uberphallus
Ah, repos without code, my favourites!

~~~
suyash
yes, could have rather been a GIST or Blog. Love to play with the code / model
if anyone can point to it. Nice work btw, some of the generated so good that
is was funny!

------
sillysaurusx
I recently had an idea for training StyleGAN without a discriminator:

StyleGAN2 FFHQ 1024x1024 is somewhat limited. In many cases, 256x256 would be
fine. It would generate 16x faster, which is important for e.g. video
thumbnails. But you can't generate 256x256 images with stylegan FFHQ. You can
only generate 1024x1024, then downsample.

Suppose you wanted to train a StyleGAN 256x256 on FFHQ. (FFHQ is just "what
people normally think of as stylegan" \-- the human faces model.) You can do
this now, and in fact we recently did a 512x512 model in <24 hours on a TPU
pod:
[https://github.com/sheminusminus/modelbattle](https://github.com/sheminusminus/modelbattle)

The problem is the latent space. You know how you can change your age, gender,
etc with stylegan "sliders"? Those are directions discovered in stylegan's
latent space. You train a classifier to predict "older" or "younger" faces,
then classify ~50k images, and fit a straight line through the latent space.
Now you have an age direction that works on any photo.

If you retrain the model from scratch, every single direction you discovered
in 1024x1024 stylegan is immediately invalid. Bye bye, effort. And yes, you
could rediscover the directions in your new model. But now if you want N
resolutions (and perhaps M different aspect ratios of each resolution?) you'll
need to keep track of NxM latent vectors, one for each model. This is a pain,
to say the least.

The training process is modified as follows. Generate a random latent, Z. Pass
this latent through your new model, and through the original StyleGAN
1024x1024 model. Minimize the perceptual distance between the images.

Done.

Yes, it's really that simple. And somewhat shockingly, I've run it by a few 4+
year veterans and they can't really see why it wouldn't work, nor can they
think of other projects that have already done that.

There are a few close ones, but I'm not trying to claim this is _novel_ , just
that it's _effective_. You'll end up with a 256x256 FFHQ model with identical
latent structure. All the directions on 1024x1024 will work fine on your new
model. At least, I think this is what will happen – it's hard to predict
anything in ML.

There are a bunch of interesting variations on this, too. You can train two
models simultaneously, on two different datasets. Then at each training step,
force the weights of both models to average together. i.e. force them to learn
both objectives simultaneously. What will happen? Who knows? Something will
happen, and it's sure to be interesting.

Another neat variation would be to train a single model, conditioned on two
different datasets. Suppose you have an animefaces stylegan model and an FFHQ
model. You can do the training process described above, but pass the latent
through both models, then add a label to the latent (0 = animefaces, 1 = ffhq)
and minimize the perceptual distance to each class, in a single model. You
should end up with a final model that can generate both animefaces and ffhq
faces just by toggling the labels.

This isn't quite the same thing as simply training on two different datasets
and using labels to classify them. You're memorizing the latent structure of
two existing models. When you set label to 0, you're saying "the rest of the
latent vector is identical to the animefaces model". When label is 1, ditto,
but for ffhq.

So the neat question is, what happens when you try to interpolate label 0 to
label 1? Something will happen. And it's sure to be interesting. Not
necessarily useful, but interesting.

~~~
svantana
I think one of the main reasons GANs are used for images is that the
perceptual distance is not well defined, and the discriminator can tease
whatever statistical difference the real and fake data (e.g. blurriness), thus
in practice learning a decent distance function. What perceptual distance
function did you have in mind?

~~~
sillysaurusx
Peter Baylies’ encoder is the best I’ve seen:
[https://github.com/pbaylies/stylegan-
encoder](https://github.com/pbaylies/stylegan-encoder)

(He has a great twitter account too:
[https://twitter.com/pbaylies](https://twitter.com/pbaylies))

It’s incredible. It’s far more stable than the projector that ships with
stylegan2. I think it was originally Putzer’s work, and pbaylies honed it to
perfection. He has a neat “latent estimate” too, where a resnet model predicts
a closer latent starting point to an arbitrary photo. This cuts down on
fitting time considerably.

The theory is, that encoder is so good, you can match faces identically. And
in fact, the readme in this submission shows Harry Potter’s face, so it’s
certainly good enough to provide a very close match to existing images.

Random tangent, but it’s also possible in practice to get close to Dota 2 hero
icons:
[https://twitter.com/theshawwn/status/1182208124117307392?s=2...](https://twitter.com/theshawwn/status/1182208124117307392?s=21)

Unfortunately in principle it seems like there isn’t a fully automatic
solution to do that yet; those were generated by me carefully tuning various
latent sliders and using pbaylies’ encoder to coax the face closer to the
target. You can see the workflow here:
[https://twitter.com/theshawwn/status/1185278057063632897?s=2...](https://twitter.com/theshawwn/status/1185278057063632897?s=21)

Point being, you can get an identical match with the encoder for images it was
trained on, or you can get pretty close to arbitrary targets by giving
artistic control to humans. Perhaps one day an ML model will learn that
mapping automatically. There is some impressive work in that direction
floating around.

Ultimately, you’re probably right that this might fail to learn smooth
interpolations in latent space for one reason or another. But it seems worth
implementing and seeing what the failure mode is, if any.

------
hardmaru
Should check that the repo has code before posting...

