-
Best in class GAN for artificially synthesizing natural-looking images by means of Machine Learning
-
There are approx. 50 pre-trained StyleGAN2 domain models available publicly
-
Each StyleGAN2 domain model can synthesize endless variations of images within the respective domain
-
Invented by NVIDIA 2020, see teaser video & source code
-
New Version "Alias-Free GAN" expected in September 2021
-
Examples from afhqwild.pkl and ffhq.pkl model data (generated on my own implementation):
Legal information: All content of "StyleGAN2 in a Nutshell" is © 2021 HANS ROETTGER. You may use parts of it in your own publications if you mark them clearly with "source: [email protected]"
-
First invented at Université de Montreal in 2014 (see evolution of GANs)
-
Generative Network: A Generator synthesizes fake objects similar to a Training Set (domain)
-
Adversarial: the Generator and a Discriminator component compete each other and hence become both better and better with each optimization step in a learning process
-
Even when the Generator & the Discriminator synthesize & assess totally random in the beginning, eventually the Generator will have learned to synthesize fake objects that could hardly be distinguished from the Training Set.
- The StyleGAN2 Generator synthesizes outputs based on a small input vector z "modulation“ (typical size: 2048 Bytes)
- Different values of z generate different outputs
- Therefore z could be interpreted as a very compressed representation of the synthesized output
- For almost all natural images there exists a generating z
- Most important: Similar z generate similar output objects! Accordingly a linear combination of two input vectors z1 & z2 results in an output object in between the outputs generated by z1 & z2
-
Needs approx. 10,000 images in the Training Set
-
Takes a few days of computing time
-
Video below: learning process based on 11,300 fashion images fed 200 times (= "epochs") to the Discriminator resulting in more than 2 million optimization steps for the Discriminator and the Generator. The video shows the generated fake images after each learning epoch. The 90 images in each video frame are linear combinations of the four input vectors z in the four corners. Remember: similar z generate similar outputs!
The StyleGAN2 Generator has no clue WHAT it is synthesizing (no hidden 3D geometrical model, no part/whole relationships, no lighting models, no nothing). It is just adding and removing dabs of paint, starting at a very coarse resolution and adding finer dabs in consecutive layers - similar to the wet-on-wet painting technique of Bob Ross.
64x64xRGBA image generator with 5 layers that has learned to generate emojis. All output layers have been normalized to visualize the full information contained (different ranges of values in reality). An other example with 6 layers: 256x256xRGB
-
The output image is synthesized as the sum of the consecutive image layers L
-
Each image layer L is a projection P of a higher dimensional data space into the desired number of image channels C
-
The higher dimensional data space for each L is generated by convolution filters C1, C2 from its predecessor
-
The Input vector w "modulates" each convolution filter and the projection! (w is a mapped version of z to achieve equal distribution in image space)
-
Additionally noise is added in each layer to generate more image variations. Final result of applying noise to different layers:
- Reimplemented StyleGAN2 from scratch to understand how it works and to eliminate some flaws of NVIDIAs reference implementation
- NVIDIA implementation flaws: outdated Tensorflow version, squared RGB Images only, proprietary & non-transparent dnnlib, mode collapse tendency, bad CPU inference
- Own Implementation: slightly slower, needs much more GPU memory, but other flaws eliminated!
- Starting point: collected and applied the pre-trained domain models available on public websites
- Training of own StyleGAN2 domain models needs high amount of computational resources and well prepared Training Sets (at least 5000 images; StyleGAN ADA claims to work with 2000 images)
- Training Sets need to have a lot of variation within but should not be too diverse. Spacial alignment is also critical, since neural networks do not cope with translations well. Upcoming "Alias-Free GAN" claims to deal better with translations and rotations
Catalog images and consecutive frames from video clips!
-
Emoji Model (64x64xRGBA, Training Set size: 6620 source)
-
Fashion Model (128x64xRGB, Training Set size: 11379 source)
-
Rain Drops Model (64x128xGray, 3800 images from own video clip)
Too much diversity in the Training Set.
-
Stamp Model (48x80xRGB, 11491 images from various stamp catalogs) In this example StyleGAN learned that a stamp has pips, a small border and typical color schemes but the motives were too diverse to be learned:
-
Tensorflow (Google) vs. PyTorch (Facebook); ML Abstraction Layer Keras
-
Use GPU acceleration!
- 10x Computational Power in float32 (standard for ML)
- BUT slower than CPU in float64 (scientific applications)
- 10x faster memory access. Get as much GPU memory as possible!
- Slower GPU just needs more time, but if ML network does not fit into GPU memory, you can‘t use it at all.
-
My development environment
- Ryzen 5600x (64 GB) + RTX 3060 (12 GB)
- Tensorflow 2 in Jupyter Notebooks running in Docker Container