Generative Models

Generative Adversarial Networks

Difficulty:Expert

Reading Time:25 min

Track:

Deep Learning

Training a generator and a discriminator in a minimax game to generate highly realistic, synthetic data.

Prerequisites

Neural Networks & Deep Learning

Deep LearningModule 10 of 19

Generative Adversarial Networks

53%

TL;DR

GANs pit a generator against a discriminator in a minimax game; at equilibrium the generator’s distribution matches the real data distribution.
The optimal discriminator for a fixed generator is $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$ , and plugging it back in reveals the GAN objective is minimizing the Jensen-Shannon divergence between $p_{data}$ and $p_g$ .
Diffusion models instead define a fixed forward noising process that gradually destroys structure, then train a network to reverse it step by step, denoising pure noise back into data.
GAN training is a fragile saddle-point search (two networks chasing a moving target), while diffusion training is a stable regression problem (predict the noise), at the cost of needing many sampling steps.
Mode collapse, vanishing gradients, and non-convergence are the classic GAN failure modes; diffusion models trade training instability for slow, multi-step sampling.

Learning Objectives

Explain the minimax game formulation of Generative Adversarial Networks
Describe the role of the Generator and the Discriminator
Compare GANs, Variational Autoencoders (VAEs), and Diffusion Models
Identify common training issues such as mode collapse and vanishing gradients

Intuition

How to think conceptually about this topic

Think of the interaction as a game between a counterfeiter and an inspector:

Initial Phase: The counterfeiter makes highly unconvincing copies (random noise). The detective easily spots them.
Adversarial Feedback: The counterfeiter receives feedback (their fakes were rejected) and improves their paper and printing methods. The detective also learns to spot newer, more subtle flaws.
Equilibrium: Eventually, the counterfeiter makes flawless bills that are indistinguishable from real currency. The detective has to guess randomly (50% accuracy).

In the ideal equilibrium, generated samples are indistinguishable from real samples, so the discriminator can do no better than guessing. In practice, GANs rarely reach that ideal exactly.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Generative Adversarial Networks (GANs) are generative models trained through a two-player game between a generator ( $G$ ) and a discriminator ( $D$ ).

The Generator ( $G$ ) maps random latent vectors from $p_z$ into synthetic samples.
The Discriminator ( $D$ ) estimates whether a sample came from the real data distribution $p_{data}$ or from the generator.

How It Compares

GAN vs VAE vs Diffusion Model

Dimension	GAN	VAE	Diffusion Model
Training stability	Unstable — adversarial saddle-point search, prone to mode collapse and non-convergence	Stable — single, well-defined loss (reconstruction + KL) being minimized	Stable — simple denoising regression loss at each step
Sample quality (images)	Historically sharpest, most photorealistic samples	Tends to produce blurrier samples due to the reconstruction objective and Gaussian decoder assumption	State-of-the-art fidelity, often surpassing GANs (e.g. Stable Diffusion, Imagen)
Sampling speed	Fast — a single forward pass through $G$	Fast — a single decoder forward pass	Slow — requires many iterative denoising steps (tens to thousands), though distillation/fast samplers narrow this gap
Likelihood estimation	No tractable likelihood — cannot directly score how probable a given sample is	Tractable lower bound on log-likelihood (the ELBO)	Tractable (approximate) likelihood via the variational bound, generally tighter than VAE
Latent space structure	Latent space is learned implicitly; interpolation is often smooth but not principled	Explicit, regularized continuous latent space designed for smooth interpolation	No single compact latent code; generation unfolds across the noise trajectory $x_T \to x_0$

TakeawayChoose GANs when you need fast, single-pass sampling and the sharpest possible images and can tolerate finicky training; choose VAEs when you need a stable, principled latent space and tractable likelihoods with modest sample quality; choose diffusion models when sample quality and training stability both matter most and you can afford slower, multi-step sampling.

When to Use It

Reach for this when

You need fast, single-pass sample generation at inference time (e.g. real-time image synthesis) and can invest in stabilizing adversarial training.
You want the sharpest possible generated images/audio and have enough data and compute to tune a GAN variant (StyleGAN, WGAN-GP) carefully.
You need smooth latent-space interpolation or manipulation (e.g. face attribute editing) and a GAN’s implicit latent structure is sufficient.
You are comparing against diffusion or autoregressive baselines and need a strong, well-understood adversarial baseline.

Avoid it when

Training stability and reproducibility matter more than peak sample sharpness — prefer diffusion models or VAEs.
You need a tractable likelihood or principled uncertainty estimates over generated samples — GANs provide neither; use VAEs or diffusion models instead.
You have limited compute/engineering budget to babysit adversarial training (tuning learning rate ratios, avoiding mode collapse) — a diffusion model’s simpler loss is often easier to get working.
Inference-time latency is not critical but sample quality and diversity are paramount — diffusion models are usually the safer default today.

Rules of thumb

If training oscillates or the generator collapses, try a Wasserstein loss with gradient penalty (WGAN-GP) or spectral normalization before tuning learning rates further.
Monitor the discriminator’s accuracy: if it pins to ~100% the generator’s gradients vanish; if it stays near 50% too early, it may be too weak to provide useful signal.
For diffusion models, more timesteps generally improve sample quality but increase sampling cost — use fast samplers (DDIM, distillation) when latency matters.

Implementation

Reference code implementation

Python

model_fitting.py

1# Pseudo-code training loop step for GANs
2import torch
3import torch.nn as nn
4
5# D and G are PyTorch nn.Modules, optimizer_D and optimizer_G are optimizers
6criterion = nn.BCELoss()
7
8def train_step(real_data, noise):
9    # 1. Train Discriminator
10    d_real_loss = criterion(D(real_data), torch.ones(real_data.size(0), 1))
11    fake_data = G(noise)
12    d_fake_loss = criterion(D(fake_data.detach()), torch.zeros(noise.size(0), 1))
13    loss_D = d_real_loss + d_fake_loss
14    # Optimize D...
15
16    # 2. Train Generator
17    loss_G = criterion(D(fake_data), torch.ones(fake_data.size(0), 1))
18    # Optimize G...
19

Strengths & Advantages

Generates sharp, highly realistic, high-fidelity synthetic images and audio compared to other methods.
Does not require explicit density modeling or intractable integrals (like Variational Autoencoders).
Latent space interpolation allows smooth blending and semantic manipulation of generated attributes.

Limitations & Drawbacks

Extremely unstable to train; prone to Mode Collapse where the generator outputs identical patterns repeatedly.
Evaluation is indirect; sample quality is often assessed with proxy metrics such as FID or by downstream performance.
Non-convergence: gradient descent can oscillate in parameter loops instead of reaching Nash Equilibrium.

Real-World Case Studies

StyleGAN: photorealistic synthetic faces

Computer vision / synthetic media

Scenario

NVIDIA researchers wanted a GAN architecture capable of generating high-resolution (1024x1024) human face images with fine, controllable detail (hair strands, pores, freckles) and smoothly disentangled attributes like pose, identity, and style — far beyond what earlier DCGAN-style architectures could produce.

Approach

StyleGAN introduced a mapping network that transforms the latent code $z$ into an intermediate latent space $w$ , then injects style information at multiple resolutions via adaptive instance normalization (AdaIN), combined with progressive growing of the generator and discriminator from low to high resolution during training.

Outcome

StyleGAN (and its successor StyleGAN2) produced face images judged photorealistic by human evaluators at rates close to indistinguishable from real photographs in user studies, popularizing sites like "This Person Does Not Exist." It also enabled fine-grained semantic editing (changing age, expression, or lighting) by manipulating the learned $w$ -space, demonstrating that adversarial training can yield not just realism but a structured, controllable latent representation.

Source: A Style-Based Generator Architecture for Generative Adversarial Networks — Karras, T., Laine, S. and Aila, T.

Common Misconceptions

MisconceptionThe generator and discriminator should both reach 100% accuracy.

CorrectionIn an ideal equilibrium, the discriminator output should be exactly 0.5 for all samples, meaning it cannot distinguish real from fake data, and the generator is producing perfect replicas.

MisconceptionGANs are the only way to generate realistic images.

CorrectionDiffusion models (like Stable Diffusion) and Autoregressive models (like Transformers) have largely replaced GANs for image and text generation due to training stability.

References & Further Reading

Generative Adversarial Networkstextbook
By Goodfellow, I. et al
View Resource →
Deep Generative Modelingtextbook
By Tomczak, J. M
View Resource →

Generative Adversarial Networks

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

How It Compares

GAN vs VAE vs Diffusion Model

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

StyleGAN: photorealistic synthetic faces

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Neural Networks & Deep Learning

Autoencoders

Diffusion Models