A side-by-side PyTorch comparison of Deep Convolutional GAN (DCGAN) and Wasserstein GAN with Gradient Penalty (WGAN-GP) on the task of anime face generation, demonstrating how WGAN-GP addresses the mode collapse problem inherent in standard GAN training.
| Item | Detail |
|---|---|
| Foundation Paper | Generative Adversarial Nets (Goodfellow et al., NeurIPS 2014) |
| DCGAN Paper | Unsupervised Representation Learning with DCGANs (Radford et al., ICLR 2016) |
| WGAN-GP Paper | Improved Training of Wasserstein GANs (Gulrajani et al., NeurIPS 2017) |
| Dataset | Anime Faces (~21K images) |
| Platform | Kaggle (GPU) |
| Framework | PyTorch |
- The Problem: Mode Collapse
- The Solution: From DCGAN to WGAN-GP
- Architecture Overview
- Notebook Walkthrough
- Training Process
- Results & Analysis
- How to Run
- Requirements
- References
The original GAN framework (Goodfellow et al., 2014) defines a minimax game between two networks:
- Generator (G): Maps random noise z to a fake image, trying to fool the discriminator.
- Discriminator (D): Classifies images as real or fake.
The training objective is:
min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]
While theoretically elegant, this formulation suffers from a critical failure mode: mode collapse. This occurs when the generator learns to produce only a small subset of possible outputs — sometimes collapsing to generate nearly identical images regardless of the input noise. The generator "finds a cheat" that consistently fools the discriminator, rather than learning the full data distribution.
- Vanishing gradients: When the discriminator becomes too powerful (which is easy with BCE loss), the gradient signal to the generator vanishes. The generator stops learning or learns erratically.
- JS divergence saturation: The standard GAN loss is related to the Jensen-Shannon divergence. When the real and generated distributions have minimal overlap (common early in training), JS divergence saturates — it provides no useful gradient.
- Non-informative loss: The discriminator loss in standard GANs doesn't meaningfully correlate with sample quality. A low D loss can coexist with terrible generations.
This notebook implements two GAN variants to compare their behaviour:
DCGAN (Radford et al., 2016) improves on the original GAN with architectural guidelines for stable training using convolutional networks:
- Replace pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
- Use BatchNorm in both networks (except D's first layer and G's output layer).
- Use ReLU in the generator, LeakyReLU in the discriminator.
- Use Tanh activation in the generator's output layer.
- Train with BCE loss (standard GAN objective).
DCGAN still uses the standard GAN loss, so it remains susceptible to mode collapse, vanishing gradients, and training instability.
WGAN-GP (Gulrajani et al., 2017) fundamentally changes the training objective by replacing the JS divergence with the Wasserstein-1 (Earth Mover's) distance:
W(P_real, P_fake) = sup_||f||_L≤1 E[f(x)] - E[f(G(z))]
Key changes from standard GANs:
| Aspect | DCGAN (Standard GAN) | WGAN-GP |
|---|---|---|
| D/C role | Discriminator (classifier) | Critic (scores, no classification) |
| D/C output | Sigmoid (probability) | Raw score (no sigmoid) |
| Loss function | BCEWithLogitsLoss |
Wasserstein loss: E[C(fake)] - E[C(real)] |
| Constraint | None | Gradient Penalty (λ=10) |
| Normalization | BatchNorm in D | No BatchNorm in Critic |
| Optimizer | Adam (β₁=0.5) | Adam (β₁=0.0, β₂=0.9) |
| Critic iters | 1 D step per G step | 5 Critic steps per G step |
- Meaningful loss metric: The Wasserstein distance provides a smooth, continuous measure of distance between distributions — even when they don't overlap. Gradients never vanish.
- Loss correlates with quality: Unlike BCE-based GANs, the WGAN critic loss directly correlates with sample quality. A decreasing critic loss genuinely means the generator is improving.
- Gradient Penalty (GP): Instead of weight clipping (original WGAN), GP enforces the 1-Lipschitz constraint softly by penalising the critic's gradient norm when it deviates from 1. This avoids capacity underuse from clipping and yields smoother training.
- More critic training: Training the critic 5× per generator step gives the generator a stronger, more informative gradient signal — the critic accurately approximates the Wasserstein distance.
Both DCGAN and WGAN-GP share the same Generator architecture. The difference lies in the discriminator/critic and the training procedure.
z (100×1×1)
│
▼
┌──────────────────────────────────┐
│ ConvTranspose2d(100→512, 4×4) │ → 512×4×4
│ BatchNorm2d + ReLU │
├──────────────────────────────────┤
│ ConvTranspose2d(512→256, 4×4) │ → 256×8×8
│ BatchNorm2d + ReLU │
├──────────────────────────────────┤
│ ConvTranspose2d(256→128, 4×4) │ → 128×16×16
│ BatchNorm2d + ReLU │
├──────────────────────────────────┤
│ ConvTranspose2d(128→64, 4×4) │ → 64×32×32
│ BatchNorm2d + ReLU │
├──────────────────────────────────┤
│ ConvTranspose2d(64→3, 4×4) │ → 3×64×64
│ Tanh │
└──────────────────────────────────┘
│
▼
Fake image (3×64×64), values in [-1, 1]
- All convolutions use
stride=2, padding=1(except the first:stride=1, padding=0). bias=Falsethroughout (since BatchNorm handles the bias).- Tanh output matches the
[-1, 1]normalisation applied to real images.
Image (3×64×64)
│
▼
┌──────────────────────────────────┐
│ Conv2d(3→64, 4×4, stride 2) │ → 64×32×32
│ LeakyReLU(0.2) │ (no BatchNorm on first layer)
├──────────────────────────────────┤
│ Conv2d(64→128, 4×4, stride 2) │ → 128×16×16
│ BatchNorm2d + LeakyReLU(0.2) │
├──────────────────────────────────┤
│ Conv2d(128→256, 4×4, stride 2) │ → 256×8×8
│ BatchNorm2d + LeakyReLU(0.2) │
├──────────────────────────────────┤
│ Conv2d(256→512, 4×4, stride 2) │ → 512×4×4
│ BatchNorm2d + LeakyReLU(0.2) │
├──────────────────────────────────┤
│ Conv2d(512→1, 4×4, stride 1) │ → 1×1×1
│ (no activation — logits for BCE) │
└──────────────────────────────────┘
│
▼
Scalar logit (real/fake probability via BCEWithLogitsLoss)
The Critic has the same convolutional structure as the Discriminator but with two critical differences:
- No BatchNorm — BatchNorm interferes with the gradient penalty by introducing dependencies between samples in a batch. All
BatchNorm2dlayers are removed. - No sigmoid/activation on the output — the critic produces a raw score (not a probability). Higher scores = "more real".
| Component | Parameters |
|---|---|
| Generator | ~3.6M |
| Discriminator (DCGAN) | ~2.8M |
| Critic (WGAN-GP) | ~2.8M |
Following the DCGAN paper, all convolutional weights are initialised from N(0, 0.02) and BatchNorm weights from N(1.0, 0.02) with biases set to 0.
Sets up PyTorch, torchvision, and GPU detection. Seeds fixed at 42 for reproducibility.
| Parameter | Value | Notes |
|---|---|---|
LATENT_DIM |
100 | Size of noise vector z |
IMG_SIZE |
64 | 64×64 output images |
CHANNELS |
3 | RGB |
BATCH_SIZE |
64 | |
LR |
0.0002 | DCGAN learning rate (both G and D) |
BETA1 |
0.5 | Adam β₁ for DCGAN |
BETA2 |
0.999 | Adam β₂ for DCGAN |
NUM_EPOCHS_DCGAN |
60 | 50 initial + 10 extended |
NUM_EPOCHS_WGAN |
50 | |
LAMBDA_GP |
10 | Gradient penalty weight |
CRITIC_ITERS |
5 | Critic updates per G update |
Loads the Anime Faces dataset via ImageFolder. Images are resized to 64×64, centre-cropped, and normalised to [-1, 1] using Normalize([0.5]*3, [0.5]*3).
Displays a 4×4 grid of real anime faces from the dataset for reference.
init_weights() — applies DCGAN-prescribed weight initialisation to all Conv and BatchNorm layers.
Generator class — 5-layer ConvTranspose2d network mapping 100-dim noise to 3×64×64 images (see Architecture).
Discriminator class — 5-layer Conv2d network with BatchNorm and LeakyReLU. Outputs a scalar logit per image.
Critic class — same architecture as Discriminator but without BatchNorm and without any final activation. Outputs a raw score.
compute_gradient_penalty():
- Samples a random interpolation factor
α ∈ [0, 1]per sample. - Creates interpolated images:
x̂ = αx_real + (1-α)x_fake. - Computes critic scores on interpolated images.
- Uses
torch.autograd.grad()to compute gradients of critic scores w.r.t. interpolated images. - Penalty =
E[(||∇C(x̂)||₂ - 1)²]— pushes gradient norms toward 1 (1-Lipschitz constraint).
Instantiates Generator and Discriminator, runs a forward pass with random noise, and verifies output shapes ([B, 3, 64, 64] for G, [B] for D). Prints parameter counts.
Standard GAN training loop:
- D step: BCE loss on real (label-smoothed to 0.9) + BCE loss on fake (label 0.0). Uses
autocast+GradScalerfor mixed precision. - G step: BCE loss pushing D(fake) → 1.0.
- Saves sample grids every 5 epochs and checkpoints every 10 epochs.
Plots G and D losses across all epochs.
Resumes DCGAN training for 10 additional epochs (total: 60).
Saves dcgan_checkpoint_epoch60.pth (full state) and dcgan_generator_final.pth (generator weights only).
Updated loss plot including all 60 epochs.
WGAN-GP training loop:
- Critic step (×5 per batch): Wasserstein loss + gradient penalty. No mixed precision (GP requires full-precision
autograd.grad). - G step: Minimise
-E[C(G(z))](make critic score fakes highly). - Saves sample grids every 5 epochs and checkpoints every 5 epochs.
Plots G and Critic losses.
Generates 16 images from the same noise vector through both generators, displayed as DCGAN vs WGAN-GP grids.
Generates 64 samples from each model to visually inspect diversity. DCGAN samples tend to be more repetitive; WGAN-GP samples show greater variation in hair colour, eye colour, pose, and background.
Displays epoch-by-epoch sample grids for both models side-by-side, showing how quality evolves over training.
Side-by-side loss plots: DCGAN (left) vs WGAN-GP (right), highlighting the differences in training dynamics.
Saves wgan_generator_final.pth for deployment.
| Aspect | Detail |
|---|---|
| Loss | BCEWithLogitsLoss — standard binary cross-entropy. |
| Optimizer | Adam (lr=2e-4, β₁=0.5, β₂=0.999) for both G and D. |
| Label smoothing | Real labels set to 0.9 instead of 1.0 — a common trick to prevent the discriminator from becoming overconfident. |
| Mixed precision | autocast + GradScaler for faster training. |
| D:G ratio | 1:1 — one D update per G update. |
| Epochs | 60 (50 + 10 extended). |
| Aspect | Detail |
|---|---|
| Loss | Wasserstein loss: C(fake).mean() - C(real).mean() + λ·GP. |
| Optimizer | Adam (lr=1e-4, β₁=0.0, β₂=0.9) — per the WGAN-GP paper. |
| Gradient Penalty | λ = 10. Enforces 1-Lipschitz constraint on the critic. |
| Mixed precision | Not used — gradient penalty requires torch.autograd.grad which needs full precision. |
| Critic:G ratio | 5:1 — five critic updates per generator update. |
| Epochs | 50. |
- Discriminator loss drops quickly then oscillates, often converging to near-zero — indicating D easily distinguishes real from fake.
- Generator loss tends to be erratic — spikes and drops without a clear downward trend.
- This oscillation is characteristic: when D is too strong, G receives vanishing gradients and may collapse.
- Critic loss starts negative and gradually moves toward zero — reflecting the Wasserstein distance decreasing as G improves.
- Generator loss shows a steady downward trend — directly correlating with improving sample quality.
- Training is much more stable — no sudden spikes or oscillations.
Both models produce recognisable anime faces, but with clear qualitative differences:
| Aspect | DCGAN | WGAN-GP |
|---|---|---|
| Overall quality | Decent faces with some artefacts | Sharper, more detailed faces |
| Colour consistency | Generally good | Generally good |
| Facial features | Sometimes blurry or distorted eyes | Clearer eye structure |
| Background | Often noisy | Cleaner backgrounds |
This is the central experiment of the notebook. When generating 64 samples from random noise:
| Observation | DCGAN | WGAN-GP |
|---|---|---|
| Diversity | Tends toward repetitive outputs — similar hair colours, poses, and expressions | Much greater diversity — varied hair, eyes, poses |
| Mode collapse | Shows signs of partial mode collapse | Minimal mode collapse |
| Face variations | Limited variety in generated attributes | Wider range of anime character styles |
Key insight: DCGAN's BCE-based training allows the generator to "cheat" by finding a few modes that reliably fool the discriminator. WGAN-GP's Wasserstein distance measures the full distributional distance, penalising mode-dropping and incentivising the generator to cover more of the data distribution.
| Metric | DCGAN | WGAN-GP |
|---|---|---|
| D/C loss | Quickly saturates near 0 | Gradually approaches 0 |
| G loss | Erratic, oscillating | Smooth, steadily decreasing |
| Loss-quality correlation | Weak — low D loss ≠ good images | Strong — decreasing C loss = better images |
In DCGAN, you cannot tell from the loss curves alone whether training is going well. In WGAN-GP, the critic loss is a reliable proxy for generation quality — this is one of the most important practical advantages of the Wasserstein formulation.
-
Gradient Penalty > No Constraint: GP softly enforces the Lipschitz constraint everywhere, providing smooth, informative gradients to the generator at all times.
-
No BatchNorm in Critic: Removing BatchNorm prevents inter-sample dependencies, ensuring the gradient penalty is computed correctly per sample.
-
β₁ = 0: Disabling the first moment (momentum) in Adam prevents the critic from "remembering" old gradient directions, keeping it responsive to the generator's evolving distribution.
-
5:1 Critic-Generator Ratio: The well-trained critic provides a much more accurate Wasserstein distance estimate, giving the generator precise gradient signals.
-
Wasserstein Distance Properties: Unlike JS divergence, the Wasserstein distance is continuous and differentiable even when distributions are supported on non-overlapping manifolds — the typical case early in training.
| Model | Epochs | Approximate Time |
|---|---|---|
| DCGAN | 60 | ~38 min |
| WGAN-GP | 50 | ~252 min (~4.2 hrs) |
WGAN-GP is ~8× slower per epoch than DCGAN due to: (1) 5× critic updates per generator step, (2) gradient penalty computation requiring
torch.autograd.grad, and (3) no mixed-precision support for the GP computation. This is the computational cost of stable, mode-collapse-resistant training.
- Platform: Upload the notebook to Kaggle and attach the Anime Faces dataset.
- GPU: Enable a GPU accelerator in the Kaggle notebook settings.
- Execute cells in order. DCGAN trains first (~38 min), then WGAN-GP (~4 hrs).
- Comparison cells at the end generate side-by-side outputs for direct visual comparison.
| Library | Purpose |
|---|---|
torch, torchvision |
Model definition, transforms, DataLoader, make_grid |
torch.cuda.amp |
Mixed-precision training for DCGAN (GradScaler, autocast) |
torch.autograd |
Gradient penalty computation for WGAN-GP |
numpy |
Seed setting, numerical operations |
matplotlib |
Visualisation (sample grids, loss curves) |
time, os |
Timing, file paths |
All dependencies are pre-installed in the default Kaggle Python 3 Docker image.
-
Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS 2014. arXiv:1406.2661 — The foundational GAN paper defining the adversarial training framework.
-
Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016. arXiv:1511.06434 — DCGAN: architectural guidelines for stable convolutional GAN training.
-
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein Generative Adversarial Networks. ICML 2017. arXiv:1701.07875 — The original WGAN paper proposing the Wasserstein distance as an alternative to JS divergence.
-
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved Training of Wasserstein GANs. NeurIPS 2017. arXiv:1704.00028 — WGAN-GP: replaces weight clipping with gradient penalty for better stability.
This project is for educational purposes (Generative AI course — AI4009 Assignment 01). Feel free to use and adapt with attribution.