- Team Members
- Project Overview
- Environment Analysis
- Implementation
- Project Structure
- Installation
- Usage
- Results
- Models and Architectures
- Contributing
| First Name | Last Name | Student ID | |
|---|---|---|---|
| Antonis | Zikas | 1115202100038 | sdi2100038@di.uoa.gr |
| Panagiotis | Papapostolou | 1115202100142 | sdi2100142@di.uoa.gr |
This project implements and experiments with various Reinforcement Learning algorithms to train agents on the CartPole-v1 environment from OpenAI Gymnasium. The main focus is on Deep Q-Network (DQN) implementations with different architectural variations and comparative analysis with other RL algorithms.
- 🧠 Multiple DQN Implementations: Standard DQN, Dueling Architecture, and Transformer-based Q-Networks
- 📊 Comprehensive Analysis: Performance comparison with random actions and sensitivity studies
- 🔧 Modular Design: Clean, well-documented code structure
- 📈 Visualization: Detailed plotting and analysis of training results
- 🎮 Environment Testing: Baseline performance analysis with random actions
- 🏆 State-of-the-art Algorithms: Integration with Stable-Baselines3 (PPO, A2C)
The CartPole-v1 environment is a classic control problem where the goal is to balance a pole on a cart by moving the cart left or right.
- Discrete: 2 possible actions
0: Move cart to the left1: Move cart to the right
- Continuous: 4-dimensional state vector
[0]: Cart Position (range: -4.8 to 4.8)[1]: Cart Velocity (range: -∞ to +∞)[2]: Pole Angle (range: ~-0.418 to 0.418 radians)[3]: Pole Angular Velocity (range: -∞ to +∞)
- +1 for each timestep the pole remains upright
- Reward threshold: 500 (considered solved)
- Episode terminates when pole angle > ±12° or cart position > ±2.4
-
Deep Q-Network (DQN)
- Experience replay buffer
- Target network for stable learning
- ε-greedy exploration strategy
-
Dueling Architecture DQN
- Separate value and advantage streams
- Improved learning efficiency
-
Transformer-based Q-Network
- Sequential state processing
- Attention mechanism for temporal dependencies
-
Stable-Baselines3 Integration
- Proximal Policy Optimization (PPO)
- Advantage Actor-Critic (A2C)
- Neural Networks: Fully connected layers with ReLU activations
- Replay Buffer: Experience replay for stable training
- Target Networks: Periodic updates for learning stability
- Exploration Strategy: ε-greedy with exponential decay
Reinforcement-Learning-Assignment/
│
├── notebooks/
│ └── cart_pole.ipynb # Main Jupyter notebook with experiments
│
├── src/
│ ├── agents.py # DQN Agent implementation
│ ├── networks.py # Neural network architectures
│ ├── trainers.py # Training logic and utilities
│ ├── replay_buffers.py # Experience replay buffer
│ ├── testing.py # Model testing and evaluation
│ ├── plotting.py # Visualization utilities
│ ├── utils.py # Helper functions and hyperparameters
│ ├── dqn.py # Main DQN training script
│ ├── env_showcase.py # Environment demonstration
│ ├── stable_baselines_a2c.py # A2C training with Stable-Baselines3
│ └── stable_baselines_ppo.py # PPO training with Stable-Baselines3
│
├── models/ # Saved trained models
│ ├── dqn_model.pth
│ ├── dueling_arc_dqn_model.pth
│ ├── transformer_model.pth
│ ├── ppo_*.pth
│ └── a2c_*.pth
│
├── reports/
│ ├── figs/ # Generated plots and visualizations
│ └── PDFs/ # Final report documents
│
├── logs/
│ └── tensorboard/ # TensorBoard logging for training metrics
│
├── assets/
│ └── imgs/ # Images and diagrams
│
├── docs/ # Assignment documentation
├── requirements.txt # Python dependencies
└── README.md # This file
- Python 3.8+
- CUDA-compatible GPU (optional, for faster training)
-
Clone the repository
git clone <repository-url> cd Reinforcement-Learning-Assignment
-
Create virtual environment (recommended)
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
torch: Deep learning frameworkgymnasium: OpenAI Gym environmentsstable-baselines3: State-of-the-art RL algorithmsmatplotlib: Plotting and visualizationnumpy: Numerical computationsjupyter: Interactive notebooks
-
Run Environment Showcase
python src/env_showcase.py
-
Train DQN Agent
python src/dqn.py
-
Train with Stable-Baselines3
python src/stable_baselines_ppo.py # For PPO python src/stable_baselines_a2c.py # For A2C
-
Interactive Analysis
jupyter notebook notebooks/cart_pole.ipynb
Key hyperparameters are defined in src/utils.py:
GAMMA = 0.99 # Discount factor
LR = 1e-3 # Learning rate
BATCH_SIZE = 64 # Minibatch size
MEMORY_SIZE = 10000 # Replay buffer size
EPSILON_START = 1.0 # Starting exploration probability
EPSILON_END = 0.01 # Minimum exploration probability
EPSILON_DECAY = 0.995 # Epsilon decay rate
TARGET_UPDATE = 10 # Target network update frequency| Algorithm | Average Score | Success Rate | Training Episodes |
|---|---|---|---|
| Random Actions | ~22 | ~10% | N/A |
| DQN | ~475+ | ~95%+ | 500 |
| Dueling DQN | ~480+ | ~96%+ | 500 |
| Transformer DQN | ~450+ | ~90%+ | 500 |
| PPO (Stable-Baselines3) | ~500 | ~99% | Variable |
| A2C (Stable-Baselines3) | ~495+ | ~98% | Variable |
- ✅ All implemented algorithms significantly outperform random actions
- ✅ Dueling architecture shows slight improvement over standard DQN
- ✅ Stable-Baselines3 implementations achieve near-optimal performance
- ✅ Transformer-based approach shows promise but requires tuning
Input Layer (4 nodes) → Hidden Layer (128) → Hidden Layer (128) → Hidden Layer (128) → Output Layer (2 nodes)
- Shared layers: 4 → 128 → 128 → 128 → 128
- Value stream: 128 → 64 → 1
- Advantage stream: 128 → 64 → 2
- Combination: Q(s,a) = V(s) + (A(s,a) - mean(A(s,·)))
- Sequence length: 10 timesteps
- Embedding dimension: 64
- Attention heads: 4
- Encoder layers: 2
The project includes comprehensive visualization tools:
- Training Progress: Score and epsilon decay over episodes
- Performance Comparison: Trained agents vs. random actions
- Sensitivity Analysis: Hyperparameter impact studies
- TensorBoard Integration: Real-time training metrics
This is an academic project for coursework. The implementation follows best practices for:
- Code Organization: Modular, well-documented structure
- Reproducibility: Seed setting for consistent results
- Experimentation: Comprehensive sensitivity studies
- Visualization: Clear, informative plots and metrics
This project is part of the coursework for Reinforcement Learning & Stochastic Games
National and Kapodistrian University of Athens
Department of Informatics and Telecommunications