Deep Q‑Learning (DQN) implementation in PyTorch with support for Double DQN, Dueling networks, and multiple backbones (MLP, 1D‑CNN, LSTM). The repo includes a training script, a replay script (to watch saved policies), video recording, and simple result tracking (designed to be easy to extend to new Gymnasium environments).
- Algorithms: DQN, Double DQN, Dueling architecture (toggle in config)
- Backbones: MLP, 1D‑CNN, LSTM (select via
MODEL_TYPE) - Targets: hard copy or soft/Polyak updates
- Loss: Huber (default) or MSE
- Replay buffer: uniform experience replay
- Logging/Artifacts: moving‑average score, reward plot, best checkpoint
- Video: optional training videos; replay videos on demand
Use the provided Conda file:
conda env create -f environment.yml
conda activate rlNote: If you don’t have a GPU, use the CPU‑only variant of the environment or remove the CUDA line from
environment.yml.
Deep-Q-Learning-Workbench/
├── config.py # Hyperparameters & switches (env, model, loss, target update, etc.)
├── dqn_agent.py # DQN agent (policy/target nets, action select, optimize step)
├── networks.py # MLP / CNN1D / LSTM (+ dueling variants)
├── replay_buffer.py # Uniform replay buffer
├── utils.py # Seeding, plotting, checkpoint & config save
├── main.py # Training entrypoint
├── replay.py # Load a saved model and replay episodes (optional video)
├── environment.yml # Conda environment spec
└── results/ # Auto‑created run folders (checkpoints, plots, videos)
Each training run creates a timestamped folder under results/:
results/<ENV_NAME>_YYYYmmdd_HHMMSS/
├── best_model.pth
├── hyperparameters.txt
├── reward_plot.png
├── total_rewards.txt
└── videos/ (optional, if enabled)
All switches live in config.py. Common ones:
-
Environment
ENV_NAME:'CartPole-v1'or'LunarLander-v3'(extend with any Gymnasium env)SEED: random seed
-
Model
MODEL_TYPE:'MLP' | 'CNN1D' | 'LSTM'
-
Algorithm
double_dqn:True|Falsedueling_network:True|FalseLOSS:'huber' | 'mse'SOFT_UPDATE:True|False;TAU: Polyak coefficientGAMMA: discount factor
-
Optimization
LEARNING_RATE,BATCH_SIZE,REPLAY_BUFFER_SIZE,WARMUP_STEPSTARGET_UPDATE_FREQ(used ifSOFT_UPDATE=False)
-
Exploration
EPSILON_START,EPSILON_END,EPSILON_DECAY
-
Training
NUM_EPISODES,MAX_STEPS_PER_EPISODE,MOVING_AVG_WINDOW,REPORT_INTERVAL
-
I/O
SAVE_MODEL: save best checkpoint by moving‑average scoreRECORD_VIDEO: record training videos everyVIDEO_RECORD_INTERVAL
- Pick the environment and model in
config.py. - Run training:
python main.pyArtifacts are written to results/<ENV_NAME>_<timestamp>/.
Tip: enable/disable training video recording via
RECORD_VIDEOinconfig.py.
Use replay.py to watch the policy or record MP4s.
Option A — edit the constants at the top of replay.py:
- Set
RESULTS_PATHto a specific run folder, e.g.results/LunarLander-v3_20250914_220000 - Toggle
SAVE_VIDEO=Trueto write MP4s toreplay_videos/inside that run
python replay.pyOption B — (if your version supports CLI flags)
python replay.py \
--checkpoint results/LunarLander-v3_20250914_220000/best_model.pth \
--episodes 10 --epsilon 0.0 --record-dir replay_videosIf you replay on a different machine (CPU vs GPU), loading still works thanks to map‑location logic.
- Box2D:
LunarLander-v3requires Box2D. If pip gives trouble, the Conda packagebox2d-pyis included in the environment. - Video encoding: Recording uses Gymnasium’s
RecordVideo; if FFmpeg is missing, installimageio-ffmpeg(included in the env) or your system’sffmpeg. - Determinism: RL training is inherently stochastic. Seeding improves repeatability but slight differences are expected.
- Extending to new envs: switch
ENV_NAME, adjustn_states/n_actionsif using custom spaces, and you’re good to go.
- Prioritized Experience Replay
- NoisyNets or parameter‑space noise for exploration
- Frame stacking / sequence sampling for the LSTM path

