A clean implementation of Commutative Q-Learning (CQ-L) — the two-step Bellman update from Clark et al. (TMLR submission) that exploits action commutativity in MDPs to speed up Q-learning and DQN convergence.
Two actions a and b are commutative under state s if executing [a, b] and [b, a] from s yields the same expected next-state distribution and reward (Def. 4.1 of the paper). When this holds, the standard Bellman update wastes samples — CQ-L recovers them via a second, gated update.
The CQ-L update (Eq. 3):
(i) Q(s', a') ← (1-α) Q(s', a') + α · (r' + γ max_a'' Q(s'', a''))
(ii) Q(s, a') ← (1-α) Q(s, a') + α · max(Q(s, a'), r + γ Q(s', a'))
Step (ii) only fires when:
env.is_commutative(prev_state, prev_action, state, action)returnsTrue, and- the update would raise
Q(s, a')(the informative gate).
This project uses uv for dependency management. The uv.lock file is checked in for reproducible installs.
git clone https://github.com/ethanmclark1/commutative_rl.git
cd commutative_rl
# Install everything (dev tools + project as editable)
uv sync --extra dev
# Optional extras
uv sync --extra dev --extra robotics --extra wandbThis creates .venv/ automatically. Run any command in the env via uv run:
uv run pytest commutative_rl/tests/
uv run uv run commutative-rl --help
uv run python -c "from commutative_rl.envs import RobotSortingEnv"To activate the venv shell-style instead: source .venv/bin/activate.
| Domain | Type | Commutativity | Source | Paper section |
|---|---|---|---|---|
multiset_sum |
Tabular + DQN | Unconditional | tmlr/tabular/Game.py + tmlr/dqn/game.py | §5.1, A.5.3, A.5.4 |
windy_gridworld |
Tabular | Conditional (top row + same wind column) | tmlr/tabular/Environment.py | §5.2, A.5.6 |
urban_planning |
Tabular | Conditional (not same bridge target) | tmlr/tabular/UrbanPlan.py | §5.2, A.5.5 |
robot_sorting |
DQN (tabular for small N) | Conditional (disjoint object AND bin) | this repo (MuJoCo-optional) | new |
mountain_car |
DQN | Minimal (action commutes with itself) | tmlr/dqn/dqn_com_gym.py | §5.3 |
Cartpole and Acrobot were part of the original paper's §5.3 evaluation but have been dropped from this repo: Cartpole shows no measurable CQ-L benefit, and Acrobot's variance bands swallow the gap (this was the editor's specific critique). Mountain Car is the only gym domain retained — it serves as a single robustness/boundary example for §5.3 rather than a three-domain weak- signal section.
Set --treat_all_commutative to run the §5.3 ablation that treats every pair as commutative.
- Tabular: γ=0.99, α=0.2, ε=0.25 (0.5 for Urban Planning).
- DQN: γ=0.99, α=2e-5, ε=0.5, layers=[128,128], target update every 25 steps, batch=4, buffer=5000 (20000 for Mountain Car).
- MultisetSum tabular sweeps three element ranges with different episode horizons; defaults match the "active" [240, 900] config from the tmlr code. Switch via
--min_elem_range/--max_elem_rangeand adjust--n_episode_steps/--n_training_stepsaccordingly:[80, 300]→ 300 ep steps, 2000 train steps, 24000 evals[240, 900]→ 100 ep steps, 2000 train steps, 4000 evals (default)[800, 3000]→ 30 ep steps, 2000 train steps, 500 evals
# Multiset Sum, tabular: CQ-L vs Q-learning vs 2Q-L
uv run commutative-rl --domain multiset_sum --approaches QLearning CQL DoubleQLearning --seed 0
# Multiset Sum, DQN
uv run commutative-rl --domain multiset_sum --approaches DQN CQL_DQN DoubleDQN --seed 0
# Windy Gridworld
uv run commutative-rl --domain windy_gridworld --approaches QLearning CQL DoubleQLearning --seed 0
# Urban Planning
uv run commutative-rl --domain urban_planning --approaches QLearning CQL DoubleQLearning --seed 0
# Mountain Car with potential-based reward shaping (DQN only)
uv run commutative-rl --domain mountain_car --approaches DQN CQL_DQN DoubleDQN --seed 0
# Robot Sorting (discrete pick-place; commutativity = disjoint object AND bin)
uv run commutative-rl --domain robot_sorting --approaches DQN CQL_DQN DoubleDQN \
--n_objects 4 --n_bins 4 --seed 0
# add --render to open a MuJoCo viewer (requires `pip install -e ".[robotics]"`)A tabletop with N objects and K target bins. The agent's action is a
high-level pick-and-place primitive (object_i, bin_j). Each object has a
single correct bin; correct placements pay utility_correct, incorrect
ones pay utility_incorrect, every action pays -action_cost. A
terminator action ends the episode with a small bonus.
Why this domain showcases CQ-L. Action commutativity is the natural
structure of multi-object manipulation: (pick A, place at slot 1) and
(pick B, place at slot 2) produce the same final arrangement regardless
of order, as long as the two placements don't share an object or a bin.
This is the same kind of conditional commutativity as Urban Planning,
lifted to a robotics setting.
MuJoCo is an optional rendering backend — install via the robotics
extra and pass --render to visualize. The transition dynamics are
abstract (deterministic) so the commutativity oracle is exact; CQ-L can
exploit it without physics noise interfering with the algorithm's
guarantees.
Per-seed evaluation trajectories land in results/<domain>/<approach>/<instance>/seed_<n>.npz.
from pathlib import Path
from commutative_rl.stats import bootstrap_iqm, sample_efficiency_table
from commutative_rl.stats.analysis import runs_grouped_by_approach
runs = runs_grouped_by_approach(Path("results"), domain="windy_gridworld")
print(bootstrap_iqm(runs)) # IQM with bootstrap 95% CIs
print(sample_efficiency_table(runs, threshold=80)) # Steps to reach a returnBacked by rliable: paired bootstrap CIs, performance profiles, probability of improvement, sample-efficiency tables.
commutative_rl/
├── envs/ # Each env owns its env class, problem dataclass, and generator
│ ├── base.py # BaseEnv abstract class (incl. is_commutative oracle)
│ ├── multiset_sum.py # + YAML cache (generate_or_load, load_problem)
│ ├── windy_gridworld.py
│ ├── urban_planning.py # + generate_problem
│ ├── robot_sorting.py # + generate_problem (MuJoCo-optional render)
│ └── gym_env.py # CartPole/Acrobot/MountainCar wrapper with shaping
├── agents/
│ ├── base.py # Shared train/eval loop
│ ├── q_learning.py # QLearning + DQN + Double-data variants
│ ├── cql.py # Canonical CQ-L (tabular + DQN)
│ ├── networks.py # MLP
│ └── buffers.py # ReplayBuffer with is_commutative flag
├── stats/
│ ├── logging.py # Per-seed .npz logger
│ └── analysis.py # rliable wrappers + pseudo-regret
├── config/default.yaml # Per-domain defaults (paper Sec. A.5)
├── tests/ # Per-domain commutativity + CQ-L behavioral tests
└── main.py
Runtime artifacts (gitignored):
results/<domain>/<approach>/<instance>/seed_N.npz— per-seed eval logs.cache/multiset_sum.yaml— auto-generated problem instances (reproducible per seed).
Subclass BaseEnv and implement:
reset(),step(action_idx)is_terminating(action_idx)is_commutative(prev_state, prev_action, state, action)— the domain oracle (swappableS_Vin the paper)- For tabular use:
supports_tabular = True,state_to_index,n_tabular_states
See envs/multiset_sum.py for a minimal reference, or envs/windy_gridworld.py for a conditional-commutativity example.
MIT. See LICENSE.
@article{clark2026commutative,
title={Improving Q-Learning via Exploiting Action Commutativity in Markov Decision Processes},
author={Clark, Ethan M. and others},
journal={Submitted to TMLR},
year={2026}
}