Commutative Reinforcement Learning

A clean implementation of Commutative Q-Learning (CQ-L) — the two-step Bellman update from Clark et al. (TMLR submission) that exploits action commutativity in MDPs to speed up Q-learning and DQN convergence.

What is action commutativity?

Two actions a and b are commutative under state s if executing [a, b] and [b, a] from s yields the same expected next-state distribution and reward (Def. 4.1 of the paper). When this holds, the standard Bellman update wastes samples — CQ-L recovers them via a second, gated update.

The CQ-L update (Eq. 3):

(i)  Q(s', a') ← (1-α) Q(s', a') + α · (r' + γ max_a'' Q(s'', a''))
(ii) Q(s,  a') ← (1-α) Q(s,  a') + α · max(Q(s, a'), r + γ Q(s', a'))

Step (ii) only fires when:

env.is_commutative(prev_state, prev_action, state, action) returns True, and
the update would raise Q(s, a') (the informative gate).

Install (uv)

This project uses uv for dependency management. The uv.lock file is checked in for reproducible installs.

git clone https://github.com/ethanmclark1/commutative_rl.git
cd commutative_rl

# Install everything (dev tools + project as editable)
uv sync --extra dev

# Optional extras
uv sync --extra dev --extra robotics --extra wandb

This creates .venv/ automatically. Run any command in the env via uv run:

uv run pytest commutative_rl/tests/
uv run uv run commutative-rl --help
uv run python -c "from commutative_rl.envs import RobotSortingEnv"

To activate the venv shell-style instead: source .venv/bin/activate.

Domains (from the CQ-L paper)

Domain	Type	Commutativity	Source	Paper section
`multiset_sum`	Tabular + DQN	Unconditional	tmlr/tabular/Game.py + tmlr/dqn/game.py	§5.1, A.5.3, A.5.4
`windy_gridworld`	Tabular	Conditional (top row + same wind column)	tmlr/tabular/Environment.py	§5.2, A.5.6
`urban_planning`	Tabular	Conditional (not same bridge target)	tmlr/tabular/UrbanPlan.py	§5.2, A.5.5
`robot_sorting`	DQN (tabular for small N)	Conditional (disjoint object AND bin)	this repo (MuJoCo-optional)	new
`mountain_car`	DQN	Minimal (action commutes with itself)	tmlr/dqn/dqn_com_gym.py	§5.3

Cartpole and Acrobot were part of the original paper's §5.3 evaluation but have been dropped from this repo: Cartpole shows no measurable CQ-L benefit, and Acrobot's variance bands swallow the gap (this was the editor's specific critique). Mountain Car is the only gym domain retained — it serves as a single robustness/boundary example for §5.3 rather than a three-domain weak- signal section.

Set --treat_all_commutative to run the §5.3 ablation that treats every pair as commutative.

Hyperparameters follow paper A.5 exactly

Tabular: γ=0.99, α=0.2, ε=0.25 (0.5 for Urban Planning).
DQN: γ=0.99, α=2e-5, ε=0.5, layers=[128,128], target update every 25 steps, batch=4, buffer=5000 (20000 for Mountain Car).
MultisetSum tabular sweeps three element ranges with different episode horizons; defaults match the "active" [240, 900] config from the tmlr code. Switch via --min_elem_range / --max_elem_range and adjust --n_episode_steps / --n_training_steps accordingly:
- [80, 300] → 300 ep steps, 2000 train steps, 24000 evals
- [240, 900] → 100 ep steps, 2000 train steps, 4000 evals (default)
- [800, 3000] → 30 ep steps, 2000 train steps, 500 evals

Run

# Multiset Sum, tabular: CQ-L vs Q-learning vs 2Q-L
uv run commutative-rl --domain multiset_sum --approaches QLearning CQL DoubleQLearning --seed 0

# Multiset Sum, DQN
uv run commutative-rl --domain multiset_sum --approaches DQN CQL_DQN DoubleDQN --seed 0

# Windy Gridworld
uv run commutative-rl --domain windy_gridworld --approaches QLearning CQL DoubleQLearning --seed 0

# Urban Planning
uv run commutative-rl --domain urban_planning --approaches QLearning CQL DoubleQLearning --seed 0

# Mountain Car with potential-based reward shaping (DQN only)
uv run commutative-rl --domain mountain_car --approaches DQN CQL_DQN DoubleDQN --seed 0

# Robot Sorting (discrete pick-place; commutativity = disjoint object AND bin)
uv run commutative-rl --domain robot_sorting --approaches DQN CQL_DQN DoubleDQN \
    --n_objects 4 --n_bins 4 --seed 0
# add --render to open a MuJoCo viewer (requires `pip install -e ".[robotics]"`)

Robot Sorting in more detail

A tabletop with N objects and K target bins. The agent's action is a high-level pick-and-place primitive (object_i, bin_j). Each object has a single correct bin; correct placements pay utility_correct, incorrect ones pay utility_incorrect, every action pays -action_cost. A terminator action ends the episode with a small bonus.

Why this domain showcases CQ-L. Action commutativity is the natural structure of multi-object manipulation: (pick A, place at slot 1) and (pick B, place at slot 2) produce the same final arrangement regardless of order, as long as the two placements don't share an object or a bin. This is the same kind of conditional commutativity as Urban Planning, lifted to a robotics setting.

MuJoCo is an optional rendering backend — install via the robotics extra and pass --render to visualize. The transition dynamics are abstract (deterministic) so the commutativity oracle is exact; CQ-L can exploit it without physics noise interfering with the algorithm's guarantees.

Per-seed evaluation trajectories land in results/<domain>/<approach>/<instance>/seed_<n>.npz.

Analyze results

from pathlib import Path
from commutative_rl.stats import bootstrap_iqm, sample_efficiency_table
from commutative_rl.stats.analysis import runs_grouped_by_approach

runs = runs_grouped_by_approach(Path("results"), domain="windy_gridworld")
print(bootstrap_iqm(runs))                          # IQM with bootstrap 95% CIs
print(sample_efficiency_table(runs, threshold=80))  # Steps to reach a return

Backed by rliable: paired bootstrap CIs, performance profiles, probability of improvement, sample-efficiency tables.

Repository layout

commutative_rl/
├── envs/                      # Each env owns its env class, problem dataclass, and generator
│   ├── base.py                #   BaseEnv abstract class (incl. is_commutative oracle)
│   ├── multiset_sum.py        #   + YAML cache (generate_or_load, load_problem)
│   ├── windy_gridworld.py
│   ├── urban_planning.py      #   + generate_problem
│   ├── robot_sorting.py       #   + generate_problem (MuJoCo-optional render)
│   └── gym_env.py             #   CartPole/Acrobot/MountainCar wrapper with shaping
├── agents/
│   ├── base.py                # Shared train/eval loop
│   ├── q_learning.py          # QLearning + DQN + Double-data variants
│   ├── cql.py                 # Canonical CQ-L (tabular + DQN)
│   ├── networks.py            # MLP
│   └── buffers.py             # ReplayBuffer with is_commutative flag
├── stats/
│   ├── logging.py             # Per-seed .npz logger
│   └── analysis.py            # rliable wrappers + pseudo-regret
├── config/default.yaml        # Per-domain defaults (paper Sec. A.5)
├── tests/                     # Per-domain commutativity + CQ-L behavioral tests
└── main.py

Runtime artifacts (gitignored):

results/<domain>/<approach>/<instance>/seed_N.npz — per-seed eval logs.
cache/multiset_sum.yaml — auto-generated problem instances (reproducible per seed).

Adding a new domain

Subclass BaseEnv and implement:

reset(), step(action_idx)
is_terminating(action_idx)
is_commutative(prev_state, prev_action, state, action) — the domain oracle (swappableS_V in the paper)
For tabular use: supports_tabular = True, state_to_index, n_tabular_states

See envs/multiset_sum.py for a minimal reference, or envs/windy_gridworld.py for a conditional-commutativity example.

License

MIT. See LICENSE.

Citation

@article{clark2026commutative,
  title={Improving Q-Learning via Exploiting Action Commutativity in Markov Decision Processes},
  author={Clark, Ethan M. and others},
  journal={Submitted to TMLR},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 392 Commits
commutative_rl		commutative_rl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Commutative Reinforcement Learning

What is action commutativity?

Install (uv)

Domains (from the CQ-L paper)

Hyperparameters follow paper A.5 exactly

Run

Robot Sorting in more detail

Analyze results

Repository layout

Adding a new domain

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Commutative Reinforcement Learning

What is action commutativity?

Install (uv)

Domains (from the CQ-L paper)

Hyperparameters follow paper A.5 exactly

Run

Robot Sorting in more detail

Analyze results

Repository layout

Adding a new domain

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages