Skip to content

ethanmclark1/commutative_rl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

392 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Commutative Reinforcement Learning

License: MIT Python 3.10+ Code style: black

A clean implementation of Commutative Q-Learning (CQ-L) — the two-step Bellman update from Clark et al. (TMLR submission) that exploits action commutativity in MDPs to speed up Q-learning and DQN convergence.

What is action commutativity?

Two actions a and b are commutative under state s if executing [a, b] and [b, a] from s yields the same expected next-state distribution and reward (Def. 4.1 of the paper). When this holds, the standard Bellman update wastes samples — CQ-L recovers them via a second, gated update.

The CQ-L update (Eq. 3):

(i)  Q(s', a') ← (1-α) Q(s', a') + α · (r' + γ max_a'' Q(s'', a''))
(ii) Q(s,  a') ← (1-α) Q(s,  a') + α · max(Q(s, a'), r + γ Q(s', a'))

Step (ii) only fires when:

  1. env.is_commutative(prev_state, prev_action, state, action) returns True, and
  2. the update would raise Q(s, a') (the informative gate).

Install (uv)

This project uses uv for dependency management. The uv.lock file is checked in for reproducible installs.

git clone https://github.com/ethanmclark1/commutative_rl.git
cd commutative_rl

# Install everything (dev tools + project as editable)
uv sync --extra dev

# Optional extras
uv sync --extra dev --extra robotics --extra wandb

This creates .venv/ automatically. Run any command in the env via uv run:

uv run pytest commutative_rl/tests/
uv run uv run commutative-rl --help
uv run python -c "from commutative_rl.envs import RobotSortingEnv"

To activate the venv shell-style instead: source .venv/bin/activate.

Domains (from the CQ-L paper)

Domain Type Commutativity Source Paper section
multiset_sum Tabular + DQN Unconditional tmlr/tabular/Game.py + tmlr/dqn/game.py §5.1, A.5.3, A.5.4
windy_gridworld Tabular Conditional (top row + same wind column) tmlr/tabular/Environment.py §5.2, A.5.6
urban_planning Tabular Conditional (not same bridge target) tmlr/tabular/UrbanPlan.py §5.2, A.5.5
robot_sorting DQN (tabular for small N) Conditional (disjoint object AND bin) this repo (MuJoCo-optional) new
mountain_car DQN Minimal (action commutes with itself) tmlr/dqn/dqn_com_gym.py §5.3

Cartpole and Acrobot were part of the original paper's §5.3 evaluation but have been dropped from this repo: Cartpole shows no measurable CQ-L benefit, and Acrobot's variance bands swallow the gap (this was the editor's specific critique). Mountain Car is the only gym domain retained — it serves as a single robustness/boundary example for §5.3 rather than a three-domain weak- signal section.

Set --treat_all_commutative to run the §5.3 ablation that treats every pair as commutative.

Hyperparameters follow paper A.5 exactly

  • Tabular: γ=0.99, α=0.2, ε=0.25 (0.5 for Urban Planning).
  • DQN: γ=0.99, α=2e-5, ε=0.5, layers=[128,128], target update every 25 steps, batch=4, buffer=5000 (20000 for Mountain Car).
  • MultisetSum tabular sweeps three element ranges with different episode horizons; defaults match the "active" [240, 900] config from the tmlr code. Switch via --min_elem_range / --max_elem_range and adjust --n_episode_steps / --n_training_steps accordingly:
    • [80, 300] → 300 ep steps, 2000 train steps, 24000 evals
    • [240, 900] → 100 ep steps, 2000 train steps, 4000 evals (default)
    • [800, 3000] → 30 ep steps, 2000 train steps, 500 evals

Run

# Multiset Sum, tabular: CQ-L vs Q-learning vs 2Q-L
uv run commutative-rl --domain multiset_sum --approaches QLearning CQL DoubleQLearning --seed 0

# Multiset Sum, DQN
uv run commutative-rl --domain multiset_sum --approaches DQN CQL_DQN DoubleDQN --seed 0

# Windy Gridworld
uv run commutative-rl --domain windy_gridworld --approaches QLearning CQL DoubleQLearning --seed 0

# Urban Planning
uv run commutative-rl --domain urban_planning --approaches QLearning CQL DoubleQLearning --seed 0

# Mountain Car with potential-based reward shaping (DQN only)
uv run commutative-rl --domain mountain_car --approaches DQN CQL_DQN DoubleDQN --seed 0

# Robot Sorting (discrete pick-place; commutativity = disjoint object AND bin)
uv run commutative-rl --domain robot_sorting --approaches DQN CQL_DQN DoubleDQN \
    --n_objects 4 --n_bins 4 --seed 0
# add --render to open a MuJoCo viewer (requires `pip install -e ".[robotics]"`)

Robot Sorting in more detail

A tabletop with N objects and K target bins. The agent's action is a high-level pick-and-place primitive (object_i, bin_j). Each object has a single correct bin; correct placements pay utility_correct, incorrect ones pay utility_incorrect, every action pays -action_cost. A terminator action ends the episode with a small bonus.

Why this domain showcases CQ-L. Action commutativity is the natural structure of multi-object manipulation: (pick A, place at slot 1) and (pick B, place at slot 2) produce the same final arrangement regardless of order, as long as the two placements don't share an object or a bin. This is the same kind of conditional commutativity as Urban Planning, lifted to a robotics setting.

MuJoCo is an optional rendering backend — install via the robotics extra and pass --render to visualize. The transition dynamics are abstract (deterministic) so the commutativity oracle is exact; CQ-L can exploit it without physics noise interfering with the algorithm's guarantees.

Per-seed evaluation trajectories land in results/<domain>/<approach>/<instance>/seed_<n>.npz.

Analyze results

from pathlib import Path
from commutative_rl.stats import bootstrap_iqm, sample_efficiency_table
from commutative_rl.stats.analysis import runs_grouped_by_approach

runs = runs_grouped_by_approach(Path("results"), domain="windy_gridworld")
print(bootstrap_iqm(runs))                          # IQM with bootstrap 95% CIs
print(sample_efficiency_table(runs, threshold=80))  # Steps to reach a return

Backed by rliable: paired bootstrap CIs, performance profiles, probability of improvement, sample-efficiency tables.

Repository layout

commutative_rl/
├── envs/                      # Each env owns its env class, problem dataclass, and generator
│   ├── base.py                #   BaseEnv abstract class (incl. is_commutative oracle)
│   ├── multiset_sum.py        #   + YAML cache (generate_or_load, load_problem)
│   ├── windy_gridworld.py
│   ├── urban_planning.py      #   + generate_problem
│   ├── robot_sorting.py       #   + generate_problem (MuJoCo-optional render)
│   └── gym_env.py             #   CartPole/Acrobot/MountainCar wrapper with shaping
├── agents/
│   ├── base.py                # Shared train/eval loop
│   ├── q_learning.py          # QLearning + DQN + Double-data variants
│   ├── cql.py                 # Canonical CQ-L (tabular + DQN)
│   ├── networks.py            # MLP
│   └── buffers.py             # ReplayBuffer with is_commutative flag
├── stats/
│   ├── logging.py             # Per-seed .npz logger
│   └── analysis.py            # rliable wrappers + pseudo-regret
├── config/default.yaml        # Per-domain defaults (paper Sec. A.5)
├── tests/                     # Per-domain commutativity + CQ-L behavioral tests
└── main.py

Runtime artifacts (gitignored):

  • results/<domain>/<approach>/<instance>/seed_N.npz — per-seed eval logs.
  • cache/multiset_sum.yaml — auto-generated problem instances (reproducible per seed).

Adding a new domain

Subclass BaseEnv and implement:

  • reset(), step(action_idx)
  • is_terminating(action_idx)
  • is_commutative(prev_state, prev_action, state, action) — the domain oracle (swappableS_V in the paper)
  • For tabular use: supports_tabular = True, state_to_index, n_tabular_states

See envs/multiset_sum.py for a minimal reference, or envs/windy_gridworld.py for a conditional-commutativity example.

License

MIT. See LICENSE.

Citation

@article{clark2026commutative,
  title={Improving Q-Learning via Exploiting Action Commutativity in Markov Decision Processes},
  author={Clark, Ethan M. and others},
  journal={Submitted to TMLR},
  year={2026}
}

About

Reinforcement learning framework for environments with action-order invariance to achieve sample efficiency gains

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages