This repository provides an official implementation of DIPOLE on RL benchmarks, including ExORL and OGBench. Please refer to the corresponding directories for detailed experimental settings and instructions.
DIPOLE (DIchotomous diffusion POLicy improvEment) is a reinforcement learning (RL) algorithm designed for stable and controllable optimization of diffusion-based policies. It addresses key challenges in applying RL to large diffusion policies, including training instability, inefficient credit assignment, and limited controllability of policy greediness.
DIPOLE reformulates KL-regularized RL with a greedified policy regularization objective, which enables the optimal diffusion policy to be decomposed into two dichotomous policies:
- Positive policy: reward maximization by emphasizing high-return actions.
- Negative policy: reward minimization by emphasizing low-return actions.
Both policies are trained using bounded sigmoid-based weighting, ensuring stable and efficient learning without loss explosion.
At inference time, actions are generated by linearly combining the score functions of the positive and negative policies. It
- enables greediness factors control the trade-off between exploitation and stability.
- is mathematically analogous to classifier-free guidance in diffusion models.
As a result, DIPOLE enables explicit and continuous control over policy optimality without retraining.
Our implementation is buit upon on CFGRL and FQL, on top of OGBench's reference implementations. We thank all the contributions of prior studies for making their work publicly available.
If you find this repository useful, please cite:
@article{liang2026dipole,
title={Dichotomous Diffusion Policy Optimization},
author={Ruiming Liang and Yinan Zheng and Kexin Zheng and Tianyi Tan and Jianxiong Li and Liyuan Mao and Zhihao Wang and Guang Chen and Hangjun Ye and Jingjing Liu and Jinqiao Wang and Xianyuan Zhan},
journal={arXiv preprint arXiv:2601.00898},
year={2026}
}This project is licensed under the MIT License.
