You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Rewards as Labels: Revisiting RLVR from a Classification Perspective
2
+
3
+
**版本依赖**:ms-swift>4.0
4
+
5
+
[Rewards as Labels: Revisiting RLVR from a Classification Perspective](https://arxiv.org/abs/2602.05630) 针对GRPO提出把奖励视为标签,在group内分类而不是计算advantage,从而将策略优化问题转化为分类问题,以此解决GRPO Loss中存在的正样本**梯度错配**与负样本**梯度主导**问题。
# Rewards as Labels: Revisiting RLVR from a Classification Perspective
2
+
3
+
**Version Requirement**:ms-swift>4.0
4
+
5
+
[Rewards as Labels: Revisiting RLVR from a Classification Perspective](https://arxiv.org/abs/2602.05630) proposes a reformulation of GRPO by treating rewards as labels and performing **in-group classification** instead of advantage estimation. This converts the policy optimization problem into a classification problem, thereby addressing two key issues in the GRPO loss:

38
+
39
+
1.**Gradient Misassignment (Positive Samples)**:
40
+
For positive samples, as the relative log-probability $s$ decreases, the gradient magnitude also decreases.
41
+
This is counterintuitive: tokens that the model is less confident about but correct should receive larger updates. However, GRPO assigns more weight to already confident tokens, causing under-trained tokens to receive insufficient learning signal.
42
+
43
+
2.**Gradient Domination (Negative Samples)**:
44
+
For negative samples, as $s$ decreases, the gradient magnitude increases exponentially.
45
+
This leads to a situation where a few overconfident incorrect tokens dominate the gradient, overwhelming other negative signals within the same group. Due to the absence of an upper bound, this may result in unstable and excessively large parameter updates.
46
+
47
+
To address the above issues, REAL treats rewards directly as labels and performs **group-wise classification training**.
0 commit comments