Fix issue #5242 grad_norm and loss is nan#7171
Conversation
|
@Glaceon-Hyy, thanks for this PR. Is it possible to convert the repro into a unit test somewhere here? |
|
@Glaceon-Hyy, also do you know if setting |
|
I noticed that in commit 61daaa1, even when total_norm produced a NaN instead of the expected -1, the clip calculation (total_norm / self.loss_scale + 1e-6)/self.clip_grad still resulted in NaN. However, the condition nan > 1 evaluates to False, which coincidentally handled the invalid value. However, in commit 1ef9b02, the torch.clamp(clip, min=1.0) introduced a new issue: when clip is NaN, torch.clamp() returns NaN unchanged. This NaN value then propagates to combined_scale, causing subsequent gradient scaling grad.data.mul_(1. / combined_scale) to produce NaN. My latest commit addresses this by adding an explicit check to convert NaN values in clip to 1.0 before applying the clamp operation. This prevents NaN propagation while maintaining the desired gradient scaling behavior, ensuring numerical stability in cases where total_norm might become invalid. |
Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>
Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>
Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>
972c8c0 to
632fefe
Compare
|
@loadams I just force-pushed to fix DCO (Developer Certificate of Origin) issues in the commits. I noticed that my development environment using Magit did not have signed-off-by configured by default. |
632fefe to
49f38a1
Compare
|
@nelyahu, FYI for any perf impact. |
…edai#7171)" This reverts commit 1f70662.
…edai#7171)" This reverts commit 1f70662. Signed-off-by: Nadav Elyahu <nelyahu@habana.ai>
|
@tjruwase @Glaceon-Hyy @loadams @hwchen2017 can you please review the fix in #7184 ? |
…edai#7171)" This reverts commit 1f70662. Signed-off-by: Nadav Elyahu <nelyahu@habana.ai>
This PR addresses a regression introduced in commit [61daaa1](deepspeedai@61daaa1) that affects gradient clipping when handling infinite values. The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior: Original logic ([v0.10.3](https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233)): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail Here is a minimal reproducible example comparing gradient clipping behavior across implementations. ```python import torch import numpy as np import copy def test(total_norm): test_old_deepspeed(total_norm) test_deepspeed(total_norm) test_torch(total_norm) test_deepspeed_fix(total_norm) def test_old_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233 if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm: total_norm = torch.tensor(float(-1)) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848 clip_grad = float(1.0) loss_scale = float(1.0) combined_scale = loss_scale clip = ((total_norm / loss_scale) + 1e-6) / clip_grad if clip > 1: combined_scale = clip * loss_scale print(f"old_deepspeed: {1. / combined_scale}") def test_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710 norm_is_inf = total_norm.isinf() norm_is_nan = total_norm.isnan() inf_or_nan = norm_is_nan.logical_or(norm_is_inf) err = torch.tensor(-1.0, dtype=torch.float) total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed: {1. / combined_scale}") def test_torch(total_norm_tensor): # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155 total_norm = copy.deepcopy(total_norm_tensor) max_norm = float(1.0) clip_coef = max_norm / (total_norm + 1e-6) clip_coef_clamped = torch.clamp(clip_coef, max=1.0) print(f"torch: {clip_coef_clamped}") def test_deepspeed_fix(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) if total_norm.isinf() or total_norm.isnan(): total_norm = torch.tensor(-1.0, dtype=torch.float) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed_fix: {1. / combined_scale}") if __name__ == '__main__': print("*****NAN*****") test(torch.tensor(float('nan'))) print("*****INF*****") test(torch.tensor(float('inf'))) print("*****positive*****") test(torch.tensor(float(2.0))) ``` Result:  --------- Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
This PR addresses a regression introduced in commit [61daaa1](deepspeedai@61daaa1) that affects gradient clipping when handling infinite values. The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior: Original logic ([v0.10.3](https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233)): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail Here is a minimal reproducible example comparing gradient clipping behavior across implementations. ```python import torch import numpy as np import copy def test(total_norm): test_old_deepspeed(total_norm) test_deepspeed(total_norm) test_torch(total_norm) test_deepspeed_fix(total_norm) def test_old_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233 if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm: total_norm = torch.tensor(float(-1)) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848 clip_grad = float(1.0) loss_scale = float(1.0) combined_scale = loss_scale clip = ((total_norm / loss_scale) + 1e-6) / clip_grad if clip > 1: combined_scale = clip * loss_scale print(f"old_deepspeed: {1. / combined_scale}") def test_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710 norm_is_inf = total_norm.isinf() norm_is_nan = total_norm.isnan() inf_or_nan = norm_is_nan.logical_or(norm_is_inf) err = torch.tensor(-1.0, dtype=torch.float) total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed: {1. / combined_scale}") def test_torch(total_norm_tensor): # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155 total_norm = copy.deepcopy(total_norm_tensor) max_norm = float(1.0) clip_coef = max_norm / (total_norm + 1e-6) clip_coef_clamped = torch.clamp(clip_coef, max=1.0) print(f"torch: {clip_coef_clamped}") def test_deepspeed_fix(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) if total_norm.isinf() or total_norm.isnan(): total_norm = torch.tensor(-1.0, dtype=torch.float) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed_fix: {1. / combined_scale}") if __name__ == '__main__': print("*****NAN*****") test(torch.tensor(float('nan'))) print("*****INF*****") test(torch.tensor(float('inf'))) print("*****positive*****") test(torch.tensor(float(2.0))) ``` Result:  --------- Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>
This PR addresses a regression introduced in commit [61daaa1](deepspeedai@61daaa1) that affects gradient clipping when handling infinite values. The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior: Original logic ([v0.10.3](https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233)): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail Here is a minimal reproducible example comparing gradient clipping behavior across implementations. ```python import torch import numpy as np import copy def test(total_norm): test_old_deepspeed(total_norm) test_deepspeed(total_norm) test_torch(total_norm) test_deepspeed_fix(total_norm) def test_old_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233 if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm: total_norm = torch.tensor(float(-1)) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848 clip_grad = float(1.0) loss_scale = float(1.0) combined_scale = loss_scale clip = ((total_norm / loss_scale) + 1e-6) / clip_grad if clip > 1: combined_scale = clip * loss_scale print(f"old_deepspeed: {1. / combined_scale}") def test_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710 norm_is_inf = total_norm.isinf() norm_is_nan = total_norm.isnan() inf_or_nan = norm_is_nan.logical_or(norm_is_inf) err = torch.tensor(-1.0, dtype=torch.float) total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed: {1. / combined_scale}") def test_torch(total_norm_tensor): # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155 total_norm = copy.deepcopy(total_norm_tensor) max_norm = float(1.0) clip_coef = max_norm / (total_norm + 1e-6) clip_coef_clamped = torch.clamp(clip_coef, max=1.0) print(f"torch: {clip_coef_clamped}") def test_deepspeed_fix(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) if total_norm.isinf() or total_norm.isnan(): total_norm = torch.tensor(-1.0, dtype=torch.float) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed_fix: {1. / combined_scale}") if __name__ == '__main__': print("*****NAN*****") test(torch.tensor(float('nan'))) print("*****INF*****") test(torch.tensor(float('inf'))) print("*****positive*****") test(torch.tensor(float(2.0))) ``` Result:  --------- Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
This PR addresses a regression introduced in commit 61daaa1 that affects gradient clipping when handling infinite values.
The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior:
Original logic (v0.10.3): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads
Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail
Here is a minimal reproducible example comparing gradient clipping behavior across implementations.
Result:
