Fix issue #5242 grad_norm and loss is nan by Glaceon-Hyy · Pull Request #7171 · deepspeedai/DeepSpeed

Glaceon-Hyy · 2025-03-25T08:59:13Z

This PR addresses a regression introduced in commit 61daaa1 that affects gradient clipping when handling infinite values.

The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior:

Original logic (v0.10.3): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads
Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail

Here is a minimal reproducible example comparing gradient clipping behavior across implementations.

import torch
import numpy as np
import copy

def test(total_norm):
    test_old_deepspeed(total_norm)
    test_deepspeed(total_norm)
    test_torch(total_norm)
    test_deepspeed_fix(total_norm)

def test_old_deepspeed(total_norm_tensor):
    total_norm = copy.deepcopy(total_norm_tensor)
    # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233
    if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm:
        total_norm = torch.tensor(float(-1))
        
    # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848
    clip_grad = float(1.0)
    loss_scale = float(1.0)
    combined_scale = loss_scale
    clip = ((total_norm / loss_scale) + 1e-6) / clip_grad
    if clip > 1:
        combined_scale = clip * loss_scale
    print(f"old_deepspeed: {1. / combined_scale}")

def test_deepspeed(total_norm_tensor):
    total_norm = copy.deepcopy(total_norm_tensor)
    # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710
    norm_is_inf = total_norm.isinf()
    norm_is_nan = total_norm.isnan()
    inf_or_nan = norm_is_nan.logical_or(norm_is_inf)

    err = torch.tensor(-1.0, dtype=torch.float)
    total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm

    # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970
    clip_grad = float(1.0)
    loss_scale = float(1.0)
    clip = ((total_norm / loss_scale) + 1e-6) / clip_grad
    clip = torch.clamp(clip, min=1.0)
    combined_scale = clip * loss_scale
    print(f"test_deepspeed: {1. / combined_scale}")
    
def test_torch(total_norm_tensor):
    # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155
    total_norm = copy.deepcopy(total_norm_tensor)
    max_norm = float(1.0)
    clip_coef = max_norm / (total_norm + 1e-6)
    clip_coef_clamped = torch.clamp(clip_coef, max=1.0)
    print(f"torch: {clip_coef_clamped}")

def test_deepspeed_fix(total_norm_tensor):
    total_norm = copy.deepcopy(total_norm_tensor)
    if total_norm.isinf() or total_norm.isnan():
        total_norm = torch.tensor(-1.0, dtype=torch.float)

    # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970
    clip_grad = float(1.0)
    loss_scale = float(1.0)
    clip = ((total_norm / loss_scale) + 1e-6) / clip_grad
    clip = torch.clamp(clip, min=1.0)
    combined_scale = clip * loss_scale
    print(f"test_deepspeed_fix: {1. / combined_scale}")
    
if __name__ == '__main__':
    print("*****NAN*****")
    test(torch.tensor(float('nan')))
    print("*****INF*****")
    test(torch.tensor(float('inf')))
    print("*****positive*****")
    test(torch.tensor(float(2.0)))

Result:

tjruwase · 2025-03-25T15:23:21Z

@Glaceon-Hyy, thanks for this PR. Is it possible to convert the repro into a unit test somewhere here?

tjruwase · 2025-03-25T15:24:33Z

@Glaceon-Hyy, also do you know if setting overlap_comm to False has any effect on this?

Glaceon-Hyy · 2025-03-26T04:16:54Z

I noticed that in commit 61daaa1, even when total_norm produced a NaN instead of the expected -1, the clip calculation (total_norm / self.loss_scale + 1e-6)/self.clip_grad still resulted in NaN. However, the condition nan > 1 evaluates to False, which coincidentally handled the invalid value.

However, in commit 1ef9b02, the torch.clamp(clip, min=1.0) introduced a new issue: when clip is NaN, torch.clamp() returns NaN unchanged. This NaN value then propagates to combined_scale, causing subsequent gradient scaling grad.data.mul_(1. / combined_scale) to produce NaN.

My latest commit addresses this by adding an explicit check to convert NaN values in clip to 1.0 before applying the clamp operation. This prevents NaN propagation while maintaining the desired gradient scaling behavior, ensuring numerical stability in cases where total_norm might become invalid.

Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>

Glaceon-Hyy · 2025-03-27T03:29:04Z

@loadams I just force-pushed to fix DCO (Developer Certificate of Origin) issues in the commits. I noticed that my development environment using Magit did not have signed-off-by configured by default.

tjruwase · 2025-03-28T18:03:07Z

@nelyahu, FYI for any perf impact.

…edai#7171)" This reverts commit 1f70662.

…edai#7171)" This reverts commit 1f70662. Signed-off-by: Nadav Elyahu <nelyahu@habana.ai>

nelyahu · 2025-03-30T08:07:13Z

@tjruwase @Glaceon-Hyy @loadams @hwchen2017 can you please review the fix in #7184 ?
i replaced the logical equation with torch.where.

…edai#7171)" This reverts commit 1f70662. Signed-off-by: Nadav Elyahu <nelyahu@habana.ai>

This PR addresses a regression introduced in commit [61daaa1](deepspeedai@61daaa1) that affects gradient clipping when handling infinite values. The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior: Original logic ([v0.10.3](https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233)): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail Here is a minimal reproducible example comparing gradient clipping behavior across implementations. ```python import torch import numpy as np import copy def test(total_norm): test_old_deepspeed(total_norm) test_deepspeed(total_norm) test_torch(total_norm) test_deepspeed_fix(total_norm) def test_old_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233 if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm: total_norm = torch.tensor(float(-1)) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848 clip_grad = float(1.0) loss_scale = float(1.0) combined_scale = loss_scale clip = ((total_norm / loss_scale) + 1e-6) / clip_grad if clip > 1: combined_scale = clip * loss_scale print(f"old_deepspeed: {1. / combined_scale}") def test_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710 norm_is_inf = total_norm.isinf() norm_is_nan = total_norm.isnan() inf_or_nan = norm_is_nan.logical_or(norm_is_inf) err = torch.tensor(-1.0, dtype=torch.float) total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed: {1. / combined_scale}") def test_torch(total_norm_tensor): # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155 total_norm = copy.deepcopy(total_norm_tensor) max_norm = float(1.0) clip_coef = max_norm / (total_norm + 1e-6) clip_coef_clamped = torch.clamp(clip_coef, max=1.0) print(f"torch: {clip_coef_clamped}") def test_deepspeed_fix(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) if total_norm.isinf() or total_norm.isnan(): total_norm = torch.tensor(-1.0, dtype=torch.float) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed_fix: {1. / combined_scale}") if __name__ == '__main__': print("*****NAN*****") test(torch.tensor(float('nan'))) print("*****INF*****") test(torch.tensor(float('inf'))) print("*****positive*****") test(torch.tensor(float(2.0))) ``` Result: ![20250325165135](https://github.com/user-attachments/assets/bd32209d-14f6-4c21-8b57-f8bd94786fe2) --------- Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>

This PR addresses a regression introduced in commit [61daaa1](deepspeedai@61daaa1) that affects gradient clipping when handling infinite values. The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior: Original logic ([v0.10.3](https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233)): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail Here is a minimal reproducible example comparing gradient clipping behavior across implementations. ```python import torch import numpy as np import copy def test(total_norm): test_old_deepspeed(total_norm) test_deepspeed(total_norm) test_torch(total_norm) test_deepspeed_fix(total_norm) def test_old_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233 if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm: total_norm = torch.tensor(float(-1)) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848 clip_grad = float(1.0) loss_scale = float(1.0) combined_scale = loss_scale clip = ((total_norm / loss_scale) + 1e-6) / clip_grad if clip > 1: combined_scale = clip * loss_scale print(f"old_deepspeed: {1. / combined_scale}") def test_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710 norm_is_inf = total_norm.isinf() norm_is_nan = total_norm.isnan() inf_or_nan = norm_is_nan.logical_or(norm_is_inf) err = torch.tensor(-1.0, dtype=torch.float) total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed: {1. / combined_scale}") def test_torch(total_norm_tensor): # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155 total_norm = copy.deepcopy(total_norm_tensor) max_norm = float(1.0) clip_coef = max_norm / (total_norm + 1e-6) clip_coef_clamped = torch.clamp(clip_coef, max=1.0) print(f"torch: {clip_coef_clamped}") def test_deepspeed_fix(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) if total_norm.isinf() or total_norm.isnan(): total_norm = torch.tensor(-1.0, dtype=torch.float) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed_fix: {1. / combined_scale}") if __name__ == '__main__': print("*****NAN*****") test(torch.tensor(float('nan'))) print("*****INF*****") test(torch.tensor(float('inf'))) print("*****positive*****") test(torch.tensor(float(2.0))) ``` Result: ![20250325165135](https://github.com/user-attachments/assets/bd32209d-14f6-4c21-8b57-f8bd94786fe2) --------- Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

This PR addresses a regression introduced in commit [61daaa1](deepspeedai@61daaa1) that affects gradient clipping when handling infinite values. The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior: Original logic ([v0.10.3](https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233)): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail Here is a minimal reproducible example comparing gradient clipping behavior across implementations. ```python import torch import numpy as np import copy def test(total_norm): test_old_deepspeed(total_norm) test_deepspeed(total_norm) test_torch(total_norm) test_deepspeed_fix(total_norm) def test_old_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233 if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm: total_norm = torch.tensor(float(-1)) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848 clip_grad = float(1.0) loss_scale = float(1.0) combined_scale = loss_scale clip = ((total_norm / loss_scale) + 1e-6) / clip_grad if clip > 1: combined_scale = clip * loss_scale print(f"old_deepspeed: {1. / combined_scale}") def test_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710 norm_is_inf = total_norm.isinf() norm_is_nan = total_norm.isnan() inf_or_nan = norm_is_nan.logical_or(norm_is_inf) err = torch.tensor(-1.0, dtype=torch.float) total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed: {1. / combined_scale}") def test_torch(total_norm_tensor): # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155 total_norm = copy.deepcopy(total_norm_tensor) max_norm = float(1.0) clip_coef = max_norm / (total_norm + 1e-6) clip_coef_clamped = torch.clamp(clip_coef, max=1.0) print(f"torch: {clip_coef_clamped}") def test_deepspeed_fix(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) if total_norm.isinf() or total_norm.isnan(): total_norm = torch.tensor(-1.0, dtype=torch.float) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed_fix: {1. / combined_scale}") if __name__ == '__main__': print("*****NAN*****") test(torch.tensor(float('nan'))) print("*****INF*****") test(torch.tensor(float('inf'))) print("*****positive*****") test(torch.tensor(float(2.0))) ``` Result: ![20250325165135](https://github.com/user-attachments/assets/bd32209d-14f6-4c21-8b57-f8bd94786fe2) --------- Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>

Glaceon-Hyy requested review from tjruwase and tohtana as code owners March 25, 2025 08:59

loadams reviewed Mar 25, 2025

View reviewed changes

Comment thread deepspeed/runtime/zero/stage_1_and_2.py Outdated

Glaceon-Hyy added 3 commits March 27, 2025 11:24

Fix issue deepspeedai#5242 grad_norm and loss is nan

38952e6

Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>

Fix format

b6e78d9

Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>

handle total_norm invalid value

49f38a1

Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>

Glaceon-Hyy force-pushed the fix_grad_norm branch from 972c8c0 to 632fefe Compare March 27, 2025 03:24

Glaceon-Hyy force-pushed the fix_grad_norm branch from 632fefe to 49f38a1 Compare March 27, 2025 03:56

Merge branch 'master' into fix_grad_norm

409f9cc

tjruwase approved these changes Mar 28, 2025

View reviewed changes

tjruwase added this pull request to the merge queue Mar 28, 2025

loadams removed this pull request from the merge queue due to a manual request Mar 28, 2025

hwchen2017 added this pull request to the merge queue Mar 29, 2025

Merged via the queue into deepspeedai:master with commit 1f70662 Mar 29, 2025

nelyahu pushed a commit to nelyahu/DeepSpeed that referenced this pull request Mar 30, 2025

Revert "Fix issue deepspeedai#5242 grad_norm and loss is nan (deepspe…

f525c62

…edai#7171)" This reverts commit 1f70662.

nelyahu pushed a commit to nelyahu/DeepSpeed that referenced this pull request Mar 30, 2025

Revert "Fix issue deepspeedai#5242 grad_norm and loss is nan (deepspe…

1d4f5ca

…edai#7171)" This reverts commit 1f70662. Signed-off-by: Nadav Elyahu <nelyahu@habana.ai>

nelyahu mentioned this pull request Mar 30, 2025

Reland perf fix for nan inf check #7184

Merged

nelyahu pushed a commit to nelyahu/DeepSpeed that referenced this pull request Mar 31, 2025

Revert "Fix issue deepspeedai#5242 grad_norm and loss is nan (deepspe…

07cf3e2

…edai#7171)" This reverts commit 1f70662. Signed-off-by: Nadav Elyahu <nelyahu@habana.ai>

whlook mentioned this pull request Mar 31, 2025

[BUG]when use 'overlap_comm:True' w/ 'contiguous_gradients:True', grad_norm is NaN #7188

Closed

hwchen2017 mentioned this pull request Apr 17, 2025

[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2 #5242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #5242 grad_norm and loss is nan#7171

Fix issue #5242 grad_norm and loss is nan#7171
hwchen2017 merged 4 commits into
deepspeedai:masterfrom
Glaceon-Hyy:fix_grad_norm

Glaceon-Hyy commented Mar 25, 2025

Uh oh!

tjruwase commented Mar 25, 2025

Uh oh!

tjruwase commented Mar 25, 2025

Uh oh!

Uh oh!

Glaceon-Hyy commented Mar 26, 2025

Uh oh!

Glaceon-Hyy commented Mar 27, 2025

Uh oh!

tjruwase commented Mar 28, 2025

Uh oh!

Uh oh!

nelyahu commented Mar 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Glaceon-Hyy commented Mar 25, 2025

Uh oh!

tjruwase commented Mar 25, 2025

Uh oh!

tjruwase commented Mar 25, 2025

Uh oh!

Uh oh!

Glaceon-Hyy commented Mar 26, 2025

Uh oh!

Glaceon-Hyy commented Mar 27, 2025

Uh oh!

tjruwase commented Mar 28, 2025

Uh oh!

Uh oh!

nelyahu commented Mar 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants