Skip to content

fix(models): audit fp32 protected tensors#2598

Open
yuhezhang-ai wants to merge 8 commits into
mainfrom
yuhez/fix/fp32-protected-tensor-audit
Open

fix(models): audit fp32 protected tensors#2598
yuhezhang-ai wants to merge 8 commits into
mainfrom
yuhez/fix/fp32-protected-tensor-audit

Conversation

@yuhezhang-ai

@yuhezhang-ai yuhezhang-ai commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Audit model-specific fp32-protected tensors from Audit model-specific fp32 protected tensors during dtype casting #2570 and keep the remaining model-specific buffers/params protected through dtype casts.
  • Preserve MoE gate/rotary fp32 buffers for MiniMax, HY, Qwen3-Omni, and Qwen3-VL paths.
  • Move Nemotron V3 Mamba A_log/dt_bias/D into fp32 holder modules, with HF-compatible state-dict adapter routing for the public checkpoint keys.
  • Keep callable fp32 holders usable under real FSDP2 by materializing full tensors from FSDP DTensor params and keeping strict holder subtrees unresharded during the parent forward.

Why

model.torch_dtype=bf16 and broad dtype casts can round small tensors that are part of a model's numerical/checkpoint contract. Trainable fp32 params that live directly on mixed modules also cannot be isolated cleanly by FSDP, so this follows the _fp32_params holder pattern used by the stacked Qwen GDN branch.

The real 2-GPU FSDP smoke exposed one more holder-specific issue: a callable holder that returns its parameter can hand the parent module a sharded DTensor. The fix keeps storage sharded/fp32, but returns the full fp32 tensor value to the caller.

Notes

This PR is stacked on yuhez/fix/qwen-gdn-fp32-precision. After that branch lands, this should be rebased onto main.

Validation

  • uv run --no-sync pytest tests/unit_tests/distributed/test_parallelizer_utils.py tests/unit_tests/distributed/test_fp32_compute_contract.py -q (25 passed)
  • focused GPT-OSS/Nemotron/DeepSeek holder tests (30 passed)
  • uv run --no-sync pytest tests/unit_tests/models/nemotron_v3/test_nemotron_v3_state_dict_adapter.py tests/unit_tests/distributed/test_fp32_compute_contract.py -q (30 passed)
  • focused Nemotron V3 Mamba/CP dtype tests (9 passed)
  • focused GPT-OSS holder/state-dict tests (5 passed, 1 skipped)
  • focused fp32 protected dtype regression suite (11 passed)
  • 2-GPU H100 Slurm smoke: fp32-holder-fsdp-smoke-c938fb8-r3, job 12863187, exit 0:0
  • ruff check on touched files
  • git diff --check

@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa added the r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Jun 16, 2026
@akoumpa

akoumpa commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

/nvskills-ci

@yuhezhang-ai

Copy link
Copy Markdown
Contributor Author

/ok to test 704a39c

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
@yuhezhang-ai yuhezhang-ai force-pushed the yuhez/fix/fp32-protected-tensor-audit branch from b2ec859 to b035704 Compare June 16, 2026 21:00
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
@yuhezhang-ai

Copy link
Copy Markdown
Contributor Author

/ok to test 9e1db7b

@yuhezhang-ai

Copy link
Copy Markdown
Contributor Author

/nvskills-ci

@yuhezhang-ai

Copy link
Copy Markdown
Contributor Author

/claude review

assert original_fn is not None, "apply_fsdp2_sharding_recursively not found in module globals"

def _fsdp_by_dtype(module, mesh, mp_policy, offload_policy=None, *args, **kwargs):
def _fsdp_by_dtype(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean PR. The fp32 holder pattern, reshard threading, and state-dict adapter routing all look correct. Good test coverage across models and distributed paths.

@yuhezhang-ai yuhezhang-ai marked this pull request as ready for review June 16, 2026 22:49
@yuhezhang-ai yuhezhang-ai requested a review from a team as a code owner June 16, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audit model-specific fp32 protected tensors during dtype casting

2 participants