Skip to content

ci: Update transformers to latest version 5.12.1#2632

Open
svcnvidia-nemo-ci wants to merge 35 commits into
mainfrom
transformers_bump_5.12.1
Open

ci: Update transformers to latest version 5.12.1#2632
svcnvidia-nemo-ci wants to merge 35 commits into
mainfrom
transformers_bump_5.12.1

Conversation

@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

beep boop 🤖: Updating transformers to latest version on pypi

svcnvidia-nemo-ci and others added 30 commits June 9, 2026 12:04
ci: update package version to 0.5.0 (#2472)

Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
feat: make mesh accept meshcontext (#2266)

* make mesh accept meshcontext



* fix(transformers): resolve mesh context inputs



* rm



* use moe_overrides



* make create_mesh_context the entry point for dist setup



* fix: add renamed distributed utility files



* fix(vlm): complete dist_setup -> mesh_context rename

Two leftover references to the old setup_distributed/dist_setup API were
missed when the recipe was migrated to create_mesh_context_from_config:

- nemo_automodel/recipes/vlm/finetune.py:794 still read
  self.dist_setup.cp_size, which would AttributeError on any PP+CP VLM run.
- tests/unit_tests/recipes/test_finetune_vlm_cp_wiring.py monkeypatched
  the stale symbol "setup_distributed", causing three parametrizations of
  test_setup_skips_pp_media_prechunk_when_cp_preembeds_vlm_inputs to fail
  during pytest setup with AttributeError.



* remove activation checkpointing from meshcontext



* refac



* dedup



* fix



* fix



* Update nemo_automodel/_diffusers/auto_diffusion_pipeline.py





* Update skills/distributed-training/SKILL.md





* Update skills/distributed-training/SKILL.md





* Update nemo_automodel/components/distributed/config.py





* Update skills/nemo-automodel-distributed-training/SKILL.md



* docs(distributed): update setup API example



* Update nemo_automodel/_diffusers/auto_diffusion_pipeline.py



* fix(diffusers): remove duplicated DistributedSetup.build call in fsdp2 path

Commit 6acf01f left a duplicated `distributed_setup = DistributedSetup.build(`
line and a stray extra `)` in the fsdp2 branch of the parallelism-manager
builder. This broke `ruff format --check` (CI linting job) and, once formatted,
would have nested the call as `DistributedSetup.build(distributed_setup=...)` —
a kwarg the builder does not accept. Restore the single build() call matching
the ddp branch.



* fix(vlm): pass distributed_setup through Gemma4 joint drafter composite

The distributed-config API refactor routes all distributed settings through a
single ``distributed_setup`` argument and rejects the separate ``moe_mesh`` /
``distributed_config`` / ``pipeline_config`` kwargs in
``NeMoAutoModel*.from_pretrained``. ``Gemma4WithDrafter.from_pretrained`` was
still forwarding those separate kwargs to its inner base/drafter loads, so the
joint-drafter VLM finetune (L2_HF_Transformer_VLM_Gemma4_Joint_Drafter) raised
``TypeError: Distributed settings must be passed with distributed_setup``.

Forward ``distributed_setup`` to both sub-loaders instead, and extend the
pipeline/context-parallel safety guards to read pp_size/cp_size from the
resolved setup so the KV-sharing invariant holds on the recipe path too.




* feat: add use_memory_efficient_lora knob (#2239)

* add use_memory_efficient_lora knob



* add use_memory_efficient_lora



* fix



* delete sft gpt-oss 20b single gpu



* add nemotron nano v3 single-gpu lora example



* add grad ckpt



* add fused lora mlp



* fix(checkpoint): support single-GPU Nemotron-H MoE LoRA checkpoint load

Loading a merged-expert Nemotron-H MoE checkpoint through the default DCP / set_model_state_dict path transiently materializes a second on-device copy of the expert weights, which OOMs a 30B-class model on a single 80GB GPU.

Checkpointer.load now routes single-device custom-model safetensors through the frugal full-state path (load to CPU, merge from_hf on CPU, copy into the model), keeping device memory at ~model size.

_load_full_state_dict_into_model normalizes stray real (CPU) buffers left behind by custom-model meta materialization onto the parameter device (avoids 'Multiple devices found'), and uses plain load_state_dict for non-DTensor models so the full state dict is not moved on-device a second time.

Adds a [nemotron-singlegpu-lora] note plus per-site tags documenting these single-device special cases, links the exercising recipe (examples/llm_finetune/nemotron/nemotron_nano_v3_singlegpu_lora.yaml), and flags the load path for a future refactor.



* feat(peft): add fused LoRA SwiGLU/ReLU² MLP with recompute backward

Fuses gate+up+down+activation into a single autograd Function that saves only (x, gate_out, up_out) and recomputes the activation and down-projection input in backward, roughly halving MLP activation memory at equal speed during LoRA SFT.

SwiGLU forward/backward use elementwise Triton kernels (with in-place backward buffer reuse) and a pure-torch fallback when Triton is unavailable; matmuls stay on cuBLAS. Covers SiLU-SwiGLU (gate/up/down) and non-gated ReLU² (e.g. Nemotron-H dense) MLPs.

install_fused_lora_mlp() swaps each LoRA-applied MLP's forward and falls back to the per-linear path at runtime under DTensor (TP/EP), DoRA, or active dropout, keeping it correct under sharding. Already wired from lora.py; opt out via NEMO_AUTOMODEL_DISABLE_FUSED_LORA_MLP=1.

Activation recompute follows Megatron-Core's SwiGLUFunction; the fused LoRA-MLP and in-place buffer reuse follow Unsloth's LoRA_MLP (both Apache-2.0).



* refactor(peft): drop NEMO_AUTOMODEL_DISABLE_FUSED_LORA_MLP env knob

The fused LoRA MLP can already be disabled via the use_memory_efficient_lora
config flag, and fusion auto-falls-back per-MLP under DTensor / DoRA / active
dropout. The env var was a redundant escape hatch; remove it and the now-unused
os import.



* test(checkpoint): align custom-model load-routing guard with single-device fast path

The nemotron-singlegpu-lora change routes single-device (world_size == 1) custom
safetensors models through the frugal full-state fast path instead of DCP. The fast
path now applies the state_dict_adapter from_hf conversion on CPU
(_maybe_adapt_state_dict_from_hf), so custom MoE expert merging still happens — the
guard test's original premise (fast path bypasses conversion) no longer holds.

- Reframe test_custom_model_skips_fast_path_uses_dcp as the multi-rank (sharded)
  case (WORLD_SIZE=2), where DCP per-rank DTensor slicing is genuinely required.
- Add test_single_device_custom_model_uses_fast_path covering the new world_size==1
  behavior (fast path used, DCP not).



---------



* fix(deepseek_v3): initialize weights in fp32 and default router to fp32 (#2450)

* fix(deepseek_v3): init weights in fp32, default router to fp32

Sampling the random init directly in bf16 distorts the variance/mean schedule and
produces exploding first-step gradients (flat/diverging loss) for from-scratch
pretraining. Add an init_weights_in_fp32 context manager that samples in fp32 and
casts back to the resident dtype, and use it in DeepSeek-V3 initialize_weights.
Also default the router (gate_precision) to fp32 to match the HF reference.



* refactor(models): rename init_weights_in_fp32 to yield_fp32_model

Generalize the context manager per review: it's a generic "run this block with
the model in fp32" tool, not init-specific. Yield the model and make the exit
dtype optional (defaults to the model's pre-context float dtype).



---------



* fix(multimodal): migrate finetune recipe to DistributedSetup/MeshContext API

The auto-class-public-api refactor deleted `recipes/_dist_setup.py` and moved recipes
to the `DistributedSetup` / `MeshContext` API, but `multimodal/finetune.py` was left
importing and calling the deleted `_dist_setup.setup_distributed`, so importing the
module raised ModuleNotFoundError and broke the import-check in every Pip/UV install job.

Migrate it to the shared `_distributed_setup_attributes(create_distributed_setup_from_config(...))`
pattern used by the llm/vlm recipes: unpack distributed_setup / mesh_context /
distributed_config / device_mesh / moe_mesh / pp_enabled / pipeline_config /
moe_parallel_config / activation_checkpointing, and update the model-build calls
(`mesh=self.mesh_context`, `moe_config`/`cfg_moe=self.moe_parallel_config`,
`activation_checkpointing=self.activation_checkpointing`).




* feat(speculative): add EAGLE-3 sequence packing and reasoning-mode control (#2444)

* feat(speculative): add reasoning mode control for EAGLE/P-EAGLE/DFlash training

Add --reasoning {none,save,disable} flag to regenerate.py for controlling
whether target model reasoning content is preserved or suppressed during
data regeneration. Add mask_reasoning_content option to EAGLE/P-EAGLE/DFlash
training recipes to exclude reasoning traces from the loss mask.





* feat(speculative): add EAGLE-3 sequence packing for draft training

Pack variable-length chat samples into fixed-width rows for EAGLE-3
training, removing the per-sample padding waste of the default
max_length path. Documents within a row attend block-causally: the
target uses a 4D block-causal mask (SDPA) and the draft uses varlen
FlashAttention-2; cross-document TTT supervision is gated by
doc_remaining so deeper steps never leak across boundaries. Opt-in via
packed_sequence_size > 0, colocated target backend only. Covered by
unit tests plus an FA2-vs-eager parity test.





---------






* feat(distributed): add selective activation checkpointing for FSDP2 (#2389)

* feat(distributed): add selective activation checkpointing for FSDP2



* fix(distributed): support selective activation checkpointing with torch.compile



* docs(fern): drop selective AC from frozen v0.4 snapshot



* feat(distributed): honor selective activation checkpointing on single GPU



* feat(moe): support selective activation checkpointing with expert parallelism



* fix(model): make DeepSeek MLP dispatch wrapper-safe



* fix(distributed): save expert grouped-GEMM in selective AC and add op trace



* feat(moe): compile selective activation checkpointing wrappers outer



* refactor(distributed): move selective AC into its own module

Extract the TorchTitan-style selective activation checkpointing core out of
the central parallelizer.py into a dedicated activation_checkpointing.py:
op-set construction, the save/recompute policy, block/sub-module wrappers,
KV-sharing detection, and the compile-outer wrapper flag. parallelizer.py
keeps only the thin apply_selective_activation_checkpointing entry point,
which still needs the heavy, transformers-aware _extract_model_layers, so the
dependency stays one-directional (parallelizer -> activation_checkpointing ->
parallelizer_utils) with no circular imports.

Move the opt-in NEMO_SELECTIVE_AC_TRACE diagnostic out of parallelizer.py into
parallelizer_utils.maybe_trace_selective_ac_decision so the hot policy is a
single call site instead of trace globals plus a helper.

Make the new module's cross-module interface public (drop the leading
underscore) and keep internal op-resolution/plumbing private. Update the moe
and fsdp2 consumers and the unit tests to import from the new module.

Also fix doc wording: clarify that torch.compile must be held fixed when
comparing full vs. selective, and refer to TorchTitan as a reference
implementation rather than "upstream".



* refactor(distributed): move selective-AC trace into the AC module



* test(distributed): patch activation_checkpointing.checkpoint_wrapper after AC module split



* docs: apply tech-writer edits to gradient-checkpointing guide



---------



* feat(diffusion): improve qwen image finetuning configs (#2442)




* ci: add nemo-run, split qwen-vl-utils from decord for arm (#2456)

* ci: add nemo-run, split qwen-vl-utils from decord for arm




* fix: override in pytorch container




* Update uv lock



---------






* Apply suggestions from code review




* fix(precision): dtype contract bug fixes for FSDP2 mixed-dtype loads (#2419)

* fix(transformers): unify loaded HF dtype via promote_types

Make _restore_loaded_model_dtype dtype-aware: instead of always restoring to
the checkpoint dtype, unify each floating tensor to promote_types(checkpoint,
requested). This honors an explicit fp32 request while preserving
intrinsically-fp32 checkpoint params (e.g. A_log) under a bf16 request, and is
a no-op for the bf16/auto path. Fixes FSDP2 uniform-dtype tripping on
HF mixed-dtype loads.



* feat(distributed): default pipeline dtype to FSDP activation dtype

When pipeline parallelism is enabled and pipeline.dtype is unset, derive it from
the FSDP mixed-precision activation dtype (mp_policy.output_dtype, falling back to
param_dtype) so pipeline stage shape inference matches the real activation dtype
(e.g. bf16 compute under fp32 master weights). An explicitly set pipeline.dtype is
honored but warned on mismatch, since it can corrupt inter-stage recv buffers.
No-ops for strategies without an mp_policy (e.g. MegatronFSDP) and for pp_size==1.


(cherry picked from commit 3f6b246)

* refactor(distributed): resolve FSDP compute dtype per-param, decoupled from storage

fully_shard_by_dtype now groups parameters by their required *compute* dtype
instead of their storage dtype, so fp32 master weights (uniform fp32 storage)
still compute the bulk in mp_policy.param_dtype (bf16) while intrinsically-fp32
params keep fp32 compute.

Per-parameter compute dtype is resolved by precedence: pinned fp32
(_keep_in_fp32_modules_strict) > HF-recorded checkpoint dtype (tagged onto each
tensor at load time in _restore_loaded_model_dtype) > mp_policy.param_dtype.
Qwen3.5's GatedDeltaNet fp32 holder is declared via patch_hf_model; the
NemotronH and Qwen3.5 strategies thread the declaration through.


(cherry picked from commit 3dd6b97)

* docs(model-onboarding): document _keep_in_fp32_modules_strict contract

Add SKILL.md §2.6 explaining which params must compute in fp32 (SSM A_log/
dt_bias/D, MoE sigmoid-gate bias, attention-sink bias, scale), how to declare
them (class attribute vs patch_hf_model instance attribute), and why the pin is
the robust signal across all load paths. Broaden the MoE checklist item and
code comment accordingly.


(cherry picked from commit a11db38)

* test(distributed): add fp32 compute-dtype contract test

Assert the resident compute dtype of every trainable parameter across the model
archetypes that use fully_shard_by_dtype (dense, Qwen3.5-style hybrid), covering
the full precedence chain: pinned fp32 > HF-recorded dtype > mp_policy.param_dtype,
under fp32 master weights and ordinary loads.


(cherry picked from commit dc83926)

* feat(model): cast frozen modules to compute dtype to avoid mismatch


(cherry picked from commit d321f5e)

* refactor(gemma4): drop projector dtype hook now general frozen cast handles it


(cherry picked from commit 1bc67e2)

* feat(training): add dormant resolve_storage_dtype helper

Add resolve_storage_dtype() (and its unit tests) for defaulting model.torch_dtype
to fp32 for full-parameter torch.optim training. Not yet wired into recipes here;
the call sites are marked with breadcrumb comments and enabled in a follow-up PR,
keeping this PR limited to dtype bug fixes with no behavior/memory change.



* fix(model): cast frozen-module buffers and unsharded params to compute dtype



* docs(infra): correct frozen-tower FSDP comment to match sharding reality



* docs(mixed-precision): clarify TE vs torch AdamW memory and precision trade-offs



* docs(mixed-precision): apply tech writer edits



* docs(mixed-precision): drop unresolvable FSDP anchor



---------



* docs(speculative): add subsystem README, fold in regeneration guide (#2448)

Add examples/speculative/README.md covering the whole speculative-decoding
draft-training subsystem: supported methods (EAGLE-1/2/3/3.1, P-EAGLE,
DFlash), target-model registry coverage, compute backends (eager vs
flash_attention_2, flex_attention/sdpa, fused Triton soft cross-entropy,
d2t/t2d draft-vocab compression), target backends (co-located, remote,
offline cache), serving and benchmarking, inference-engine compatibility,
and a consolidated config reference.

Fold the standalone regenerate_with_target.md into the README's data
preparation section (full two-step flow, tuning table, pitfalls) and remove
the separate file so there is a single entry point.



* feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support (#2284)

* feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support



* fix the memory management for training large 14B wan model

* fix wan2.2 support

* all good for wan2.2

* update



* docs(fern): add Wan2.2 T2V-A14B model coverage and release log entry



* fix anther round of code review

* fix(diffusion): sort wan.py imports to satisfy CI isort (I001)



* fix(diffusion): load inference checkpoints to CPU to halve peak GPU memory

Avoids doubling peak GPU memory (and a potential OOM in Wan2.2 two-stage
inference) by loading EMA/consolidated state dicts with map_location="cpu";
load_state_dict copies into the already-on-device parameters.



---------





* test: include find_unused_parameters in ddp manager args expectation

The DDP strategy config exposes find_unused_parameters (default False),
so _build_diffusion_parallel_manager_args returns it in the ddp branch.
Update the test's expected dict to match, fixing the L0 unit test failure.




* fix(distributed): address Claude review comments

- infrastructure.py: forward the model wrapper's mp_policy (from FSDP2Config)
  to the MoE expert parallelizer when MoEParallelizerConfig.mp_policy is unset,
  so a custom precision policy isn't silently dropped for EP models.
- skills/nemo-automodel-distributed-training/SKILL.md: fix stale references —
  MeshContext no longer holds strategy_config/pipeline_config/moe_config and
  STRATEGY_MAP moved to _STRATEGY_MAP in config.py; MoEParallelizerConfig now
  lives in components/distributed/config.py.




---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Signed-off-by: Adil Asif <adasif@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: thyways <2484113689@qq.com>
Signed-off-by: khazic <khazzz1c@gmail.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: linnan wang <linnanw@nvidia.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Yuhe Zhang <yuhezhang.zju@gmail.com>
Co-authored-by: khazzz1c <khazzz1c@gmail.com>
Co-authored-by: thyways <2484113689@qq.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Pranav Thombre <pthombre@nvidia.com>
Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Co-authored-by: linnan wang <linnanw@nvidia.com>
feat(vlm): enable Qwen3.5 MoE VLM CP (#2432)

* feat(vlm): enable Qwen3.5 MoE VLM CP



* test(vlm): cover Qwen3.5 MoE VLM CP changes

Add unit coverage for the new/changed code paths in PR #2432:
- cp_utils: opt-in seq_index CP buffer, singleton expansion, arange-continued padding
- Qwen3_5MoeBlock.forward: seq_index threading into linear_attn, stripping on full-attn path
- prepare_model_inputs_for_cp / _pre_embed_only dispatch and text-only forward path
- PreTokenizedDatasetWrapper inject_fake_images gating + build_dataloader passthrough
- _run_validation_epoch: total_tokens not summed over CP ranks




* style(vlm): sort imports in qwen3_5_moe model.py

Fixes ruff I001 (unsorted import block) flagged by CI linting:
`import inspect` was added above `import copy`.




* refactor(qwen): keep CP seq index out of cp utils



* rename qwen medpix cp2 config



* test(qwen): align CP seq-index tests with cp_linear_attn refactor

The "keep CP seq index out of cp utils" refactor moved seq_index handling
out of make_cp_batch_and_ctx and prepare_model_inputs_for_cp into
CPAwareGatedDeltaNet. Update tests accordingly:
- drop obsolete seq_index buffer/padding tests from test_cp_utils
- prepare_model_inputs_for_cp now returns only inputs_embeds + position_ids
- rewrite TestExtractLocalPositions -> TestExtractLocalSeqIndex for the new
  _extract_local_seq_index signature
- add coverage for _build_dual_chunk_local_positions (DualChunkSwap layout)




---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(model): flux2 (#2145)

* flux2 init draft

* udpate

* fix(diffusion): revert flux1 example and add flux2 inference config

- Revert accidental FLUX.2 changes in flux_t2i_flow.yaml back to FLUX.1-dev
- Add examples/diffusion/generate/configs/generate_flux2.yaml for FLUX.2-dev inference




* fix(diffusion): fix flux2 contiguity and text encoder eval

- Add .contiguous() after permute in _pack_latents and _unpack_latents
  so hidden_states is always contiguous before flash-attention kernel
- Call pipeline.text_encoder.eval() after device placement, consistent
  with FluxProcessor, WanProcessor, and QwenImageProcessor




* feat(diffusion): sync flux2 configs with main performance fields

Add optimizer flags (foreach/fused), performance block, FSDP2 prefetch
tuning, save_checkpoint_every_epoch, and save_consolidated=final to
flux2_t2i_flow.yaml and flux2_t2i_flow_lora.yaml to match the fields
added to flux_t2i_flow.yaml in main.




* fix(diffusion): fix flux2 cfg dropout to apply per-sample not per-batch

Replace single random.random() gate (correlated across entire batch) with
a per-sample Bernoulli mask so each sample independently has
cfg_dropout_prob chance of receiving zeroed text embeddings. Also drop
the now-unused `import random`.




* fix(diffusion): fix flux cfg dropout to apply per-sample not per-batch

Replace single random.random() gate (correlated across entire batch) with
a per-sample Bernoulli mask so each sample independently has
cfg_dropout_prob chance of receiving zeroed text/pooled embeddings.
Also drop the now-unused `import random`.




* test(diffusion): add unit tests for Flux2Adapter and Flux2Processor

- tests/unit_tests/flow_matching/test_flux2_adapter.py: 36 tests covering
  pack/unpack roundtrip + contiguity, 4D positional IDs (img_ids/txt_ids)
  shape/dtype/value correctness, prepare_inputs keys/shapes/normalization/
  CFG dropout, and forward model call kwargs
- tests/unit_tests/diffusion_processors/test_flux2_processor.py: 22 tests
  covering model_type/default_model_name properties, encode_image BN
  normalization + dtype + squeeze, encode_text Mistral3 args + no-clip keys,
  verify_latent shape/NaN/Inf checks, get_cache_data structure, and
  ProcessorRegistry lookup




---------

Signed-off-by: linnan wang <linnanw@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: linnan wang <linnanw@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(config): glm4.7 yaml (#2527)

fix(recipe): use distributed.moe schema for glm_4.7_flash packed-seq

The glm_4.7_flash_te_packed_sequence recipe used a top-level `moe_config:`
block with `_target_: nemo_automodel.components.moe.config.MoEParallelizerConfig`,
but that class lives at `nemo_automodel.components.distributed.config`
(moe.config has no such attribute). The config loader eagerly resolves every
`_target_`, so this raised at config-load time, before training started:

  AttributeError: module 'nemo_automodel.components.moe.config'
                  has no attribute 'MoEParallelizerConfig'

The top-level `moe_config:` block is not a supported recipe key: MoE
parallelizer settings are read from `distributed.moe` and activation
checkpointing from `distributed.activation_checkpointing` (see
recipes/_dist_utils.py:parse_distributed_section). Every other MoE recipe
(e.g. glm_4.7_flash_te_deepep.yaml) already uses the distributed.* form.

Fix: drop the top-level moe_config block and move `activation_checkpointing:
false` under `distributed:`. With ep_size>1 the loader builds a default
MoEParallelizerConfig, preserving the original intent (activation
checkpointing off).

Verified against current source: the recipe now loads without the
AttributeError, and parse_distributed_section yields the correct default
MoEParallelizerConfig with activation_checkpointing=False.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…nto `r0.5.0` (#2525)

fix(gemma4): cast dense params without casting buffers (#2359)

* fix(gemma4): avoid bf16 casting dense model buffers



* ADD test code fix(gemma4): avoid bf16 casting dense model buffers



* fix(gemma4): cast dense params without rounding buffers



* Fix Gemma4 MoE import formatting for Ruff



---------

Signed-off-by: kdg6245 <kdg6245@snu.ac.kr>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Dogeun Kim <82812668+DOGEUNNKIM@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…2530)

fix: unwrap ModelOutput to extract logits (#2523)

unwrap modeloutput to extract logits

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
… `r0.5.0` (#2554)

fix(qwen3_5): make dense VLM pipeline-parallel safe (#2524)

* fix(qwen3_5): make dense VLM pipeline-parallel safe

The dense Qwen3.5/3.6 VLM crashed on the first PP forward with
"KeyError: slice(None, 64, None)". It keeps its own outer forward under PP
but delegates the decoder stack to HF's Qwen3_5TextModel.forward, which is
not PP-aware: it slices self.layers[: num_hidden_layers] (the splitter
rewrites layers into a ModuleDict) and calls self.norm/self.embed_tokens
unconditionally (dropped to None on non-last/non-first stages).

Fix (all in qwen3_5/model.py):
- Add Qwen3_5TextModelPP(Qwen3_5TextModel) overriding forward to present the
  post-split ModuleDict layers as a slice-able ModuleList over the same layer
  objects and swap a dropped norm for nn.Identity, delegating to super() so
  HF's mRoPE/mask/rotary logic is reused unchanged. Pure passthrough off PP.
- In __init__, set self.model.language_model.__class__ = Qwen3_5TextModelPP
  (same instance+weights; class-based so it survives the splitter's deepcopy).
- Make the outer forward PP-stage-aware: stage 0 runs the full HF VLM path,
  middle/last stages feed upstream hidden states straight into the text
  backbone, lm_head runs only on the last stage, MTP is skipped under PP.
  Non-PP (TP/CP/single) path is unchanged.

Verified: TP4xPP4 on 2 nodes (MedPix, 27B dense) trains end to end,
loss 1.82 -> 1.53, ~19s/step, no errors.




* test(qwen3_5): cover PP-stage dispatch and text-backbone class swap

Add unit tests exercising the previously-uncovered new lines in the dense
VLM PP fix: the __init__ Qwen3_5TextModelPP class swap and the outer
forward's PP-stage dispatch (first / middle / last stage and the non-PP
fall-through), using a tiny CPU VLM with simulated per-stage module layouts
(embed_tokens / lm_head dropped as the splitter would). Raises codecov patch
coverage on model.py.




---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… recipes (2539)` into `r0.5.0` (#2550)

feat(examples): add Nemotron-3-Ultra-550B benchmark and full-SFT recipes (#2539)

* feat(examples): add Nemotron-3-Ultra-550B benchmark and full-SFT recipes

Add two example configs for Nemotron-3-Ultra-550B-A55B on 16x H100 nodes
(128 GPUs, EP=64) and record the 16-node pre-training throughput in the
performance summary:

- examples/llm_benchmark/nemotron/nemotron_ultra_v3_te_deepep.yaml: throughput
  benchmark (torch_mm experts, balanced gate, MockIterableDataset, repeated MTP
  head) measuring 815 tok/s/GPU, 293 TFLOP/s/GPU at 10.05 s/global step.
- examples/llm_finetune/nemotron/nemotron_ultra_v3_full_sft.yaml: full
  supervised fine-tune (real router, gmm experts, THD sequence packing at 4096,
  repeated MTP head) on SQuAD with chat-template formatting and answer-only loss
  masking. Validated end-to-end on 128x H100 (loss 6.1 -> 1.7, ~49 GiB/GPU).
- docs/performance-summary.mdx: add the Ultra 550B pre-training row and its
  benchmark config link.




* refactor(examples): rename Ultra-550B SFT example to _squad, tidy comments

Rename nemotron_ultra_v3_full_sft.yaml -> nemotron_ultra_v3_squad.yaml to match
the directory's <model>_squad.yaml naming convention (the example fine-tunes on
SQuAD). Simplify the inline comments and update the self-referencing launch path
in the header. fake_balanced_gate is dropped from the backend block since it
defaults to False (real router) -- behavior is unchanged.




* chore: restore perf-summary Last Updated date; trailing newline in SQuAD yaml

Revert docs/performance-summary.mdx Last Updated back to 2025-10-02 (the date
bump will be handled separately later). Add the missing trailing newline at EOF
of the Ultra-550B SQuAD example.




---------

Signed-off-by: Adil Asif <adasif@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…nts (2546)` into `r0.5.0` (#2558)

ci: schedule ep-parallel finetune recipes at documented node counts (#2546)

* ci(deepseek_v4): set nodes so recipes run at documented scale

The deepseek_v4 finetune recipes declare large parallelism (pp_size x
ep_size of 128/512 GPUs) but had no `ci:` block, so release-scope
auto-discovery in generate_ci_tests.py emitted them with the default
1 node (8 GPUs). On 8 GPUs non_pp_size = world/pp = 2, which cannot
satisfy ep_size=32, so _create_fsdp2_device_mesh raised
"ValueError: non_pp_size=2 must be a multiple of ep_size=32" and every
deepseek_v4 job failed at recipe.setup().

Add a `ci:` block with the node count each recipe's header already
documents so CI allocates the correct world size:
  - flash_hellaswag / flash_packed_sequence_hellaswag: 16 nodes (pp4*ep32=128)
  - flash_hellaswag_lora: 4 nodes (pp1*ep32=32)
  - pro_*_pp8_ep64_*: 64 nodes (pp8*ep64=512)

Fixes CI jobs 337980465/337980467/337980469/337980471/337980601
(pipeline 54319542).



* ci: set node counts for the remaining ep-parallel finetune recipes

Same root cause and fix as the deepseek_v4 recipes, for the rest of the
config_parallelism/ep_size_not_divisible bucket (AM-419). These recipes
declare large expert/pipeline parallelism but had no `ci:` block, so
release-scope auto-discovery scheduled them on the default 8 GPUs, where
non_pp_size % ep_size != 0 and _create_fsdp2_device_mesh raised
"non_pp_size must be a multiple of ep_size".

Add `ci.nodes` at each recipe's documented scale:
  - glm_5.1_lora, mimo_v2_flash_hellaswag,
    step3p7_medpix_200b_ep32pp4: 16 nodes (pp4*ep32 = 128 GPUs)
  - hy3_preview_deepep_lora (pp1*ep64 = 64),
    step3p7_medpix_200b_lora_pp8ep8_8node (pp8*ep8 = 64): 8 nodes
  - ling_flash_2_0_sft, minimax_m2.7_hellaswag_lora,
    nemotron_ultra_v3_hellaswag_peft: 4 nodes (pp1*ep32 = 32 GPUs)

hy3_preview_deepep: fix ep_size 32 -> 8 (the recipe's own comments say
"ep_size=8 gives 24 experts/rank" and "32 GPUs for full fine-tuning";
192 % 8 = 0) and add ci.nodes: 4 (pp4*ep8 = 32 GPUs).

nemotron_ultra_v3_hellaswag_peft_gb200 is intentionally left unchanged:
it is a GB200-only recipe (4 GPUs/node, 184 GB) and the eos functional
pipeline has no per-recipe GB200 routing, so it remains a known failure.

Fixes CI jobs 337980609 337980497 337980612 337980499 337980515
337980623 337980631 337980380 337980397 (pipeline 54319542).



* Apply suggestion from @akoumpa

* hy3_preview_deepep: pp8/ep8 @ 8 nodes (pp4/ep8 @ 4 nodes OOMs; verified on cw)

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…0.5.0` (#2567)

fix(diffusion): resolve flux nightly CI failures (#2529)

Two startup failures in the diffusion nightly functional tests:

- flux_t2i_flow_lora: the CI launcher unconditionally passed
  --fsdp.dp_size, which injects an 'fsdp' section and conflicts with
  the recipe's 'ddp' section (mutual-exclusion ValueError). The
  launcher now skips the override for DDP-based recipes.
- flux_t2i_flow: _build_diffusion_parallel_manager_args called
  dict() on ConfigNode sections, which are not iterable. Normalize
  fsdp/ddp sections via to_dict() before use; this also fixes the
  identical latent bug in the DDP branch.

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Pranav Thombre <pthombre@nvidia.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…ert (2575)` into `r0.5.0` (#2578)

fix(ci): bump ling_1t_lora_pp local_batch_size to satisfy PP assert (#2575)

fix(ci): bump ling_1t_lora_pp local_batch_size to satisfy PP assert (AM-471)

The ling_1t_lora_pp recipe fails in CI with:

  AssertionError: pp_batch_size 4 // pp_microbatch_size 1 must be >= pp_size 8

train_ft.py requires local_batch_size // pp_microbatch_size >= pp_size so the
pipeline schedule (interleaved1f1b, pp_size=8) has at least pp_size microbatches
to fill its stages. The recipe set local_batch_size=4, giving only 4 microbatches.

Raise local_batch_size 4 -> 8 (= pp_size * pp_microbatch_size). Per-microbatch
size stays 1 (no extra memory), and global_batch_size=512 stays cleanly
divisible: 512 / (local 8 * dp 8) = 8 grad-accum steps.

Also reassign recipe_owner to akoumpa.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…amba merge) (2559)` into `r0.5.0` (#2576)

test: fix all 5 vllm_deploy tests (token drift, nemotron OOM + mamba merge) (#2559)

test: fix nemotron-9b vllm_deploy (OOM, mamba LoRA via merge, token drift)

Job 337980666 (nemotron_nano_9b_squad_peft_vllm_deploy) failed three ways:

1. vLLM EngineCore OOM (62.24/79.11 GiB free < 0.9 target): the HF model
   stayed resident via the PeftModel<->base reference cycle. Fix:
   gc.collect() before empty_cache() + gpu_memory_utilization=0.7.
2. vLLM cannot apply LoRA to NemotronH's fused mamba MambaMixer2 (asserts
   on model.layers.0.mixer.conv1d), independent of the adapter's targets.
   So enable_lora can't serve this model at all. Fix: when
   ci.checkpoint_robustness.vllm_merge_lora is set, merge the adapter into
   the base and deploy the merged model without enable_lora.
3. Exact HF-vs-vLLM greedy token equality is not a valid cross-engine
   invariant. Fix: compare a matching prefix (MIN_MATCH_PREFIX=5).

Signed-off-by: Adil Asif <adasif@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…0.5.0` (#2573)

fix(checkpoint): preserve tied lm_head on resume (#2511)

* fix(checkpoint): preserve tied lm_head on resume



* docs(checkpoint): clarify tied lm_head storage check



* fix(checkpoint): retie local lm_head after sharding



* fix(checkpoint): refresh tied lm head state during DCP



---------

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Yuhe Zhang <yuhezhang.zju@gmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…74)` into `r0.5.0` (#2579)

fix(ci): set node counts for multi-node VLM finetune recipes (#2574)

* fix(ci): set node counts for multi-node VLM finetune recipes (AM-434)

CI auto-discovers every examples/vlm_finetune/**/*.yaml in the release scope and
runs each on a single 8-GPU node unless the recipe's ci: section requests more.
These three recipes had no ci: section, so they ran with world_size=8 while their
parallelism requires more GPUs: mistral3p5_128b_medpix(_lora) need TP*PP=64 and
qwen3_5_27b_tp4pp4 needs TP*PP=16. _infer_dp_size then raised "world_size must be
divisible by (tp_size * cp_size * pp_size)".

Add a ci: section to each requesting the node count its parallelism needs
(8 nodes for the 128B medpix recipes, 2 nodes for qwen3_5_27b) so the device mesh
builds with dp_size=1 instead of failing.




* Change recipe owner from akoumpa to HuiyingLi

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…v3_cp_test (2577)` into `r0.5.0` (#2580)

fix(recipe): reshard MoE experts after forward in nemotron_nano_v3_cp_test (#2577)

The cp=2/ep=4 variant shards experts on a 2-wide ep_shard FSDP dimension. With
the default reshard_after_forward=False the all-gathered expert weights stay
resident across the whole forward (~11.5 GB/rank); once Adam state is allocated
on step 0, step 1's forward exceeds the 80 GiB H100 budget and OOMs. Setting
moe.reshard_after_forward=true frees that headroom and is numerically transparent
(recompute of the gather only), so the CP-parity comparison is unaffected.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ci: use digits for spark recipes (#2581)

use digits for spark recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…464) (2585)` into `r0.5.0` (#2586)

ci: Enable activation checkpointing for gemma_2_9b_it_squad (AM-464) (#2585)

enable activation checkpointing for gemma_2_9b_it_squad

gemma_2_9b_it_squad OOMs on 8xH100-80GB in the backward pass at train
step ~10/50 (AM-464, nemo-ci job 337980482). With attn_implementation=
eager and no activation checkpointing, each of the 42 layers keeps a
full [B, heads, S, S] attention-score tensor for backward; the first
uncapped pad-to-longest SQuAD batch then spikes past 80 GB.

Enable activation_checkpointing so per-layer activations are recomputed
in backward (one layer's eager scores resident at a time). Verified on
cw-dfw 8xH100: baseline OOMs at step 10; with this change the run
completes 20 steps + a full validation pass at 71-79 GB peak, with an
identical (loss-neutral) loss curve.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…p (2582)` into `r0.5.0` (#2588)

fix(test): load checkpoint-robustness HF reference via device_map (#2582)

The checkpoint_robustness test's Phase 4 loads a vanilla-HF reference
model on rank 0 only (AutoModelForCausalLM.from_pretrained -> CPU ->
.to(device)). For a 14B checkpoint that is ~50s warm / ~225s cold; the
other ranks idle at the post-phase _barrier(), so the rank-0 stall
overruns the NCCL watchdog (dist_env.timeout_minutes), the peers abort
(SIGABRT) and rank 0 is SIGKILL'd in teardown. nemo-ci's extractor
reports the rank-0 SIGKILL as "Signal/OOM-Kill", but it is a watchdog
timeout, not a host OOM.

Load the reference model straight onto the GPU via device_map (~12s) for
standard-HF loads; trust_remote_code (needs _no_meta), quantized and
device_map=auto paths are unchanged.

Fixes AM-468 (phi_4_squad, phi_4_squad_peft).

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…53) (2584)` into `r0.5.0` (#2597)

fix(peft): LoRA MLP QLoRA/PP/gemma3n fixes (AM-435, AM-447, AM-453) (#2584)

* fix(peft): make gemma3n per-layer LoRA output safe for in-place use (model-scoped)

gemma3n_vl_4b_medpix_peft (VLM PEFT) failed at backward with "Output 0 of
LoRATritonFunctionBackward is a view and is being modified inplace": transformers
gemma3n project_per_layer_inputs does an in-place op on the output of
per_layer_model_projection, which under the memory-efficient LoRA path is a view
of a custom autograd Function output.

Fix: instead of cloning inside the generic LoRATritonFunction (a clone on every
LoRA forward, for all models), components/_peft/lora.py adds
patch_gemma3n_inplace_lora_views() (called from apply_lora_to_linear_modules). It
structurally detects gemma3n's text model and wraps only its
per_layer_model_projection LoRA forward to return a non-view clone -- one clone
per forward instead of ~200+, and zero cost for non-gemma3n LoRA. clone() is an
autograd identity (grads unchanged); the patch is guarded and idempotent.


(cherry picked from commit f02c0f4)

* fix(peft): skip fused LoRA SwiGLU/ReLU2 path for quantized (QLoRA) base weights

QLoRA 4-bit base weights are packed buffers (e.g. shape (1, out*in/2)), not a
2D (out_features, in_features) matrix, so the fused path's
``F.linear(x, base_weight)`` failed with "mat1 and mat2 shapes cannot be
multiplied (Nx4096 and 1x14680064)". ``_fusible`` now rejects quantized bases
(bitsandbytes ``quant_state`` marker, or weight shape != (out, in)) so the
per-linear ``LinearLoRA`` path (which dequantizes the base) handles them.
Adds a regression test.

Fixes AM-435.



* fix(peft): return fused LoRA MLP grads on their parameter's device (PP/meta-safe)

Under pipeline parallelism torch builds the backward graph with the LoRA
parameters on the meta device while the activations (and the grads computed from
them) are on cuda. The fused LoRASwiGLUMLPFunction / LoRAReLU2MLPFunction returned
the cuda grads, so torch rejected them:

  Function LoRASwiGLUMLPFunctionBackward returned an invalid gradient at index 2
  - expected device meta but got cuda

Move each LoRA gradient onto its parameter's device before returning (a no-op in
normal single-device training; meta in the PP graph pass, so no real gradient is
lost). Verified on 2-GPU PP (qwen3 + fused SwiGLU LoRA): the crash is gone and
training runs end-to-end.

Fixes AM-447.



* refactor(peft): make memory-efficient LoRA output non-view (drop gemma3n patch)

AM-453's gemma3n-specific patch (patch_gemma3n_inplace_lora_views) lived in the
generic _peft/lora.py and only covered one consumer. Root cause: the
memory-efficient LoRATritonFunction returned a *view* of its custom-autograd
output (an in-Function .view()), which torch forbids mutating in place
("Output 0 of LoRATritonFunctionBackward is a view and is being modified inplace").

Move the (N, out) -> (bs, seq, out) reshape OUT of the autograd Function into a thin
apply_memory_efficient_lora wrapper (used by LinearLoRA). The Function now returns a
non-view 2D tensor; the wrapper's reshape is an ordinary autograd view, which supports
in-place ops. This fixes the bug class for any in-place consumer with no model-specific
code, and removes the gemma3n patch + its tests. Adds a generic in-place-safety
regression test (gemma3n-style consumer, no patch).

AM-453.



---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…M recipes (2600)` into `r0.5.0` (#2602)

fix(vlm): enable activation checkpointing for 35B Qwen3.5/3.6 VLM recipes (#2600)

The qwen3_5_35b VLM nightly recipe OOMs intermittently in the backward
pass on 8xH100-80GB. Since #1896 restored fp32 master weights for custom
MoE under FSDP2 (a convergence fix), steady-state peak memory rose ~3 GiB
to ~67 GiB, leaving only ~2 GiB of headroom; variable VLM/MoE batch sizes
then push the backward pass over the limit on some steps.

Enable activation_checkpointing (recompute transformer-block activations in
backward) on the two full-FT 35B-A3B recipes that use the memory-tight
pp1/ep8 layout:
- examples/vlm_finetune/qwen3_5_moe/qwen3_5_35b.yaml
- examples/vlm_finetune/qwen3_5_moe/qwen3_6_35b.yaml

Verified qwen3_5_35b on 8xH100-80GB: baseline OOMs at step 31 (peak
67.2 GiB); with AC it completes all 50 steps + validation at 56.3 GiB.
qwen3_6_35b is the identical pp1/ep8 full-FT analog (its ep8cp2 sibling
already enables AC). The pp2/ep4 neat_packing recipe was checked and left
unchanged: it peaks at ~48 GiB and does not OOM.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…grad-accum (2566)` into `r0.5.0` (#2599)

fix(gemma4): FSDP2-safe kv-sharing + skip frozen audio tower on grad-accum (#2566)

* fix(gemma4): make shared_kv_states FSDP2-safe for kv-shared layers




* fix(fsdp2): skip wrapping frozen audio tower to avoid grad-accum crash




* fix(pp): thread shared_kv_states through gemma4 pipeline-parallel forward




* chore(gemma4): update example run commands, warmup, and audio-tower comment




---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…gits OOM (2603)` into `r0.5.0` (#2604)

fix(vlm): use FusedLinearCrossEntropy for qwen3_5_9b to avoid logits OOM (#2603)

The qwen3_5_9b VLM nightly recipe (FinetuneRecipeForVLM, medpix) CUDA-OOMs
in MaskedCrossEntropy: it fp32-upcasts the full [num_tokens, vocab] logits
and calls F.cross_entropy, which spikes ~45 GiB (steady ~30 -> 77+ GiB) on
large vision-token batches and OOMs at masked_ce.py:84 (AM-457).

Switch loss_fn to FusedLinearCrossEntropy: with the recipe's logits_to_keep=1
path the full logits matrix is never materialized. Matches the
qwen3_6_27b_medpix recipes, which already use it for the same reason.

Verified on 8xH100-80GB: baseline OOMs at step 31 in cross_entropy; with
FusedLinearCrossEntropy the run completes all 50 steps + validation at
37.7 GiB peak (no extra config needed).

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
… (2589)` into `r0.5.0` (#2605)

fix(distributed): register Falcon-H1 TP plan to fix 34B PEFT OOM (#2589)

falcon_h1_34b_instruct_squad_peft (8xH100, tp_size=4) OOMs during the
first training steps with ~72 GiB allocated per GPU.

Root cause: Falcon-H1 is a hybrid Transformer + Mamba2 model. HuggingFace
ships only `_tp_plan = {"lm_head": "colwise_gather_output"}` for it, and
names its MLP `feed_forward` (not `mlp`). The parallelizer rejects the HF
plan (the `colwise_gather_output` style is unknown) and falls back to the
generic llama-style base plan, whose `model.layers.*.mlp.*` patterns never
match `feed_forward`. The dominant MLP weights (~70% of the 34B) are thus
left replicated across the TP group, so each rank holds almost the whole
model -> OOM. The attention path already sharded fine via the fallback.

Fix: add a dedicated FalconH1ForCausalLM plan that shards `self_attn` and
`feed_forward` (correct module name) and leaves the Mamba2 mixer replicated
(its SSM scan / conv1d are not TP-shardable with stock kernels, same as
Qwen3.5's GatedDeltaNet branch). Registered by both qualified name (native
transformers load) and bare name (trust_remote_code load).

This is the only Falcon-H1 recipe using tp_size>1; the 0.5B/1.5B/7B
recipes use tp_size=1 and are unaffected. Also drops a pre-existing
unused-variable in the touched test file.

Verifies CI job 337980604 (pipeline 54319542).


(cherry picked from commit 90bcf87)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
…ast (2549)` into `r0.5.0` (#2606)

fix(models): keep RoPE frequency buffers fp32 under bf16 model cast (#2549)

* fix(llama): keep RoPE inv_freq in float32 under bf16 model cast

LlamaForCausalLM.__init__ casts the whole model with
self.to(config.torch_dtype). nn.Module.to rounds floating-point buffers,
so the non-persistent inv_freq buffer in LlamaRotaryEmbedding was being
downcast to bf16. _build_cache then upcast it back to float32 to build
the cos/sin tables, but the precision was already lost -- the low-freq
components (largest under rope_theta=5e5 + llama3 scaling) carried up to
~17% relative error.

Vanilla HF keeps inv_freq in float32 (from_pretrained reconstructs it and
never overwrites the non-persistent buffer), so a checkpoint trained with
AutoModel and reloaded in HF computed slightly different RoPE. On a
trained model's peaky output distribution that small logit gap (~0.1 max)
is amplified into a large per-token KL: the checkpoint_robustness Phase 4
(automodel -> vanilla HF) test failed with max KL 1.1e-2 > 5e-3 threshold.

Fix: recompute inv_freq in float32 from config inside _build_cache, so the
rotary tables are independent of the model's parameter dtype. Also fixes
Qwen2, which shares this module.

After the fix, AutoModel and HF logits are bit-identical (max KL 0.0 at all
softmax temperatures; rotary cos/sin match HF exactly). Added a regression
test that fails on the old code (0.0039 abs / 17% rel) and passes now.

* fix(models): keep RoPE frequency buffers fp32 under bf16 model cast

The prior commit fixed llama; the same bug — a rotary inv_freq/freqs_cis
buffer rounded to bf16 by a model-wide .to(dtype), degrading RoPE vs HF
and causing a large logit/KL divergence on vanilla-HF reload — affects 6
more families. Fix each via the existing _keep_in_fp32_modules mechanism
(honored by cast_model_to_dtype), routing the two raw-cast models through
cast_model_to_dtype:

- deepseek_v4, mimo_v2_flash: add "rotary_emb" to _keep_in_fp32_modules_strict
  (matches rotary_emb + rotary_emb_compress / swa_rotary_emb).
- minimax_m3_vl: add "inv_freq" — the vision tower's rotary buffer is not
  under a module named "rotary_emb".
- gemma4_moe, diffusion_gemma: switch raw self.to(dtype) -> cast_model_to_dtype
  and add _keep_in_fp32_modules=["rotary_emb"] (raw .to ignores keep-fp32).
- kimi_k25_vl: switch raw model.to(dtype) -> cast_model_to_dtype and add
  _keep_in_fp32_modules=["freqs_cis","rotary_emb"].

Add cast_model_to_dtype rope-buffer regression tests: non-persistent
inv_freq/freqs_cis preserved across the cast, incl. the unprotected-is-rounded
reproduction case.

* fix(models): honor set-valued _keep_in_fp32_modules + add bf16-init rope tests

cast_model_to_dtype's _get_fp32_module_keywords only collected list-valued
keep-fp32 attributes, but HF's PreTrainedModel.__init__ normalizes
_keep_in_fp32_modules from a class-level list to an instance-level set — so the
gemma4_moe / diffusion_gemma rope fixes (which set
_keep_in_fp32_modules=["rotary_emb"]) were silently no-ops, leaving inv_freq
rounded to bf16. Accept set/tuple too. (deepseek_v4/mimo use the NeMo-only
_keep_in_fp32_modules_strict, which HF doesn't touch, so they were unaffected.)

Add per-model regression tests that build a mini (~2-layer) model, run the real
bf16 init path (initialize_weights, or from_config + cast_model_to_dtype), and
assert the rotary frequency buffers stay float32 while a regular weight is
bfloat16: deepseek_v4, mimo_v2_flash, gemma4_moe, minimax_m3_vl (vision tower),
kimi_k25_vl, and diffusion_gemma (skipped unless the transformers fork is
present). Plus set/tuple coverage for _get_fp32_module_keywords and
cast_model_to_dtype.

(cherry picked from commit e41e076)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
…) (2587)` into `r0.5.0` (#2611)

fix: use TE attention for gpt_oss packed-sequence recipe (AM-438) (#2587)

* fix: use TE attention for gpt_oss packed-sequence recipe (AM-438)

gpt_oss_20b_te_packed_sequence.yaml paired backend.attn=flex with THD
packed sequences, but FlexAttention does not support the THD (3D) layout
(its sink path requires a 4D tensor and ignores cu_seqlens), so the recipe
crashed with "ValueError: not enough values to unpack (expected 4, got 3)".
TE DotProductAttention supports THD/packed natively (qkv_format=thd +
cu_seqlens) and gpt-oss attention sinks via softmax_offset, which is the
path #1757 validated. Switch the recipe's attention backend to te.



* fix(recipe): lower gpt_oss packed-sequence local_batch_size to 1 to avoid MoE OOM

Full fine-tuning of gpt-oss-20b (MoE) on a single 8xH100-80GB node OOMs at
local_batch_size: 4 -- in the MoE experts grouped-GEMM and the vocab-201088
cross-entropy logits. local_batch_size: 1 (global_batch_size stays 32 via
grad accumulation) trains within ~35 GiB/GPU. Activation checkpointing is not
an option here: DeepEP's non-deterministic token dispatch makes the recompute
mismatch the forward.



---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…avoid OOM (2609)` into `r0.5.0` (#2612)

fix(oom): use FusedLinearCrossEntropy in qwen3 tulu3 configs to avoid OOM (#2609)

fix(convergence): use FusedLinearCrossEntropy in qwen3 tulu3 configs to avoid OOM

The qwen3-4b and qwen3-moe-30b tulu3 convergence configs (added in #1554) OOM
on 8xH100-80GB at masked_ce.py logits.float(): with truncation: false a long
tulu3 sample builds a full [tokens, vocab] logit tensor whose bf16->fp32 upcast
needs one huge contiguous allocation (48 GiB for 4B lb8, 18.6 GiB for 30B).

Switch all 5 configs to FusedLinearCrossEntropy + output_hidden_states: true,
so the recipe's logits_to_keep=1 path never materializes the full logits. Also
drop qwen3_4b_cp1_flashoptim local_batch_size 8->4: once FLCE removes the logits
bottleneck it exposes a backward-pass activation OOM at lb8 with unbounded
sequences (cp2 is unaffected since CP halves per-GPU seq; the te_fusedadam and
30B configs already use lb2).

Verified on 8xH100-80GB (max_steps 10): baseline OOMs (4B 48 GiB / 30B 18.6 GiB
at logits.float); with the fix the runs complete 10 steps + validation with no
OOM (30B flashoptim, 4B cp1 lb4, 4B cp2).

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
….0` (#2607)

perf(distributed): add retrieval tuning knobs (#2452)

* perf(distributed): add retrieval tuning knobs



* fix(retrieval): unwrap ddp model attrs



* perf(retrieval): speed up ddp grad clipping



* fix(distributed): reduce DDP recipe metrics



* fix(retrieval): preserve optimizer groups and log average loss



* chore(retrieval): drop megatron fsdp side changes



* style(retrieval): sort bi-encoder imports



* style(training): format step scheduler



* test(distributed): fix ddp config expectations



* perf(retrieval): make autocast configurable



* perf(retrieval): wire compile config



* test(diffusion): update DDP manager config expectation



* test(diffusion): update ConfigNode DDP expectation



---------

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Yuhe Zhang <yuhezhang.zju@gmail.com>
…g probability (2591)` into `r0.5.0` (#2610)

fix(moe): weight GroupedExpertsTE down-projection bias by routing probability (#2591)

* fix(moe): weight GroupedExpertsTE down-projection bias by routing probability

GroupedExpertsTE applied the MoE down-projection bias UNWEIGHTED: TE's
GroupedLinear adds the down bias inside the grouped GEMM, but the per-token
routing probability is applied at the activation (permuted_probs). So each of
the top-k selected experts contributed a full prob-independent down bias, and
the combine summed them -> ~k*bias instead of ~1*bias per token, across every
MoE layer. This is a large systematic activation offset (gpt-oss-20b finetune
step-0 loss ~8.2 vs the correct ~4.5).

Add the missing (prob - 1) * down_bias term so the net down-bias contribution
is prob * down_bias, matching GroupedExperts (expert_out + down_bias * w) and
GroupedExpertsDeepEP (_apply_bias(..., permuted_probs)). No-op for experts
without bias; correct in both bf16 and fp8 paths; gradients sum to the correct
Sum(prob). Affects all experts=te MoE models with expert bias under EP.

Verified on gpt-oss-20b (8xH100, ep=8, deepep, te, packed) on cw-dfw: step-0
loss 8.21 -> 5.10, val 4.19 (matches HF reference 4.53 and the gmm path).



* test(moe): 2-GPU GroupedExpertsTE EP-vs-single-GPU down-bias parity guard

Add a 2-GPU (ep_size=2) functional regression test that guards the
GroupedExpertsTE down-projection-bias fix (PR #2591 / Linear AM-487) so it
cannot be silently reverted.

TE's GroupedLinear adds the per-expert down bias UNWEIGHTED inside the grouped
GEMM, but the per-token routing probability is applied at the activation
(permuted_probs). Without the (permuted_probs - 1.0) * down_bias correction,
each of a token's top-k expert contributions carries a full prob-independent
down bias that the combine step sums (~k x bias instead of ~prob x bias) -- a
large systematic offset (gpt-oss-20b step-0 loss ~8.2 vs the correct ~4.5).

The test builds GroupedExpertsTE with 8 experts sharded 4+4 across 2 ranks, a
DeepEP token dispatcher, expert_bias=True, quick_geglu activation and
deterministic weights with a non-zero down bias. It feeds seeded, identical
hidden states / router indices / probs, runs the EP forward, gathers every
rank's local-token output and compares it to a single-GPU reference that
applies the correct prob-weighted down bias via plain matmuls. It also checks
that the tolerance actually separates the buggy (unweighted) output from the
correct one so the guard cannot become vacuous.

Verified on 2xGPU: PASSES with the fix (max_err ~1.6e-2 < tol), FAILS without
it (max_err ~1.7 > tol). Wired into CI via a new L2_MoE matrix entry
(test-folder: moe) on the 2-GPU runners.



---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…o r0.5.0 (#2618)

fix(qwen3_5_moe): convert MTP experts as grouped tensors (AM-442) (#2595)

The HF checkpoint stores the MTP block's MoE experts in the SAME grouped
layout as the main decoder layers (mtp.layers.0.mlp.experts.{gate_up_proj,
down_proj}); verified against Qwen/Qwen3.6-35B-A3B's safetensors index.

convert_single_tensor_to_hf split the MTP experts into per-expert HF keys
(mtp.layers.0.mlp.experts.{id}.down_proj.weight). When the checkpoint loader
builds load destinations via to_hf, it requested per-expert keys that don't
exist in the grouped checkpoint, raising 'Missing key in checkpoint
state_dict: mtp.layers.0.mlp.experts.224.down_proj.weight' on the first EP
rank that owns expert 224 (rank 7 with 256 experts / EP=8).

Remove the per-expert MTP split so MTP experts fall through to the generic
grouped rename+transpose used by the main layers, matching the on-disk keys.
The grouped from_hf path already handles loading (incl. EP shard slicing).

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
svcnvidia-nemo-ci and others added 5 commits June 17, 2026 11:33
fix(bagel): distributed setup init (#2608)

* fix(bagel): pass distributed setup to VLM builder



* fix(bagel): microbatch VAE encode in SFT example



---------

Signed-off-by: Zeyu Zhou <zezhou@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Zeyu Zhou <zezhou@nvidia.com>
…AM-454) (2594)` into `r0.5.0` (#2619)

fix(transformers): keep gemma3n KV sharing working under FSDP2 (AM-454) (#2594)

* fix(transformers): keep gemma3n KV sharing working under FSDP2 (AM-454)

HF gemma3n implements cross-layer KV sharing by threading a single mutable
`shared_kv_states` dict through every decoder layer as a forward kwarg: the
last full-length layer of each attention type writes its K/V into the dict and
the later shared layers read it back.

Under FSDP2 with MixedPrecisionPolicy(cast_forward_inputs=True), the per-layer
fully_shard pre-forward casts inputs via tree_map over (args, kwargs), which
reconstructs the dict fresh for each layer. The writer fills its throwaway copy
and the reader sees an empty one -> KeyError (e.g. KeyError: 18) on the first
forward. Only triggers once fully_shard is active (dp_shard > 1, i.e. multi-GPU).

Fix: inject a pytree-opaque, dict-like _SharedKVStates holder (shared by
reference across all layers, reset per forward) via a per-layer forward
pre-hook. Because it is not a type pytree flattens, tree_map treats it as a
leaf and passes the same instance to every layer, so in-place writes are
visible to readers. Applied from _apply_runtime_compatibility_fixes alongside
the rotary fix; no-op for models without KV sharing.

Verified: gemma3n_vl_4b_medpix on 2 GPUs trains cleanly (was KeyError: 18).




* test(transformers): cover _apply_runtime_compatibility_fixes KV-sharing wiring




* fix(transformers): scope gemma3n KV-sharing holder to gemma3n only

The holder was installed for any model with num_kv_shared_layers > 0, which clobbered gemma4's caller-supplied shared_kv_states (the speculative drafter threads the base model's store via composite.py), regressing test_hf_transformer_vlm_gemma4_joint_drafter with KeyError: 'sliding_attention'. gemma4 already makes its shared store FSDP2-safe in-model (#2566) and preserves a caller-supplied store via setdefault, so gate our holder on model_type startswith 'gemma3n' and leave other kv-sharing models alone.




---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
… apt runtime (2614)` into `r0.5.0` (#2629)

fix(docker): build DeepEP against the NVSHMEM wheel matching the apt runtime (#2614)

* fix(recipe): use hybridep dispatcher for glm_4.5_air_te_deepep internode

The default deepep dispatcher (Buffer.internode_dispatch) faults internode with an illegal memory access at DeepEP csrc/kernels/internode.cu:346 (recv counters stuck at -1, then timeout) on the 8-node ep32 GLM-4.5-Air job. The hybridep backend (HybridEPBuffer) works internode -- verified standalone on EOS (test_hybrid_ep.py 2-node: correctness PASS for BF16 and FP8, ~65 GB/s RDMA). Mirrors gpt_oss_120b, ling_1t_sft, qwen3_moe_*_gb200, deepseek_v3_*_gb200 which already set dispatcher: hybridep for the same deepep internode failure.



* fix(docker): align glm DeepEP/HybridEP build for internode dispatch

Build DeepEP/HybridEP with apt rdma-core v60 (build==runtime libibverbs) and RDMA_CORE_HOME symlinked to the system install, add the libnvshmem_host.so->.so.3 symlink (required for DeepEP fabric handle operations), pin the nvshmem wheel to 3.6.5, set LD_LIBRARY_PATH, and bump DEEPEP_COMMIT to 17cfb817. With dispatcher: hybridep this lets glm_4.5_air_te_deepep complete internode dispatch (8-node, eos) instead of faulting at DeepEP csrc/kernels/internode.cu:346 (recv counters stuck at -1).



---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…25)` into `r0.5.0` (#2628)

fix(qwen3_moe): keep native forward under PP so CP+THD works (#2625)

* fix(qwen3_moe): keep native forward under PP so CP+THD works

Qwen3MoeForCausalLM did not declare `_pp_keep_self_forward = True`, so the
pipeline builder replaced its forward with the generic HF pipeline forward.
That generic forward assumes the HF rotary API
(`rotary_emb(hidden_states, position_ids) -> (cos, sin)`), but Qwen3-MoE uses
the gpt_oss-style rope (`position_ids_to_freqs_cis` + `apply_rotary_emb_qk` with
`cu_seqlens`/`cp_size`/`cp_rank`) and a `freqs_cis` decoder-layer API. Under
pp_size>1 with context parallelism + THD this crashed in the first forward:

  RuntimeError: Sizes of tensors must match except in dimension 2.
                Expected size 512 but got size 1   (at apply_rotary_emb torch.cat)

The model's own forward already handles PP stage routing (embed_tokens/norm/
lm_head are None off the owning stage; hidden states arrive in the input_ids
slot via squeeze_input_for_thd) and CP+THD (CP is handled inside TE attention
via cp_size/cp_rank). Declaring `_pp_keep_self_forward = True` (matching the
sibling native MoE models nemotron_v3 / qwen3_5_moe / ling_v2) makes the PP
builder preserve it; the stage wrapper then unwraps the CausalLMOutputWithPast
to a tensor.

Verified on 16xH100 (cw-dfw): qwen3_moe_30b_te_chat_thd at pp_size=2, cp_size=2,
THD trains end-to-end (20/20 steps, exit 0, ~19 GiB/GPU, no OOM, no RoPE crash).

Adds a CPU unit test asserting the opt-in flag is declared and recognized by
model_keeps_self_forward.



* feat(qwen3_moe): enable PP in qwen3_moe_30b_te_chat_thd recipe

Now that Qwen3MoeForCausalLM keeps its native forward under PP, enable pipeline
parallelism on the recipe that motivated the fix (AM-460): pp_size 1->2 with
pp_microbatch_size 1 (must divide local_batch_size=2) and ci.nodes=2 (16xH100).
pp=2 splits the model across 2 nodes so each shard fits (~19 GiB/GPU), resolving
the backward OOM; pp=2 on a single node is memory-neutral vs pp=1 and still OOMs.



---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Signed-off-by: gitlab-runner <gitlab-runner@gitlab-master.nvidia.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested review from a team and jgerh as code owners June 18, 2026 08:48
@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants