ci: Update transformers to latest version 5.12.1#2632
Open
svcnvidia-nemo-ci wants to merge 35 commits into
Open
ci: Update transformers to latest version 5.12.1#2632svcnvidia-nemo-ci wants to merge 35 commits into
svcnvidia-nemo-ci wants to merge 35 commits into
Conversation
feat: make mesh accept meshcontext (#2266) * make mesh accept meshcontext * fix(transformers): resolve mesh context inputs * rm * use moe_overrides * make create_mesh_context the entry point for dist setup * fix: add renamed distributed utility files * fix(vlm): complete dist_setup -> mesh_context rename Two leftover references to the old setup_distributed/dist_setup API were missed when the recipe was migrated to create_mesh_context_from_config: - nemo_automodel/recipes/vlm/finetune.py:794 still read self.dist_setup.cp_size, which would AttributeError on any PP+CP VLM run. - tests/unit_tests/recipes/test_finetune_vlm_cp_wiring.py monkeypatched the stale symbol "setup_distributed", causing three parametrizations of test_setup_skips_pp_media_prechunk_when_cp_preembeds_vlm_inputs to fail during pytest setup with AttributeError. * remove activation checkpointing from meshcontext * refac * dedup * fix * fix * Update nemo_automodel/_diffusers/auto_diffusion_pipeline.py * Update skills/distributed-training/SKILL.md * Update skills/distributed-training/SKILL.md * Update nemo_automodel/components/distributed/config.py * Update skills/nemo-automodel-distributed-training/SKILL.md * docs(distributed): update setup API example * Update nemo_automodel/_diffusers/auto_diffusion_pipeline.py * fix(diffusers): remove duplicated DistributedSetup.build call in fsdp2 path Commit 6acf01f left a duplicated `distributed_setup = DistributedSetup.build(` line and a stray extra `)` in the fsdp2 branch of the parallelism-manager builder. This broke `ruff format --check` (CI linting job) and, once formatted, would have nested the call as `DistributedSetup.build(distributed_setup=...)` — a kwarg the builder does not accept. Restore the single build() call matching the ddp branch. * fix(vlm): pass distributed_setup through Gemma4 joint drafter composite The distributed-config API refactor routes all distributed settings through a single ``distributed_setup`` argument and rejects the separate ``moe_mesh`` / ``distributed_config`` / ``pipeline_config`` kwargs in ``NeMoAutoModel*.from_pretrained``. ``Gemma4WithDrafter.from_pretrained`` was still forwarding those separate kwargs to its inner base/drafter loads, so the joint-drafter VLM finetune (L2_HF_Transformer_VLM_Gemma4_Joint_Drafter) raised ``TypeError: Distributed settings must be passed with distributed_setup``. Forward ``distributed_setup`` to both sub-loaders instead, and extend the pipeline/context-parallel safety guards to read pp_size/cp_size from the resolved setup so the KV-sharing invariant holds on the recipe path too. * feat: add use_memory_efficient_lora knob (#2239) * add use_memory_efficient_lora knob * add use_memory_efficient_lora * fix * delete sft gpt-oss 20b single gpu * add nemotron nano v3 single-gpu lora example * add grad ckpt * add fused lora mlp * fix(checkpoint): support single-GPU Nemotron-H MoE LoRA checkpoint load Loading a merged-expert Nemotron-H MoE checkpoint through the default DCP / set_model_state_dict path transiently materializes a second on-device copy of the expert weights, which OOMs a 30B-class model on a single 80GB GPU. Checkpointer.load now routes single-device custom-model safetensors through the frugal full-state path (load to CPU, merge from_hf on CPU, copy into the model), keeping device memory at ~model size. _load_full_state_dict_into_model normalizes stray real (CPU) buffers left behind by custom-model meta materialization onto the parameter device (avoids 'Multiple devices found'), and uses plain load_state_dict for non-DTensor models so the full state dict is not moved on-device a second time. Adds a [nemotron-singlegpu-lora] note plus per-site tags documenting these single-device special cases, links the exercising recipe (examples/llm_finetune/nemotron/nemotron_nano_v3_singlegpu_lora.yaml), and flags the load path for a future refactor. * feat(peft): add fused LoRA SwiGLU/ReLU² MLP with recompute backward Fuses gate+up+down+activation into a single autograd Function that saves only (x, gate_out, up_out) and recomputes the activation and down-projection input in backward, roughly halving MLP activation memory at equal speed during LoRA SFT. SwiGLU forward/backward use elementwise Triton kernels (with in-place backward buffer reuse) and a pure-torch fallback when Triton is unavailable; matmuls stay on cuBLAS. Covers SiLU-SwiGLU (gate/up/down) and non-gated ReLU² (e.g. Nemotron-H dense) MLPs. install_fused_lora_mlp() swaps each LoRA-applied MLP's forward and falls back to the per-linear path at runtime under DTensor (TP/EP), DoRA, or active dropout, keeping it correct under sharding. Already wired from lora.py; opt out via NEMO_AUTOMODEL_DISABLE_FUSED_LORA_MLP=1. Activation recompute follows Megatron-Core's SwiGLUFunction; the fused LoRA-MLP and in-place buffer reuse follow Unsloth's LoRA_MLP (both Apache-2.0). * refactor(peft): drop NEMO_AUTOMODEL_DISABLE_FUSED_LORA_MLP env knob The fused LoRA MLP can already be disabled via the use_memory_efficient_lora config flag, and fusion auto-falls-back per-MLP under DTensor / DoRA / active dropout. The env var was a redundant escape hatch; remove it and the now-unused os import. * test(checkpoint): align custom-model load-routing guard with single-device fast path The nemotron-singlegpu-lora change routes single-device (world_size == 1) custom safetensors models through the frugal full-state fast path instead of DCP. The fast path now applies the state_dict_adapter from_hf conversion on CPU (_maybe_adapt_state_dict_from_hf), so custom MoE expert merging still happens — the guard test's original premise (fast path bypasses conversion) no longer holds. - Reframe test_custom_model_skips_fast_path_uses_dcp as the multi-rank (sharded) case (WORLD_SIZE=2), where DCP per-rank DTensor slicing is genuinely required. - Add test_single_device_custom_model_uses_fast_path covering the new world_size==1 behavior (fast path used, DCP not). --------- * fix(deepseek_v3): initialize weights in fp32 and default router to fp32 (#2450) * fix(deepseek_v3): init weights in fp32, default router to fp32 Sampling the random init directly in bf16 distorts the variance/mean schedule and produces exploding first-step gradients (flat/diverging loss) for from-scratch pretraining. Add an init_weights_in_fp32 context manager that samples in fp32 and casts back to the resident dtype, and use it in DeepSeek-V3 initialize_weights. Also default the router (gate_precision) to fp32 to match the HF reference. * refactor(models): rename init_weights_in_fp32 to yield_fp32_model Generalize the context manager per review: it's a generic "run this block with the model in fp32" tool, not init-specific. Yield the model and make the exit dtype optional (defaults to the model's pre-context float dtype). --------- * fix(multimodal): migrate finetune recipe to DistributedSetup/MeshContext API The auto-class-public-api refactor deleted `recipes/_dist_setup.py` and moved recipes to the `DistributedSetup` / `MeshContext` API, but `multimodal/finetune.py` was left importing and calling the deleted `_dist_setup.setup_distributed`, so importing the module raised ModuleNotFoundError and broke the import-check in every Pip/UV install job. Migrate it to the shared `_distributed_setup_attributes(create_distributed_setup_from_config(...))` pattern used by the llm/vlm recipes: unpack distributed_setup / mesh_context / distributed_config / device_mesh / moe_mesh / pp_enabled / pipeline_config / moe_parallel_config / activation_checkpointing, and update the model-build calls (`mesh=self.mesh_context`, `moe_config`/`cfg_moe=self.moe_parallel_config`, `activation_checkpointing=self.activation_checkpointing`). * feat(speculative): add EAGLE-3 sequence packing and reasoning-mode control (#2444) * feat(speculative): add reasoning mode control for EAGLE/P-EAGLE/DFlash training Add --reasoning {none,save,disable} flag to regenerate.py for controlling whether target model reasoning content is preserved or suppressed during data regeneration. Add mask_reasoning_content option to EAGLE/P-EAGLE/DFlash training recipes to exclude reasoning traces from the loss mask. * feat(speculative): add EAGLE-3 sequence packing for draft training Pack variable-length chat samples into fixed-width rows for EAGLE-3 training, removing the per-sample padding waste of the default max_length path. Documents within a row attend block-causally: the target uses a 4D block-causal mask (SDPA) and the draft uses varlen FlashAttention-2; cross-document TTT supervision is gated by doc_remaining so deeper steps never leak across boundaries. Opt-in via packed_sequence_size > 0, colocated target backend only. Covered by unit tests plus an FA2-vs-eager parity test. --------- * feat(distributed): add selective activation checkpointing for FSDP2 (#2389) * feat(distributed): add selective activation checkpointing for FSDP2 * fix(distributed): support selective activation checkpointing with torch.compile * docs(fern): drop selective AC from frozen v0.4 snapshot * feat(distributed): honor selective activation checkpointing on single GPU * feat(moe): support selective activation checkpointing with expert parallelism * fix(model): make DeepSeek MLP dispatch wrapper-safe * fix(distributed): save expert grouped-GEMM in selective AC and add op trace * feat(moe): compile selective activation checkpointing wrappers outer * refactor(distributed): move selective AC into its own module Extract the TorchTitan-style selective activation checkpointing core out of the central parallelizer.py into a dedicated activation_checkpointing.py: op-set construction, the save/recompute policy, block/sub-module wrappers, KV-sharing detection, and the compile-outer wrapper flag. parallelizer.py keeps only the thin apply_selective_activation_checkpointing entry point, which still needs the heavy, transformers-aware _extract_model_layers, so the dependency stays one-directional (parallelizer -> activation_checkpointing -> parallelizer_utils) with no circular imports. Move the opt-in NEMO_SELECTIVE_AC_TRACE diagnostic out of parallelizer.py into parallelizer_utils.maybe_trace_selective_ac_decision so the hot policy is a single call site instead of trace globals plus a helper. Make the new module's cross-module interface public (drop the leading underscore) and keep internal op-resolution/plumbing private. Update the moe and fsdp2 consumers and the unit tests to import from the new module. Also fix doc wording: clarify that torch.compile must be held fixed when comparing full vs. selective, and refer to TorchTitan as a reference implementation rather than "upstream". * refactor(distributed): move selective-AC trace into the AC module * test(distributed): patch activation_checkpointing.checkpoint_wrapper after AC module split * docs: apply tech-writer edits to gradient-checkpointing guide --------- * feat(diffusion): improve qwen image finetuning configs (#2442) * ci: add nemo-run, split qwen-vl-utils from decord for arm (#2456) * ci: add nemo-run, split qwen-vl-utils from decord for arm * fix: override in pytorch container * Update uv lock --------- * Apply suggestions from code review * fix(precision): dtype contract bug fixes for FSDP2 mixed-dtype loads (#2419) * fix(transformers): unify loaded HF dtype via promote_types Make _restore_loaded_model_dtype dtype-aware: instead of always restoring to the checkpoint dtype, unify each floating tensor to promote_types(checkpoint, requested). This honors an explicit fp32 request while preserving intrinsically-fp32 checkpoint params (e.g. A_log) under a bf16 request, and is a no-op for the bf16/auto path. Fixes FSDP2 uniform-dtype tripping on HF mixed-dtype loads. * feat(distributed): default pipeline dtype to FSDP activation dtype When pipeline parallelism is enabled and pipeline.dtype is unset, derive it from the FSDP mixed-precision activation dtype (mp_policy.output_dtype, falling back to param_dtype) so pipeline stage shape inference matches the real activation dtype (e.g. bf16 compute under fp32 master weights). An explicitly set pipeline.dtype is honored but warned on mismatch, since it can corrupt inter-stage recv buffers. No-ops for strategies without an mp_policy (e.g. MegatronFSDP) and for pp_size==1. (cherry picked from commit 3f6b246) * refactor(distributed): resolve FSDP compute dtype per-param, decoupled from storage fully_shard_by_dtype now groups parameters by their required *compute* dtype instead of their storage dtype, so fp32 master weights (uniform fp32 storage) still compute the bulk in mp_policy.param_dtype (bf16) while intrinsically-fp32 params keep fp32 compute. Per-parameter compute dtype is resolved by precedence: pinned fp32 (_keep_in_fp32_modules_strict) > HF-recorded checkpoint dtype (tagged onto each tensor at load time in _restore_loaded_model_dtype) > mp_policy.param_dtype. Qwen3.5's GatedDeltaNet fp32 holder is declared via patch_hf_model; the NemotronH and Qwen3.5 strategies thread the declaration through. (cherry picked from commit 3dd6b97) * docs(model-onboarding): document _keep_in_fp32_modules_strict contract Add SKILL.md §2.6 explaining which params must compute in fp32 (SSM A_log/ dt_bias/D, MoE sigmoid-gate bias, attention-sink bias, scale), how to declare them (class attribute vs patch_hf_model instance attribute), and why the pin is the robust signal across all load paths. Broaden the MoE checklist item and code comment accordingly. (cherry picked from commit a11db38) * test(distributed): add fp32 compute-dtype contract test Assert the resident compute dtype of every trainable parameter across the model archetypes that use fully_shard_by_dtype (dense, Qwen3.5-style hybrid), covering the full precedence chain: pinned fp32 > HF-recorded dtype > mp_policy.param_dtype, under fp32 master weights and ordinary loads. (cherry picked from commit dc83926) * feat(model): cast frozen modules to compute dtype to avoid mismatch (cherry picked from commit d321f5e) * refactor(gemma4): drop projector dtype hook now general frozen cast handles it (cherry picked from commit 1bc67e2) * feat(training): add dormant resolve_storage_dtype helper Add resolve_storage_dtype() (and its unit tests) for defaulting model.torch_dtype to fp32 for full-parameter torch.optim training. Not yet wired into recipes here; the call sites are marked with breadcrumb comments and enabled in a follow-up PR, keeping this PR limited to dtype bug fixes with no behavior/memory change. * fix(model): cast frozen-module buffers and unsharded params to compute dtype * docs(infra): correct frozen-tower FSDP comment to match sharding reality * docs(mixed-precision): clarify TE vs torch AdamW memory and precision trade-offs * docs(mixed-precision): apply tech writer edits * docs(mixed-precision): drop unresolvable FSDP anchor --------- * docs(speculative): add subsystem README, fold in regeneration guide (#2448) Add examples/speculative/README.md covering the whole speculative-decoding draft-training subsystem: supported methods (EAGLE-1/2/3/3.1, P-EAGLE, DFlash), target-model registry coverage, compute backends (eager vs flash_attention_2, flex_attention/sdpa, fused Triton soft cross-entropy, d2t/t2d draft-vocab compression), target backends (co-located, remote, offline cache), serving and benchmarking, inference-engine compatibility, and a consolidated config reference. Fold the standalone regenerate_with_target.md into the README's data preparation section (full two-step flow, tuning table, pitfalls) and remove the separate file so there is a single entry point. * feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support (#2284) * feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support * fix the memory management for training large 14B wan model * fix wan2.2 support * all good for wan2.2 * update * docs(fern): add Wan2.2 T2V-A14B model coverage and release log entry * fix anther round of code review * fix(diffusion): sort wan.py imports to satisfy CI isort (I001) * fix(diffusion): load inference checkpoints to CPU to halve peak GPU memory Avoids doubling peak GPU memory (and a potential OOM in Wan2.2 two-stage inference) by loading EMA/consolidated state dicts with map_location="cpu"; load_state_dict copies into the already-on-device parameters. --------- * test: include find_unused_parameters in ddp manager args expectation The DDP strategy config exposes find_unused_parameters (default False), so _build_diffusion_parallel_manager_args returns it in the ddp branch. Update the test's expected dict to match, fixing the L0 unit test failure. * fix(distributed): address Claude review comments - infrastructure.py: forward the model wrapper's mp_policy (from FSDP2Config) to the MoE expert parallelizer when MoEParallelizerConfig.mp_policy is unset, so a custom precision policy isn't silently dropped for EP models. - skills/nemo-automodel-distributed-training/SKILL.md: fix stale references — MeshContext no longer holds strategy_config/pipeline_config/moe_config and STRATEGY_MAP moved to _STRATEGY_MAP in config.py; MoEParallelizerConfig now lives in components/distributed/config.py. --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Signed-off-by: Adil Asif <adasif@nvidia.com> Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> Signed-off-by: thyways <2484113689@qq.com> Signed-off-by: khazic <khazzz1c@gmail.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Signed-off-by: linnan wang <linnanw@nvidia.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Yuhe Zhang <yuhezhang.zju@gmail.com> Co-authored-by: khazzz1c <khazzz1c@gmail.com> Co-authored-by: thyways <2484113689@qq.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Pranav Thombre <pthombre@nvidia.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Co-authored-by: linnan wang <linnanw@nvidia.com>
feat(vlm): enable Qwen3.5 MoE VLM CP (#2432) * feat(vlm): enable Qwen3.5 MoE VLM CP * test(vlm): cover Qwen3.5 MoE VLM CP changes Add unit coverage for the new/changed code paths in PR #2432: - cp_utils: opt-in seq_index CP buffer, singleton expansion, arange-continued padding - Qwen3_5MoeBlock.forward: seq_index threading into linear_attn, stripping on full-attn path - prepare_model_inputs_for_cp / _pre_embed_only dispatch and text-only forward path - PreTokenizedDatasetWrapper inject_fake_images gating + build_dataloader passthrough - _run_validation_epoch: total_tokens not summed over CP ranks * style(vlm): sort imports in qwen3_5_moe model.py Fixes ruff I001 (unsorted import block) flagged by CI linting: `import inspect` was added above `import copy`. * refactor(qwen): keep CP seq index out of cp utils * rename qwen medpix cp2 config * test(qwen): align CP seq-index tests with cp_linear_attn refactor The "keep CP seq index out of cp utils" refactor moved seq_index handling out of make_cp_batch_and_ctx and prepare_model_inputs_for_cp into CPAwareGatedDeltaNet. Update tests accordingly: - drop obsolete seq_index buffer/padding tests from test_cp_utils - prepare_model_inputs_for_cp now returns only inputs_embeds + position_ids - rewrite TestExtractLocalPositions -> TestExtractLocalSeqIndex for the new _extract_local_seq_index signature - add coverage for _build_dual_chunk_local_positions (DualChunkSwap layout) --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(model): flux2 (#2145) * flux2 init draft * udpate * fix(diffusion): revert flux1 example and add flux2 inference config - Revert accidental FLUX.2 changes in flux_t2i_flow.yaml back to FLUX.1-dev - Add examples/diffusion/generate/configs/generate_flux2.yaml for FLUX.2-dev inference * fix(diffusion): fix flux2 contiguity and text encoder eval - Add .contiguous() after permute in _pack_latents and _unpack_latents so hidden_states is always contiguous before flash-attention kernel - Call pipeline.text_encoder.eval() after device placement, consistent with FluxProcessor, WanProcessor, and QwenImageProcessor * feat(diffusion): sync flux2 configs with main performance fields Add optimizer flags (foreach/fused), performance block, FSDP2 prefetch tuning, save_checkpoint_every_epoch, and save_consolidated=final to flux2_t2i_flow.yaml and flux2_t2i_flow_lora.yaml to match the fields added to flux_t2i_flow.yaml in main. * fix(diffusion): fix flux2 cfg dropout to apply per-sample not per-batch Replace single random.random() gate (correlated across entire batch) with a per-sample Bernoulli mask so each sample independently has cfg_dropout_prob chance of receiving zeroed text embeddings. Also drop the now-unused `import random`. * fix(diffusion): fix flux cfg dropout to apply per-sample not per-batch Replace single random.random() gate (correlated across entire batch) with a per-sample Bernoulli mask so each sample independently has cfg_dropout_prob chance of receiving zeroed text/pooled embeddings. Also drop the now-unused `import random`. * test(diffusion): add unit tests for Flux2Adapter and Flux2Processor - tests/unit_tests/flow_matching/test_flux2_adapter.py: 36 tests covering pack/unpack roundtrip + contiguity, 4D positional IDs (img_ids/txt_ids) shape/dtype/value correctness, prepare_inputs keys/shapes/normalization/ CFG dropout, and forward model call kwargs - tests/unit_tests/diffusion_processors/test_flux2_processor.py: 22 tests covering model_type/default_model_name properties, encode_image BN normalization + dtype + squeeze, encode_text Mistral3 args + no-clip keys, verify_latent shape/NaN/Inf checks, get_cache_data structure, and ProcessorRegistry lookup --------- Signed-off-by: linnan wang <linnanw@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: linnan wang <linnanw@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(config): glm4.7 yaml (#2527) fix(recipe): use distributed.moe schema for glm_4.7_flash packed-seq The glm_4.7_flash_te_packed_sequence recipe used a top-level `moe_config:` block with `_target_: nemo_automodel.components.moe.config.MoEParallelizerConfig`, but that class lives at `nemo_automodel.components.distributed.config` (moe.config has no such attribute). The config loader eagerly resolves every `_target_`, so this raised at config-load time, before training started: AttributeError: module 'nemo_automodel.components.moe.config' has no attribute 'MoEParallelizerConfig' The top-level `moe_config:` block is not a supported recipe key: MoE parallelizer settings are read from `distributed.moe` and activation checkpointing from `distributed.activation_checkpointing` (see recipes/_dist_utils.py:parse_distributed_section). Every other MoE recipe (e.g. glm_4.7_flash_te_deepep.yaml) already uses the distributed.* form. Fix: drop the top-level moe_config block and move `activation_checkpointing: false` under `distributed:`. With ep_size>1 the loader builds a default MoEParallelizerConfig, preserving the original intent (activation checkpointing off). Verified against current source: the recipe now loads without the AttributeError, and parse_distributed_section yields the correct default MoEParallelizerConfig with activation_checkpointing=False. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…nto `r0.5.0` (#2525) fix(gemma4): cast dense params without casting buffers (#2359) * fix(gemma4): avoid bf16 casting dense model buffers * ADD test code fix(gemma4): avoid bf16 casting dense model buffers * fix(gemma4): cast dense params without rounding buffers * Fix Gemma4 MoE import formatting for Ruff --------- Signed-off-by: kdg6245 <kdg6245@snu.ac.kr> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dogeun Kim <82812668+DOGEUNNKIM@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
… `r0.5.0` (#2554) fix(qwen3_5): make dense VLM pipeline-parallel safe (#2524) * fix(qwen3_5): make dense VLM pipeline-parallel safe The dense Qwen3.5/3.6 VLM crashed on the first PP forward with "KeyError: slice(None, 64, None)". It keeps its own outer forward under PP but delegates the decoder stack to HF's Qwen3_5TextModel.forward, which is not PP-aware: it slices self.layers[: num_hidden_layers] (the splitter rewrites layers into a ModuleDict) and calls self.norm/self.embed_tokens unconditionally (dropped to None on non-last/non-first stages). Fix (all in qwen3_5/model.py): - Add Qwen3_5TextModelPP(Qwen3_5TextModel) overriding forward to present the post-split ModuleDict layers as a slice-able ModuleList over the same layer objects and swap a dropped norm for nn.Identity, delegating to super() so HF's mRoPE/mask/rotary logic is reused unchanged. Pure passthrough off PP. - In __init__, set self.model.language_model.__class__ = Qwen3_5TextModelPP (same instance+weights; class-based so it survives the splitter's deepcopy). - Make the outer forward PP-stage-aware: stage 0 runs the full HF VLM path, middle/last stages feed upstream hidden states straight into the text backbone, lm_head runs only on the last stage, MTP is skipped under PP. Non-PP (TP/CP/single) path is unchanged. Verified: TP4xPP4 on 2 nodes (MedPix, 27B dense) trains end to end, loss 1.82 -> 1.53, ~19s/step, no errors. * test(qwen3_5): cover PP-stage dispatch and text-backbone class swap Add unit tests exercising the previously-uncovered new lines in the dense VLM PP fix: the __init__ Qwen3_5TextModelPP class swap and the outer forward's PP-stage dispatch (first / middle / last stage and the non-PP fall-through), using a tiny CPU VLM with simulated per-stage module layouts (embed_tokens / lm_head dropped as the splitter would). Raises codecov patch coverage on model.py. --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… recipes (2539)` into `r0.5.0` (#2550) feat(examples): add Nemotron-3-Ultra-550B benchmark and full-SFT recipes (#2539) * feat(examples): add Nemotron-3-Ultra-550B benchmark and full-SFT recipes Add two example configs for Nemotron-3-Ultra-550B-A55B on 16x H100 nodes (128 GPUs, EP=64) and record the 16-node pre-training throughput in the performance summary: - examples/llm_benchmark/nemotron/nemotron_ultra_v3_te_deepep.yaml: throughput benchmark (torch_mm experts, balanced gate, MockIterableDataset, repeated MTP head) measuring 815 tok/s/GPU, 293 TFLOP/s/GPU at 10.05 s/global step. - examples/llm_finetune/nemotron/nemotron_ultra_v3_full_sft.yaml: full supervised fine-tune (real router, gmm experts, THD sequence packing at 4096, repeated MTP head) on SQuAD with chat-template formatting and answer-only loss masking. Validated end-to-end on 128x H100 (loss 6.1 -> 1.7, ~49 GiB/GPU). - docs/performance-summary.mdx: add the Ultra 550B pre-training row and its benchmark config link. * refactor(examples): rename Ultra-550B SFT example to _squad, tidy comments Rename nemotron_ultra_v3_full_sft.yaml -> nemotron_ultra_v3_squad.yaml to match the directory's <model>_squad.yaml naming convention (the example fine-tunes on SQuAD). Simplify the inline comments and update the self-referencing launch path in the header. fake_balanced_gate is dropped from the backend block since it defaults to False (real router) -- behavior is unchanged. * chore: restore perf-summary Last Updated date; trailing newline in SQuAD yaml Revert docs/performance-summary.mdx Last Updated back to 2025-10-02 (the date bump will be handled separately later). Add the missing trailing newline at EOF of the Ultra-550B SQuAD example. --------- Signed-off-by: Adil Asif <adasif@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…nts (2546)` into `r0.5.0` (#2558) ci: schedule ep-parallel finetune recipes at documented node counts (#2546) * ci(deepseek_v4): set nodes so recipes run at documented scale The deepseek_v4 finetune recipes declare large parallelism (pp_size x ep_size of 128/512 GPUs) but had no `ci:` block, so release-scope auto-discovery in generate_ci_tests.py emitted them with the default 1 node (8 GPUs). On 8 GPUs non_pp_size = world/pp = 2, which cannot satisfy ep_size=32, so _create_fsdp2_device_mesh raised "ValueError: non_pp_size=2 must be a multiple of ep_size=32" and every deepseek_v4 job failed at recipe.setup(). Add a `ci:` block with the node count each recipe's header already documents so CI allocates the correct world size: - flash_hellaswag / flash_packed_sequence_hellaswag: 16 nodes (pp4*ep32=128) - flash_hellaswag_lora: 4 nodes (pp1*ep32=32) - pro_*_pp8_ep64_*: 64 nodes (pp8*ep64=512) Fixes CI jobs 337980465/337980467/337980469/337980471/337980601 (pipeline 54319542). * ci: set node counts for the remaining ep-parallel finetune recipes Same root cause and fix as the deepseek_v4 recipes, for the rest of the config_parallelism/ep_size_not_divisible bucket (AM-419). These recipes declare large expert/pipeline parallelism but had no `ci:` block, so release-scope auto-discovery scheduled them on the default 8 GPUs, where non_pp_size % ep_size != 0 and _create_fsdp2_device_mesh raised "non_pp_size must be a multiple of ep_size". Add `ci.nodes` at each recipe's documented scale: - glm_5.1_lora, mimo_v2_flash_hellaswag, step3p7_medpix_200b_ep32pp4: 16 nodes (pp4*ep32 = 128 GPUs) - hy3_preview_deepep_lora (pp1*ep64 = 64), step3p7_medpix_200b_lora_pp8ep8_8node (pp8*ep8 = 64): 8 nodes - ling_flash_2_0_sft, minimax_m2.7_hellaswag_lora, nemotron_ultra_v3_hellaswag_peft: 4 nodes (pp1*ep32 = 32 GPUs) hy3_preview_deepep: fix ep_size 32 -> 8 (the recipe's own comments say "ep_size=8 gives 24 experts/rank" and "32 GPUs for full fine-tuning"; 192 % 8 = 0) and add ci.nodes: 4 (pp4*ep8 = 32 GPUs). nemotron_ultra_v3_hellaswag_peft_gb200 is intentionally left unchanged: it is a GB200-only recipe (4 GPUs/node, 184 GB) and the eos functional pipeline has no per-recipe GB200 routing, so it remains a known failure. Fixes CI jobs 337980609 337980497 337980612 337980499 337980515 337980623 337980631 337980380 337980397 (pipeline 54319542). * Apply suggestion from @akoumpa * hy3_preview_deepep: pp8/ep8 @ 8 nodes (pp4/ep8 @ 4 nodes OOMs; verified on cw) --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…0.5.0` (#2567) fix(diffusion): resolve flux nightly CI failures (#2529) Two startup failures in the diffusion nightly functional tests: - flux_t2i_flow_lora: the CI launcher unconditionally passed --fsdp.dp_size, which injects an 'fsdp' section and conflicts with the recipe's 'ddp' section (mutual-exclusion ValueError). The launcher now skips the override for DDP-based recipes. - flux_t2i_flow: _build_diffusion_parallel_manager_args called dict() on ConfigNode sections, which are not iterable. Normalize fsdp/ddp sections via to_dict() before use; this also fixes the identical latent bug in the DDP branch. Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Pranav Thombre <pthombre@nvidia.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…ert (2575)` into `r0.5.0` (#2578) fix(ci): bump ling_1t_lora_pp local_batch_size to satisfy PP assert (#2575) fix(ci): bump ling_1t_lora_pp local_batch_size to satisfy PP assert (AM-471) The ling_1t_lora_pp recipe fails in CI with: AssertionError: pp_batch_size 4 // pp_microbatch_size 1 must be >= pp_size 8 train_ft.py requires local_batch_size // pp_microbatch_size >= pp_size so the pipeline schedule (interleaved1f1b, pp_size=8) has at least pp_size microbatches to fill its stages. The recipe set local_batch_size=4, giving only 4 microbatches. Raise local_batch_size 4 -> 8 (= pp_size * pp_microbatch_size). Per-microbatch size stays 1 (no extra memory), and global_batch_size=512 stays cleanly divisible: 512 / (local 8 * dp 8) = 8 grad-accum steps. Also reassign recipe_owner to akoumpa. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…amba merge) (2559)` into `r0.5.0` (#2576) test: fix all 5 vllm_deploy tests (token drift, nemotron OOM + mamba merge) (#2559) test: fix nemotron-9b vllm_deploy (OOM, mamba LoRA via merge, token drift) Job 337980666 (nemotron_nano_9b_squad_peft_vllm_deploy) failed three ways: 1. vLLM EngineCore OOM (62.24/79.11 GiB free < 0.9 target): the HF model stayed resident via the PeftModel<->base reference cycle. Fix: gc.collect() before empty_cache() + gpu_memory_utilization=0.7. 2. vLLM cannot apply LoRA to NemotronH's fused mamba MambaMixer2 (asserts on model.layers.0.mixer.conv1d), independent of the adapter's targets. So enable_lora can't serve this model at all. Fix: when ci.checkpoint_robustness.vllm_merge_lora is set, merge the adapter into the base and deploy the merged model without enable_lora. 3. Exact HF-vs-vLLM greedy token equality is not a valid cross-engine invariant. Fix: compare a matching prefix (MIN_MATCH_PREFIX=5). Signed-off-by: Adil Asif <adasif@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…0.5.0` (#2573) fix(checkpoint): preserve tied lm_head on resume (#2511) * fix(checkpoint): preserve tied lm_head on resume * docs(checkpoint): clarify tied lm_head storage check * fix(checkpoint): retie local lm_head after sharding * fix(checkpoint): refresh tied lm head state during DCP --------- Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Yuhe Zhang <yuhezhang.zju@gmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…74)` into `r0.5.0` (#2579) fix(ci): set node counts for multi-node VLM finetune recipes (#2574) * fix(ci): set node counts for multi-node VLM finetune recipes (AM-434) CI auto-discovers every examples/vlm_finetune/**/*.yaml in the release scope and runs each on a single 8-GPU node unless the recipe's ci: section requests more. These three recipes had no ci: section, so they ran with world_size=8 while their parallelism requires more GPUs: mistral3p5_128b_medpix(_lora) need TP*PP=64 and qwen3_5_27b_tp4pp4 needs TP*PP=16. _infer_dp_size then raised "world_size must be divisible by (tp_size * cp_size * pp_size)". Add a ci: section to each requesting the node count its parallelism needs (8 nodes for the 128B medpix recipes, 2 nodes for qwen3_5_27b) so the device mesh builds with dp_size=1 instead of failing. * Change recipe owner from akoumpa to HuiyingLi --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…v3_cp_test (2577)` into `r0.5.0` (#2580) fix(recipe): reshard MoE experts after forward in nemotron_nano_v3_cp_test (#2577) The cp=2/ep=4 variant shards experts on a 2-wide ep_shard FSDP dimension. With the default reshard_after_forward=False the all-gathered expert weights stay resident across the whole forward (~11.5 GB/rank); once Adam state is allocated on step 0, step 1's forward exceeds the 80 GiB H100 budget and OOMs. Setting moe.reshard_after_forward=true frees that headroom and is numerically transparent (recompute of the gather only), so the CP-parity comparison is unaffected. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ci: use digits for spark recipes (#2581) use digits for spark recipes Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…464) (2585)` into `r0.5.0` (#2586) ci: Enable activation checkpointing for gemma_2_9b_it_squad (AM-464) (#2585) enable activation checkpointing for gemma_2_9b_it_squad gemma_2_9b_it_squad OOMs on 8xH100-80GB in the backward pass at train step ~10/50 (AM-464, nemo-ci job 337980482). With attn_implementation= eager and no activation checkpointing, each of the 42 layers keeps a full [B, heads, S, S] attention-score tensor for backward; the first uncapped pad-to-longest SQuAD batch then spikes past 80 GB. Enable activation_checkpointing so per-layer activations are recomputed in backward (one layer's eager scores resident at a time). Verified on cw-dfw 8xH100: baseline OOMs at step 10; with this change the run completes 20 steps + a full validation pass at 71-79 GB peak, with an identical (loss-neutral) loss curve. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…p (2582)` into `r0.5.0` (#2588) fix(test): load checkpoint-robustness HF reference via device_map (#2582) The checkpoint_robustness test's Phase 4 loads a vanilla-HF reference model on rank 0 only (AutoModelForCausalLM.from_pretrained -> CPU -> .to(device)). For a 14B checkpoint that is ~50s warm / ~225s cold; the other ranks idle at the post-phase _barrier(), so the rank-0 stall overruns the NCCL watchdog (dist_env.timeout_minutes), the peers abort (SIGABRT) and rank 0 is SIGKILL'd in teardown. nemo-ci's extractor reports the rank-0 SIGKILL as "Signal/OOM-Kill", but it is a watchdog timeout, not a host OOM. Load the reference model straight onto the GPU via device_map (~12s) for standard-HF loads; trust_remote_code (needs _no_meta), quantized and device_map=auto paths are unchanged. Fixes AM-468 (phi_4_squad, phi_4_squad_peft). Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…53) (2584)` into `r0.5.0` (#2597) fix(peft): LoRA MLP QLoRA/PP/gemma3n fixes (AM-435, AM-447, AM-453) (#2584) * fix(peft): make gemma3n per-layer LoRA output safe for in-place use (model-scoped) gemma3n_vl_4b_medpix_peft (VLM PEFT) failed at backward with "Output 0 of LoRATritonFunctionBackward is a view and is being modified inplace": transformers gemma3n project_per_layer_inputs does an in-place op on the output of per_layer_model_projection, which under the memory-efficient LoRA path is a view of a custom autograd Function output. Fix: instead of cloning inside the generic LoRATritonFunction (a clone on every LoRA forward, for all models), components/_peft/lora.py adds patch_gemma3n_inplace_lora_views() (called from apply_lora_to_linear_modules). It structurally detects gemma3n's text model and wraps only its per_layer_model_projection LoRA forward to return a non-view clone -- one clone per forward instead of ~200+, and zero cost for non-gemma3n LoRA. clone() is an autograd identity (grads unchanged); the patch is guarded and idempotent. (cherry picked from commit f02c0f4) * fix(peft): skip fused LoRA SwiGLU/ReLU2 path for quantized (QLoRA) base weights QLoRA 4-bit base weights are packed buffers (e.g. shape (1, out*in/2)), not a 2D (out_features, in_features) matrix, so the fused path's ``F.linear(x, base_weight)`` failed with "mat1 and mat2 shapes cannot be multiplied (Nx4096 and 1x14680064)". ``_fusible`` now rejects quantized bases (bitsandbytes ``quant_state`` marker, or weight shape != (out, in)) so the per-linear ``LinearLoRA`` path (which dequantizes the base) handles them. Adds a regression test. Fixes AM-435. * fix(peft): return fused LoRA MLP grads on their parameter's device (PP/meta-safe) Under pipeline parallelism torch builds the backward graph with the LoRA parameters on the meta device while the activations (and the grads computed from them) are on cuda. The fused LoRASwiGLUMLPFunction / LoRAReLU2MLPFunction returned the cuda grads, so torch rejected them: Function LoRASwiGLUMLPFunctionBackward returned an invalid gradient at index 2 - expected device meta but got cuda Move each LoRA gradient onto its parameter's device before returning (a no-op in normal single-device training; meta in the PP graph pass, so no real gradient is lost). Verified on 2-GPU PP (qwen3 + fused SwiGLU LoRA): the crash is gone and training runs end-to-end. Fixes AM-447. * refactor(peft): make memory-efficient LoRA output non-view (drop gemma3n patch) AM-453's gemma3n-specific patch (patch_gemma3n_inplace_lora_views) lived in the generic _peft/lora.py and only covered one consumer. Root cause: the memory-efficient LoRATritonFunction returned a *view* of its custom-autograd output (an in-Function .view()), which torch forbids mutating in place ("Output 0 of LoRATritonFunctionBackward is a view and is being modified inplace"). Move the (N, out) -> (bs, seq, out) reshape OUT of the autograd Function into a thin apply_memory_efficient_lora wrapper (used by LinearLoRA). The Function now returns a non-view 2D tensor; the wrapper's reshape is an ordinary autograd view, which supports in-place ops. This fixes the bug class for any in-place consumer with no model-specific code, and removes the gemma3n patch + its tests. Adds a generic in-place-safety regression test (gemma3n-style consumer, no patch). AM-453. --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…M recipes (2600)` into `r0.5.0` (#2602) fix(vlm): enable activation checkpointing for 35B Qwen3.5/3.6 VLM recipes (#2600) The qwen3_5_35b VLM nightly recipe OOMs intermittently in the backward pass on 8xH100-80GB. Since #1896 restored fp32 master weights for custom MoE under FSDP2 (a convergence fix), steady-state peak memory rose ~3 GiB to ~67 GiB, leaving only ~2 GiB of headroom; variable VLM/MoE batch sizes then push the backward pass over the limit on some steps. Enable activation_checkpointing (recompute transformer-block activations in backward) on the two full-FT 35B-A3B recipes that use the memory-tight pp1/ep8 layout: - examples/vlm_finetune/qwen3_5_moe/qwen3_5_35b.yaml - examples/vlm_finetune/qwen3_5_moe/qwen3_6_35b.yaml Verified qwen3_5_35b on 8xH100-80GB: baseline OOMs at step 31 (peak 67.2 GiB); with AC it completes all 50 steps + validation at 56.3 GiB. qwen3_6_35b is the identical pp1/ep8 full-FT analog (its ep8cp2 sibling already enables AC). The pp2/ep4 neat_packing recipe was checked and left unchanged: it peaks at ~48 GiB and does not OOM. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…grad-accum (2566)` into `r0.5.0` (#2599) fix(gemma4): FSDP2-safe kv-sharing + skip frozen audio tower on grad-accum (#2566) * fix(gemma4): make shared_kv_states FSDP2-safe for kv-shared layers * fix(fsdp2): skip wrapping frozen audio tower to avoid grad-accum crash * fix(pp): thread shared_kv_states through gemma4 pipeline-parallel forward * chore(gemma4): update example run commands, warmup, and audio-tower comment --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…gits OOM (2603)` into `r0.5.0` (#2604) fix(vlm): use FusedLinearCrossEntropy for qwen3_5_9b to avoid logits OOM (#2603) The qwen3_5_9b VLM nightly recipe (FinetuneRecipeForVLM, medpix) CUDA-OOMs in MaskedCrossEntropy: it fp32-upcasts the full [num_tokens, vocab] logits and calls F.cross_entropy, which spikes ~45 GiB (steady ~30 -> 77+ GiB) on large vision-token batches and OOMs at masked_ce.py:84 (AM-457). Switch loss_fn to FusedLinearCrossEntropy: with the recipe's logits_to_keep=1 path the full logits matrix is never materialized. Matches the qwen3_6_27b_medpix recipes, which already use it for the same reason. Verified on 8xH100-80GB: baseline OOMs at step 31 in cross_entropy; with FusedLinearCrossEntropy the run completes all 50 steps + validation at 37.7 GiB peak (no extra config needed). Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
… (2589)` into `r0.5.0` (#2605) fix(distributed): register Falcon-H1 TP plan to fix 34B PEFT OOM (#2589) falcon_h1_34b_instruct_squad_peft (8xH100, tp_size=4) OOMs during the first training steps with ~72 GiB allocated per GPU. Root cause: Falcon-H1 is a hybrid Transformer + Mamba2 model. HuggingFace ships only `_tp_plan = {"lm_head": "colwise_gather_output"}` for it, and names its MLP `feed_forward` (not `mlp`). The parallelizer rejects the HF plan (the `colwise_gather_output` style is unknown) and falls back to the generic llama-style base plan, whose `model.layers.*.mlp.*` patterns never match `feed_forward`. The dominant MLP weights (~70% of the 34B) are thus left replicated across the TP group, so each rank holds almost the whole model -> OOM. The attention path already sharded fine via the fallback. Fix: add a dedicated FalconH1ForCausalLM plan that shards `self_attn` and `feed_forward` (correct module name) and leaves the Mamba2 mixer replicated (its SSM scan / conv1d are not TP-shardable with stock kernels, same as Qwen3.5's GatedDeltaNet branch). Registered by both qualified name (native transformers load) and bare name (trust_remote_code load). This is the only Falcon-H1 recipe using tp_size>1; the 0.5B/1.5B/7B recipes use tp_size=1 and are unaffected. Also drops a pre-existing unused-variable in the touched test file. Verifies CI job 337980604 (pipeline 54319542). (cherry picked from commit 90bcf87) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
…ast (2549)` into `r0.5.0` (#2606) fix(models): keep RoPE frequency buffers fp32 under bf16 model cast (#2549) * fix(llama): keep RoPE inv_freq in float32 under bf16 model cast LlamaForCausalLM.__init__ casts the whole model with self.to(config.torch_dtype). nn.Module.to rounds floating-point buffers, so the non-persistent inv_freq buffer in LlamaRotaryEmbedding was being downcast to bf16. _build_cache then upcast it back to float32 to build the cos/sin tables, but the precision was already lost -- the low-freq components (largest under rope_theta=5e5 + llama3 scaling) carried up to ~17% relative error. Vanilla HF keeps inv_freq in float32 (from_pretrained reconstructs it and never overwrites the non-persistent buffer), so a checkpoint trained with AutoModel and reloaded in HF computed slightly different RoPE. On a trained model's peaky output distribution that small logit gap (~0.1 max) is amplified into a large per-token KL: the checkpoint_robustness Phase 4 (automodel -> vanilla HF) test failed with max KL 1.1e-2 > 5e-3 threshold. Fix: recompute inv_freq in float32 from config inside _build_cache, so the rotary tables are independent of the model's parameter dtype. Also fixes Qwen2, which shares this module. After the fix, AutoModel and HF logits are bit-identical (max KL 0.0 at all softmax temperatures; rotary cos/sin match HF exactly). Added a regression test that fails on the old code (0.0039 abs / 17% rel) and passes now. * fix(models): keep RoPE frequency buffers fp32 under bf16 model cast The prior commit fixed llama; the same bug — a rotary inv_freq/freqs_cis buffer rounded to bf16 by a model-wide .to(dtype), degrading RoPE vs HF and causing a large logit/KL divergence on vanilla-HF reload — affects 6 more families. Fix each via the existing _keep_in_fp32_modules mechanism (honored by cast_model_to_dtype), routing the two raw-cast models through cast_model_to_dtype: - deepseek_v4, mimo_v2_flash: add "rotary_emb" to _keep_in_fp32_modules_strict (matches rotary_emb + rotary_emb_compress / swa_rotary_emb). - minimax_m3_vl: add "inv_freq" — the vision tower's rotary buffer is not under a module named "rotary_emb". - gemma4_moe, diffusion_gemma: switch raw self.to(dtype) -> cast_model_to_dtype and add _keep_in_fp32_modules=["rotary_emb"] (raw .to ignores keep-fp32). - kimi_k25_vl: switch raw model.to(dtype) -> cast_model_to_dtype and add _keep_in_fp32_modules=["freqs_cis","rotary_emb"]. Add cast_model_to_dtype rope-buffer regression tests: non-persistent inv_freq/freqs_cis preserved across the cast, incl. the unprotected-is-rounded reproduction case. * fix(models): honor set-valued _keep_in_fp32_modules + add bf16-init rope tests cast_model_to_dtype's _get_fp32_module_keywords only collected list-valued keep-fp32 attributes, but HF's PreTrainedModel.__init__ normalizes _keep_in_fp32_modules from a class-level list to an instance-level set — so the gemma4_moe / diffusion_gemma rope fixes (which set _keep_in_fp32_modules=["rotary_emb"]) were silently no-ops, leaving inv_freq rounded to bf16. Accept set/tuple too. (deepseek_v4/mimo use the NeMo-only _keep_in_fp32_modules_strict, which HF doesn't touch, so they were unaffected.) Add per-model regression tests that build a mini (~2-layer) model, run the real bf16 init path (initialize_weights, or from_config + cast_model_to_dtype), and assert the rotary frequency buffers stay float32 while a regular weight is bfloat16: deepseek_v4, mimo_v2_flash, gemma4_moe, minimax_m3_vl (vision tower), kimi_k25_vl, and diffusion_gemma (skipped unless the transformers fork is present). Plus set/tuple coverage for _get_fp32_module_keywords and cast_model_to_dtype. (cherry picked from commit e41e076) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
…) (2587)` into `r0.5.0` (#2611) fix: use TE attention for gpt_oss packed-sequence recipe (AM-438) (#2587) * fix: use TE attention for gpt_oss packed-sequence recipe (AM-438) gpt_oss_20b_te_packed_sequence.yaml paired backend.attn=flex with THD packed sequences, but FlexAttention does not support the THD (3D) layout (its sink path requires a 4D tensor and ignores cu_seqlens), so the recipe crashed with "ValueError: not enough values to unpack (expected 4, got 3)". TE DotProductAttention supports THD/packed natively (qkv_format=thd + cu_seqlens) and gpt-oss attention sinks via softmax_offset, which is the path #1757 validated. Switch the recipe's attention backend to te. * fix(recipe): lower gpt_oss packed-sequence local_batch_size to 1 to avoid MoE OOM Full fine-tuning of gpt-oss-20b (MoE) on a single 8xH100-80GB node OOMs at local_batch_size: 4 -- in the MoE experts grouped-GEMM and the vocab-201088 cross-entropy logits. local_batch_size: 1 (global_batch_size stays 32 via grad accumulation) trains within ~35 GiB/GPU. Activation checkpointing is not an option here: DeepEP's non-deterministic token dispatch makes the recompute mismatch the forward. --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…avoid OOM (2609)` into `r0.5.0` (#2612) fix(oom): use FusedLinearCrossEntropy in qwen3 tulu3 configs to avoid OOM (#2609) fix(convergence): use FusedLinearCrossEntropy in qwen3 tulu3 configs to avoid OOM The qwen3-4b and qwen3-moe-30b tulu3 convergence configs (added in #1554) OOM on 8xH100-80GB at masked_ce.py logits.float(): with truncation: false a long tulu3 sample builds a full [tokens, vocab] logit tensor whose bf16->fp32 upcast needs one huge contiguous allocation (48 GiB for 4B lb8, 18.6 GiB for 30B). Switch all 5 configs to FusedLinearCrossEntropy + output_hidden_states: true, so the recipe's logits_to_keep=1 path never materializes the full logits. Also drop qwen3_4b_cp1_flashoptim local_batch_size 8->4: once FLCE removes the logits bottleneck it exposes a backward-pass activation OOM at lb8 with unbounded sequences (cp2 is unaffected since CP halves per-GPU seq; the te_fusedadam and 30B configs already use lb2). Verified on 8xH100-80GB (max_steps 10): baseline OOMs (4B 48 GiB / 30B 18.6 GiB at logits.float); with the fix the runs complete 10 steps + validation with no OOM (30B flashoptim, 4B cp1 lb4, 4B cp2). Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
….0` (#2607) perf(distributed): add retrieval tuning knobs (#2452) * perf(distributed): add retrieval tuning knobs * fix(retrieval): unwrap ddp model attrs * perf(retrieval): speed up ddp grad clipping * fix(distributed): reduce DDP recipe metrics * fix(retrieval): preserve optimizer groups and log average loss * chore(retrieval): drop megatron fsdp side changes * style(retrieval): sort bi-encoder imports * style(training): format step scheduler * test(distributed): fix ddp config expectations * perf(retrieval): make autocast configurable * perf(retrieval): wire compile config * test(diffusion): update DDP manager config expectation * test(diffusion): update ConfigNode DDP expectation --------- Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Yuhe Zhang <yuhezhang.zju@gmail.com>
…g probability (2591)` into `r0.5.0` (#2610) fix(moe): weight GroupedExpertsTE down-projection bias by routing probability (#2591) * fix(moe): weight GroupedExpertsTE down-projection bias by routing probability GroupedExpertsTE applied the MoE down-projection bias UNWEIGHTED: TE's GroupedLinear adds the down bias inside the grouped GEMM, but the per-token routing probability is applied at the activation (permuted_probs). So each of the top-k selected experts contributed a full prob-independent down bias, and the combine summed them -> ~k*bias instead of ~1*bias per token, across every MoE layer. This is a large systematic activation offset (gpt-oss-20b finetune step-0 loss ~8.2 vs the correct ~4.5). Add the missing (prob - 1) * down_bias term so the net down-bias contribution is prob * down_bias, matching GroupedExperts (expert_out + down_bias * w) and GroupedExpertsDeepEP (_apply_bias(..., permuted_probs)). No-op for experts without bias; correct in both bf16 and fp8 paths; gradients sum to the correct Sum(prob). Affects all experts=te MoE models with expert bias under EP. Verified on gpt-oss-20b (8xH100, ep=8, deepep, te, packed) on cw-dfw: step-0 loss 8.21 -> 5.10, val 4.19 (matches HF reference 4.53 and the gmm path). * test(moe): 2-GPU GroupedExpertsTE EP-vs-single-GPU down-bias parity guard Add a 2-GPU (ep_size=2) functional regression test that guards the GroupedExpertsTE down-projection-bias fix (PR #2591 / Linear AM-487) so it cannot be silently reverted. TE's GroupedLinear adds the per-expert down bias UNWEIGHTED inside the grouped GEMM, but the per-token routing probability is applied at the activation (permuted_probs). Without the (permuted_probs - 1.0) * down_bias correction, each of a token's top-k expert contributions carries a full prob-independent down bias that the combine step sums (~k x bias instead of ~prob x bias) -- a large systematic offset (gpt-oss-20b step-0 loss ~8.2 vs the correct ~4.5). The test builds GroupedExpertsTE with 8 experts sharded 4+4 across 2 ranks, a DeepEP token dispatcher, expert_bias=True, quick_geglu activation and deterministic weights with a non-zero down bias. It feeds seeded, identical hidden states / router indices / probs, runs the EP forward, gathers every rank's local-token output and compares it to a single-GPU reference that applies the correct prob-weighted down bias via plain matmuls. It also checks that the tolerance actually separates the buggy (unweighted) output from the correct one so the guard cannot become vacuous. Verified on 2xGPU: PASSES with the fix (max_err ~1.6e-2 < tol), FAILS without it (max_err ~1.7 > tol). Wired into CI via a new L2_MoE matrix entry (test-folder: moe) on the 2-GPU runners. --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…o r0.5.0 (#2618) fix(qwen3_5_moe): convert MTP experts as grouped tensors (AM-442) (#2595) The HF checkpoint stores the MTP block's MoE experts in the SAME grouped layout as the main decoder layers (mtp.layers.0.mlp.experts.{gate_up_proj, down_proj}); verified against Qwen/Qwen3.6-35B-A3B's safetensors index. convert_single_tensor_to_hf split the MTP experts into per-expert HF keys (mtp.layers.0.mlp.experts.{id}.down_proj.weight). When the checkpoint loader builds load destinations via to_hf, it requested per-expert keys that don't exist in the grouped checkpoint, raising 'Missing key in checkpoint state_dict: mtp.layers.0.mlp.experts.224.down_proj.weight' on the first EP rank that owns expert 224 (rank 7 with 256 experts / EP=8). Remove the per-expert MTP split so MTP experts fall through to the generic grouped rename+transpose used by the main layers, matching the on-disk keys. The grouped from_hf path already handles loading (incl. EP shard slicing). Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(bagel): distributed setup init (#2608) * fix(bagel): pass distributed setup to VLM builder * fix(bagel): microbatch VAE encode in SFT example --------- Signed-off-by: Zeyu Zhou <zezhou@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Zeyu Zhou <zezhou@nvidia.com>
…AM-454) (2594)` into `r0.5.0` (#2619) fix(transformers): keep gemma3n KV sharing working under FSDP2 (AM-454) (#2594) * fix(transformers): keep gemma3n KV sharing working under FSDP2 (AM-454) HF gemma3n implements cross-layer KV sharing by threading a single mutable `shared_kv_states` dict through every decoder layer as a forward kwarg: the last full-length layer of each attention type writes its K/V into the dict and the later shared layers read it back. Under FSDP2 with MixedPrecisionPolicy(cast_forward_inputs=True), the per-layer fully_shard pre-forward casts inputs via tree_map over (args, kwargs), which reconstructs the dict fresh for each layer. The writer fills its throwaway copy and the reader sees an empty one -> KeyError (e.g. KeyError: 18) on the first forward. Only triggers once fully_shard is active (dp_shard > 1, i.e. multi-GPU). Fix: inject a pytree-opaque, dict-like _SharedKVStates holder (shared by reference across all layers, reset per forward) via a per-layer forward pre-hook. Because it is not a type pytree flattens, tree_map treats it as a leaf and passes the same instance to every layer, so in-place writes are visible to readers. Applied from _apply_runtime_compatibility_fixes alongside the rotary fix; no-op for models without KV sharing. Verified: gemma3n_vl_4b_medpix on 2 GPUs trains cleanly (was KeyError: 18). * test(transformers): cover _apply_runtime_compatibility_fixes KV-sharing wiring * fix(transformers): scope gemma3n KV-sharing holder to gemma3n only The holder was installed for any model with num_kv_shared_layers > 0, which clobbered gemma4's caller-supplied shared_kv_states (the speculative drafter threads the base model's store via composite.py), regressing test_hf_transformer_vlm_gemma4_joint_drafter with KeyError: 'sliding_attention'. gemma4 already makes its shared store FSDP2-safe in-model (#2566) and preserves a caller-supplied store via setdefault, so gate our holder on model_type startswith 'gemma3n' and leave other kv-sharing models alone. --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
… apt runtime (2614)` into `r0.5.0` (#2629) fix(docker): build DeepEP against the NVSHMEM wheel matching the apt runtime (#2614) * fix(recipe): use hybridep dispatcher for glm_4.5_air_te_deepep internode The default deepep dispatcher (Buffer.internode_dispatch) faults internode with an illegal memory access at DeepEP csrc/kernels/internode.cu:346 (recv counters stuck at -1, then timeout) on the 8-node ep32 GLM-4.5-Air job. The hybridep backend (HybridEPBuffer) works internode -- verified standalone on EOS (test_hybrid_ep.py 2-node: correctness PASS for BF16 and FP8, ~65 GB/s RDMA). Mirrors gpt_oss_120b, ling_1t_sft, qwen3_moe_*_gb200, deepseek_v3_*_gb200 which already set dispatcher: hybridep for the same deepep internode failure. * fix(docker): align glm DeepEP/HybridEP build for internode dispatch Build DeepEP/HybridEP with apt rdma-core v60 (build==runtime libibverbs) and RDMA_CORE_HOME symlinked to the system install, add the libnvshmem_host.so->.so.3 symlink (required for DeepEP fabric handle operations), pin the nvshmem wheel to 3.6.5, set LD_LIBRARY_PATH, and bump DEEPEP_COMMIT to 17cfb817. With dispatcher: hybridep this lets glm_4.5_air_te_deepep complete internode dispatch (8-node, eos) instead of faulting at DeepEP csrc/kernels/internode.cu:346 (recv counters stuck at -1). --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…25)` into `r0.5.0` (#2628) fix(qwen3_moe): keep native forward under PP so CP+THD works (#2625) * fix(qwen3_moe): keep native forward under PP so CP+THD works Qwen3MoeForCausalLM did not declare `_pp_keep_self_forward = True`, so the pipeline builder replaced its forward with the generic HF pipeline forward. That generic forward assumes the HF rotary API (`rotary_emb(hidden_states, position_ids) -> (cos, sin)`), but Qwen3-MoE uses the gpt_oss-style rope (`position_ids_to_freqs_cis` + `apply_rotary_emb_qk` with `cu_seqlens`/`cp_size`/`cp_rank`) and a `freqs_cis` decoder-layer API. Under pp_size>1 with context parallelism + THD this crashed in the first forward: RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 512 but got size 1 (at apply_rotary_emb torch.cat) The model's own forward already handles PP stage routing (embed_tokens/norm/ lm_head are None off the owning stage; hidden states arrive in the input_ids slot via squeeze_input_for_thd) and CP+THD (CP is handled inside TE attention via cp_size/cp_rank). Declaring `_pp_keep_self_forward = True` (matching the sibling native MoE models nemotron_v3 / qwen3_5_moe / ling_v2) makes the PP builder preserve it; the stage wrapper then unwraps the CausalLMOutputWithPast to a tensor. Verified on 16xH100 (cw-dfw): qwen3_moe_30b_te_chat_thd at pp_size=2, cp_size=2, THD trains end-to-end (20/20 steps, exit 0, ~19 GiB/GPU, no OOM, no RoPE crash). Adds a CPU unit test asserting the opt-in flag is declared and recognized by model_keeps_self_forward. * feat(qwen3_moe): enable PP in qwen3_moe_30b_te_chat_thd recipe Now that Qwen3MoeForCausalLM keeps its native forward under PP, enable pipeline parallelism on the recipe that motivated the fix (AM-460): pp_size 1->2 with pp_microbatch_size 1 (must divide local_batch_size=2) and ci.nodes=2 (16xH100). pp=2 splits the model across 2 nodes so each shard fits (~19 GiB/GPU), resolving the backward OOM; pp=2 on a single node is memory-neutral vs pp=1 and still OOMs. --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Signed-off-by: gitlab-runner <gitlab-runner@gitlab-master.nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
beep boop 🤖: Updating transformers to latest version on pypi