Skip to content

Releases: NVIDIA-NeMo/Megatron-Bridge

NVIDIA Megatron-Bridge 0.4.2

28 May 21:18
c810129

Choose a tag to compare

Highlights

  • Expanded performance configs for DeepSeek V3, Qwen, GPT-OSS, and WAN
  • Supported fp4_param_gather mixed precision config
  • Enhanced security in dataset checkpoint deserialization and url loading. Safer trust_remote_code handling.

Performance

  • NVFP4 with 4-bit parameter AllGather in DP communications (PR#3364, PR#4005)
  • DSV3 B300 recipe tuning (PR#3549)
  • DSV3 B200 recipe tuning (PR#3368)
  • Qwen3 235B A22B B300 recipe tuning (PR#3490)
  • NT3 super B300 recipe tuning (PR#3579)
  • GPT-OSS B200 regression fix (PR#3614)

Software Component

Known issues

  • There is a known issue with Evaluator when installing nvidia-vlmeval inside /opt/NeMo-FW. Please use the /opt/Megatron-Bridge directory to install the package:
cd /opt/Megatron-Bridge
uv pip install nvidia-vlmeval
Changelog Details
  • beep boop 🤖: Bumping megatron.bridge to v0.4.1 by @nemo-automation-bot[bot] :: PR: #3363
  • cp: [perf] fix: guard cuda_graph_scope validation against None (3249) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3262
  • cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0 by @yaoyu-33 :: PR: #3305
  • cp: Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names (3179) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3324
  • cp: Perf script utility to lock gpu frequency. (2977) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3326
  • cp: fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3332
  • cp: fix(perf): read baseline values from golden values when using new format (3334) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3338
  • [docs] chore: bump versions1.json to 0.4.0 (latest) by @ko3n1g :: PR: #3376
  • b200 DSv3 better cfg (#3368), mxfp8 to fp8_cs for h100 gpt-oss #3378 by @malay-nagda :: PR: #3420
  • 2604 perf summary (#3377) by @malay-nagda :: PR: #3405
  • cp: docs(releases): add 26.04 software component versions (3421) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3430
  • cp: b200 DSv3 better cfg (3368) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3401
  • cp: [training] fix: report memory on 2nd iteration to better reflect actual peak (3169) into r0.4.0 by @dingqingy-nv :: PR: #3367
  • cp: Update Qwen3-VL pretrain perf configs for 30B and 235B (3327) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3342
  • cp: docs: Add container version to docs version picker (3434) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3435
  • cp: [docs] Add Megatron Bridge 0.4.0 release notes (#3419) by @chtruong814 :: PR: #3439
  • cp: fix(test): clone mmap-backed tensors before overwriting safetensors file (#3335) by @yaoyu-33 :: PR: #3441
  • cp: [test] refactor: move diffusion tests to test_groups directory (3275) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3442
  • remove archival data from main page by @malay-nagda :: PR: #3448
  • cp: fix: set 644 permissions on COPY'd files to match cloned repos (3431) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3450
  • cp: [perf] fix: use direct assignment for NCCL env vars when nccl_ub enabled (3350) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3453
  • cp: [training] feat: enable fp4_param_gather in MixedPrecisionConfig (3364) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3454
  • cp: fix(docker): replace rdma-core source build with system package install (3429) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3457
  • cp: [training] fix: record CUDA memory history before snapshot so dumps are non-empty (#3487) into r0.4.0 by @dingqingy-nv :: PR: #3508
  • cp: [vulnops][misc] fix: Add allowlist validation for _target_ instantiation (3142) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3540
  • cp: [vulnops][data] fix: Replace unsafe pickle.loads with restricted unpickler in Qwen VL pipeline (3139) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3541
  • cp: [vulnops][ckpt] fix: Use weights_only=True in ModelOpt checkpoint loading (3138) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3542
  • cp: [vulnops][ckpt] fix: Use weights_only=True in TrainState checkpoint loading (3506) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3557
  • cp: [vulnops][data] fix: Replace unsafe pickle.load with restricted unpickler for index metadata (3140) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3558
  • cp: [vulnops] fix: _contains_code_references allowlist bypass leads to RCE (3379) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3559
  • fix: Add security warning for trust_remote_code and remove hardcoded True by @chtruong814 :: PR: #3539
  • cp: Cleanup TE cuda graphs with the right api (3459) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3476
  • cp: Update DeepSeek-V3 configs for B300 (3549) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3565
  • cp: log repo status manual (3570) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3572
  • cp: ci: post merge comment with SHA after successful CI run (3567) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3573
  • cp: [perf] update: switch GPT-OSS GB200 V2 dispatcher default to alltoall (3561) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3577
  • cp: no fp4 param gather (3578) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3580
  • cp: fix(evaluate): skip non-dict golden value entries such as job_id (3581) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3582
  • cp: [vulnops][data] fix: Validate URLs in VLM video loader to prevent SSRF (3482) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3588
  • fix(docker): suppress lightning from uv resolution in fw_pyproject by @ko3n1g :: PR: #3602
  • cp: [vulnops][data] fix: Remove unnecessary allow_pickle=True and add security warnings (3141) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3615
  • cp: [vulnops][data] fix: Replace allow_pickle=True with restricted unpickler in packed dataset loading (3616) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3629
  • cp: add VP for LoRA Lm3 70B (3547) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3596
  • cp: num_layers_fix- qwen vl 235b_a22b on B200 (3589) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3603
  • cp: fix(docker): resolve lightning not found on PyPI by providing local stub (3604) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3606
  • cp: 70b_lora_gb200_bf16_fix (3623) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3627
  • cp: [vulnops] fix: Add SSRF protection to image-loading utilities (3630) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3632
  • chore(beep boop 🤖): Bump uv.lock (r0.4.0, mcore-core_r0.17.0) (2026-04-30) by @svcnvidia-nemo-ci :: PR: #3591
  • cp: [vulnops] fix: Add SSRF protection to audio URL loading (3633) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3636
  • cp: fix(perf): keep PCT binding for deepseek_v3 large_scale on b300 (3656) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3657
  • fix: apply vllm PR 36192 patch and bump pillow to 12.20 by @ko3n1g :: PR: #3671
  • cp: Add previously removed NemotronHBridge SequentialMLP mappings (3628) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3701
  • Use HybridEP flex dispatcher for Qwen3 235B B300 perf configs (#3490) by @rhmukundan :: PR: #3675
  • [build] chore: bump package version to 0.4.2 by @ko3n1g :: PR: #3721
  • [model, ckpt, docs] fix: support HF→Megatron conversion under decentralized PGs (r0.4.0) by @cuichenx :: PR: #3674
  • cp: Fix Gemma3 example folder (3724) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3728
  • cp: Reorganize ModelOpt docs (3715) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3751
  • [model, ckpt] fix: align GPT-OSS BF16 down_proj orientation on import (r0.4.0) by @cuichenx :: PR: #3753
  • perf(qwen3-next): set expandable_segments on GB300 BF16/FP8_MX to fix OOM by @ko3n1g :: PR: #3767
  • cp: llama31 405b gb200 nvfp4 no pg overlap (3713) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3773
  • cp: [perf] update: switch GPT-OSS B200 V2 dispatcher default to alltoall (3614) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3682
  • nt3 super nvfp4; lm3.1 405B nvfp4; lm3 70B mxfp8- expandable_segments by @malay-nagda :: PR: #3780
  • cp: [config] Update micro_batch_size to 2 for gemma3 recipe (3815) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3828
  • chore: Bump TE to latest 2.14 and MCore to latest 0.17.0 by @chtruong814 :: PR: #3806
  • qwen3 next env var fix by @malay-nagda :: PR: #3845
  • chore: Bump and remove packages to address CVEs (#3841) by @chtruong814 :: PR: #3855
  • Bump MCore to 2edffa by @chtruong814 :: PR: #3857
  • cp: chore: Bump deps to address CVEs (3919) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3925
  • cp: 2604_patch_perf_summary (3818) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3861
  • cp: 26.04.01_perf_summary (3997) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3998
  • cp: docs: note 26.04 drops PyAV by default and document runtime install (4020) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #4021
  • cp: [perf] fix: guard cuda_graph_scope validation against None (3249) into r0.4.0 (#3262) by @svcnvidia-nemo-ci
  • cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0 (#3305) by @yaoyu-33
  • cp: Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names (3179) into r0.4.0 (#3324) by @svcnvidia-nemo-ci
  • cp: Perf script utility to lock gpu frequency. (2977) into r0.4.0 (#3326) by @svcnvidia-nemo-ci
  • cp: `fix(gemma3-vl): force right-padding in VLM collate t...
Read more

26.04-alpha.rc2

07 May 07:08
fd5c473

Choose a tag to compare

[MXFP8 param gather]Update param buffer before copy to model weights …

NVIDIA Megatron-Bridge 0.4.1

06 May 21:49
f9b6319

Choose a tag to compare

26.04-alpha.rc1

23 Apr 09:32

Choose a tag to compare

Merge branch 'PR2411' into 26.04-alpha

NVIDIA Megatron-Bridge 0.4.0

16 Apr 22:46
0fbfe7d

Choose a tag to compare

Highlights

Model Collection Support

  • MiniMax M2 / M2.5 support (PR#2602)
  • Kimi 2.5 support, including GB300 MXFP8 recipe and HF config updates (PR#2743)
  • Nemotron 3 Super model support (PR#2912)
  • Sarvam support (PR#1814)
  • Qwen 3.5 VL Bridge with recipes and LoRA bridge / merge support (PR#2530, PR#2654, PR#2736)
  • Qwen 2.5 Omni support (PR#2634)
  • Qwen2-Audio support (PR#2324)
  • Xiaomi MiMo dense MTP model bridge support (PR#2387, by HollowMan6)

Diffusion Collection

Training & Functionality

  • Parquet support for sequence-packing preprocessing, improving handling of larger datasets (PR#2395)
  • Energon integration for sequence packing with WebDataset workflows (PR#2440)
  • Default packed sequences across finetune recipes (PR#2284)
  • More modern finetuning datasets, including OpenMathInstruct V2 and GSM8K (PR#2264)
  • Unified dataset configuration in run_recipe.py (PR#2826)
  • NCCL flight recorder configuration support (PR#2891)
  • Comet ML experiment tracking integration (PR#2910)
  • Refactored SFT and PEFT recipes for VLM workflows (PR#2614)
  • Added the on_checkpoint_save callback event for training workflows (PR#2905)
  • Added MoE LoRA rank normalization for expert layers (PR#3006)
  • Direct export of block-wise FP8 weights and scaling factors (PR#1994)
  • Accelerated first-fit packing with a segment tree for much faster packing on large datasets (PR#2953)

Model Optimization

  • Pruning support and documentation (PR#2244)
  • Post-training quantization support for Nano, Super, and Ultra model families (PR#2303)
  • Distillation quantization support in NeMo 2 (PR#2591)

Performance

  • Nemotron 3 Super perf config, including GB200 improvements and BF16 / NVFP4 functional support via module recompute (PR#3208)

Developer Experience & Compatibility

  • ModelConfig and ModelBuilder refactor integrated into the training loop (PR#2798, PR#2671)
  • Dev branch support and documentation updates (PR#2497)
  • Python 3.12 migration announcement (PR#2773)
  • Transformers 5.0 through 5.3 compatibility (PR#2068, PR#2781)
  • PEFT Bridge offline mode support (PR#2574)
  • LoRA merge on CPU (PR#2194)
  • Self-contained Megatron-to-HF export with auto-config synthesis (PR#2778)
  • Scripts and documentation for Megatron-LM and Megatron Bridge correlation

Examples & Tutorials

Community Contributions

  • @HollowMan6 (Aalto University): Xiaomi MiMo dense MTP bridge support, Qwen 3.5 VL LoRA bridge and merge, and additional export / PEFT fixes (PR#2387, PR#2736, PR#2384, PR#2799)
  • @shaltielshmid: packed-sequence improvements for large datasets and safer model loading defaults (PR#2395, PR#2766)
  • @jaeminh: accelerated first-fit packing with a segment tree (PR#2953)
  • @pavelgein: added the on_checkpoint_save callback event (PR#2905)
  • @ShiftyBlock (UC Berkeley): added auto-config for self-contained Megatron-to-HF export (PR#2778)
  • @erictang000 (Anyscale): added LoRA rank normalization for MoE expert layers (PR#3006)
  • @eternally-z: added direct export support for block-wise FP8 weights and scaling factors (PR#1994)
  • @Hayak3: fixed the unsupported normalization argument for Qwen3-VL (PR#1970)
  • @mohit-sarvam (Sarvam AI): added Sarvam MoE support (PR#1814)

A big thank you to our community contributors for their valuable support!

Changelog Details
  • docs: Update callback code snippets to include all imports needed for example by @ananthsub :: PR: #2283
  • M4 leftover for QWen3-VL with MCore vision encoder by @shifangx :: PR: #2370
  • Update Qwen3 235B B300 Configs to match Qwen3 B200 Configs by @rhmukundan :: PR: #2669
  • [bridge] Fix off-by-one in sliding window size for Gemma2, Gemma3, Mistral, and GPT-OSS by @cuichenx :: PR: #2656
  • fix: Write intermediate results to tmp by @ko3n1g :: PR: #2726
  • Perf recipe dataloader num_workers interface fix by @dingqingy-nv :: PR: #2710
  • Suppress noisy _extra_state warnings during checkpoint loading by @cuichenx :: PR: #2689
  • [model, recipe] Add Qwen 3.5 recipes by @cuichenx :: PR: #2654
  • [ci] chore: add nightly dev commit bump workflow by @ko3n1g :: PR: #2729
  • ci(fix): Unique naming for dev branch by @ko3n1g :: PR: #2747
  • [ci] Refactor Gemma3-VL launch script to run finetune and packed tests separately by @cuichenx :: PR: #2730
  • add qwen2_5_omni by @yuekaizhang :: PR: #2634
  • build: Bump TE 2.13 by @ko3n1g :: PR: #2753
  • [docs, ci] chore: add governance issue forms and triage guide by @yaoyu-33 :: PR: #2716
  • [test] fix: temporarily disable qwen2.5 omni unit tests by @yaoyu-33 :: PR: #2759
  • add nemotron3 super docs by @liding-nv :: PR: #2757
  • ci: Fix stopiteration for Mbridge by @ko3n1g :: PR: #2760
  • GPT-OSS Blackwell MXFP8 recipes by @weijiac0619 :: PR: #2633
  • feat(mimo): phase 2 - model provider, DDP wrapping, process groups by @aroshanghias-nvd :: PR: #2004
  • [build] feat: add OSS NeMo FW dockerfiles by @thomasdhc :: PR: #2722
  • Lm3 70B GB200 FP8_CS SFT cfg update by @malay-nagda :: PR: #2748
  • [docs] chore: use uv run in test file docstring run instructions by @cuichenx :: PR: #2728
  • build: Bump NVRX by @ko3n1g :: PR: #2775
  • NVFP4 memory spike fix compared to M-LM by @sanandaraj5597 :: PR: #2764
  • [doc] feat: Document adapter merge verification in stream_adapter_weights example by @yaoyu-33 :: PR: #2042
  • [doc] chore: Add needs-review to PR state labels guidance by @yaoyu-33 :: PR: #2758
  • [ckpt] fix: broaden exception handling in save_artifacts dynamic module loading by @yaoyu-33 :: PR: #2765
  • [test] fix: use toy configs in qwen2.5 omni unit tests by @yaoyu-33 :: PR: #2761
  • [model] Refactor Qwen3-VL and Ministral3 fine-tuning scripts by @kamran-nvidia :: PR: #2735
  • docs - Update user manual with new MoE features and Megatron FSDP by @onel :: PR: #2529
  • remove encoder_and_decoder usage by @dimapihtar :: PR: #2512
  • Fix attention_mask mismatch in compare.py by @mohsinm-dev :: PR: #2476
  • [model, test] fix: guard hybrid layer count across MCore branches by @yaoyu-33 :: PR: #2776
  • [data] fix: guard eval_interval division to prevent ZeroDivisionError by @yaoyu-33 :: PR: #2732
  • [sync][training] fix: log loss values of exactly 0.0 in training_log() by @mehraakash :: PR: #2740
  • [model] feat: support Qwen 3.5 MTP c...
Read more

NVIDIA Megatron-Bridge 0.3.1

20 Mar 22:35

Choose a tag to compare

Changelog Details

Performance & Model Configs

  • CP SFT performance improvements (#2527)
  • Nemotron 3 Nano perf config updates (#2560, #2681)
  • Onboard LLaMA3 70B LoRA to B300 and B200 chips (#2588)
  • Update Qwen3 235B B300 configs to match B200 configs (#2706, #2720)
  • Update DeepSeek-V3 B300 config (#2723)
  • DeepSeek-V3: set no_non_det_algo for deterministic training (#2673)
  • Add MoE Sequential MLP mappings in HF Bridges (#2589)

Bug Fixes

  • [training] Cap lr_warmup_steps to be strictly less than lr_decay_steps (#2858)
  • [training] Fix DistillationProvider.to_cfg_dict to save missing keys in run_config (#2594)
  • [training] Fix StopIteration error in MBridge (#2762)
  • [checkpoint] Fix local checkpoint integration (#2709)
  • [checkpoint] Log warning when HuggingFace Hub download fails silently (#2493)
  • [checkpoint] Low-memory save: use AutoBridge directly in distill_llama32_3b-1b to load HF weights (#2860)
  • [inference] Use config.hidden_size directly for Qwen3VL inference wrapper (#2855)
  • [misc] Improve compare.py robustness for multi-GPU and vocab-padded models (#2647)
  • [misc] Fix BOS token mismatch in compare_text_generation (#2889)
  • [misc] Guard eod_id access in compare_text_generation for HF tokenizers (#2853)
  • [misc] Guard missing kubernetes deps (#2871)
  • [example] Fix example scripts and recipe names in release branch (#2862, #2863)

Documentation

  • Add ModelOpt pruning docs (#2629)

NVIDIA Megatron-Bridge 0.3.0

26 Feb 03:51
21b02e0

Choose a tag to compare

Highlights

  • Model Collection Support
  • Performance
    • NVFP4 support for LLama3 models.
    • HybridEP support for NVL8 systems (PR#494)
    • MLA performance improvement with cudnn layernorm and cudnn 9.18
    • LN+MXFP8 quantization fusion with TE.sequence and cudnn backend
    • Supports FSDP for MoE models with MXFP8 (PR#2135, PR#2239)
    • Support Muon Optimizer (PR#683)
    • NVFP4 Llama Playbook (PR#1409)
  • Training & Functionality
    • LoRA Bridge (initial): RL LoRA support for VeRL / nemo-rl (PR#1766)
    • Multi-token prediction (MTP): Qwen3 dense examples (PR#2138)
    • Decentralized parallel group (M4) end to end support and examples (PR#2011, examples)
    • Context Parallelism (CP) with sequence packing in LLMs (PR#1867)
    • Context Parallelism (CP) with sequence packing in VLMs (PR#1997)
    • Callbacks integration (PR#2063)
    • Low memory save for model importing from HF (fix Deepseek V3 and Kimi-K2 import) (PR#1949)
  • Community Contributions
Changelog Details
Read more

NVIDIA Megatron-Bridge 0.2.2

09 Jan 18:14
0465189

Choose a tag to compare

NVIDIA Megatron-Bridge 0.2.1

18 Dec 00:04
v0.2.1
1c43b39

Choose a tag to compare

  • Performance
    • Activation offloading to host memory support with pipelining
      • Supports the high activation memory needs of MoE models training with dynamic shapes
      • Fixed Nemotron FLOPS calculation model
  • Model Collection Support
    • Ministral 3
  • Enhanced LoRA support
    • LoRA support for Mamba layers (for Nemotron Nano V2 and NemotronH finetuning)

NVIDIA Megatron-Bridge 0.2.0

04 Dec 23:56
v0.2.0
7af9601

Choose a tag to compare