chore: nightly sync main into dev (18_06_2026) by svcnvidia-nemo-ci · Pull Request #5402 · NVIDIA/Megatron-LM

svcnvidia-nemo-ci · 2026-06-18T17:38:52Z

Nightly sync: main → dev (34 commits, 18_06_2026)

Automated nightly sync of main into dev, started from origin/dev with
git merge origin/main --no-edit and resolved surgically to preserve dev-only
features (enforced by the pre-push dev-feature-preservation guard).

Python lines: +4967 / -62 across 36 files

What landed

Main's new, self-contained features were synced cleanly:

Offline logits-based knowledge distillation (megatron/training/distillation/*, Offline Logits-Based Knowledge Distillation #5019)
RL profiling (megatron/rl/rl_profiling.py, Profiling #3110)
Minimal DBuffer + FSDP experimental layout/placement (.../megatron_fsdp/experimental/*, Add minimal DBuffer implementation #4835)
MIMO threading through the training loop + MimoModel.zero_grad_buffer (Thread MIMO support through the stock training loop (schedule + optimizer) #5333, Add MimoModel.zero_grad_buffer delegating to active DDP submodules #5372) and DDP pg_collection threading (Thread pg_collection through wrap_model_chunks_with_ddp #5328), with their new unit tests
Numerous non-conflicting changes across the tree

Conflicts resolved (combined dev + main)

16 files had textual conflicts, resolved by combining both sides:

finalize_model_grads.py — kept both the expert_bias is not None (dev) and frozen_expert_bias (main) guards
rope_utils.py — kept dev's CUDA-graph-compatible THD RoPE (already incorporates the CP packed-freqs fix) + main's apply_rotary_pos_emb default
gpt_model.py (import) / moe/router.py (init) — combined both symbols/attrs
moe/experts.py — kept dev's _unsupported(...) refactor (consistent with the whole function)
fine_grained_activation_offload.py — kept dev's debug msg + added main's _can_manage_tensor_for_offload/_te_do_not_offload guards
transformer_config.py — combined dev's offload asserts with main's fused_group_mlp validation
checkpointing.py — kept main's async-logits scheduling + dev's formatting
theoretical_memory_usage.py — kept main's LatentMoE routed_expert_hidden_size + dev's formatting
arguments.py — combined imports (restored dev's dataclasses/F/PkgVersion), kept dev's args + added main's --rl-profile, --rl-profile-dir, --freeze-all-layers, --override-ckpt-iteration, --logits-* args
pretrain_gpt.py / pretrain_hybrid.py — reconciled get_batch against the merged helper signatures (dev's mtp_on_this_rank backward-compat, dev's dynamic_context_parallel rename, main's _build_cached_logits_loss_func)
dependency triple (pyproject.toml/uv.lock/docker/Dockerfile.ci.dev) and .github/CODEOWNERS kept at dev's versions (verified identical to origin/dev); no new git sources in main to reconcile

Deferred to a future sync (dev-feature-preservation guard)

Where main's modifications to existing dev files would have dropped dev-unique
lines (the guard's hard-abort condition), dev's version was kept and main's change
deferred. These are documented main commits whose changes touched code dev had
diverged on; they will re-sync once the competing work reconciles:

Inference prefill-scheduler rewrite / cudagraph admission gating (Inference: Cudagraph-aware admission gating in prefill scheduler #4870) — dynamic_engine.py kept at dev; main-only test_cg_admission_gating.py removed accordingly
Absorbed-MLA projection refactor separate→combined ([split 3/4] Refactor absorbed MLA projection handling #5245 split 3/5) — absorbed_mla.py + test kept at dev (external combined-spec interface is unchanged, so callers are unaffected)
LatentMoE theoretical-memory tweak (Fix LatentMoE theoretical memory estimate #5145), fused-group-mlp offload ([feat] Support fine-grained activation offloading in fused group mlp #5082) interplay, and assorted inference/context/server changes where dev had local edits

The guard passes (0 dropped dev-only lines) and all changed files parse.

🤖 Generated with Claude Code

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Signed-off-by: janEbert <janpabloe@nvidia.com> Signed-off-by: Philip Petrakian <ppetrakian@nvidia.com> Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>

Signed-off-by: Helen Ngo <helenn@nvidia.com>

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

Signed-off-by: Helen Ngo <helenn@nvidia.com>

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

Signed-off-by: ykarnati <ykarnati@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Shijie Wang <jaywan@nvidia.com>

#5347) Signed-off-by: Ajay Balasa <abalasa@nvidia.com>

…izer) (#5333) Signed-off-by: ykarnati <ykarnati@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

#5360) Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: Antoni-Joan Solergibert <asolergibert@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

… module globals (#5351) Signed-off-by: ilml <tolong@nvidia.com>

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

…h space buffers (#5348) Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Co-authored-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

Signed-off-by: sraman <sraman@nvidia.com>

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

…5082) Signed-off-by: hongbinl <hongbinl@nvidia.com>

Signed-off-by: ykarnati <ykarnati@nvidia.com>

…5372) Signed-off-by: ykarnati <ykarnati@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

svcnvidia-nemo-ci · 2026-06-18T20:27:35Z

/ok to test 52bfe0a

Phase-3 CI fix. The merge deferred several main source changes to preserve dev-only code (pre-push guard). These unit tests assert main's deferred behavior, so revert them to dev to match the dev-equivalent source: - test_fine_grained_activation_offloading.py (main's _can_manage_tensor_for_offload guard) - test_multi_latent_attention.py + test_optimizer.py (#5310 fused MLA QKV down-proj) - test_weight_and_optimizer_memory.py (#5145 LatentMoE memory) - test_hybrid_moe_model.py (#3956 moe_grad_scale_func) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: svcnvidia-nemo-ci <svcnvidia-nemo-ci@nvidia.com>

svcnvidia-nemo-ci · 2026-06-18T21:25:46Z

/ok to test a21f083

svcnvidia-nemo-ci · 2026-06-18T23:30:49Z

✅ Ready for human review — CI summary

166 / 167 non-exempt required checks are green, including the Nemo_CICD_Test
aggregate gate (all unit-test buckets + the full functional/integration golden-value
suite + linting + copyright + wheel builds).
GitHub Actions run: https://github.com/NVIDIA/Megatron-LM/actions/runs/27790303479

The only non-green non-exempt check is cicd-mbridge-testing, which is a
verified pre-existing / external failure (NOT sync-caused) — see evidence below.

Pre-existing failure: `cicd-mbridge-testing`

The MBridge job checks out NVIDIA-NeMo/Megatron-Bridge@main and runs its unit suite
against this mcore commit. The 4 failing tests are all Megatron-Bridge Gemma4 tests:

models/gemma/test_gemma4_modeling.py::...injects_layer_inputs_and_restores_state
models/gemma/test_gemma4_modeling.py::...wraps_checkpointed_forward
models/gemma/test_gemma4_provider.py::...threads_per_layer_inputs_to_each_layer
→ AttributeError: module 'megatron.core.transformer.transformer_block' has no attribute 'checkpointed_forward'
models/gemma_vl/test_gemma4_vl_modeling.py::...scatters_sequence_parallel_decoder_input
→ TypeError: fake_scatter() got an unexpected keyword argument 'group'

Why this is not caused by the sync (empirical evidence):

The failing file is unchanged by this PR: git diff origin/dev HEAD -- megatron/core/transformer/transformer_block.py is empty (merged transformer_block.py is byte-identical to origin/dev).
transformer_block.checkpointed_forward was removed by a dev-side refactor: from megatron.core.recompute import checkpointed_forward exists in the merge-base and origin/main, but origin/dev refactored it away (now uses a _checkpointed_forward method). So the symbol Megatron-Bridge patches is absent because dev removed it, independent of this merge.
The same Launch_Unit_Tests_Core job fails identically for a different concurrent mcore commit: NVIDIA-NeMo/Megatron-Bridge run 27791666020 (branch mcore-testing-27791614845).
The previous nightly sync (chore: nightly sync main into dev (12_06_2026) #5314) passed cicd-mbridge-testing, i.e. before this Megatron-Bridge ↔ mcore API skew appeared.
It is not fixable from mcore: the scatter(group=...) mismatch lives in Megatron-Bridge's Gemma4-VL test mock, and re-adding checkpointed_forward would partially revert dev's intentional refactor (and dev itself is incompatible). The fix belongs in Megatron-Bridge (align Gemma4 tests with dev's _checkpointed_forward API) or a coordinated mcore release.

Merge notes

34 commits synced from main (Python: +4967 / −62 across 36 files at merge time).
Protected files kept at dev's version (verified identical): .github/CODEOWNERS, pyproject.toml, uv.lock, docker/Dockerfile.ci.dev. No new [tool.uv.sources] git sources in main required reconciliation.
Conflicts (16) resolved by combining dev + main, e.g. finalize_model_grads.py (both expert-bias guards), rope_utils.py (dev's CUDA-graph THD RoPE + main's apply_rotary_pos_emb default), transformer_config.py (dev offload asserts + main fused_group_mlp validation), checkpointing.py (main async-logits scheduling + dev formatting), arguments.py (restored dev imports + added main's --rl-profile/--freeze-all-layers/--override-ckpt-iteration/--logits-* args).
Dev-feature-preservation guard: to keep the pre-push guard green (0 dropped dev-only lines), several main modifications to existing dev files were deferred in favor of dev's versions where main's change would have dropped dev-unique lines (e.g. dev's absorbed-MLA separate-K/V incl. SP-assert / dynamic-CP, dev's inference engine). Main's new, self-contained features landed cleanly: offline logits distillation, RL profiling, DBuffer/FSDP-experimental, MIMO + DDP pg_collection threading (with their unit tests).
Phase-3 CI fixes (one rolling fix commit): unit tests asserting the deferred main behavior were aligned to dev (test_fine_grained_activation_offloading, test_multi_latent_attention, test_optimizer, test_weight_and_optimizer_memory); test_train_step_schedule_plumbing's mock was extended to tolerate dev's extra train_step args. test_hybrid_moe_model kept main's version (merged config legitimately has moe_grad_scale_func).

🤖 Generated with Claude Code

svcnvidia-nemo-ci · 2026-06-22T16:13:31Z

Superseded by today's nightly sync.

tdene and others added 30 commits June 12, 2026 16:35

Allow for pre-bound socket to be passed in server (#5301)

df9141e

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

Offline Logits-Based Knowledge Distillation (#5019)

277c4f8

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

Handle None values in sampling parameters (#5300)

1f537e8

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

Add moe loss normalization for RL SFT (#3956)

c0c1f91

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Add code owners for optimizer-related files (#5297)

18a2f55

Signed-off-by: janEbert <janpabloe@nvidia.com> Signed-off-by: Philip Petrakian <ppetrakian@nvidia.com> Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>

Fix EP=1 inference by allocating buffers anyway (#5233)

b45ae73

Signed-off-by: Helen Ngo <helenn@nvidia.com>

Fix crash due to tool call at sequence length (#5302)

806022f

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

Inference: Cudagraph-aware admission gating in prefill scheduler (#4870)

ef549a6

Signed-off-by: Helen Ngo <helenn@nvidia.com>

Account for reasoning token stripping (#5313)

0022550

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

Thread pg_collection through wrap_model_chunks_with_ddp (#5328)

eb1c677

Signed-off-by: ykarnati <ykarnati@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(beep boop 🤖): Bump (main) (2026-06-15)

59bb1c1

Fix LatentMoE theoretical memory estimate (#5145)

1bcb3b9

Signed-off-by: Shijie Wang <jaywan@nvidia.com>

Add zstandard package to Docker LTS requirements. Fix nightly failures (

133cf60

#5347) Signed-off-by: Ajay Balasa <abalasa@nvidia.com>

Thread MIMO support through the stock training loop (schedule + optim…

addc601

…izer) (#5333) Signed-off-by: ykarnati <ykarnati@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

ci: default functional test time limit to 4h for release/weekly scopes (

4165673

#5360) Signed-off-by: oliver könig <okoenig@nvidia.com>

Fix memory leak with log_max_attention_logit (#4699) (#5067)

72171c0

Signed-off-by: Antoni-Joan Solergibert <asolergibert@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

Clean up pretrain_gpt.py and pretrain_hybrid.py formatting and remove…

b60de39

… module globals (#5351) Signed-off-by: ilml <tolong@nvidia.com>

Add full model cuda graph support for MTP inference (#4950)

1cfa834

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Expand the Mamba prefix caching memory safety check to include scratc…

a83f408

…h space buffers (#5348) Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Make Megatron RL only materialize last token logit (#4551)

b00cad1

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

Profiling (#3110)

a12484b

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Co-authored-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

Support fused MLA QKV checkpoint reload (#5310)

2b90b3f

Signed-off-by: sraman <sraman@nvidia.com>

Add minimal DBuffer implementation (#4835)

000dc1c

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

[split 1/5] Fix packed THD RoPE under CP (#5243)

d30e165

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Update copy-pr-bot.yaml [skip ci]

49737fd

Document agent PR commit sign-off and signing (#5381)

2e1183a

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Remove unused distributed pytest markers (#5380)

2463dbe

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

[feat] Support fine-grained activation offloading in fused group mlp (#…

5c660c3

…5082) Signed-off-by: hongbinl <hongbinl@nvidia.com>

Thread tensor-parallel group into the RADIO patch embedder (#5371)

bd381ac

Signed-off-by: ykarnati <ykarnati@nvidia.com>

Add MimoModel.zero_grad_buffer delegating to active DDP submodules (#…

a00c0de

…5372) Signed-off-by: ykarnati <ykarnati@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

copy-pr-bot Bot temporarily deployed to public June 18, 2026 19:55 Inactive

svcnvidia-nemo-ci force-pushed the main2dev/18_06_2026 branch from 0362e81 to 52bfe0a Compare June 18, 2026 20:27

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:28 Inactive

copy-pr-bot Bot temporarily deployed to test June 18, 2026 20:28 Inactive

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:31 Inactive

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:32 Inactive

copy-pr-bot Bot temporarily deployed to public June 18, 2026 20:41 Inactive

svcnvidia-nemo-ci force-pushed the main2dev/18_06_2026 branch from 52bfe0a to a21f083 Compare June 18, 2026 21:25

copy-pr-bot Bot temporarily deployed to public June 18, 2026 21:26 Inactive

copy-pr-bot Bot temporarily deployed to test June 18, 2026 21:27 Inactive

copy-pr-bot Bot temporarily deployed to public June 18, 2026 21:29 Inactive

copy-pr-bot Bot temporarily deployed to public June 18, 2026 21:39 Inactive

svcnvidia-nemo-ci marked this pull request as ready for review June 18, 2026 23:30

svcnvidia-nemo-ci requested review from a team as code owners June 18, 2026 23:30

svcnvidia-nemo-ci added the complexity: high label Jun 18, 2026

svcnvidia-nemo-ci closed this Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: nightly sync main into dev (18_06_2026)#5402

chore: nightly sync main into dev (18_06_2026)#5402
svcnvidia-nemo-ci wants to merge 36 commits into
devfrom
main2dev/18_06_2026

svcnvidia-nemo-ci commented Jun 18, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 18, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 18, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 18, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

svcnvidia-nemo-ci commented Jun 18, 2026

Nightly sync: main → dev (34 commits, 18_06_2026)

What landed

Conflicts resolved (combined dev + main)

Deferred to a future sync (dev-feature-preservation guard)

Uh oh!

svcnvidia-nemo-ci commented Jun 18, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 18, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 18, 2026

✅ Ready for human review — CI summary

Pre-existing failure: cicd-mbridge-testing

Merge notes

Uh oh!

svcnvidia-nemo-ci commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Pre-existing failure: `cicd-mbridge-testing`