Context
Megatron-LM dev ToT appears incompatible with Megatron Bridge main ToT; Bridge tests are failing when trying to run against current MLM dev. We need a tracked PoR item to find the latest compatible MLM dev commit and onboard the MLM dev MoE recipes into Bridge training recipes so Bridge can run dev on top of those recipes.
References
Initial recipe inventory
The MLM examples/moe_recipes directory currently contains MoE recipe groups for:
- DeepSeek-V3
- DeepSeek-V4-Flash GB200
- Qwen3-235B-A22B
- Qwen3-30B-A3B
The README indexes hardware/config variants across H100, B200, B300, GB200, and GB300, including BF16, FP8, MXFP8, DeepEP, HybridEP, CUDA graph, EP overlap, paged stash, and offload variants.
Bridge already has relevant target families under:
src/megatron/bridge/recipes/deepseek/
src/megatron/bridge/recipes/qwen/
Scope
- Identify and record the latest MLM
dev commit compatible with Bridge main; preserve the currently failing ToT result as a known-bad reference.
- Port the MLM dev
examples/moe_recipes content into Bridge recipe functions under src/megatron/bridge/recipes/, following the existing ConfigContainer / run_recipe.py conventions.
- Decide whether each onboarded recipe should be a functional training recipe, a performance recipe, or both, especially with the Python recipe unification work in mind.
- Translate MLM YAML sections (
DEPENDENCIES, ENV_VARS, ARGS) into Bridge-native config, helper calls, documented runtime assumptions, and optional extras only where needed.
- Preserve model/hardware-specific optimization intent: TP/PP/EP/CP/ETP, batch sizes, sequence length, precision mode, dispatcher choice, CUDA graph settings, overlap settings, offload settings, and any required env vars.
- Add focused validation for recipe import/construction and representative smoke launches with mock data; do not run the full test suite.
- Update recipe exports and any recipe discovery/docs needed so users can invoke the new recipes through Bridge entry points.
Acceptance criteria
- A compatible MLM
dev commit SHA is documented, along with the incompatible ToT SHA/failure context.
- DeepSeek and Qwen3 MoE MLM dev recipe variants are represented in Bridge recipe code or explicitly scoped out with rationale.
- Recipes can be discovered/imported from Bridge and invoked through the expected training entry point.
- Focused tests or smoke commands are documented in the implementing PR.
- No changes are made inside
3rdparty/Megatron-LM/, and no new required dependencies are added without separate approval.
Context
Megatron-LM
devToT appears incompatible with Megatron BridgemainToT; Bridge tests are failing when trying to run against current MLM dev. We need a tracked PoR item to find the latest compatible MLMdevcommit and onboard the MLM dev MoE recipes into Bridge training recipes so Bridge can run dev on top of those recipes.References
devreference observed on 2026-06-23:94f70b6322bad7428dbc6f1f514969ed1d0f9346mainreference observed on 2026-06-23:d26b3f9ef13a162c9ed1be90587668d1fad14059Initial recipe inventory
The MLM
examples/moe_recipesdirectory currently contains MoE recipe groups for:The README indexes hardware/config variants across H100, B200, B300, GB200, and GB300, including BF16, FP8, MXFP8, DeepEP, HybridEP, CUDA graph, EP overlap, paged stash, and offload variants.
Bridge already has relevant target families under:
src/megatron/bridge/recipes/deepseek/src/megatron/bridge/recipes/qwen/Scope
devcommit compatible with Bridgemain; preserve the currently failing ToT result as a known-bad reference.examples/moe_recipescontent into Bridge recipe functions undersrc/megatron/bridge/recipes/, following the existingConfigContainer/run_recipe.pyconventions.DEPENDENCIES,ENV_VARS,ARGS) into Bridge-native config, helper calls, documented runtime assumptions, and optional extras only where needed.Acceptance criteria
devcommit SHA is documented, along with the incompatible ToT SHA/failure context.3rdparty/Megatron-LM/, and no new required dependencies are added without separate approval.