Skip to content

PoR: Merge MLM dev branch MoE recipes into Bridge training recipes #4471

Description

@yaoyu-33

Context

Megatron-LM dev ToT appears incompatible with Megatron Bridge main ToT; Bridge tests are failing when trying to run against current MLM dev. We need a tracked PoR item to find the latest compatible MLM dev commit and onboard the MLM dev MoE recipes into Bridge training recipes so Bridge can run dev on top of those recipes.

References

Initial recipe inventory

The MLM examples/moe_recipes directory currently contains MoE recipe groups for:

  • DeepSeek-V3
  • DeepSeek-V4-Flash GB200
  • Qwen3-235B-A22B
  • Qwen3-30B-A3B

The README indexes hardware/config variants across H100, B200, B300, GB200, and GB300, including BF16, FP8, MXFP8, DeepEP, HybridEP, CUDA graph, EP overlap, paged stash, and offload variants.

Bridge already has relevant target families under:

  • src/megatron/bridge/recipes/deepseek/
  • src/megatron/bridge/recipes/qwen/

Scope

  • Identify and record the latest MLM dev commit compatible with Bridge main; preserve the currently failing ToT result as a known-bad reference.
  • Port the MLM dev examples/moe_recipes content into Bridge recipe functions under src/megatron/bridge/recipes/, following the existing ConfigContainer / run_recipe.py conventions.
  • Decide whether each onboarded recipe should be a functional training recipe, a performance recipe, or both, especially with the Python recipe unification work in mind.
  • Translate MLM YAML sections (DEPENDENCIES, ENV_VARS, ARGS) into Bridge-native config, helper calls, documented runtime assumptions, and optional extras only where needed.
  • Preserve model/hardware-specific optimization intent: TP/PP/EP/CP/ETP, batch sizes, sequence length, precision mode, dispatcher choice, CUDA graph settings, overlap settings, offload settings, and any required env vars.
  • Add focused validation for recipe import/construction and representative smoke launches with mock data; do not run the full test suite.
  • Update recipe exports and any recipe discovery/docs needed so users can invoke the new recipes through Bridge entry points.

Acceptance criteria

  • A compatible MLM dev commit SHA is documented, along with the incompatible ToT SHA/failure context.
  • DeepSeek and Qwen3 MoE MLM dev recipe variants are represented in Bridge recipe code or explicitly scoped out with rationale.
  • Recipes can be discovered/imported from Bridge and invoked through the expected training entry point.
  • Focused tests or smoke commands are documented in the implementing PR.
  • No changes are made inside 3rdparty/Megatron-LM/, and no new required dependencies are added without separate approval.

Metadata

Metadata

Assignees

Labels

PoRPlan of record item for roadmap and release trackingarea:recipeTraining recipes and launch configsarea:trainingTraining loop, callbacks, and runtime integrationfeatureNew capabilities, enhancements, or enablement workmlm-syncRequires API/behavior sync with upstream Megatron-LM changestrackingTracking issue for an ongoing project with smaller steps

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions