Refactor pipeline parallel helpers for graph PP reuse#2724
Refactor pipeline parallel helpers for graph PP reuse#2724sanketpurandare wants to merge 1 commit intomainfrom
Conversation
2d2fb54 to
bd097f8
Compare
bd097f8 to
d3d14c2
Compare
tianyu-l
left a comment
There was a problem hiding this comment.
Please add more detailed description about why making these changes.
| __all__ = [ | ||
| "pipeline_llm", | ||
| "get_pipeline_metadata", | ||
| "build_pipeline_schedule", | ||
| "generate_llm_fqn_per_model_part", | ||
| "pipeline_module_split", | ||
| "split_module", | ||
| "get_pp_rank_to_stage_indices_mapping", | ||
| ] |
There was a problem hiding this comment.
do all these fields need to be public and exposed?
There was a problem hiding this comment.
Sorry this should have been marked as draft PR, it is not ready for review yet.
There was a problem hiding this comment.
I have updated the PR description to include the change description.
|
@claude review this PR |
|
Claude finished @tianyu-l's task in 3m 25s —— View job Review of PR #2724: Refactor pipeline_parallel.py for graph PP reuse
SummaryThis PR refactors
Additionally, All three new functions and new Issues and Suggestions1.
|
d3d14c2 to
2fb3b75
Compare
2fb3b75 to
36c1ede
Compare
| stages: list[PipelineStage], | ||
| loss_fn: Callable, | ||
| backward_requires_autograd: bool = True, | ||
| scale_grads: bool = False, |
There was a problem hiding this comment.
why do you need to expose this? IMO we should deprecate this -- IIRC it's assuming all microbatches have the same number of valid tokens.
There was a problem hiding this comment.
scale_grads has been removed
| local_batch_size: int, | ||
| stages: list[PipelineStage], | ||
| loss_fn: Callable, | ||
| backward_requires_autograd: bool = True, |
There was a problem hiding this comment.
curious what this field is used for
There was a problem hiding this comment.
This is used when we want to run the backward actions for a microbatch without supplying the loss function to the pp runtime. Currently if we don't supply the loss function the pp_runtime will assume there is no backward. But in graph_pp the forward graph contains the loss function, so explicit loss function is not passed to the pp_runtime.
| __all__ = [ | ||
| "pipeline_llm", | ||
| "get_pipeline_metadata", | ||
| "build_pipeline_schedule", | ||
| "generate_llm_fqn_per_model_part", | ||
| "pipeline_module_split", | ||
| "split_module", | ||
| "get_pp_rank_to_stage_indices_mapping", | ||
| ] |
36c1ede to
4dd3fe7
Compare
4dd3fe7 to
e3fcf74
Compare
e3fcf74 to
ba8eebc
Compare
Extract pipeline metadata, module splitting, and PP rank-to-stage mapping from pipeline_llm so graph PP can reuse the underlying setup logic without duplicating it. Add backward_requires_autograd to the schedule builder for graph PP, which runs explicit backward graphs instead of autograd. Existing eager PP behavior is unchanged. Keep pipeline_llm as the only public entrypoint exported by torchtitan.distributed.pipeline_parallel. Make build_pipeline_schedule, generate_llm_fqn_per_model_part, and pipeline_module_split private because they are implementation details with narrower contracts: schedule construction depends on the current PP config shape, LLM FQN generation encodes TorchTitan-specific module naming heuristics, and module splitting assumes models tolerate deleted or empty layer containers. Update internal and experiment callsites to use the private helper names directly where reuse is still needed. This keeps the reusable code centralized while avoiding accidentally blessing those helpers as stable public API. stack-info: PR: #2724, branch: sanketpurandare/stack/3
ba8eebc to
d036765
Compare
Stacked PRs:
Refactor pipeline parallel helpers for graph PP reuse
Extract pipeline metadata, module splitting, and PP rank-to-stage mapping from pipeline_llm so graph PP can reuse the underlying setup logic without duplicating it. Add backward_requires_autograd to the schedule builder for graph PP, which runs explicit backward graphs instead of autograd. Existing eager PP behavior is unchanged.
How does this refactoring help?
This is the call chain:
pipeline_llm
-> _get_pipeline_metadata
-> _generate_llm_fqn_per_model_part
-> _build_get_mesh_callback
-> _pipeline_module_split
-> _get_pp_rank_to_stage_indices_mapping
-> _split_module
-> PipelineStage
-> parallelize_fn
-> _build_pipeline_schedule
GraphPP will define:
graph_pipeline_llm
-> _get_pipeline_metadata
-> _generate_llm_fqn_per_model_part
-> _build_get_mesh_callback
-> _pipeline_module_split
-> _get_pp_rank_to_stage_indices_mapping
-> _split_module
-> GraphPipelineStage
-> parallelize_fn
-> _build_pipeline_schedule
-> GraphPPRunner
So by just overriding _pipeline_module_split but by reusing all the other helpers, we minimize code redundancy and maximize code reuse.
Keep pipeline_llm as the only public entrypoint exported by torchtitan.distributed.pipeline_parallel. Make build_pipeline_schedule, generate_llm_fqn_per_model_part, and pipeline_module_split private because they are implementation details with narrower contracts: schedule construction depends on the current PP config shape, LLM FQN generation encodes TorchTitan-specific module naming heuristics, and module splitting assumes models tolerate deleted or empty layer containers.
Update internal and experiment callsites to use the private helper names directly where reuse is still needed. This keeps the reusable code centralized while avoiding accidentally blessing those helpers as stable public API.