User problem
I'm trying to perform DPO on gpt-oss-120b with a dataset containing sequences that are over 5000 tokens long. By tuning EP,PP and number of nodes, I was able to get it working for the first step, but the job fails with OOM after the first step.
File "/opt/ray_venvs/nemo_rl.models.policy.workers.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1829, in inner
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 448, in forward
outputs = float16_to_fp32(outputs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 376, in float16_to_fp32
return conversion_helper(val, float_conversion)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 335, in conversion_helper
return conversion(val)
^^^^^^^^^^^^^^^
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 373, in float_conversion
val = val.float()
^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.80 GiB. GPU 0 has a total capacity of 139.80 GiB of which 1.42 GiB is free. Including non-PyTorch memory, this process has 138.29 GiB memory in use. Of the allocated memory 121.77 GiB is allocated by PyTorch, and 11.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
DPO job runs for all the steps and successfully exits.
Additional context
- Cluster has 3 p5e nodes each containing 8 GPUs.
- I set
PYTORCH_CUDA_ALLOC_CONF to expandable_segments:True, but still face the same issue.
Please share any suggestions for me to try.
Thanks in advance.
Desired outcome
DPO job runs successfully and stores checkpoints.
Alternatives or workarounds considered
- I fine-tuned PP, EP, number of nodes in the cluster parameter to get first step working. But struggling to proceed further.
Affected area
area:recipe
Urgency / use case
Blocking current work
Environment
Steps/Code to reproduce bug*
- Pull nemo-rl:0.5.0 image
- Following is the config override.
uv run examples/run_dpo.py \
--config examples/configs/dpo.yaml \
policy.model_name=openai/gpt-oss-120b \
policy.tokenizer.name=openai/gpt-oss-120b \
policy.max_total_sequence_length=5000 \
policy.train_global_batch_size=2 \
policy.train_micro_batch_size=1 \
policy.dtensor_cfg.enabled=false \
policy.megatron_cfg.enabled=true \
policy.megatron_cfg.sequence_parallel=false \
policy.megatron_cfg.expert_model_parallel_size=2 \
policy.megatron_cfg.tensor_model_parallel_size=1 \
policy.megatron_cfg.context_parallel_size=1 \
policy.megatron_cfg.pipeline_model_parallel_size=12 \
policy.make_sequence_length_divisible_by=1 \
+policy.megatron_cfg.env_vars.NRL_MEGATRON_CHECKPOINT_DIR=/fsx/shared/checkpoints/megatron/gpt-oss-120b \
policy.megatron_cfg.optimizer.use_distributed_optimizer=true \
logger.mlflow_enabled=true \
logger.mlflow.experiment_name=nemo-rl-experiments \
logger.mlflow.run_name=dpo \
+logger.mlflow.tracking_uri=<mlflow tracking server> \
dpo.sft_loss_weight=0.1 \
dpo.preference_average_log_probs=true \
cluster.gpus_per_node=8 \
cluster.num_nodes=3 \
checkpointing.checkpoint_dir=/fsx/shared/experiments/v1/ \
data.dataset_name=PreferenceDataset \
++data.train_data_path=/fsx/shared/datasets/v1/dpo_dataset_train.jsonl \
++data.val_data_paths.val=/fsx/shared/datasets/v1/dpo_dataset_val.jsonl
# -----------------------------------------------
# RUNTIME ENVIRONMENT VARIABLES
# -----------------------------------------------
# Set environment variables needed by the job.
runtimeEnvYAML: |
envVars:
FSX_ROOT: "/fsx"
GPUS_PER_NODE: "8"
NCCL_DEBUG: "INFO"
PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
NCCL_DEBUG_SUBSYS: "ALL"
TORCH_DISTRIBUTED_DEBUG: "DETAIL"
CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
workingDir: "."
User problem
I'm trying to perform DPO on gpt-oss-120b with a dataset containing sequences that are over 5000 tokens long. By tuning EP,PP and number of nodes, I was able to get it working for the first step, but the job fails with OOM after the first step.
Expected behavior
DPO job runs for all the steps and successfully exits.
Additional context
PYTORCH_CUDA_ALLOC_CONFtoexpandable_segments:True, but still face the same issue.Please share any suggestions for me to try.
Thanks in advance.
Desired outcome
DPO job runs successfully and stores checkpoints.
Alternatives or workarounds considered
Affected area
area:recipe
Urgency / use case
Blocking current work
Environment
Steps/Code to reproduce bug*