Skip to content

[support] GPT-OSS DPO fails after step 1 #4402

Description

@dhineshkumar-r

User problem

I'm trying to perform DPO on gpt-oss-120b with a dataset containing sequences that are over 5000 tokens long. By tuning EP,PP and number of nodes, I was able to get it working for the first step, but the job fails with OOM after the first step.

  File "/opt/ray_venvs/nemo_rl.models.policy.workers.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1829, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 448, in forward
    outputs = float16_to_fp32(outputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 376, in float16_to_fp32
    return conversion_helper(val, float_conversion)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 335, in conversion_helper
    return conversion(val)
           ^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 373, in float_conversion
    val = val.float()
          ^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.80 GiB. GPU 0 has a total capacity of 139.80 GiB of which 1.42 GiB is free. Including non-PyTorch memory, this process has 138.29 GiB memory in use. Of the allocated memory 121.77 GiB is allocated by PyTorch, and 11.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

DPO job runs for all the steps and successfully exits.

Additional context

  • Cluster has 3 p5e nodes each containing 8 GPUs.
  • I set PYTORCH_CUDA_ALLOC_CONF to expandable_segments:True, but still face the same issue.

Please share any suggestions for me to try.
Thanks in advance.

Desired outcome

DPO job runs successfully and stores checkpoints.

Alternatives or workarounds considered

  • I fine-tuned PP, EP, number of nodes in the cluster parameter to get first step working. But struggling to proceed further.

Affected area

area:recipe

Urgency / use case

Blocking current work

Environment

Steps/Code to reproduce bug*

  1. Pull nemo-rl:0.5.0 image
  2. Following is the config override.
 uv run examples/run_dpo.py \
      --config examples/configs/dpo.yaml \
      policy.model_name=openai/gpt-oss-120b \
      policy.tokenizer.name=openai/gpt-oss-120b \
      policy.max_total_sequence_length=5000 \
      policy.train_global_batch_size=2 \
      policy.train_micro_batch_size=1 \
      policy.dtensor_cfg.enabled=false \
      policy.megatron_cfg.enabled=true \
      policy.megatron_cfg.sequence_parallel=false \
      policy.megatron_cfg.expert_model_parallel_size=2 \
      policy.megatron_cfg.tensor_model_parallel_size=1 \
      policy.megatron_cfg.context_parallel_size=1 \
      policy.megatron_cfg.pipeline_model_parallel_size=12 \
      policy.make_sequence_length_divisible_by=1 \
      +policy.megatron_cfg.env_vars.NRL_MEGATRON_CHECKPOINT_DIR=/fsx/shared/checkpoints/megatron/gpt-oss-120b \
      policy.megatron_cfg.optimizer.use_distributed_optimizer=true \
      logger.mlflow_enabled=true \
      logger.mlflow.experiment_name=nemo-rl-experiments \
      logger.mlflow.run_name=dpo \
      +logger.mlflow.tracking_uri=<mlflow tracking server> \
      dpo.sft_loss_weight=0.1 \
      dpo.preference_average_log_probs=true \
      cluster.gpus_per_node=8 \
      cluster.num_nodes=3 \
      checkpointing.checkpoint_dir=/fsx/shared/experiments/v1/ \
      data.dataset_name=PreferenceDataset \
      ++data.train_data_path=/fsx/shared/datasets/v1/dpo_dataset_train.jsonl \
      ++data.val_data_paths.val=/fsx/shared/datasets/v1/dpo_dataset_val.jsonl

  # -----------------------------------------------
  # RUNTIME ENVIRONMENT VARIABLES
  # -----------------------------------------------
  # Set environment variables needed by the job.
  runtimeEnvYAML: |
    envVars:
      FSX_ROOT: "/fsx"
      GPUS_PER_NODE: "8"
      NCCL_DEBUG: "INFO"
      PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
      NCCL_DEBUG_SUBSYS: "ALL"
      TORCH_DISTRIBUTED_DEBUG: "DETAIL"
      CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
    workingDir: "."

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions