[support] GPT-OSS DPO fails after step 1

### User problem


I'm trying to perform DPO on gpt-oss-120b with a dataset containing sequences that are over 5000 tokens long. By tuning EP,PP and number of nodes, I was able to get it working for the first step, but the job fails with OOM after the first step.

```
  File "/opt/ray_venvs/nemo_rl.models.policy.workers.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1829, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 448, in forward
    outputs = float16_to_fp32(outputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 376, in float16_to_fp32
    return conversion_helper(val, float_conversion)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 335, in conversion_helper
    return conversion(val)
           ^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/transformer/module.py", line 373, in float_conversion
    val = val.float()
          ^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.80 GiB. GPU 0 has a total capacity of 139.80 GiB of which 1.42 GiB is free. Including non-PyTorch memory, this process has 138.29 GiB memory in use. Of the allocated memory 121.77 GiB is allocated by PyTorch, and 11.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
``` 


**Expected behavior**

DPO job runs for all the steps and successfully exits.

**Additional context**

* Cluster has 3 p5e nodes each containing 8 GPUs.
* I set `PYTORCH_CUDA_ALLOC_CONF` to `expandable_segments:True`, but still face the same issue.



Please share any suggestions for me to try. 
Thanks in advance.


### Desired outcome

DPO job runs successfully and stores checkpoints.

### Alternatives or workarounds considered

* I fine-tuned PP, EP, number of nodes in the cluster parameter to get first step working. But struggling to proceed further.

### Affected area

area:recipe

### Urgency / use case

Blocking current work

### Environment

*Steps/Code to reproduce bug**

1. Pull nemo-rl:0.5.0 image
2. Following is the config override.
```
 uv run examples/run_dpo.py \
      --config examples/configs/dpo.yaml \
      policy.model_name=openai/gpt-oss-120b \
      policy.tokenizer.name=openai/gpt-oss-120b \
      policy.max_total_sequence_length=5000 \
      policy.train_global_batch_size=2 \
      policy.train_micro_batch_size=1 \
      policy.dtensor_cfg.enabled=false \
      policy.megatron_cfg.enabled=true \
      policy.megatron_cfg.sequence_parallel=false \
      policy.megatron_cfg.expert_model_parallel_size=2 \
      policy.megatron_cfg.tensor_model_parallel_size=1 \
      policy.megatron_cfg.context_parallel_size=1 \
      policy.megatron_cfg.pipeline_model_parallel_size=12 \
      policy.make_sequence_length_divisible_by=1 \
      +policy.megatron_cfg.env_vars.NRL_MEGATRON_CHECKPOINT_DIR=/fsx/shared/checkpoints/megatron/gpt-oss-120b \
      policy.megatron_cfg.optimizer.use_distributed_optimizer=true \
      logger.mlflow_enabled=true \
      logger.mlflow.experiment_name=nemo-rl-experiments \
      logger.mlflow.run_name=dpo \
      +logger.mlflow.tracking_uri=<mlflow tracking server> \
      dpo.sft_loss_weight=0.1 \
      dpo.preference_average_log_probs=true \
      cluster.gpus_per_node=8 \
      cluster.num_nodes=3 \
      checkpointing.checkpoint_dir=/fsx/shared/experiments/v1/ \
      data.dataset_name=PreferenceDataset \
      ++data.train_data_path=/fsx/shared/datasets/v1/dpo_dataset_train.jsonl \
      ++data.val_data_paths.val=/fsx/shared/datasets/v1/dpo_dataset_val.jsonl

  # -----------------------------------------------
  # RUNTIME ENVIRONMENT VARIABLES
  # -----------------------------------------------
  # Set environment variables needed by the job.
  runtimeEnvYAML: |
    envVars:
      FSX_ROOT: "/fsx"
      GPUS_PER_NODE: "8"
      NCCL_DEBUG: "INFO"
      PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
      NCCL_DEBUG_SUBSYS: "ALL"
      TORCH_DISTRIBUTED_DEBUG: "DETAIL"
      CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
    workingDir: "."
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[support] GPT-OSS DPO fails after step 1 #4402

User problem

Desired outcome

Alternatives or workarounds considered

Affected area

Urgency / use case

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[support] GPT-OSS DPO fails after step 1 #4402

Description

User problem

Desired outcome

Alternatives or workarounds considered

Affected area

Urgency / use case

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions