Skip to content

fix(deepseek-v4): restore batch axis for packed-sequence (THD) forward#2651

Open
akoumpa wants to merge 1 commit into
mainfrom
akoumparouli/nvbug6329577-deepseek-v4-thd-batch
Open

fix(deepseek-v4): restore batch axis for packed-sequence (THD) forward#2651
akoumpa wants to merge 1 commit into
mainfrom
akoumparouli/nvbug6329577-deepseek-v4-thd-batch

Conversation

@akoumpa

@akoumpa akoumpa commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

NVBug: 6329577

What

Add the missing leading batch dimension to inputs_embeds in DeepseekV4Model.forward
when the input arrives in packed-sequence (THD) layout, before the hc_mult expansion.

Why

Packed-sequence finetuning of DeepSeek-V4-Flash crashes on the first optim step in the model
forward (NVBugs 6329577):

RuntimeError: The expanded size of the tensor (4) must match the existing size (512)
              at non-singleton dimension 2.  Target sizes: [-1, -1, 4, -1].
              Tensor sizes: [512, 512, 1]

at nemo_automodel/components/models/deepseek_v4/model.py:450:

h = inputs_embeds.unsqueeze(2).expand(-1, -1, self.config.hc_mult, -1).contiguous()

Root cause: the THD packed path (make_cp_batch_and_ctx(use_te=True) ->
process_input_for_thd) collapses the batch dimension, handing the model a rank-1
input_ids of shape [T]. embed_tokens([T]) then yields a rank-2 [T, H] inputs_embeds,
so unsqueeze(2) -> [T, H, 1] and expand(-1,-1,hc_mult,-1) tries to resize the (non-singleton)
hidden dim to hc_mult and fails. The model already restores the batch dim on the OUTPUT side
(compute_lm_head_logits(is_thd=True) does unsqueeze(0) -> [1, T, V]); the input side just
lacked the symmetric up-rank. (The original NVBug "suggested fix" was a no-op — identical to the
existing code — and did not address the actual rank mismatch.)

How

In DeepseekV4Model.forward, after computing inputs_embeds and before the hc_mult expand:

if inputs_embeds.dim() == 2:
    inputs_embeds = inputs_embeds.unsqueeze(0)

This is a no-op for the normal BSHD [B, S, H] path and mirrors the existing output-side THD
restoration; downstream position_ids (1-D) and seq_lens (1-D) up-ranks were already present.
7 lines added, 0 removed, 1 file.

How tested

  • Weightless single-GPU repro (tiny random-init DSV4, hc_mult=4, hidden_size=512) that feeds
    input_ids through the real process_input_for_thd to produce the exact rank-1 [T] layout:
    • Before: reproduces the reported error at model.py:450 (Tensor sizes: [512, 512, 1]).
    • After: forward completes, logits [1, 512, 256] = [1, T, vocab], no NaN.
  • pytest tests/unit_tests/models/deepseek_v4/test_dsv4_model_smoke.py -> 17 passed
    (incl. THD / forward / backward smoke tests). No regressions.

NVBugs: 6329577
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa requested a review from a team as a code owner June 20, 2026 02:04
@akoumpa akoumpa added the r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Jun 20, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 20, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa

akoumpa commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 0862c83

@akoumpa akoumpa enabled auto-merge (squash) June 20, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants