Skip to content

fix(vlm): guard validation forward against cuDNN fused-MHA SDPA backend#2659

Draft
akoumpa wants to merge 1 commit into
mainfrom
akoumparouli/nvbug6293238-vlm-eval-sdpa-guard
Draft

fix(vlm): guard validation forward against cuDNN fused-MHA SDPA backend#2659
akoumpa wants to merge 1 commit into
mainfrom
akoumparouli/nvbug6293238-vlm-eval-sdpa-guard

Conversation

@akoumpa

@akoumpa akoumpa commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

NVBug: 6293238

What

Constrain the VLM validation forward to non-cuDNN SDPA backends so it cannot dispatch to the cuDNN
fused-MHA kernel that can fail mha_graph.execute(...).is_good() == false.

  • New eval_safe_sdpa_kernel() in nemo_automodel/_transformers/kernel_patches.py
    sdpa_kernel([FLASH_ATTENTION, EFFICIENT_ATTENTION, MATH]) (cuDNN excluded), mirroring the cuDNN
    exclusion resolve_sdpa_method already applies under activation checkpointing.
  • FinetuneRecipeForVLM._run_validation_epoch wraps the eval forward in eval_safe_sdpa_kernel()
    (skipped under context parallelism, where train_ctx already enters a cuDNN-free context).
  • Unit tests for the new helper.

Why

Qwen3.6-27B dense VLM MTP finetune trains 10/10 steps cleanly, then crashes in the end-of-training
validation forward:

RuntimeError: Expected mha_graph.execute(...).is_good() to be true, but got false.

inside F.scaled_dot_product_attention. Root cause is a cuDNN fused-MHA backend defect surfaced by the
validation-forward shape (Qwen3.5 full-attention layers use head_dim=256 and the eval batch carries an
explicit attention mask). The Qwen3.5 VLM is a custom model, so it bypasses _patch_attention
(auto_model.py:543) and never receives the cuDNN-excluding SDPA backend list that HF models get —
leaving the eval forward free to select cuDNN. Training takes the maskless is_causal=True path that
cuDNN handles, so only validation crashes. This keeps the published recipe's validation enabled
end-to-end (the POR test currently masks the bug by stripping validation), as recommended in the bug's
"Suggested next steps".

How tested

On cw-dfw (1× H100, container with torch 2.12 / cuDNN 9.21):

  • eval_safe_sdpa_kernel() disables cuDNN SDPA (flash/mem-efficient/math remain enabled) and restores
    the flag on exit.
  • A downsized real Qwen3_5ForConditionalGeneration validation forward (eval/no_grad + activation
    checkpointing + MTP + image, multiple seqlens) runs green under the guard with finite logits.
  • pytest tests/unit_tests/_transformers/test_eval_safe_sdpa_kernel.py + existing kernel_patches tests:
    10 passed. ruff check / ruff format --check clean.

Note: the kernel-level crash no longer reproduces on cuDNN 9.21 (the kernel defect was fixed in a newer
cuDNN than the bug's original nemo-automodel:nightly image); this change removes the code-level
fragility so older-cuDNN users are protected and the backend choice is explicit.

Scope / risk

Eval-only; training behavior unchanged. CP path unchanged (guard skipped — CP's context already excludes
cuDNN). Low risk: flash/mem-efficient/math are already the AC-time default for HF models. +83/-1, 3 files.

NVBugs: 6293238
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa requested a review from a team as a code owner June 20, 2026 02:07
@akoumpa akoumpa added the r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Jun 20, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 20, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa

akoumpa commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 2e7f40c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant