fix(vlm): guard validation forward against cuDNN fused-MHA SDPA backend by akoumpa · Pull Request #2659 · NVIDIA-NeMo/Automodel

akoumpa · 2026-06-20T02:07:03Z

NVBug: 6293238

What

Constrain the VLM validation forward to non-cuDNN SDPA backends so it cannot dispatch to the cuDNN
fused-MHA kernel that can fail mha_graph.execute(...).is_good() == false.

New eval_safe_sdpa_kernel() in nemo_automodel/_transformers/kernel_patches.py —
sdpa_kernel([FLASH_ATTENTION, EFFICIENT_ATTENTION, MATH]) (cuDNN excluded), mirroring the cuDNN
exclusion resolve_sdpa_method already applies under activation checkpointing.
FinetuneRecipeForVLM._run_validation_epoch wraps the eval forward in eval_safe_sdpa_kernel()
(skipped under context parallelism, where train_ctx already enters a cuDNN-free context).
Unit tests for the new helper.

Why

Qwen3.6-27B dense VLM MTP finetune trains 10/10 steps cleanly, then crashes in the end-of-training
validation forward:

RuntimeError: Expected mha_graph.execute(...).is_good() to be true, but got false.

inside F.scaled_dot_product_attention. Root cause is a cuDNN fused-MHA backend defect surfaced by the
validation-forward shape (Qwen3.5 full-attention layers use head_dim=256 and the eval batch carries an
explicit attention mask). The Qwen3.5 VLM is a custom model, so it bypasses _patch_attention
(auto_model.py:543) and never receives the cuDNN-excluding SDPA backend list that HF models get —
leaving the eval forward free to select cuDNN. Training takes the maskless is_causal=True path that
cuDNN handles, so only validation crashes. This keeps the published recipe's validation enabled
end-to-end (the POR test currently masks the bug by stripping validation), as recommended in the bug's
"Suggested next steps".

How tested

On cw-dfw (1× H100, container with torch 2.12 / cuDNN 9.21):

eval_safe_sdpa_kernel() disables cuDNN SDPA (flash/mem-efficient/math remain enabled) and restores
the flag on exit.
A downsized real Qwen3_5ForConditionalGeneration validation forward (eval/no_grad + activation
checkpointing + MTP + image, multiple seqlens) runs green under the guard with finite logits.
pytest tests/unit_tests/_transformers/test_eval_safe_sdpa_kernel.py + existing kernel_patches tests:
10 passed. ruff check / ruff format --check clean.

Note: the kernel-level crash no longer reproduces on cuDNN 9.21 (the kernel defect was fixed in a newer
cuDNN than the bug's original nemo-automodel:nightly image); this change removes the code-level
fragility so older-cuDNN users are protected and the backend choice is explicit.

Scope / risk

Eval-only; training behavior unchanged. CP path unchanged (guard skipped — CP's context already excludes
cuDNN). Low risk: flash/mem-efficient/math are already the AC-time default for HF models. +83/-1, 3 files.

NVBugs: 6293238 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot · 2026-06-20T02:07:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

akoumpa · 2026-06-20T02:13:31Z

/ok to test 2e7f40c

fix(vlm): guard validation forward against cuDNN fused-MHA SDPA backend

2e7f40c

NVBugs: 6293238 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa requested a review from a team as a code owner June 20, 2026 02:07

akoumpa added the r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Jun 20, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci June 20, 2026 02:13 Inactive

copy-pr-bot Bot temporarily deployed to test June 20, 2026 02:13 Inactive

copy-pr-bot Bot temporarily deployed to public June 20, 2026 02:14 Inactive

copy-pr-bot Bot temporarily deployed to public June 20, 2026 02:16 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 20, 2026 02:18 Inactive

akoumpa marked this pull request as draft June 20, 2026 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vlm): guard validation forward against cuDNN fused-MHA SDPA backend#2659

fix(vlm): guard validation forward against cuDNN fused-MHA SDPA backend#2659
akoumpa wants to merge 1 commit into
mainfrom
akoumparouli/nvbug6293238-vlm-eval-sdpa-guard

akoumpa commented Jun 20, 2026

Uh oh!

copy-pr-bot Bot commented Jun 20, 2026

Uh oh!

akoumpa commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akoumpa commented Jun 20, 2026

What

Why

How tested

Scope / risk

Uh oh!

copy-pr-bot Bot commented Jun 20, 2026

Uh oh!

akoumpa commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant