Skip to content

Add pytorch-cuda-fix-xpu-alignment agent skill#3466

Open
laifenxiawucha wants to merge 42 commits intomainfrom
add-pytorch-cuda-xpu-triage-skill
Open

Add pytorch-cuda-fix-xpu-alignment agent skill#3466
laifenxiawucha wants to merge 42 commits intomainfrom
add-pytorch-cuda-xpu-triage-skill

Conversation

@laifenxiawucha
Copy link
Copy Markdown
Contributor

@laifenxiawucha laifenxiawucha commented Apr 24, 2026

Summary

Add a new agent skill (pytorch-cuda-fix-xpu-alignment) that discovers CUDA and other backend bug fixes in pytorch/pytorch and validates whether the same bugs exist on XPU.

Input / Output

Input: A time window (e.g., "last 1 day") and target repository (pytorch/pytorch).

Output: A local triage_scan_<date>.md file containing:

  • Candidate list with URLs, titles, and dates
  • Filter decisions (reject/pass) with reasons for each candidate
  • Adapted XPU reproducer scripts for passed candidates
  • Local XPU nightly validation results (confirmed / not-reproduced / unverified)
  • Filter and validation statistics summary

Workflow example

User: "Scan pytorch/pytorch for CUDA bug fixes in the last 1 day and check if they affect XPU."

The skill then:

  1. Collect — Search pytorch/pytorch via GitHub MCP for recent backend bug-fix signals (issues, PRs, commits). Save a lightweight candidate list to triage_scan_<date>.md.
  2. Process one at a time — For each candidate:
    • Filter: Read details, reject if infra-only / build-CI-only / docs-typo-only; otherwise pass
    • Reproduce: Extract or adapt the upstream regression test for torch.xpu
    • Validate: Run on local XPU nightly, record confirmed / not-reproduced / unverified
  3. Summarize — Append filter statistics and validation statistics to the scan file.

What is included

File Purpose
.github/skills/pytorch-cuda-fix-xpu-alignment/SKILL.md Skill definition and end-to-end workflow
.github/skills/pytorch-cuda-fix-xpu-alignment/references/local-xpu-validation-reference.md Environment setup, CUDA→XPU device mapping, run commands, bug confirmation criteria
.github/copilot-instructions.md Registered in the skills index (1 row added)

Adds a new agentskills.io-compatible skill package that mines
pytorch/pytorch for backend-divergence bug fixes and generates
minimal Python reproducers for XPU nightly validation.

Skill layout:
  SKILL.md            - instructions, ranking rubric, guardrails
  scripts/            - find_xpu_python, run_collect_env,
                        run_with_xpu_python, update_torch_xpu_nightly
  references/         - GitHub MCP query patterns, local validation guide
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Skill used: xpu-ops-pr-review.

Adds a new agentskills.io-compatible skill package (pytorch-cuda-xpu-triage) intended to help an agent mine pytorch/pytorch CUDA-fix signals and generate minimal XPU reproducers for local nightly validation.

Changes:

  • Introduces a new skill definition (SKILL.md) plus workflow references for GitHub MCP querying and local XPU validation.
  • Adds helper bash scripts to (a) locate an XPU-capable Python, (b) run scripts / collect_env with it, and (c) install/upgrade XPU nightly wheels.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
.github/skills/pytorch-cuda-xpu-triage/SKILL.md Defines the triage/mining workflow, ranking rubric, output format, and guardrails.
.github/skills/pytorch-cuda-xpu-triage/references/github-mcp-reference.md Documents GitHub MCP setup and query patterns for mining upstream fixes.
.github/skills/pytorch-cuda-xpu-triage/references/local-xpu-validation-reference.md Provides a checklist and suggested boilerplate for validating repros on XPU nightly.
.github/skills/pytorch-cuda-xpu-triage/scripts/find_xpu_python.sh Auto-detects an interpreter by probing candidates for torch.xpu support.
.github/skills/pytorch-cuda-xpu-triage/scripts/run_with_xpu_python.sh Runs an arbitrary Python script with the detected interpreter.
.github/skills/pytorch-cuda-xpu-triage/scripts/run_collect_env.sh Runs torch.utils.collect_env with the detected interpreter.
.github/skills/pytorch-cuda-xpu-triage/scripts/update_torch_xpu_nightly.sh Upgrades/install torch/vision/audio from the XPU nightly wheel index with optional locking and dry-run.

Comment thread .github/skills/pytorch-cuda-xpu-triage/scripts/update_torch_xpu_nightly.sh Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/references/github-mcp-reference.md Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/references/github-mcp-reference.md Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/scripts/find_xpu_python.sh Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/scripts/update_torch_xpu_nightly.sh Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/scripts/update_torch_xpu_nightly.sh Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/scripts/find_xpu_python.sh Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/scripts/run_collect_env.sh Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/scripts/run_with_xpu_python.sh Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/scripts/update_torch_xpu_nightly.sh Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/SKILL.md Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/SKILL.md Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/SKILL.md Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/SKILL.md Outdated
Copilot AI review requested due to automatic review settings April 24, 2026 08:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new GitHub Copilot skill to mine recent CUDA/backend fixes in pytorch/pytorch, qualify XPU-relevant candidates, and generate local XPU repro/validation steps.

Changes:

  • Introduces pytorch-cuda-xpu-triage skill definition and end-to-end workflow guidance.
  • Adds reference docs for GitHub MCP search patterns and local XPU validation commands/criteria.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
.github/skills/pytorch-cuda-xpu-triage/SKILL.md Defines the skill, search/qualification rubric, and expected output/handoff format.
.github/skills/pytorch-cuda-xpu-triage/references/github-mcp-reference.md Documents read-only GitHub search/query patterns and a narrowing heuristic.
.github/skills/pytorch-cuda-xpu-triage/references/local-xpu-validation-reference.md Documents local XPU validation assumptions, commands, and bug confirmation criteria.

Copy link
Copy Markdown
Contributor

@Stonepia Stonepia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, we could quick enable for the init stage

@Stonepia
Copy link
Copy Markdown
Contributor

Next step should land this skills. Currently, this skill only working with local OpenCode/Codex env

Copilot AI review requested due to automatic review settings April 24, 2026 09:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new “pytorch-cuda-xpu-triage” skill to guide discovery of upstream CUDA/backend fixes in pytorch/pytorch and produce minimal local XPU repro/validation steps.

Changes:

  • Adds a skill definition (SKILL.md) describing search strategy, qualification gates, and output format.
  • Adds GitHub MCP query/tooling reference guidance.
  • Adds local XPU validation and evidence collection reference guidance.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
.github/skills/pytorch-cuda-xpu-triage/SKILL.md Defines the triage skill’s workflow, candidate rubric, and required guardrails.
.github/skills/pytorch-cuda-xpu-triage/references/github-mcp-reference.md Documents GitHub search/tool patterns to discover and narrow candidates.
.github/skills/pytorch-cuda-xpu-triage/references/local-xpu-validation-reference.md Provides local XPU run/validation checklists and environment/evidence collection steps.

@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mnasnet1_0 1.030685 0.755972
torchbench_bfloat16_training dcgan 0.799755 0.765201
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mobilenet_v3_large 0.967016 0.842025
torchbench_bfloat16_training densenet121 0.823402 0.842823
torchbench_bfloat16_training resnext50_32x4d 1.047162 0.860684
huggingface_bfloat16_training AllenaiLongformerBase 0.894286 0.948284

Comment thread .github/skills/pytorch-cuda-xpu-triage/SKILL.md Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/SKILL.md Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/SKILL.md Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/SKILL.md Outdated
Comment thread .github/skills/pytorch-cuda-xpu-triage/SKILL.md Outdated
@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
timm_models_bfloat16_training deit_base_distilled_patch16_224 0.742053 0.751471
timm_models_bfloat16_training vit_base_patch16_siglip_256 0.733104 0.758308
timm_models_bfloat16_training dm_nfnet_f0 0.650604 0.774531
timm_models_bfloat16_training visformer_small 0.742822 0.802383
timm_models_bfloat16_training nfnet_l0 0.743742 0.819781
timm_models_bfloat16_training mobilenetv3_large_100 0.780790 0.820245
timm_models_bfloat16_training beit_base_patch16_224 0.782675 0.854589
timm_models_bfloat16_training adv_inception_v3 0.796822 0.861245
timm_models_bfloat16_training mobilevit_s 0.722264 0.956258
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
timm_models_bfloat16_training convnextv2_nano.fcmae_ft_in22k_in1k 0.814999 0.817716
timm_models_bfloat16_training repvgg_a2 0.843529 0.847274
timm_models_bfloat16_training inception_v3 0.804447 0.848304
timm_models_bfloat16_training deit_tiny_patch16_224.fb_in1k 0.874167 0.850921
timm_models_bfloat16_training ghostnet_100 0.849341 0.857362
timm_models_bfloat16_training swin_base_patch4_window7_224 0.883423 0.863834
timm_models_bfloat16_training mobilenetv2_100 0.834205 0.877436
timm_models_bfloat16_training tf_efficientnet_b0 0.811884 0.935734

@laifenxiawucha laifenxiawucha requested a review from Stonepia April 29, 2026 09:04
Comment thread .github/skills/pytorch-cuda-xpu-triage/references/github-mcp-reference.md Outdated
@Stonepia
Copy link
Copy Markdown
Contributor

Overall seems ok for now. I have the concern about the pipeline:

  1. Please specify in the PR description on what you are going to deliver (for example, what is the input of this skill and what is the output). We need an example to indicate what is the workflow of your skills.
  2. The naming convention is not good. You are not actually doing the cuda-xpu-triage, you are just aligning the cuda fix for xpu on this skill, then give the reproducer. So this is not the issue triage, but an issue discovery task

…ub-mcp-reference

- Rename skill folder from pytorch-cuda-xpu-triage to pytorch-cuda-fix-xpu-alignment
  (this is fix discovery + alignment, not issue triage)
- Delete github-mcp-reference.md; inline the example query into SKILL.md Step 1
- Update copilot-instructions.md skill index entry
- Update SKILL.md frontmatter and description
@laifenxiawucha laifenxiawucha changed the title Add pytorch-cuda-xpu-triage agent skill Add pytorch-cuda-fix-xpu-alignment agent skill Apr 29, 2026
@laifenxiawucha laifenxiawucha requested a review from Stonepia April 29, 2026 09:39
After validating a bug on XPU, determine whether the fix belongs in
pytorch/pytorch (upstream XPU kernel) or intel/torch-xpu-ops (local kernel).
Add routing statistics to the summary.
laifenxiawucha added a commit that referenced this pull request Apr 29, 2026
Remove dispatch-coverage.md and triage-patterns.md — the domain knowledge
they contained (coverage signals, triage patterns, key interpretations) is
either already in SKILL.md's workflow steps or something the agent should
derive from code.

Rewrite batch-scan-workflow.md: strip JSON schemas, shell command examples,
and Markdown templates. Keep only workflow steps and constraints.

Inline the 'where to look' file paths from dispatch-coverage.md into
SKILL.md Step 2.

Before: 4 files, 342 lines
After:  2 files, 112 lines (comparable to PR #3466's 109 lines)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants