Add W4A8 INT8 activation kernels for batched MoE prefill by digantdesai · Pull Request #19187 · pytorch/executorch

digantdesai · 2026-04-28T15:56:27Z

Stack from ghstack (oldest at bottom):

INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.

Co-authored-by: Claude noreply@anthropic.com

INT8 tensor core variants of the batched MoE GEMM kernels that dynamically quantize bf16 activations to INT8 per-row per-tile and dequantize INT4 weights directly to INT8 (skipping bf16 conversion). Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32 rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine similarity vs bf16 baseline. Co-authored-by: Claude <noreply@anthropic.com> [ghstack-poisoned]

pytorch-bot · 2026-04-28T15:56:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19187

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 5 Cancelled Jobs, 3 Unrelated Failures

As of commit 2b1e1eb with merge base cb4e5ae ():

NEW FAILURES - The following jobs have failed:

MLX / test-mlx-voxtral-realtime / test-mlx-voxtral-realtime (gh)
Process completed with exit code 126.
pull / test-static-llama-qnn-linux (stories_260k_bc) / linux-job (gh)
RuntimeError: Command docker exec -t 7941fb08f2a4a4083887294eb0d6ec73c2c88036d8f1ed722808600f207d9c2c /exec failed with exit code 92
pull / unittest / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_linear_model
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 9a4a68257fdb4fc0be1ad08181cf63ffea43748ecc93b38a22e7ff245fa74dc3 /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t cc69d2e110e04b38a3e11ed325f453f6a9e3e8d44e4774cefb605f91fc57d890 /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t bb7b27a978bbf2e268c10cde64baa0a51d2e3e07dec03baf90218284fec7e4bd /exec failed with exit code 1
trunk / test-coreml-delegate / macos-job (gh)
The process '/opt/homebrew/bin/git' failed with exit code 128

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / unittest-release / windows / windows-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

MLX / test-mlx-qwen35-moe / test-mlx-qwen35-moe (gh) (trunk failure)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
Test Metal Backend / test-metal-qwen35-moe-tiny / macos-job (gh) (trunk failure)
/Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/examples/models/qwen3_5_moe/main.cpp:65:16: error: use of undeclared identifier 'cudaSuccess'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-28T15:57:09Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

INT8 tensor core variants of the batched MoE GEMM kernels that dynamically quantize bf16 activations to INT8 per-row per-tile and dequantize INT4 weights directly to INT8 (skipping bf16 conversion). Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32 rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine similarity vs bf16 baseline. Co-authored-by: Claude <noreplyanthropic.com> [ghstack-poisoned]

Gasoonjia

Thanks your work! Can you also update the ci to use int8 activation type for moe prefill?

Gasoonjia · 2026-04-29T18:22:52Z

        help="Disable split-K (flash-decoding) SDPA for decode; use tiled SDPA instead.",
    )
+    parser.add_argument(
+        "--moe-activation-dtype",


maybe we call prefill-moe-activation-dtype would be better?

I didn't do that because not doing int8 for decode is something we may revisit later.

INT8 tensor core variants of the batched MoE GEMM kernels that dynamically quantize bf16 activations to INT8 per-row per-tile and dequantize INT4 weights directly to INT8 (skipping bf16 conversion). Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32 rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine similarity vs bf16 baseline. Co-authored-by: Claude <noreplyanthropic.com> [ghstack-poisoned]

…oE (#19188) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #19190 * __->__ #19188 * #19187 Add three new Triton kernels for dense W4A16 linear projections that replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights (same format as MoE experts): - int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover) - int4_matvec: bandwidth-optimized vec-mat for M=1 decode - dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm W4DequantLinear wraps these with dual decode/prefill dispatch: - Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm) - Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens, +67% over tinygemm baseline) Single 18GB weight blob (no duplication). Decode perf regression is a known trade-off for uniform weight format — to be revisited with a CUDA C++ matvec kernel. Also adds INT8 dynamic-activation MoE tests and comprehensive correctness tests (48 tests, all passing at rtol=0.01). Co-authored-by: Claude <noreply@anthropic.com>

… runner (#19190) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #19190 * #19188 * #19187 Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code.

@digantdesai

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #19187 by @digantdesai ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/digantdesai/50/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/digantdesai/50/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/digantdesai/50/orig @diff-train-skip-merge Co-authored-by: Digant Desai <digantdesai@meta.com>

digantdesai requested a review from lucylq as a code owner April 28, 2026 15:56

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 28, 2026

This was referenced Apr 28, 2026

Add Triton INT4 dense kernels with dequant prefill path for Qwen3.5 MoE #19188

Merged

Remove benchmark scripts from git tracking #19189

Open

Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner #19190

Merged

digantdesai requested review from Gasoonjia and mergennachin April 28, 2026 21:09

digantdesai added 2 commits April 28, 2026 14:18

Gasoonjia approved these changes Apr 29, 2026

View reviewed changes

digantdesai merged commit 62ba859 into gh/digantdesai/50/base Apr 30, 2026
382 of 397 checks passed

digantdesai deleted the gh/digantdesai/50/head branch April 30, 2026 15:04

digantdesai temporarily deployed to cherry-pick-bot April 30, 2026 15:04 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Apr 30, 2026

Add W4A8 INT8 activation kernels for batched MoE prefill #19226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add W4A8 INT8 activation kernels for batched MoE prefill#19187

Add W4A8 INT8 activation kernels for batched MoE prefill#19187
digantdesai merged 5 commits intogh/digantdesai/50/basefrom
gh/digantdesai/50/head

digantdesai commented Apr 28, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Gasoonjia left a comment

Uh oh!

Uh oh!

Gasoonjia Apr 29, 2026

Uh oh!

digantdesai Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

digantdesai commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19187

❌ 7 New Failures, 5 Cancelled Jobs, 3 Unrelated Failures

Uh oh!

github-actions Bot commented Apr 28, 2026

This PR needs a release notes: label

Uh oh!

Gasoonjia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Gasoonjia Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

digantdesai Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

digantdesai commented Apr 28, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 28, 2026 •

edited

Loading

This PR needs a `release notes:` label