Skip to content

Add W4A8 INT8 activation kernels for batched MoE prefill#19187

Merged
digantdesai merged 5 commits intogh/digantdesai/50/basefrom
gh/digantdesai/50/head
Apr 30, 2026
Merged

Add W4A8 INT8 activation kernels for batched MoE prefill#19187
digantdesai merged 5 commits intogh/digantdesai/50/basefrom
gh/digantdesai/50/head

Conversation

@digantdesai
Copy link
Copy Markdown
Contributor

@digantdesai digantdesai commented Apr 28, 2026

Stack from ghstack (oldest at bottom):

INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.

Co-authored-by: Claude noreply@anthropic.com

INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.

Co-authored-by: Claude <noreply@anthropic.com>

[ghstack-poisoned]
@digantdesai digantdesai requested a review from lucylq as a code owner April 28, 2026 15:56
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 28, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19187

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 5 Cancelled Jobs, 3 Unrelated Failures

As of commit 2b1e1eb with merge base cb4e5ae (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 28, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.

Co-authored-by: Claude <noreplyanthropic.com>

[ghstack-poisoned]
INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.

Co-authored-by: Claude <noreplyanthropic.com>

[ghstack-poisoned]
INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.

Co-authored-by: Claude <noreplyanthropic.com>

[ghstack-poisoned]
Copy link
Copy Markdown
Contributor

@Gasoonjia Gasoonjia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks your work! Can you also update the ci to use int8 activation type for moe prefill?

Comment thread examples/models/qwen3_5_moe/model.py
help="Disable split-K (flash-decoding) SDPA for decode; use tiled SDPA instead.",
)
parser.add_argument(
"--moe-activation-dtype",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we call prefill-moe-activation-dtype would be better?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do that because not doing int8 for decode is something we may revisit later.

INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.

Co-authored-by: Claude <noreplyanthropic.com>

[ghstack-poisoned]
@digantdesai digantdesai merged commit 62ba859 into gh/digantdesai/50/base Apr 30, 2026
382 of 397 checks passed
@digantdesai digantdesai deleted the gh/digantdesai/50/head branch April 30, 2026 15:04
digantdesai added a commit that referenced this pull request Apr 30, 2026
…oE (#19188)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #19190
* __->__ #19188
* #19187

Add three new Triton kernels for dense W4A16 linear projections that
replace tinygemm's tiled INT4 format with simple [N, K//2] packed
weights
(same format as MoE experts):

- int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill
crossover)
- int4_matvec: bandwidth-optimized vec-mat for M=1 decode
- dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm

W4DequantLinear wraps these with dual decode/prefill dispatch:
- Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm)
- Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens,
  +67% over tinygemm baseline)

Single 18GB weight blob (no duplication). Decode perf regression is a
known trade-off for uniform weight format — to be revisited with a
CUDA C++ matvec kernel.

Also adds INT8 dynamic-activation MoE tests and comprehensive
correctness
tests (48 tests, all passing at rtol=0.01).

Co-authored-by: Claude <noreply@anthropic.com>
digantdesai added a commit that referenced this pull request Apr 30, 2026
… runner (#19190)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* __->__ #19190
* #19188
* #19187

Runner now uses llm::Stats with proper timestamps for model load,
prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable
table.

This commit was authored with the assistance of Claude Code.
Gasoonjia pushed a commit that referenced this pull request Apr 30, 2026
This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: #19187 by
@digantdesai
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/digantdesai/50/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/digantdesai/50/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/digantdesai/50/orig

@diff-train-skip-merge

Co-authored-by: Digant Desai <digantdesai@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants