Skip to content

Commit 9bf68c3

Browse files
author
shijiashuai
committed
docs: refresh benchmark guidance and cleanup tracked binary
Update the English and Chinese docs to match the new benchmark semantics, Tensor Core fallback behavior, and current CI/build workflow. Drop the committed benchmark binary so generated artifacts stop polluting the repository.
1 parent be983b5 commit 9bf68c3

3 files changed

Lines changed: 111 additions & 91 deletions

File tree

CONTRIBUTING.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,26 @@
1212

1313
## 构建与测试
1414

15+
推荐优先使用 CMake:
16+
17+
```bash
18+
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
19+
cmake --build build -j$(nproc)
20+
./build/bin/sgemm_benchmark
21+
cmake --build build --target test_sgemm
22+
ctest --test-dir build
23+
```
24+
25+
Makefile 也可用于快速本地构建:
26+
1527
```bash
1628
make GPU_ARCH=sm_86
1729
make benchmark
1830
make test
1931
```
2032

33+
说明:GitHub Actions 当前执行格式检查和容器化 CUDA compile-only 构建;CUDA 运行时测试仍需在本地或带 GPU 的 runner 上完成。
34+
2135
## 代码规范
2236

2337
- CUDA 代码遵循项目现有风格

README.md

Lines changed: 46 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -8,54 +8,60 @@
88

99
English | [简体中文](README.zh-CN.md)
1010

11-
Hand-written, progressively optimized CUDA matrix multiplication — the "Hello World" of HPC. Five kernel variants demonstrate core GPU optimization techniques, from a naive triple loop to **Tensor Core WMMA reaching 40% of cuBLAS throughput**.
11+
Hand-written, progressively optimized CUDA matrix multiplication — the "Hello World" of HPC. Five kernel variants demonstrate core GPU optimization techniques, from a naive triple loop to a guarded Tensor Core WMMA path with explicit mixed-precision benchmarking.
1212

13-
## Performance (RTX 3060 Laptop, 1024×1024×1024)
13+
## Performance
1414

15-
| Kernel | GFLOPS | vs cuBLAS | Time | Key Technique |
16-
|--------|-------:|----------:|-----:|---------------|
17-
| **cuBLAS** (ref) | 5727 | 100% | 0.375 ms | NVIDIA optimized library |
18-
| **Tensor Core** (WMMA) | 2300 | 40.2% | 0.934 ms | FP16→FP32 mixed precision |
19-
| **Tiled** (32×32) | 753 | 13.1% | 2.853 ms | Shared memory blocking |
20-
| **Double Buffer** | 701 | 12.2% | 3.064 ms | Compute-memory overlap |
21-
| **Bank Conflict Free** | 673 | 11.8% | 3.190 ms | Shared memory padding (+1) |
22-
| **Naive** | 604 | 10.6% | 3.553 ms | One thread per output element |
15+
The exact GFLOPS you see will depend on GPU model, CUDA version, and problem size.
16+
The benchmark now reports two Tensor Core views:
2317

24-
*All kernels verified against cuBLAS (allclose: rtol=1e-3, atol=1e-4; Tensor Core: rtol=5e-2)*
18+
- **Tensor Core (WMMA end-to-end)**: includes FP32→FP16 conversion and safe fallback for non-WMMA-compatible dimensions.
19+
- **Tensor Core (WMMA compute-only)**: times only the WMMA compute path and is shown only when `M`, `K`, and `N` are multiples of 16.
20+
21+
Verification tolerances are centralized in code:
22+
23+
- Standard FP32 kernels: `rtol=1e-3`, `atol=1e-4`
24+
- Tensor Core mixed-precision path: `rtol=5e-2`, `atol=1e-2`
25+
26+
The default benchmark set includes:
27+
28+
- aligned square cases: `512x512x512`, `1024x1024x1024`
29+
- one aligned non-square case: `256x384x640`
30+
- one unaligned edge case: `511x513x1025` to exercise safe Tensor Core fallback
31+
32+
> Note: the printed theoretical peak and roofline numbers are approximate analytical references, not exact hardware limits.
2533
2634
## Optimization Roadmap
2735

2836
```
29-
┌─────────┐ ┌──────────┐ ┌──────────────┐ ┌───────────────┐
30-
│ Naive │────▶│ Tiled │────▶│ Bank-Free │────▶│ Double Buffer │
31-
│ 604 GF │ │ 753 GF │ │ 673 GF │ │ 701 GF │
32-
└─────────┘ └──────────┘ └──────────────┘ └───────┬───────┘
33-
34-
35-
┌───────────────────┐
36-
│ Tensor Core │
37-
│ 2300 GF (WMMA) │
38-
└───────────────────┘
37+
Naive -> Tiled -> Bank-Free -> Double Buffer -> Tensor Core (WMMA)
3938
```
4039

4140
| Stage | What Changes | Why It Helps |
4241
|-------|-------------|--------------|
4342
| **Naive → Tiled** | Load tiles into shared memory | Data reuse reduces global memory traffic by TILE_SIZE× |
4443
| **Tiled → Bank-Free** | Pad shared memory `[32][33]` | Eliminates 32-way bank conflicts on column access |
45-
| **Bank-Free → Double Buffer** | Two shared-memory buffers | Overlaps next-tile load with current-tile compute |
44+
| **Bank-Free → Double Buffer** | Two shared-memory buffers | Restructures tile staging and buffering to reduce memory stalls |
4645
| **→ Tensor Core** | WMMA API `mma_sync` | Dedicated matrix units, ~8× peak over CUDA cores |
4746

4847
## Build & Run
4948

50-
```bash
51-
# Makefile (adjust GPU arch for your hardware)
52-
make GPU_ARCH=sm_86
53-
make benchmark
49+
Recommended path: CMake
5450

55-
# Or CMake
51+
```bash
5652
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
5753
cmake --build build -j$(nproc)
5854
./build/bin/sgemm_benchmark
55+
./build/bin/sgemm_benchmark --dims 256 384 640
56+
./build/bin/sgemm_benchmark -a
57+
```
58+
59+
Quick local path: Makefile
60+
61+
```bash
62+
make GPU_ARCH=sm_86
63+
make benchmark
64+
make test
5965
```
6066

6167
## Project Structure
@@ -83,18 +89,22 @@ sgemm-optimization/
8389

8490
## Testing
8591

86-
Property-based tests with Google Test:
92+
Google Test coverage includes:
8793

8894
| Property | What It Verifies |
8995
|----------|-----------------|
90-
| **Numerical correctness** | All kernels match cuBLAS output (allclose) |
91-
| **Tensor Core tolerance** | Correct under relaxed FP16 tolerance |
92-
| **Error detection** | Verification system catches injected errors |
93-
| **Dimension invariance** | All kernels handle arbitrary aligned sizes |
96+
| **Numerical correctness** | Standard kernels match cuBLAS across square and non-square cases |
97+
| **Tensor Core fast path** | WMMA path is validated on `16`-aligned dimensions |
98+
| **Tensor Core fallback** | Non-aligned dimensions safely fall back to an FP32 kernel |
99+
| **Small/edge inputs** | Includes `1x1x1` and unaligned edge cases |
100+
| **Error detection** | Verification helpers stay consistent with benchmark tolerances |
94101

95102
```bash
103+
cmake --build build --target test_sgemm
104+
ctest --test-dir build
105+
106+
# or
96107
make test
97-
# Or: cmake --build build --target test_sgemm && ctest --test-dir build
98108
```
99109

100110
## GPU Architecture Reference
@@ -109,10 +119,10 @@ make test
109119

110120
## Engineering Quality
111121

112-
- **Build**: CMake 3.18+ with `target_include_directories`, `target_compile_options` (generator expressions), FetchContent for GTest v1.14.0
122+
- **Build**: CMake 3.18+ is the primary build system; Makefile remains available for quick local use
113123
- **Code style**: clang-format enforced via CI
114-
- **CI**: GitHub Actions CUDA container build + format check
115-
- **Testing**: Google Test property-based verification against cuBLAS
124+
- **CI**: GitHub Actions runs format checks and a containerized CUDA compile-only build; GPU runtime tests are still local / dedicated-runner only
125+
- **Testing**: Google Test verification against cuBLAS, including Tensor Core fallback and edge-size coverage
116126

117127
## References
118128

README.zh-CN.md

Lines changed: 51 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,26 @@
1212

1313
本项目实现了一个从最简单的三层循环到接近 cuBLAS 性能的 CUDA SGEMM (Single-precision General Matrix Multiply) 优化演进过程。通过渐进式优化,展示 GPU 编程中的核心优化技术。
1414

15-
## 实测性能结果
15+
## 性能说明
1616

17-
在 NVIDIA GeForce RTX 3060 Laptop GPU (sm_86) 上的 1024×1024×1024 矩阵乘法性能:
17+
实际 GFLOPS 会受到 GPU 型号、CUDA 版本和问题规模影响。
18+
当前 benchmark 会报告两种 Tensor Core 口径:
1819

19-
| Kernel | GFLOPS | vs cuBLAS | 状态 |
20-
|--------|--------|-----------|------|
21-
| cuBLAS (参考) | 5727 | 100% | ✅ PASS |
22-
| Tensor Core (WMMA) | 2300 | 40.2% | ✅ PASS |
23-
| Tiled (32×32) | 753 | 13.1% | ✅ PASS |
24-
| Double Buffer | 701 | 12.2% | ✅ PASS |
25-
| Bank Conflict Free | 673 | 11.8% | ✅ PASS |
26-
| Naive | 604 | 10.6% | ✅ PASS |
20+
- **Tensor Core (WMMA end-to-end)**:包含 FP32→FP16 转换,以及对非 WMMA 兼容尺寸的安全回退。
21+
- **Tensor Core (WMMA compute-only)**:仅统计 WMMA 计算路径时间,只会在 `M/K/N` 都是 16 的倍数时显示。
2722

28-
*所有 kernel 均通过与 cuBLAS 的正确性验证*
23+
项目中的验证容差已统一:
24+
25+
- 标准 FP32 kernel:`rtol=1e-3`, `atol=1e-4`
26+
- Tensor Core 混合精度路径:`rtol=5e-2`, `atol=1e-2`
27+
28+
默认 benchmark 集合包含:
29+
30+
- 对齐方阵:`512x512x512``1024x1024x1024`
31+
- 一组对齐非方阵:`256x384x640`
32+
- 一组非对齐边界尺寸:`511x513x1025`,用于验证 Tensor Core 安全回退
33+
34+
> 注意:程序输出中的 theoretical peak 和 roofline 数据都是近似分析值,不是严格硬件峰值。
2935
3036
## 优化版本
3137

@@ -34,7 +40,7 @@
3440
| Naive | 基础三层循环 | 每线程计算一个输出元素 |
3541
| Tiled | 共享内存分块 | 数据复用,减少全局内存访问 |
3642
| Bank Conflict Free | 消除 bank 冲突 | 共享内存 padding (+1) |
37-
| Double Buffer | 双缓冲流水线 | 计算与访存重叠 |
43+
| Double Buffer | 双缓冲流水线 | 通过重组 tile staging 降低访存停顿 |
3844
| Tensor Core | WMMA API | 硬件加速矩阵运算 (FP16→FP32) |
3945

4046
## 构建与运行
@@ -46,52 +52,36 @@
4652
- GPU: Volta (sm_70) 或更新架构
4753
- Google Test (可选,用于属性测试)
4854

49-
### 编译
55+
### 编译与运行
5056

51-
```bash
52-
# 根据你的 GPU 架构调整 (RTX 30 系列用 sm_86)
53-
make GPU_ARCH=sm_86
57+
推荐优先使用 CMake:
5458

55-
# 或直接使用默认架构
56-
make
59+
```bash
60+
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
61+
cmake --build build -j$(nproc)
62+
./build/bin/sgemm_benchmark
63+
./build/bin/sgemm_benchmark --dims 256 384 640
64+
./build/bin/sgemm_benchmark -a
5765
```
5866

59-
### 运行
67+
Makefile 可用于快速本地构建:
6068

6169
```bash
62-
# 运行基准测试
63-
./build/sgemm_benchmark
64-
65-
# 或使用 make
70+
make GPU_ARCH=sm_86
6671
make benchmark
67-
68-
# 清理构建
72+
make test
6973
make clean
7074
```
7175

72-
### 输出示例
76+
### 输出说明
7377

74-
```
75-
===============================================================
76-
SGEMM Optimization Benchmark Suite
77-
===============================================================
78-
GPU Device: NVIDIA GeForce RTX 3060 Laptop GPU
79-
Compute Capability: 8.6
80-
SM Count: 30
81-
82-
===============================================================
83-
Benchmarking 1024 x 1024 x 1024 SGEMM
84-
===============================================================
85-
86-
Kernel | Dimensions | Time | Performance | Pass
87-
-----------------------------------------------------------------------
88-
cuBLAS | 1024 x 1024 x 1024 | 0.375ms | 5726 GFLOPS | PASS
89-
Naive | 1024 x 1024 x 1024 | 3.553ms | 604 GFLOPS | PASS
90-
Tiled (32x32) | 1024 x 1024 x 1024 | 2.853ms | 753 GFLOPS | PASS
91-
Bank Conflict Free | 1024 x 1024 x 1024 | 3.190ms | 673 GFLOPS | PASS
92-
Double Buffer | 1024 x 1024 x 1024 | 3.064ms | 701 GFLOPS | PASS
93-
Tensor Core (WMMA) | 1024 x 1024 x 1024 | 0.934ms | 2300 GFLOPS | PASS
94-
```
78+
程序会根据输入尺寸和 GPU 能力输出不同 kernel 的结果。
79+
当尺寸满足 WMMA 要求时,会同时显示:
80+
81+
- `Tensor Core (WMMA end-to-end)`:包含 FP32→FP16 转换与安全回退语义
82+
- `Tensor Core (WMMA compute-only)`:仅统计 WMMA 计算路径
83+
84+
具体数值会随 GPU、CUDA 版本和尺寸而变化,因此 README 不再固定展示单一机器上的性能表。
9585

9686
## 目录结构
9787

@@ -208,6 +198,9 @@ for (int t = 0; t < numTiles; ++t) {
208198
209199
### 5. Tensor Core (WMMA API)
210200
201+
当前实现中,Tensor Core 入口在不满足 `sm_70+` 或 `M/K/N` 非 16 对齐时会自动回退到稳定的 FP32 kernel,而不是继续执行不安全的 WMMA 访问。
202+
203+
211204
**特点:**
212205
- 专用矩阵计算单元,执行 D = A×B + C
213206
- 支持 FP16 输入,FP32 累加 (混合精度)
@@ -270,18 +263,21 @@ bool passed = abs_error <= atol + rtol * fabs(ref_val);
270263
271264
## 属性测试 (Property-Based Testing)
272265
273-
测试文件 `tests/test_sgemm.cu` 包含以下属性测试
266+
测试文件 `tests/test_sgemm.cu` 目前覆盖
274267
275-
1. **Property 1: Kernel Numerical Correctness** - 所有 kernel 与 cuBLAS 结果一致
276-
2. **Property 2: Tensor Core Correctness** - Tensor Core 在放宽容差下正确
277-
3. **Property 3: Error Detection** - 验证系统能正确检测错误
278-
4. **Property 4: Dimension Invariance** - 所有 kernel 支持任意对齐维度
268+
1. **标准 kernel 数值正确性** - 方阵与非方阵结果对齐 cuBLAS
269+
2. **Tensor Core fast path** - 16 对齐尺寸下验证 WMMA 路径
270+
3. **Tensor Core fallback** - 非对齐尺寸时安全回退到 FP32 kernel
271+
4. **小尺寸与边界尺寸** - 包含 `1x1x1` 和非对齐 edge case
272+
5. **Error Detection** - 验证工具与 benchmark 容差保持一致
279273
280-
运行测试需要 Google Test
274+
运行测试
281275
```bash
282-
# 安装 Google Test 后
276+
cmake --build build --target test_sgemm
277+
ctest --test-dir build
278+
279+
# 或
283280
make test
284-
./build/test_sgemm
285281
```
286282

287283
## 面试要点总结

0 commit comments

Comments
 (0)