You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: refresh benchmark guidance and cleanup tracked binary
Update the English and Chinese docs to match the new benchmark semantics, Tensor Core fallback behavior, and current CI/build workflow. Drop the committed benchmark binary so generated artifacts stop polluting the repository.
Copy file name to clipboardExpand all lines: README.md
+46-36Lines changed: 46 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,54 +8,60 @@
8
8
9
9
English | [简体中文](README.zh-CN.md)
10
10
11
-
Hand-written, progressively optimized CUDA matrix multiplication — the "Hello World" of HPC. Five kernel variants demonstrate core GPU optimization techniques, from a naive triple loop to **Tensor Core WMMA reaching 40% of cuBLAS throughput**.
11
+
Hand-written, progressively optimized CUDA matrix multiplication — the "Hello World" of HPC. Five kernel variants demonstrate core GPU optimization techniques, from a naive triple loop to a guarded Tensor Core WMMA path with explicit mixed-precision benchmarking.
12
12
13
-
## Performance (RTX 3060 Laptop, 1024×1024×1024)
13
+
## Performance
14
14
15
-
| Kernel | GFLOPS | vs cuBLAS | Time | Key Technique |
0 commit comments