Skip to content

Commit 20c505d

Browse files
authored
Update README.md
1 parent 264b619 commit 20c505d

File tree

1 file changed

+23
-10
lines changed

1 file changed

+23
-10
lines changed

kernels/hgemm/README.md

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
![toy-hgemm-library](https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d)
55

6-
[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is a library that write many HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are source from 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) ![](https://img.shields.io/github/stars/DefTruth/CUDA-Learn-Notes.svg?style=social) and exported as a standalone library, please checkout [CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) for latest updates. Welcome to 🌟👆🏻star this repo to support me, thanks ~ 🎉🎉
6+
[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is a library that write many HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are source from 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) ![](https://img.shields.io/github/stars/DefTruth/CUDA-Learn-Notes.svg?style=social) and exported as a standalone library, please checkout [CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) for latest updates. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
77

88
<div id="hgemm-sgemm"></div>
99

@@ -83,26 +83,34 @@ void hgemm_mma_stages_block_swizzle_tn_cute(torch::Tensor a, torch::Tensor b, to
8383
8484
## 📖 目录
8585
86+
- [📖 Prerequisites](#prerequisites)
8687
- [📖 Installation](#install)
87-
- [📖 Python/C++ Testing](#test)
88+
- [📖 Python Testing](#test)
89+
- [📖 C++ Testing](#test-cpp)
8890
- [📖 NVIDIA L20 bench](#perf-l20)
8991
- [📖 NVIDIA RTX 4090 bench](#perf-4090)
9092
- [📖 NVIDIA RTX 3080 Laptop bench](#perf-3080)
9193
- [📖 Docs](#opt-docs)
9294
- [📖 References](#ref)
9395
96+
## 📖 Prerequisites
97+
<div id="prerequisites"></div>
98+
99+
- PyTorch >= 2.0, CUDA >= 12.0
100+
- Recommended: PyTorch >= 2.5.1, CUDA >= 12.6
101+
94102
## 📖 Installation
95103
96104
<div id="install"></div>
97105
98-
The HGEMM implemented in this repo can be install as a python library, namely, `toy-hgemm` library (optional)
106+
The HGEMM implemented in this repo can be install as a python library, namely, `toy-hgemm` library (optional).
99107
```bash
100108
cd kernels/hgemm
101109
git submodule update --init --recursive --force # Fetch `CUTLASS` submodule, needed
102110
python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl # pip uninstall toy-hgemm -y
103111
```
104112

105-
## 📖 Python/C++ Testing
113+
## 📖 Python Testing
106114

107115
<div id="test"></div>
108116

@@ -111,7 +119,7 @@ python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl # pip un
111119
git submodule update --init --recursive --force
112120
```
113121

114-
**Python**: Test many custom HGEMM kernel via Python script and figure out the difference in their performance.
122+
You can test many custom HGEMM kernel via Python script and figure out the difference in their performance.
115123

116124
```bash
117125
# You can test Ada or Ampere only, also, Volta, Ampere, Ada, Hopper, ...
@@ -134,7 +142,12 @@ python3 hgemm.py --mma-all --plot --topk 8
134142
# test default mma kernels & cute hgemm kernels with smem swizzle for all MNK
135143
python3 hgemm.py --cute-tn --mma --plot
136144
```
137-
**C++**: The HGEMM benchmark also supports C++ testing. Currently, it supports comparisons between the following implementations:
145+
146+
## 📖 C++ Testing
147+
148+
<div id="test-cpp"></div>
149+
150+
The HGEMM benchmark also supports C++ testing. Currently, it supports comparisons between the following implementations:
138151

139152
- MMA HGEMM NN implemented in this repository
140153
- CuTe HGEMM TN implemented in this repository
@@ -178,7 +191,7 @@ M N K = 16384 16384 16384, Time = 0.07668429 0.07669371 0.07670784 s, A
178191

179192
<div id="perf-l20"></div>
180193

181-
### NVIDIA L20
194+
### 📖 NVIDIA L20
182195
<!--
183196
目前最优的实现,在L20上(理论Tensor Cores FP16算力为 119.5 TFLOPS),整体上能达到cuBLAS大概`99~100+%`左右的性能。使用WMMA API能达到cuBLAS大概`95%~98%`左右的性能(105-113 TFLOPS vs 105-115 TFLOPS),使用MMA API能达到115 TFLOPS,部分 case 会超越 cuBLAS。CuTe 版本的 HGEMM 实现了 Block Swizzle(L2 Cache friendly)和 SMEM Swizzle(bank conflicts free),性能最优,大规模矩阵乘能达到 116-117 TFLOPS,是 cuBLAS 大概`98%~100%+`左右的性能,很多case会超越cuBLAS。目前通过 SMEM Padding 和 SMEM Swizzle 的方式缓解 bank conflicts。对于 NN layout,使用 SMEM Padding 缓解 bank conflicts;对于 TN layout,通过 CUTLASS/CuTe 的 SMEM Swizzle 消除 bank conflicts。
184197
-->
@@ -204,7 +217,7 @@ The command for testing all MNK setups (Tip: Performance data for each MNK teste
204217
python3 hgemm.py --cute-tn --mma --plot
205218
```
206219

207-
### NVIDIA GeForce RTX 4090
220+
### 📖 NVIDIA GeForce RTX 4090
208221

209222
<div id="perf-4090"></div>
210223

@@ -224,7 +237,7 @@ On the NVIDIA RTX 4090 (with an FP16 Tensor Cores performance of 330 TFLOPS), th
224237
python3 hgemm.py --cute-tn --mma --wmma-all --plot
225238
```
226239

227-
### NVIDIA GeForce RTX 3080 Laptop
240+
### 📖 NVIDIA GeForce RTX 3080 Laptop
228241

229242
<div id="perf-3080"></div>
230243

@@ -240,7 +253,7 @@ python3 hgemm.py --wmma-all --plot
240253
```
241254

242255
<details>
243-
<summary> 🔑️ Performance Optimization Notes(TODO) !Click here! </summary>
256+
<summary> 🔑️ Performance Optimization Notes(TODO)</summary>
244257

245258
## 📖 Performance Optimization Notes
246259

0 commit comments

Comments
 (0)