Add fused_transpose_quant op #10644

lshpku · 2025-05-23T05:26:35Z

PR types

Performance optimization

PR changes

APIs

Description

新增fused_transpose_quant算子，定义如下：

/**
 * Doing quantization on dim[-2] of X, then transpose dim[-1] and dim[-2] of X.
 *
 * Inputs:
 *   X    : [*, M, K], bfloat16
 *
 * Outputs:
 *   out  : [*, K, M], float8_e4m3fn
 *   scale: [*, M/128, K], float32
 *
 * Requirements:
 *   1) batch_size <= 65535
 *   2) M <= 65535 * 128 and M % 128 == 0
 */
std::vector<paddle::Tensor> fused_transpose_quant(const paddle::Tensor& X);

实现亮点

使用128x128的分块进行transpose，实现高并行度
由于M维保证128对齐，因此总是使用4x向量化写回out
由于K维无对齐保证，因此读X和写scale时区分了1x/2x/4x向量化实例

性能测试

在A100-40G上做了初步测试，由于A100不支持fp8，因此在cast fp32 to fp8的时候用了int8代替，基本可以反映H卡上的性能

输入x.shape	用时(ns)	带宽(GBps)	带宽利用率	说明
[4, 7168, 4096]	278,176	1280	82.3%	4x向量化
[4, 7168, 4096 + 2]	294,663	1208	77.7%	2x向量化
[4, 7168, 4096 + 1]	404,087	881	56.7%	无向量化

Pcard-85711

paddle-bot · 2025-05-23T05:26:39Z

Thanks for your contribution!

lshpku mentioned this pull request May 23, 2025

Add fused_transpose_quant op #10601

Closed

lshpku force-pushed the fused-transpose-quant-2 branch 2 times, most recently from 422e190 to adf38e5 Compare May 23, 2025 05:45

Add fused_transpose_quant op

56f7fe6

lshpku force-pushed the fused-transpose-quant-2 branch from adf38e5 to 56f7fe6 Compare May 23, 2025 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add fused_transpose_quant op #10644

Add fused_transpose_quant op #10644

Uh oh!

lshpku commented May 23, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented May 23, 2025

Uh oh!

Uh oh!

Add fused_transpose_quant op #10644

Are you sure you want to change the base?

Add fused_transpose_quant op #10644

Uh oh!

Conversation

lshpku commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

实现亮点

性能测试

Uh oh!

paddle-bot bot commented May 23, 2025

Uh oh!

Uh oh!

lshpku commented May 23, 2025 •

edited

Loading