Skip to content

Add fused_transpose_quant op #10644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: dsv3_dev
Choose a base branch
from

Conversation

lshpku
Copy link

@lshpku lshpku commented May 23, 2025

PR types

Performance optimization

PR changes

APIs

Description

新增fused_transpose_quant算子,定义如下:

/**
 * Doing quantization on dim[-2] of X, then transpose dim[-1] and dim[-2] of X.
 *
 * Inputs:
 *   X    : [*, M, K], bfloat16
 *
 * Outputs:
 *   out  : [*, K, M], float8_e4m3fn
 *   scale: [*, M/128, K], float32
 *
 * Requirements:
 *   1) batch_size <= 65535
 *   2) M <= 65535 * 128 and M % 128 == 0
 */
std::vector<paddle::Tensor> fused_transpose_quant(const paddle::Tensor& X);

实现亮点

  1. 使用128x128的分块进行transpose,实现高并行度
  2. 由于M维保证128对齐,因此总是使用4x向量化写回out
  3. 由于K维无对齐保证,因此读X和写scale时区分了1x/2x/4x向量化实例

性能测试

在A100-40G上做了初步测试,由于A100不支持fp8,因此在cast fp32 to fp8的时候用了int8代替,基本可以反映H卡上的性能

输入x.shape 用时(ns) 带宽(GBps) 带宽利用率 说明
[4, 7168, 4096] 278,176 1280 82.3% 4x向量化
[4, 7168, 4096 + 2] 294,663 1208 77.7% 2x向量化
[4, 7168, 4096 + 1] 404,087 881 56.7% 无向量化

Pcard-85711

Copy link

paddle-bot bot commented May 23, 2025

Thanks for your contribution!

@lshpku lshpku force-pushed the fused-transpose-quant-2 branch 2 times, most recently from 422e190 to adf38e5 Compare May 23, 2025 05:45
@lshpku lshpku force-pushed the fused-transpose-quant-2 branch from adf38e5 to 56f7fe6 Compare May 23, 2025 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant