Skip to content

Commit 1ebd88a

Browse files
authored
Optimize the Falcon block for inference (#500)
This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically, * Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche * If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match. Lastly, the pull request makes the backward pass work (as discussed in #499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.
1 parent d40eb6c commit 1ebd88a

File tree

2 files changed

+518
-4
lines changed

2 files changed

+518
-4
lines changed

0 commit comments

Comments
 (0)