Commit 1ebd88a

authored

Optimize the Falcon block for inference (#500)

This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically, * Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche * If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match. Lastly, the pull request makes the backward pass work (as discussed in #499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.

1 parent d40eb6c commit 1ebd88aCopy full SHA for 1ebd88a

2 files changed

+518

-4

lines changed

src/petals/models/falcon
- block.py
tests
- test_optimized_layers.py

2 files changed

+518

-4

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 1ebd88a

2 files changed

2 files changed

File tree

2 files changed

2 files changed

0 commit comments