[Feature] Support variable-length sequences for mamba block #244

zigzagcai · 2024-03-14T08:21:46Z

Support variable-length sequences for mamba block via cu_seqlens/seq_idx/position_ids in the forward pass and backward pass, similar to what has been done (such as cumulative sequences cu_seqlens or lower triangular block diagonal matrix attention mask) in flash attention varlen_fwd/varlen_bwd API.

We have tested that training with variable-length sequences on real world datasets can bring end-to-end 2~4x speedup.

Why we need?
High speedup and hardware utilization on real world datasets that we tested. Can be used to improve hardware utilization when you have variable-length sequences and you don't want to waste computing resources on meaningless padded tokens. Especially useful when you do mamba training on real world datasets, where length distribution varies much and large proportion of samples are short sequences. Last but not least, we ensure exact fwd/bwd numerical equality with padding approach.
How to use?
Zero learning overhead, packed mamba API is similar to packed flash-attn API or packed mamba2 API. Just need to pack multiple variable-length sequences into one and additionally pass cu_seqlens/seq_idx/position_ids into mamba forward pass.
No need to modify causal-conv1d, just use the original https://github.com/Dao-AILab/causal-conv1d is fine. (version>=1.4.0)

Note:
We thank @wang-zerui for the fwd pass python reference implementation and invaluable discussion on how to ensure numerical equality.
This is a joint work with @wang-zerui and @Dmovic and @ptxu78

Example usage:
https://github.com/zigzagcai/varlen_mamba/blob/feat/add-cu_seqlens/tests/ops/test_mamba_varlen.py

Limitation:

This PR currently works well with variable-length training, but variable-length generation (or inference) has not been supported yet.

Some related issues about mamba and flash-attn variable-length training:

zigzagcai · 2024-03-14T09:56:15Z

Hello @tridao @albertfgu

Thanks for the awesome work on mamba and it is really a strong competitor for transformer!

We have noticed some issues (#236, #180) stated that they have a need for training on variable-length sequences. But they can’t find functionalities such as attention_mask or cu_seqlens in mamba block, which are commonly used in transformer structure to support variable length training.

Also, in real world scenarios, length distribution of datasets varies much, simply padding token to maximum length would waste computing resources on the meaningless padded tokens.

So we implemented this PR and hope it helps!

EricPaul03 · 2024-03-18T15:14:58Z

Hello, it's great to see your input on variable length data. How can I use the method you provided? Is there any difference in results between it and padding?

zigzagcai · 2024-03-18T15:38:58Z

Hello, it's great to see your input on variable length data. How can I use the method you provided? Is there any difference in results between it and padding?

Thank you for your interest in this PR!
For the forward pass of mamba block, we have done numerical comparison between it and padding results, which are finally shown to be consistent. (numerical equality for forward pass has been verified)
~~For the backward pass, we decide to add some unit tests to show the consistency when we have bandwidth. (haven't verified numerical equality for backward pass)~~

Update (2024/03/19):
Numerical equality for both forward and backward pass have been validated.
In terms of training loss and accuracy, this PR can be numerically aligned with padding approach, but can also avoid wasting computation resources on the meaningless padded tokens.
When training on a sample dataset, using variable-length training can bring high speedup compared to padding.

EricPaul03 · 2024-03-18T15:41:52Z

Hello, it's great to see your input on variable length data. How can I use the method you provided? Is there any difference in results between it and padding?

Thank you for your interest in this PR! For the forward pass of mamba block, we have done numerical comparison between it and padding results, which are finally shown to be consistent. For the backward pass, we decide to add some unite tests to show the consistency when we have bandwidth.

Thank you for your reply. Due to performance considerations, I would like to use bidirectional mamba. Should I wait for your updated code?

[Fix Typos] Fix while loop

zigzagcai · 2024-03-19T05:17:31Z

Hello, it's great to see your input on variable length data. How can I use the method you provided? Is there any difference in results between it and padding?

Thank you for your interest in this PR! For the forward pass of mamba block, we have done numerical comparison between it and padding results, which are finally shown to be consistent. For the backward pass, we decide to add some unite tests to show the consistency when we have bandwidth.

Thank you for your reply. Due to performance considerations, I would like to use bidirectional mamba. Should I wait for your updated code?

Hi @EricPaul03 ,

@Dmovic has created unit test on the backward pass of mamba block with variable-length sequences, and the test results show numerical equality for both forward and backward pass in the scenarios of varlen inputs.

I haven't tried it with bidirectional mamba. But since it is numerical equivalent for the default unidirectional mamba, I think you can just give it a try!

zigzagcai · 2024-03-19T05:48:32Z

To give a simple example. What we originally pass into the original mamba block is an input with shape (batch_size=7, seq_len=10, hidden_dim)
Through this PR, we can instead pass into the variable-length mamba block with an input with shape (batch_size=1, seq_len=32, hidden_dim), where the original variable-length sequences are packed into one fixed-length sequence, with an additional parameter cu_seqlens to mark sequence boundaries.

From the above figure, we can clearly see that through this PR, mamba block can focus computing resources on variable-length sequences and avoid the overhead of meaningless padding tokens.

Variable-length training is very useful for optimizing the hardware utilization during training, and we know that the well-known flash attention has supported variable-length training via cu_seqlens.
Therefore, we believe that mamba, as a competitor of transformer, can improve its hardware utilization during training on real world datasets (the length distribution varies much between data samples) through this PR!

EricPaul03 · 2024-03-19T06:06:33Z

To give a simple example. What we originally pass into the original mamba block is an input with shape (batch_size=7, seq_len=10, hidden_dim) Through this PR, we can instead pass into the enhanced mamba block with an input with shape (batch_size=1, seq_len=32, hidden_dim), where the original variable-length sequences are packed into one fixed-length sequence, with an additional parameter cu_seqlens to mark sequence boundaries.

From the above figure, we can clearly see that through this PR, mamba block can focus computing resources on variable-length sequences and avoid the overhead of meaningless padding tokens.

Variable-length training is very useful for optimizing the hardware utilization during training, and we know that the well-known flash attention has supported variable-length training via cu_seqlens.

Thank you for your answer. This is a great code that I will try to use for my project！

EricPaul03 · 2024-03-19T07:38:55Z

Sorry to bother you again, I would like to implement the same operation for bidirectional mamba. I would like to know if I also need to reset the value for cu_seqlens when flipping the propagation sequence to cope with the flipped sequence, and can these two share d_conv?
for example:
out_f, scan_intermediates_f, out_z_f = selective_scan_cuda.fwd( conv1d_out, delta, A, B, C, D, z, delta_bias, delta_softplus, cu_seqlens )

out_b, scan_intermediates_b, out_z_b = selective_scan_cuda.fwd( conv1d_out.flip([-1]), delta.flip([-1]), A_b, B.flip([-1]), C.flip([-1]), D, z.flip([-1]), delta_bias, delta_softplus, cu_seqlens )# cu_seqlens should be changed??

ctx.save_for_backward(xz, conv1d_weight, conv1d_bias, x_dbl, x_proj_weight, delta_proj_weight, out_proj_weight, conv1d_out, delta, A, A_b, B, C, D, delta_bias, scan_intermediates_f, scan_intermediates_b, out_f, out_b, cu_seqlens, d_conv) #the same d_conv ?

EricPaul03 · 2024-03-19T09:03:21Z

Sorry to bother you again, I would like to implement the same operation for bidirectional mamba. I would like to know if I also need to reset the value for cu_seqlens when flipping the propagation sequence to cope with the flipped sequence, and can these two share d_conv? for example: out_f, scan_intermediates_f, out_z_f = selective_scan_cuda.fwd( conv1d_out, delta, A, B, C, D, z, delta_bias, delta_softplus, cu_seqlens )

out_b, scan_intermediates_b, out_z_b = selective_scan_cuda.fwd( conv1d_out.flip([-1]), delta.flip([-1]), A_b, B.flip([-1]), C.flip([-1]), D, z.flip([-1]), delta_bias, delta_softplus, cu_seqlens )# cu_seqlens should be changed??

ctx.save_for_backward(xz, conv1d_weight, conv1d_bias, x_dbl, x_proj_weight, delta_proj_weight, out_proj_weight, conv1d_out, delta, A, A_b, B, C, D, delta_bias, scan_intermediates_f, scan_intermediates_b, out_f, out_b, cu_seqlens, d_conv) #the same d_conv ?

I think I should divide conv1d_out, delta, etc. into subsequences and reverse each subsequence separately? (Instead of the entire sequence, use the same cu_seqlens?)

junphine · 2024-03-21T06:16:54Z

I copy some method in MixerModel to help use this feature.

def unpad_input(self, hidden_states, attention_mask):
    hidden_states = rearrange(hidden_states, "b s ... -> (b s) ...")
    valid_mask = attention_mask.squeeze(1).squeeze(1).eq(1)  # some time is eq(1)
    seqlens_in_batch = valid_mask.sum(dim=-1, dtype=torch.int32)
    indices = torch.nonzero(valid_mask.flatten(), as_tuple=False).flatten()
    max_seqlen_in_batch = seqlens_in_batch.max().item()
    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
    hidden_states = hidden_states[indices].unsqueeze(0)
    return hidden_states, indices, cu_seqlens, max_seqlen_in_batch

def pad_input(self, hidden_states, indices, batch, seqlen):
    """        
    :param hidden_states: Shape is [L,H] not [B,L,H]
    :param indices: from unpad_input return indices
    :param batch: 
    :param seqlen:  from unpad_input return max_seqlen_in_batch
    :return: 
    """
    output = torch.zeros(batch * seqlen, *hidden_states.shape[1:], device=hidden_states.device,dtype=hidden_states.dtype)
    output[indices] = hidden_states
    return rearrange(output, '(b s) ... -> b s ...', b=batch)

zigzagcai · 2024-03-21T09:40:02Z

Sorry to bother you again, I would like to implement the same operation for bidirectional mamba. I would like to know if I also need to reset the value for cu_seqlens when flipping the propagation sequence to cope with the flipped sequence, and can these two share d_conv? for example: out_f, scan_intermediates_f, out_z_f = selective_scan_cuda.fwd( conv1d_out, delta, A, B, C, D, z, delta_bias, delta_softplus, cu_seqlens )
out_b, scan_intermediates_b, out_z_b = selective_scan_cuda.fwd( conv1d_out.flip([-1]), delta.flip([-1]), A_b, B.flip([-1]), C.flip([-1]), D, z.flip([-1]), delta_bias, delta_softplus, cu_seqlens )# cu_seqlens should be changed??
ctx.save_for_backward(xz, conv1d_weight, conv1d_bias, x_dbl, x_proj_weight, delta_proj_weight, out_proj_weight, conv1d_out, delta, A, A_b, B, C, D, delta_bias, scan_intermediates_f, scan_intermediates_b, out_f, out_b, cu_seqlens, d_conv) #the same d_conv ?

I think I should divide conv1d_out, delta, etc. into subsequences and reverse each subsequence separately? (Instead of the entire sequence, use the same cu_seqlens?)

For bidirectional mamba, you need to pass in the reverse_cu_seqlens to the reverse pass like that,

           out_rev = self.mamba_rev(
                hidden_states.flip(dims=(1,)),  # Flip along the sequence length dimension
                cu_seqlens=reverse_cu_seqlens, # Reverse cu_seqlens
                inference_params=inference_params
            ).flip(dims=(1,))  # Flip back for combining with forward hidden states

For example, if you have cu_seqlens = torch.tensor([0, 5, 15, 18, 19, 21, 26, 32]), the reverse_cu_seqlens should be reverse_cu_seqlens = tensor([ 0, 6, 11, 13, 14, 17, 27, 32]), which represents the position in the reverse pass that we need to reset hidden_states.

We can calculate reverse_cu_seqlens as following formula,

reverse_cu_seqlens = torch.cumsum(torch.cat((torch.tensor([0]), (cu_seqlens[1:]-cu_seqlens[:-1]).flip(dims=(0,))), dim=0), dim=0)

zigzagcai · 2024-03-21T09:52:51Z

Sorry to bother you again, I would like to implement the same operation for bidirectional mamba. I would like to know if I also need to reset the value for cu_seqlens when flipping the propagation sequence to cope with the flipped sequence, and can these two share d_conv? for example: out_f, scan_intermediates_f, out_z_f = selective_scan_cuda.fwd( conv1d_out, delta, A, B, C, D, z, delta_bias, delta_softplus, cu_seqlens )
out_b, scan_intermediates_b, out_z_b = selective_scan_cuda.fwd( conv1d_out.flip([-1]), delta.flip([-1]), A_b, B.flip([-1]), C.flip([-1]), D, z.flip([-1]), delta_bias, delta_softplus, cu_seqlens )# cu_seqlens should be changed??
ctx.save_for_backward(xz, conv1d_weight, conv1d_bias, x_dbl, x_proj_weight, delta_proj_weight, out_proj_weight, conv1d_out, delta, A, A_b, B, C, D, delta_bias, scan_intermediates_f, scan_intermediates_b, out_f, out_b, cu_seqlens, d_conv) #the same d_conv ?

I think I should divide conv1d_out, delta, etc. into subsequences and reverse each subsequence separately? (Instead of the entire sequence, use the same cu_seqlens?)

I think you might not need to divide these items into subsequences. All you need is to pass in the reverse_cu_seqlens to the reverse pass, and finally enjoys the benefits of both bidirectional and variable-length training.

For combining the benefits of bidirectional mamba and this PR's variable-length sequences, I drew my graphical understanding here,

The mechanism can be simply viewed as that when scanning bidirectionally, hidden_states need to be reset on sequence boundaries of both directions.

zigzagcai · 2024-08-09T03:21:54Z

It's great to see that there already one paper/project (Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning, NeurIPS'24) adopting our code in the area of offline Reinforcement Learning.

Link:
https://arxiv.org/pdf/2405.12094

JindongJiang · 2024-08-11T05:55:55Z

Hi @zigzagcai, thank you for the great work! I tried to install your version but encountered the selective_scan_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol problem. Does it also occur to you when you test the code?

The full pipeline I did is the following:

# (optionally) clone causal-conv1d, also tried pip install causal-conv1d==1.4.0
git clone https://github.com/Dao-AILab/causal-conv1d
cd causal-conv1d
git checkout v1.4.0
pip install -e .

cd ..
# clone and checkout your pr
git clone https://github.com/state-spaces/mamba
cd mamba
git fetch origin pull/244/head:pr-244
git checkout pr-244
pip install -e .

Tried installing with pytorch 2.4, 2.1, cuda 12.5, 12.1. All settings have the same problem:

> python tests/ops/test_mamba_cu_seqlens_equivalence.py

Traceback (most recent call last):
  File "/.../mamba/tests/ops/test_mamba_cu_seqlens_equivalence.py", line 5, in <module>
    from mamba_ssm.modules.mamba_simple import Mamba
  File "/usr/local/lib/python3.10/dist-packages/mamba_ssm/__init__.py", line 3, in <module>
    from mamba_ssm.ops.selective_scan_interface import selective_scan_fn, mamba_inner_fn
  File "/usr/local/lib/python3.10/dist-packages/mamba_ssm/ops/selective_scan_interface.py", line 16, in <module>
    import selective_scan_cuda
ImportError: /usr/local/lib/python3.10/dist-packages/selective_scan_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE

Additionally, I also found that the installed causal-conv1d and mamba-ssm doesn't seem to recognize each other, because when I do the following, it shows that causal-conv1d is required by nothing:

>pip show causal-conv1d

Name: causal-conv1d
Version: 1.4.0
Summary: Causal depthwise conv1d in CUDA, with a PyTorch interface
Home-page: https://github.com/Dao-AILab/causal-conv1d
Author: Tri Dao
Author-email: tri@tridao.me
License:
Location: /usr/local/lib/python3.10/dist-packages
Requires: ninja, packaging, torch
Required-by: (empty here)

Similarly, mamba_ssm does not require causal-conv1d:

> pip show mamba-ssm

Name: mamba_ssm
Version: 2.2.2
Summary: Mamba state-space model
Home-page:
Author:
Author-email: Tri Dao <tri@tridao.me>, Albert Gu <agu@cs.cmu.edu>
...
Location: /usr/local/lib/python3.10/dist-packages
Requires: einops, ninja, packaging, setuptools, torch, transformers, triton (causal-conv1d is not here)
Required-by:

If this issue does't occur to you, could you provide the installing script you are using for the most up-to-date version? Thanks!

zigzagcai · 2024-08-11T14:22:22Z

Hi @JindongJiang ,

I share my minimum reproducing steps here.

The hardware and software info:

HW: A800/A100
Driver: CUDA 11.8

Steps to setup envs:

conda create -n mamba_dev python=3.10
conda activate mamba_dev
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install causal-conv1d==1.4.0
pip install einops huggingface-hub transformers triton pytest 
git clone https://github.com/zigzagcai/varlen_mamba.git --branch feat/add-cu_seqlens
cd varlen_mamba
pip install -e .

Run tests:

pytest tests/

============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.2, pluggy-1.5.0
rootdir: /blahblah/zigzagcai/varlen_mamba
plugins: typeguard-3.0.2
collected 392 items

tests/ops/test_selective_scan.py ....................                    [  5%]
tests/ops/test_selective_scan_var_len.py ............                    [  8%]
tests/ops/triton/test_layernorm_gated.py .s.s.s.s.s.s.....s.s.s.s.s.s... [ 16%]
..s.s.s.s.s.s.....s.s.s.s.s.s....                                        [ 24%]
tests/ops/triton/test_selective_state_update.py ........................ [ 30%]
........................................................................ [ 48%]
........................................................................ [ 67%]
........................................................................ [ 85%]
..............................                                           [ 93%]
tests/ops/triton/test_ssd.py ........................                    [ 99%]
tests/test_generation.py ..                                              [100%]

================= 368 passed, 24 skipped in 183.78s (0:03:03) ==================

python tests/ops/test_mamba_cu_seqlens_equivalence.py

Generate random cu_seqlens = [0, 5, 84, 182, 202, 284, 796, 836, 1024]
max diff for output in varlen_mamba fwd pass: 6.407499313354492e-07
mean diff for output in varlen_mamba fwd pass: 3.794611203034037e-08
max diff for A_log in varlen_mamba bwd pass: 6.705522537231445e-08
mean diff for A_log in varlen_mamba bwd pass: 6.687657094772703e-10
max diff for D in varlen_mamba bwd pass: 4.76837158203125e-06
mean diff for D in varlen_mamba bwd pass: 6.003104999763309e-07
max diff for in_proj.weight in varlen_mamba bwd pass: 1.9073486328125e-05
mean diff for in_proj.weight in varlen_mamba bwd pass: 1.0953947366942884e-06
max diff for conv1d.weight in varlen_mamba bwd pass: 5.364418029785156e-06
mean diff for conv1d.weight in varlen_mamba bwd pass: 8.792806056590052e-07
max diff for conv1d.bias in varlen_mamba bwd pass: 7.867813110351562e-06
mean diff for conv1d.bias in varlen_mamba bwd pass: 1.4787228792556562e-06
max diff for x_proj.weight in varlen_mamba bwd pass: 5.029141902923584e-06
mean diff for x_proj.weight in varlen_mamba bwd pass: 3.1919995535645285e-08
max diff for dt_proj.weight in varlen_mamba bwd pass: 1.3300450518727303e-08
mean diff for dt_proj.weight in varlen_mamba bwd pass: 3.616623112101536e-10
max diff for dt_proj.bias in varlen_mamba bwd pass: 3.166496753692627e-08
mean diff for dt_proj.bias in varlen_mamba bwd pass: 2.6783406603669846e-09
max diff for out_proj.weight in varlen_mamba bwd pass: 6.67572021484375e-06
mean diff for out_proj.weight in varlen_mamba bwd pass: 2.693569740586099e-07

zigzagcai · 2024-08-11T14:32:44Z

Hi @zigzagcai, thank you for the great work! I tried to install your version but encountered the selective_scan_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol problem. Does it also occur to you when you test the code?

The full pipeline I did is the following:

# (optionally) clone causal-conv1d, also tried pip install causal-conv1d==1.4.0
git clone https://github.com/Dao-AILab/causal-conv1d
cd causal-conv1d
git checkout v1.4.0
pip install -e .

cd ..
# clone and checkout your pr
git clone https://github.com/state-spaces/mamba
cd mamba
git fetch origin pull/244/head:pr-244
git checkout pr-244
pip install -e .

Tried installing with pytorch 2.4, 2.1, cuda 12.5, 12.1. All settings have the same problem:

> python tests/ops/test_mamba_cu_seqlens_equivalence.py

Traceback (most recent call last):
  File "/.../mamba/tests/ops/test_mamba_cu_seqlens_equivalence.py", line 5, in <module>
    from mamba_ssm.modules.mamba_simple import Mamba
  File "/usr/local/lib/python3.10/dist-packages/mamba_ssm/__init__.py", line 3, in <module>
    from mamba_ssm.ops.selective_scan_interface import selective_scan_fn, mamba_inner_fn
  File "/usr/local/lib/python3.10/dist-packages/mamba_ssm/ops/selective_scan_interface.py", line 16, in <module>
    import selective_scan_cuda
ImportError: /usr/local/lib/python3.10/dist-packages/selective_scan_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE

Additionally, I also found that the installed causal-conv1d and mamba-ssm doesn't seem to recognize each other, because when I do the following, it shows that causal-conv1d is required by nothing:

>pip show causal-conv1d

Name: causal-conv1d
Version: 1.4.0
Summary: Causal depthwise conv1d in CUDA, with a PyTorch interface
Home-page: https://github.com/Dao-AILab/causal-conv1d
Author: Tri Dao
Author-email: tri@tridao.me
License:
Location: /usr/local/lib/python3.10/dist-packages
Requires: ninja, packaging, torch
Required-by: (empty here)

Similarly, mamba_ssm does not require causal-conv1d:

> pip show mamba-ssm

Name: mamba_ssm
Version: 2.2.2
Summary: Mamba state-space model
Home-page:
Author:
Author-email: Tri Dao <tri@tridao.me>, Albert Gu <agu@cs.cmu.edu>
...
Location: /usr/local/lib/python3.10/dist-packages
Requires: einops, ninja, packaging, setuptools, torch, transformers, triton (causal-conv1d is not here)
Required-by:

If this issue does't occur to you, could you provide the installing script you are using for the most up-to-date version? Thanks!

Hi, @JindongJiang

Firstly, Thanks for your interest in this PR!

I also have ever met the error like selective_scan_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol when I tried to build mamba from source.
I searched about it in mamba issues and It seemed to be a known issue Import Error #169 (comment)
I think it might be some problem with conda dirty cached files. I always re-create a new conda env from scratch to workaround it. And you might also try the approach from the above comment. (uninstall and reinstall with --no-cache-dir)
It might be some problem with the project.toml build system. I noticed that this file was added just recently in the commit 323db26
You can just delete the project.toml and manually install the depedencies to workaround it.

pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install causal-conv1d==1.4.0
pip install einops huggingface-hub transformers triton pytest

JindongJiang · 2024-08-11T19:05:05Z

Hi @zigzagcai, thank you very much for the help. Interestingly, deleting the project.toml solve the undefined symbol issue. Now I can successfully run python tests/ops/test_mamba_cu_seqlens_equivalence.py with CUDA 12.5 and PyTorch 2.4.0. However, the reported diff seem to be quite large for in_proj.weight and out_proj.weight. Rerun more times will further trigger the assert torch.allclose(mamba_grad[name], mamba_ref_grad[name], rtol=rtol, atol=atol) AssertionError.

> python tests/ops/test_mamba_cu_seqlens_equivalence.py
/lustre/fs2/portfolios/nvr/users/jindongj/Documents/Programming/PyTorch/tmp/mamba_varlen/varlen_mamba/mamba_ssm/ops/selective_scan_interface.py:169: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
/lustre/fs2/portfolios/nvr/users/jindongj/Documents/Programming/PyTorch/tmp/mamba_varlen/varlen_mamba/mamba_ssm/ops/selective_scan_interface.py:265: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, dout):
/lustre/fs2/portfolios/nvr/users/jindongj/Documents/Programming/PyTorch/tmp/mamba_varlen/varlen_mamba/mamba_ssm/ops/triton/layer_norm.py:986: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/lustre/fs2/portfolios/nvr/users/jindongj/Documents/Programming/PyTorch/tmp/mamba_varlen/varlen_mamba/mamba_ssm/ops/triton/layer_norm.py:1045: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, dout, *args):
/lustre/fs2/portfolios/nvr/users/jindongj/Documents/Programming/PyTorch/tmp/mamba_varlen/varlen_mamba/mamba_ssm/distributed/tensor_parallel.py:26: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, x, weight, bias, process_group=None, sequence_parallel=True):
/lustre/fs2/portfolios/nvr/users/jindongj/Documents/Programming/PyTorch/tmp/mamba_varlen/varlen_mamba/mamba_ssm/distributed/tensor_parallel.py:62: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/lustre/fs2/portfolios/nvr/users/jindongj/Documents/Programming/PyTorch/tmp/mamba_varlen/varlen_mamba/mamba_ssm/ops/triton/ssd_combined.py:758: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, zxbcdt, conv1d_weight, conv1d_bias, dt_bias, A, D, chunk_size, initial_states=None, seq_idx=None, dt_limit=(0.0, float("inf")), return_final_states=False, activation="silu",
/lustre/fs2/portfolios/nvr/users/jindongj/Documents/Programming/PyTorch/tmp/mamba_varlen/varlen_mamba/mamba_ssm/ops/triton/ssd_combined.py:836: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, dout, *args):
Generate random cu_seqlens = [0, 360, 587, 663, 696, 731, 783, 1016, 1024]
max diff for output in varlen_mamba fwd pass: 5.148351192474365e-06
mean diff for output in varlen_mamba fwd pass: 3.023584440597915e-07
max diff for A_log in varlen_mamba bwd pass: 1.1734664440155029e-07
mean diff for A_log in varlen_mamba bwd pass: 4.355594551697095e-09
max diff for D in varlen_mamba bwd pass: 2.09808349609375e-05
mean diff for D in varlen_mamba bwd pass: 2.792159648379311e-06
max diff for in_proj.weight in varlen_mamba bwd pass: 0.002126932144165039 (larger than the others)
mean diff for in_proj.weight in varlen_mamba bwd pass: 4.3531857954803854e-05
max diff for conv1d.weight in varlen_mamba bwd pass: 2.5033950805664062e-05
mean diff for conv1d.weight in varlen_mamba bwd pass: 3.934956112061627e-06
max diff for conv1d.bias in varlen_mamba bwd pass: 4.76837158203125e-05
mean diff for conv1d.bias in varlen_mamba bwd pass: 6.707944066874916e-06
max diff for x_proj.weight in varlen_mamba bwd pass: 9.655207395553589e-05
mean diff for x_proj.weight in varlen_mamba bwd pass: 1.2083935416740132e-06
max diff for dt_proj.weight in varlen_mamba bwd pass: 2.1980376914143562e-06
mean diff for dt_proj.weight in varlen_mamba bwd pass: 1.9807760764933846e-08
max diff for dt_proj.bias in varlen_mamba bwd pass: 2.2165477275848389e-07
mean diff for dt_proj.bias in varlen_mamba bwd pass: 1.6080232256854288e-08
max diff for out_proj.weight in varlen_mamba bwd pass: 0.001113295555114746 (larger than the others)
mean diff for out_proj.weight in varlen_mamba bwd pass: 2.5028268737514736e-06

Beside the pytorch and cuda version, I used the same setup as you suggested:

pip install causal-conv1d==1.4.0
pip install einops huggingface-hub transformers triton pytest 
git clone https://github.com/zigzagcai/varlen_mamba.git --branch feat/add-cu_seqlens
cd varlen_mamba
pip install -e .

I will now try using cuda 11.8 as well and will let you know if I get the same problem.

JindongJiang · 2024-08-11T19:15:26Z

Hi @zigzagcai, I am back with cuda 11.8 results, problem still exist. This time I am (almost) fully following your setup script:

# I first pulled and started a cuda 11.8 containter
conda create -n mamba_dev python=3.10
conda activate mamba_dev
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install causal-conv1d==1.4.0
pip install einops huggingface-hub transformers triton pytest 
git clone https://github.com/zigzagcai/varlen_mamba.git --branch feat/add-cu_seqlens
cd varlen_mamba
pip install --no-build-isolation -e .

Only difference is that I have to do --no-build-isolation for the final pip, otherwise will get

RuntimeError:
      The detected CUDA version (11.8) mismatches the version that was used to compile
      PyTorch (12.1). Please make sure to use the same CUDA versions.

Complete results and env:

> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

> python -c 'import torch; print(torch.version.cuda)'
11.8

> python -c 'import torch; print(torch.__version__)'
2.1.0+cu118

> python tests/ops/test_mamba_cu_seqlens_equivalence.py
Generate random cu_seqlens = [0, 47, 108, 301, 355, 654, 710, 1018, 1024]
max diff for output in varlen_mamba fwd pass: 5.170702934265137e-06
mean diff for output in varlen_mamba fwd pass: 2.9561221026597195e-07
max diff for A_log in varlen_mamba bwd pass: 3.259629011154175e-07
mean diff for A_log in varlen_mamba bwd pass: 5.464801056120905e-09
max diff for D in varlen_mamba bwd pass: 2.0265579223632812e-05
mean diff for D in varlen_mamba bwd pass: 2.7891701392945834e-06
max diff for in_proj.weight in varlen_mamba bwd pass: 0.0019719600677490234 (still quite large)
mean diff for in_proj.weight in varlen_mamba bwd pass: 4.3339219701010734e-05
max diff for conv1d.weight in varlen_mamba bwd pass: 2.4557113647460938e-05
mean diff for conv1d.weight in varlen_mamba bwd pass: 3.905551238858607e-06
max diff for conv1d.bias in varlen_mamba bwd pass: 4.1484832763671875e-05
mean diff for conv1d.bias in varlen_mamba bwd pass: 6.763219062122516e-06
max diff for x_proj.weight in varlen_mamba bwd pass: 8.52346420288086e-05
mean diff for x_proj.weight in varlen_mamba bwd pass: 1.2547155847641989e-06
max diff for dt_proj.weight in varlen_mamba bwd pass: 8.707866072654724e-07
mean diff for dt_proj.weight in varlen_mamba bwd pass: 1.8839402926573712e-08
max diff for dt_proj.bias in varlen_mamba bwd pass: 2.5704503059387207e-07
mean diff for dt_proj.bias in varlen_mamba bwd pass: 1.7761198733978745e-08
max diff for out_proj.weight in varlen_mamba bwd pass: 0.0009202957153320312 (still quite large)
mean diff for out_proj.weight in varlen_mamba bwd pass: 2.580519321782049e-06

It is actually quite surprising that the big discrepancies only happen at the beginning and end: in_proj and out_proj. Could you provide some comments on this? Thanks!

zigzagcai · 2024-08-15T04:06:53Z

Hi @JindongJiang ,

The error below is caused by the project.toml build system, and it is also a popular encountered issue in vllm project. vllm-project/vllm#129 (comment)

RuntimeError:
      The detected CUDA version (11.8) mismatches the version that was used to compile
      PyTorch (12.1). Please make sure to use the same CUDA versions.

zigzagcai · 2024-08-15T04:09:12Z

I just revert the recent merge commit in 0a15f1d
Therefore the project.toml is removed and it should be okay to pip install -e .

Could you please re-try my branch? I just re-tested the code on my env and it is okay.

conda create -n mamba_dev python=3.10
conda activate mamba_dev
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install causal-conv1d==1.4.0
pip install einops huggingface-hub transformers triton pytest 
git clone https://github.com/zigzagcai/varlen_mamba.git --branch feat/add-cu_seqlens
cd varlen_mamba
pip install -e .

The test results

python tests/ops/test_mamba_cu_seqlens_equivalence.py

Generate random cu_seqlens = [0, 23, 124, 465, 501, 678, 847, 1000, 1024]
max diff for output in varlen_mamba fwd pass: 6.854534149169922e-07
mean diff for output in varlen_mamba fwd pass: 3.783769386700442e-08
max diff for A_log in varlen_mamba bwd pass: 6.51925802230835e-08
mean diff for A_log in varlen_mamba bwd pass: 7.246340749667013e-10
max diff for D in varlen_mamba bwd pass: 4.410743713378906e-06
mean diff for D in varlen_mamba bwd pass: 6.200841653480893e-07
max diff for in_proj.weight in varlen_mamba bwd pass: 2.002716064453125e-05
mean diff for in_proj.weight in varlen_mamba bwd pass: 1.0927163884844049e-06
max diff for conv1d.weight in varlen_mamba bwd pass: 5.7220458984375e-06
mean diff for conv1d.weight in varlen_mamba bwd pass: 8.621824463261873e-07
max diff for conv1d.bias in varlen_mamba bwd pass: 9.775161743164062e-06
mean diff for conv1d.bias in varlen_mamba bwd pass: 1.4727456800756045e-06
max diff for x_proj.weight in varlen_mamba bwd pass: 7.62939453125e-06
mean diff for x_proj.weight in varlen_mamba bwd pass: 3.4194002296317194e-08
max diff for dt_proj.weight in varlen_mamba bwd pass: 1.1408701539039612e-08
mean diff for dt_proj.weight in varlen_mamba bwd pass: 4.2962927659928596e-10
max diff for dt_proj.bias in varlen_mamba bwd pass: 4.516914486885071e-08
mean diff for dt_proj.bias in varlen_mamba bwd pass: 3.3461828863323717e-09
max diff for out_proj.weight in varlen_mamba bwd pass: 6.67572021484375e-06
mean diff for out_proj.weight in varlen_mamba bwd pass: 2.732678581196524e-07

FYI. My local envs (including cuda version and pip packages):

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

pip list

Package                       Version               Editable project location
----------------------------- --------------------- ---------------------------------
alabaster                     0.7.13
appdirs                       1.4.4
attrs                         23.1.0
Babel                         2.12.1
cattrs                        23.1.2
causal-conv1d                 1.4.0
certifi                       2022.12.7
charset-normalizer            2.1.1
coloredlogs                   15.0.1
einops                        0.8.0
environs                      11.0.0
esbonio                       0.16.1
exceptiongroup                1.2.2
filelock                      3.13.1
fsspec                        2024.2.0
huggingface-hub               0.24.5
humanfriendly                 10.0
humanize                      4.9.0
idna                          3.4
imagesize                     1.4.1
iniconfig                     2.0.0
Jinja2                        3.1.3
lsprotocol                    2023.0.0a3
mamba_ssm                     2.2.2                 /blahblah/zigzagcai/varlen_mamba
MarkupSafe                    2.1.5
marshmallow                   3.21.1
mpmath                        1.3.0
multiprocessing-logging       0.3.4
networkx                      3.2.1
ninja                         1.11.1.1
numpy                         1.26.3
packaging                     24.1
pillow                        10.2.0
pip                           24.2
pluggy                        1.5.0
pygls                         1.0.2
pyspellchecker                0.7.2
pytest                        8.3.2
python-dotenv                 1.0.1
PyYAML                        6.0.2
regex                         2024.7.24
requests                      2.28.1
safetensors                   0.4.4
setuptools                    72.1.0
shared-memory-dict            0.7.2
snowballstemmer               2.2.0
Sphinx                        7.2.5
sphinxcontrib-applehelp       1.0.7
sphinxcontrib-devhelp         1.0.5
sphinxcontrib-htmlhelp        2.0.4
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.6
sphinxcontrib-serializinghtml 1.1.9
sympy                         1.12
tokenizers                    0.19.1
tomli                         2.0.1
torch                         2.1.0+cu118
torch-tb-profiler             0.4.1
torchaudio                    2.1.0+cu118
torchdata                     0.7.1.dev20240618+cpu
torchvision                   0.16.0+cu118
tqdm                          4.66.5
transformers                  4.44.0
triton                        2.1.0
typeguard                     3.0.2
typing_extensions             4.9.0
UltraDict                     0.0.6
urllib3                       1.26.13
wheel                         0.43.0
zstandard                     0.22.0

zigzagcai · 2024-08-15T04:14:53Z

BTW. @JindongJiang Which model of GPU are you using, A100, H100 or others? This way I can have better knowledge about your software and hardware environment.

JindongJiang · 2024-08-15T04:21:52Z

Hi @zigzagcai , thank you very much for the updates and new commits. I will test the new setup. I got the above results using A100.

JindongJiang · 2024-08-15T18:48:03Z

Hi @zigzagcai , it seems that the grad discrepancy only exist when I use docker image in slurm. I have two ways to run the experiments:

With conda env and without docker, works fine

# branch installed in mamba_env with project.toml removed 
> srun -A $ACCOUNT -t 00:05:00 --partition $PARTITION --job-name job_name --gpus 1 bash -c 'source activate mamba_env; python tests/ops/test_mamba_cu_seqlens_equivalence.py'

# Results
Generate random cu_seqlens = [0, 122, 132, 373, 545, 620, 958, 966, 1024]
max diff for output in varlen_mamba fwd pass: 6.556510925292969e-07
mean diff for output in varlen_mamba fwd pass: 3.8284465375681975e-08
max diff for A_log in varlen_mamba bwd pass: 2.3748725652694702e-08
mean diff for A_log in varlen_mamba bwd pass: 6.788294371062875e-10
max diff for D in varlen_mamba bwd pass: 4.291534423828125e-06
mean diff for D in varlen_mamba bwd pass: 6.210023002495291e-07
max diff for in_proj.weight in varlen_mamba bwd pass: 1.8596649169921875e-05
mean diff for in_proj.weight in varlen_mamba bwd pass: 1.1082604487455683e-06
max diff for conv1d.weight in varlen_mamba bwd pass: 5.125999450683594e-06
mean diff for conv1d.weight in varlen_mamba bwd pass: 8.711153896001633e-07
max diff for conv1d.bias in varlen_mamba bwd pass: 7.808208465576172e-06
mean diff for conv1d.bias in varlen_mamba bwd pass: 1.5153941603784915e-06
max diff for x_proj.weight in varlen_mamba bwd pass: 2.637505531311035e-06
mean diff for x_proj.weight in varlen_mamba bwd pass: 3.0485256985457454e-08
max diff for dt_proj.weight in varlen_mamba bwd pass: 8.381903171539307e-09
mean diff for dt_proj.weight in varlen_mamba bwd pass: 3.972099316129629e-10
max diff for dt_proj.bias in varlen_mamba bwd pass: 2.514570951461792e-08
mean diff for dt_proj.bias in varlen_mamba bwd pass: 2.534548571020423e-09
max diff for out_proj.weight in varlen_mamba bwd pass: 5.7220458984375e-06
mean diff for out_proj.weight in varlen_mamba bwd pass: 2.7322744244884234e-07

With docker image and varlen_mamba installed in system python. Not working:

# branch installed for /usr/bin/python
> srun -A $ACCOUNT -t 00:05:00 --partition $PARTITION --job-name job_name --container-image=/path_to_image \
--gpus 1 bash -c 'python /path_to/tests/ops/test_mamba_cu_seqlens_equivalence.py'

# Results
Generate random cu_seqlens = [0, 81, 82, 139, 328, 377, 569, 724, 1024]
max diff for output in varlen_mamba fwd pass: 5.111098289489746e-06
mean diff for output in varlen_mamba fwd pass: 3.1440134762306116e-07
max diff for A_log in varlen_mamba bwd pass: 2.738088369369507e-07
mean diff for A_log in varlen_mamba bwd pass: 6.286445142222874e-09
max diff for D in varlen_mamba bwd pass: 2.002716064453125e-05
mean diff for D in varlen_mamba bwd pass: 2.825587216648273e-06
max diff for in_proj.weight in varlen_mamba bwd pass: 0.0038983821868896484
mean diff for in_proj.weight in varlen_mamba bwd pass: 4.323392204241827e-05
Traceback (most recent call last):
  File "/path_to/tests/ops/test_mamba_cu_seqlens_equivalence.py", line 125, in <module>
    main()
  File "/path_to/tests/ops/test_mamba_cu_seqlens_equivalence.py", line 122, in main
    assert torch.allclose(mamba_grad[name], mamba_ref_grad[name], rtol=rtol, atol=atol)
AssertionError

Thank you for your help again! I think the problem is not in the implementation then. I will use conda without docker for now.

zigzagcai · 2024-08-21T03:24:02Z

Hi @zigzagcai , it seems that the grad discrepancy only exist when I use docker image in slurm. I have two ways to run the experiments:

With conda env and without docker, works fine

# branch installed in mamba_env with project.toml removed 
> srun -A $ACCOUNT -t 00:05:00 --partition $PARTITION --job-name job_name --gpus 1 bash -c 'source activate mamba_env; python tests/ops/test_mamba_cu_seqlens_equivalence.py'

# Results
Generate random cu_seqlens = [0, 122, 132, 373, 545, 620, 958, 966, 1024]
max diff for output in varlen_mamba fwd pass: 6.556510925292969e-07
mean diff for output in varlen_mamba fwd pass: 3.8284465375681975e-08
max diff for A_log in varlen_mamba bwd pass: 2.3748725652694702e-08
mean diff for A_log in varlen_mamba bwd pass: 6.788294371062875e-10
max diff for D in varlen_mamba bwd pass: 4.291534423828125e-06
mean diff for D in varlen_mamba bwd pass: 6.210023002495291e-07
max diff for in_proj.weight in varlen_mamba bwd pass: 1.8596649169921875e-05
mean diff for in_proj.weight in varlen_mamba bwd pass: 1.1082604487455683e-06
max diff for conv1d.weight in varlen_mamba bwd pass: 5.125999450683594e-06
mean diff for conv1d.weight in varlen_mamba bwd pass: 8.711153896001633e-07
max diff for conv1d.bias in varlen_mamba bwd pass: 7.808208465576172e-06
mean diff for conv1d.bias in varlen_mamba bwd pass: 1.5153941603784915e-06
max diff for x_proj.weight in varlen_mamba bwd pass: 2.637505531311035e-06
mean diff for x_proj.weight in varlen_mamba bwd pass: 3.0485256985457454e-08
max diff for dt_proj.weight in varlen_mamba bwd pass: 8.381903171539307e-09
mean diff for dt_proj.weight in varlen_mamba bwd pass: 3.972099316129629e-10
max diff for dt_proj.bias in varlen_mamba bwd pass: 2.514570951461792e-08
mean diff for dt_proj.bias in varlen_mamba bwd pass: 2.534548571020423e-09
max diff for out_proj.weight in varlen_mamba bwd pass: 5.7220458984375e-06
mean diff for out_proj.weight in varlen_mamba bwd pass: 2.7322744244884234e-07

With docker image and varlen_mamba installed in system python. Not working:

# branch installed for /usr/bin/python
> srun -A $ACCOUNT -t 00:05:00 --partition $PARTITION --job-name job_name --container-image=/path_to_image \
--gpus 1 bash -c 'python /path_to/tests/ops/test_mamba_cu_seqlens_equivalence.py'

# Results
Generate random cu_seqlens = [0, 81, 82, 139, 328, 377, 569, 724, 1024]
max diff for output in varlen_mamba fwd pass: 5.111098289489746e-06
mean diff for output in varlen_mamba fwd pass: 3.1440134762306116e-07
max diff for A_log in varlen_mamba bwd pass: 2.738088369369507e-07
mean diff for A_log in varlen_mamba bwd pass: 6.286445142222874e-09
max diff for D in varlen_mamba bwd pass: 2.002716064453125e-05
mean diff for D in varlen_mamba bwd pass: 2.825587216648273e-06
max diff for in_proj.weight in varlen_mamba bwd pass: 0.0038983821868896484
mean diff for in_proj.weight in varlen_mamba bwd pass: 4.323392204241827e-05
Traceback (most recent call last):
  File "/path_to/tests/ops/test_mamba_cu_seqlens_equivalence.py", line 125, in <module>
    main()
  File "/path_to/tests/ops/test_mamba_cu_seqlens_equivalence.py", line 122, in main
    assert torch.allclose(mamba_grad[name], mamba_ref_grad[name], rtol=rtol, atol=atol)
AssertionError

Thank you for your help again! I think the problem is not in the implementation then. I will use conda without docker for now.

Very glad to see it is helpful to you!

You are right. I guess there might be some conflicts when you try to install packages with /usr/bin/python environment in NVIDIA docker. So, it is always better to start a fresh new virtual environment even in a docker.

bali-eng · 2024-09-04T18:53:38Z

Hi @zigzagcai
Thanks for the cool PR.

Here is how I install dependencies, which might be useful for those working with CUDA 12.5:

`conda create -n your_env_name python=3.10.13

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt

git clone git@github.com:hustvl/Vim.git

pip install -e causal-conv1d>=1.1.0

pip install -e mamba-1p1p1

pip install --upgrade huggingface-hub==0.24.0`

I made a slight adjustment to your example, and here is the revised version:

`from collections import Counter
from itertools import chain

import torch
from mamba_ssm.modules.mamba_simple import Mamba
from torch.nn.utils.rnn import pad_sequence

sentences = [
"Apples.",
"The dog barked.",
"She smiled warmly at him.",
"The sun set behind the mountains.",
]

word_counter = Counter(chain(*[sentence.lower().split() for sentence in sentences]))
vocab = {word: i + 1 for i, (word, _) in enumerate(word_counter.most_common())}
sequences = [
[vocab[word] for word in sentence.lower().split()] for sentence in sentences
]
padded_sequences = pad_sequence(
[torch.tensor(seq) for seq in sequences], batch_first=True, padding_value=-500
)

def variable_length_sequences(new_tensor):
new_tensor_reeshaped = new_tensor.reshape(-1, 1).squeeze(1)
new_tensor_reeshaped_index = [
idx for idx, i in enumerate(new_tensor_reeshaped) if i != -500
]
start_indexes = []
last_index = None
for idx in new_tensor_reeshaped_index:
if last_index is None or idx != last_index + 1:
start_indexes.append(idx)
last_index = idx
return torch.tensor(start_indexes), torch.tensor(new_tensor_reeshaped)

def unpack(packed_hidden_states, cu_seqlens):
batch_size = cu_seqlens.shape[0] - 1
seq_len = (cu_seqlens[1:] - cu_seqlens[:-1]).max()
hidden_dim = packed_hidden_states.shape[2]
hidden_states = torch.zeros(
batch_size,
seq_len,
hidden_dim,
dtype=packed_hidden_states.dtype,
device=packed_hidden_states.device,
)
for i in range(batch_size):
hidden_states[i, : cu_seqlens[i + 1] - cu_seqlens[i], :] = packed_hidden_states[
:, cu_seqlens[i] : cu_seqlens[i + 1], :
]
return hidden_states

def pack(hidden_states, cu_seqlens):
batch_size, seq_len, hidden_dim = hidden_states.shape
seq_len_list = cu_seqlens[1:] - cu_seqlens[:-1]
seq_len_list_3d = seq_len_list.unsqueeze(1).unsqueeze(2)
indices_3d = (
torch.arange(seq_len, device=hidden_states.device)
.unsqueeze(0)
.unsqueeze(2)
.repeat(batch_size, 1, hidden_dim)
)
mask_3d = indices_3d < seq_len_list_3d
packed_hidden_states = hidden_states[mask_3d].view(-1, hidden_dim)
return packed_hidden_states

hidden_dim = 256
seq_len = 1024
batch_size = 8
device = "cuda"
mamba = Mamba(
# This module uses roughly 3 * expand * d_model^2 parameters
d_model=hidden_dim, # Model dimension d_model
d_state=16, # SSM state expansion factor
d_conv=4, # Local convolution width
expand=2, # Block expansion factor
).to(device)

new_tensor_reeshaped_index = variable_length_sequences(padded_sequences)
hidden_states_list = [
torch.randn(l, hidden_dim, device=device)
for l in (
new_tensor_reeshaped_index[0][1:] - new_tensor_reeshaped_index[0][:-1]
).tolist()
]
packed_hidden_states = torch.cat(hidden_states_list, dim=0).unsqueeze(0)
hidden_states = unpack(packed_hidden_states, new_tensor_reeshaped_index[0])

out_ref = mamba(hidden_states)
out_ref = pack(out_ref, new_tensor_reeshaped_index[0].to("cuda")).unsqueeze(0)
out = mamba(packed_hidden_states, new_tensor_reeshaped_index[0].to("cuda"))
unpack(out, new_tensor_reeshaped_index[0]).shape`

I noticed that when processing 4 sentences, you receive embeddings for only 3 sentences (torch.Size([3, 6, 256])). It might be helpful to append last_index + 1 to the list in your variable_length_sequences function (i.e., start_indexes.append(last_index + 1)). This adjustment should ensure that the number of output sentences matches the number of input sentences (torch.Size([4, 6, 256])).

I am receiving embeddings with a shape of torch.Size([4, 6, 256]). However, one of my sentences contains only three words. Should I apply masking to the returned sequences to remove embeddings that might not be meaningful?

Thanks,

zongtianhu · 2025-01-20T10:07:58Z

Hi，

To give a simple example. What we originally pass into the original mamba block is an input with shape (batch_size=7, seq_len=10, hidden_dim) Through this PR, we can instead pass into the variable-length mamba block with an input with shape (batch_size=1, seq_len=32, hidden_dim), where the original variable-length sequences are packed into one fixed-length sequence, with an additional parameter cu_seqlens to mark sequence boundaries.

From the above figure, we can clearly see that through this PR, mamba block can focus computing resources on variable-length sequences and avoid the overhead of meaningless padding tokens.

Variable-length training is very useful for optimizing the hardware utilization during training, and we know that the well-known flash attention has supported variable-length training via cu_seqlens. Therefore, we believe that mamba, as a competitor of transformer, can improve its hardware utilization during training on real world datasets (the length distribution varies much between data samples) through this PR!

Thank you very much for your code and illustrations, but I have some doubts about the parameters seqlen and seq_idx in Mamba2 in the following figure. Could you provide the corresponding illustration for these parameters?

CacatuaAlan · 2025-04-04T09:47:41Z

Thanks for the great job! I confuse that this version shares hidden states between each batches? That will be a great help to me!

zigzagcai · 2025-04-25T06:42:18Z

Thanks for the great job! I confuse that this version shares hidden states between each batches? That will be a great help to me!

Hi @CacatuaAlan ,

Thank you!

This version of code can handle packed hidden states, which combines multiple batches of hidden states into one.
But each batches won't share hidden states, actually we use cu_seqlens to avoid cross-batch contamination.

How to avoid avoid cross-batch contamination?

Our modified selective scan CUDA kernels with cu_seqlens passed in. code here
Using https://github.com/Dao-AILab/causal-conv1d with seq_idx passed in. In our implementation, seq_idx can be computed from cu_seqlens. code here. BTW, we also provide a python alternative implementation of variable-length causal-conv1d here

zigzagcai · 2025-04-25T07:30:15Z

Hi，

To give a simple example. What we originally pass into the original mamba block is an input with shape (batch_size=7, seq_len=10, hidden_dim) Through this PR, we can instead pass into the variable-length mamba block with an input with shape (batch_size=1, seq_len=32, hidden_dim), where the original variable-length sequences are packed into one fixed-length sequence, with an additional parameter cu_seqlens to mark sequence boundaries.
From the above figure, we can clearly see that through this PR, mamba block can focus computing resources on variable-length sequences and avoid the overhead of meaningless padding tokens.
Variable-length training is very useful for optimizing the hardware utilization during training, and we know that the well-known flash attention has supported variable-length training via cu_seqlens. Therefore, we believe that mamba, as a competitor of transformer, can improve its hardware utilization during training on real world datasets (the length distribution varies much between data samples) through this PR!

To give a simple example. What we originally pass into the original mamba block is an input with shape (batch_size=7, seq_len=10, hidden_dim) Through this PR, we can instead pass into the variable-length mamba block with an input with shape (batch_size=1, seq_len=32, hidden_dim), where the original variable-length sequences are packed into one fixed-length sequence, with an additional parameter cu_seqlens to mark sequence boundaries.
From the above figure, we can clearly see that through this PR, mamba block can focus computing resources on variable-length sequences and avoid the overhead of meaningless padding tokens.
Variable-length training is very useful for optimizing the hardware utilization during training, and we know that the well-known flash attention has supported variable-length training via cu_seqlens. Therefore, we believe that mamba, as a competitor of transformer, can improve its hardware utilization during training on real world datasets (the length distribution varies much between data samples) through this PR!

Thank you very much for your code and illustrations, but I have some doubts about the parameters seqlen and seq_idx in Mamba2 in the following figure. Could you provide the corresponding illustration for these parameters?

Hi @zongtianhu ,

cu_seqlens and seq_idx can be computed from each other. They are different ways to represent the information of sequence boundaries when we packed multiple sentences together.

For example, a packed sentence consisting of 7 sub-sentences:

>>> import torch
>>> cu_seqlens=torch.tensor([0, 5, 15, 18, 19, 21, 26, 32])
>>> seq_idx = torch.cat([torch.full((s,), i, dtype=torch.int32, device=cu_seqlens.device) for i, s in enumerate(cu_seqlens[1:]-cu_seqlens[:-1])], dim=0).unsqueeze(0)
>>> seq_idx
tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 4, 5, 5, 5,
         5, 5, 6, 6, 6, 6, 6, 6]], dtype=torch.int32)

xtwigs · 2025-05-12T21:07:36Z

Hi @zigzagcai great work!
I'm working with variable length generation and, correct me if I'm wrong but is there currently a way to retrieve all ssm_states per sample in the batch?

If I pack all samples as per your example and run with return_last_state=True, the resulting tensor will be [1, d_inner, d_state] where ideally we would want [bsz, ...] to run more seq_len=1 update steps.

Looking at the cuda kernel, storing the ssm_state happens inside this block (L#279-L#298, selective_scan_fwd_kernel.cuh):

               // Initialize running total
               scan_t running_prefix;
               if constexpr (!kIsComplex) {
                   // If we use WARP_SCAN then all lane 0 of all warps (not just thread 0) needs to read
                   running_prefix = chunk > 0 && threadIdx.x % 32 == 0 ? smem_running_prefix[state_idx + r * MAX_DSTATE] : make_float2(1.f, 0.f);
                   // running_prefix = chunk > 0 && threadIdx.x == 0 ? smem_running_prefix[state_idx] : make_float2(1.f, 0.f);
               } else {
                   running_prefix = chunk > 0 && threadIdx.x % 32 == 0 ? smem_running_prefix[state_idx + r * MAX_DSTATE] : make_float4(1.f, 0.f, 0.f, 0.f);
                   // running_prefix = chunk > 0 && threadIdx.x == 0 ? smem_running_prefix[state_idx] : make_float4(1.f, 0.f, 0.f, 0.f);
               }
               SSMScanPrefixCallbackOp<weight_t> prefix_op(running_prefix);
               typename Ktraits::BlockScanT(smem_scan).InclusiveScan(
                   thread_data, thread_data, SSMScanOp<weight_t>(), prefix_op
               );
               // There's a syncthreads in the scan op, so we don't need to sync here.
               // Unless there's only 1 warp, but then it's the same thread (0) reading and writing.
               if (threadIdx.x == 0) {
                   smem_running_prefix[state_idx] = prefix_op.running_prefix;
                   x[(r * params.n_chunks + chunk) * params.dstate + state_idx] = prefix_op.running_prefix;
               }

It seems that the InclusiveScan op only returns the last state in the scan?
Have you encountered this issue before? If so, how do you run generation starting from a variable length batch?

fzsomb · 2025-05-25T09:13:57Z

Hey @zigzagcai! First of all, thank you for your work. I'm trying to use your feature for my project, and benchmarked the variable length forward pass against separate forward passes for each sample. Unfortunately, the results are rather disappointing, separate passes are ~2.5x faster than variable length batching. Any idea what might be going wrong?
Code:

import copy
import random
import torch
import time
import numpy as np
from mamba_ssm.modules.mamba_simple import Mamba


"""
unpack function: convert packed_hidden_states (batch_size=1) to hidden_states
"""


def unpack(packed_hidden_states, cu_seqlens):
    batch_size = cu_seqlens.shape[0] - 1
    seq_len = (cu_seqlens[1:] - cu_seqlens[:-1]).max()
    hidden_dim = packed_hidden_states.shape[2]
    hidden_states = torch.zeros(
        batch_size,
        seq_len,
        hidden_dim,
        dtype=packed_hidden_states.dtype,
        device=packed_hidden_states.device,
    )
    for i in range(batch_size):
        hidden_states[i, : cu_seqlens[i + 1] - cu_seqlens[i], :] = packed_hidden_states[
            :, cu_seqlens[i] : cu_seqlens[i + 1], :
        ]
    return hidden_states


"""
pack function: convert hidden_states to packed_hidden_states (batch_size=1)
"""


def pack(hidden_states, cu_seqlens):
    batch_size, seq_len, hidden_dim = hidden_states.shape
    seq_len_list = cu_seqlens[1:] - cu_seqlens[:-1]
    seq_len_list_3d = seq_len_list.unsqueeze(1).unsqueeze(2)
    indices_3d = (
        torch.arange(seq_len, device=hidden_states.device)
        .unsqueeze(0)
        .unsqueeze(2)
        .repeat(batch_size, 1, hidden_dim)
    )
    mask_3d = indices_3d < seq_len_list_3d
    packed_hidden_states = hidden_states[mask_3d].view(-1, hidden_dim)
    return packed_hidden_states


"""
Generate random cu_seqlens for testing
"""


def generate_random_cu_seqlens(seq_len, batch_size):
    if batch_size > 1:
        ret = sorted(random.sample(range(1, seq_len), batch_size - 1))
    else:
        ret = []
    cu_seqlens = [0] + ret + [seq_len]
    assert batch_size == len(cu_seqlens) - 1
    return cu_seqlens


def main():
    # config tested with A100
    hidden_dim = 24
    seq_len = 56 * 112 * 100
    batch_size = 8
    device = "cuda"
    warmup = 25
    trials = 50

    itype = torch.float32
    rtol, atol = (6e-4, 2e-3) if itype == torch.float32 else (3e-3, 5e-3)
    if itype == torch.bfloat16:
        rtol, atol = 3e-2, 5e-2
    rtolw, atolw = (1e-3, 1e-3)
    # If we have z, the errors on the weights seem higher
    rtolw = max(rtolw, rtol)
    atolw = max(atolw, atol)

    # Generate random cu_seqlens for testing
    cu_seqlens = generate_random_cu_seqlens(seq_len, batch_size)
    cu_seqlens = torch.tensor(cu_seqlens, device=device)
    print(f"Generate random cu_seqlens = {cu_seqlens.tolist()}")

    # Generate packed_hidden_states with random values for testing
    # packed_hidden_states (batch_size=1) should be forwarded with cu_seqlens
    hidden_states_list = [
        torch.randn(l, hidden_dim, device=device)
        for l in (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
    ]
    packed_hidden_states = torch.cat(hidden_states_list, dim=0).unsqueeze(0)
    # hidden_states should be forwarded without cu_seqlens
    hidden_states = unpack(packed_hidden_states, cu_seqlens)

    # Check: sum of seq_len of item in hidden_states_list should be equal to seq_len of packed_hidden_states
    assert sum([hs.shape[0] for hs in hidden_states_list]) == packed_hidden_states.shape[1]
    # Check: max of seq_len of item in hidden_states_list should be equal to seq_len of hidden_states
    assert max([hs.shape[0] for hs in hidden_states_list]) == hidden_states.shape[1]

    # creat one simple mamba block
    mamba_ref = Mamba(
        # This module uses roughly 3 * expand * d_model^2 parameters
        d_model=hidden_dim,  # Model dimension d_model
        d_state=16,  # SSM state expansion factor
        d_conv=4,  # Local convolution width
        expand=2,  # Block expansion factor
    ).to(device)

    # reference output for forwardding hidden_states
    # warmup
    for _ in range(warmup):
        for hidden_states in hidden_states_list:
            out_ref_original = mamba_ref(hidden_states.unsqueeze(0))

    # testing
    separate_forward_time = []
    out_refs = []
    for i in range(trials):
        for hidden_states in hidden_states_list:
            start_time = time.time()
            out_ref_original = mamba_ref(hidden_states.unsqueeze(0))
            end_time = time.time()
            separate_forward_time.append(end_time - start_time)
            if i == 0:  # save once to compare with batched forward
                out_refs.append(out_ref_original)

    # output for forwardding packed_hidden_states with cu_seqlens
    mamba = copy.deepcopy(mamba_ref)

    # warmup
    for _ in range(warmup):
        out = mamba(packed_hidden_states, cu_seqlens)

    # testing
    batched_forward_time = []
    for i in range(trials):
        start_time = time.time()
        out = mamba(packed_hidden_states, cu_seqlens)
        end_time = time.time()
        batched_forward_time.append(end_time - start_time)
        if i == 0:  # save once to compare with separate forward
            out_batched = out[0]

    for i in range(cu_seqlens.shape[0] - 1):
        # Testing the max/mean diff
        print(
            f"max diff for output in varlen_mamba fwd pass: {(out_batched[cu_seqlens[i] : cu_seqlens[i + 1]] - out_refs[i]).abs().max().item()}"
        )
        print(
            f"mean diff for output in varlen_mamba fwd pass: {(out_batched[cu_seqlens[i] : cu_seqlens[i + 1]] - out_refs[i]).abs().mean().item()}"
        )
        assert torch.allclose(
            out_batched[cu_seqlens[i] : cu_seqlens[i + 1]], out_refs[i], rtol=rtol, atol=atol
        )

    print("Total forward time for separate: ", np.sum(np.array(separate_forward_time)))
    print("Total forward time for batched: ", np.sum(np.array(batched_forward_time)))


main()

Output:

Generate random cu_seqlens = [0, 75092, 179251, 374603, 450164, 503071, 545276, 571269, 627200]
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.646816987805778e-09
max diff for output in varlen_mamba fwd pass: 1.1920928955078125e-07
mean diff for output in varlen_mamba fwd pass: 2.7164301918958245e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.7169722027764465e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.7203550523324793e-09
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.7136011215844746e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.7419433390463155e-09
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.724727554692663e-09
max diff for output in varlen_mamba fwd pass: 7.450580596923828e-08
mean diff for output in varlen_mamba fwd pass: 2.725233594347287e-09
Total forward time for separate:  0.7590713500976562
Total forward time for batched:  1.8863234519958496

Environment:

GPU: NVIDIA A100 80GB PCIe
Driver Version: 550.127.08
CUDA Version: 12.4
Python 3.11.11
torch 2.5.1+cu124
causal-conv1d 1.5.0.post8

…ifier

zigzagcai · 2025-06-18T08:50:37Z

Hey @zigzagcai! First of all, thank you for your work. I'm trying to use your feature for my project, and benchmarked the variable length forward pass against separate forward passes for each sample. Unfortunately, the results are rather disappointing, separate passes are ~2.5x faster than variable length batching. Any idea what might be going wrong? Code:

import copy
import random
import torch
import time
import numpy as np
from mamba_ssm.modules.mamba_simple import Mamba


"""
unpack function: convert packed_hidden_states (batch_size=1) to hidden_states
"""


def unpack(packed_hidden_states, cu_seqlens):
    batch_size = cu_seqlens.shape[0] - 1
    seq_len = (cu_seqlens[1:] - cu_seqlens[:-1]).max()
    hidden_dim = packed_hidden_states.shape[2]
    hidden_states = torch.zeros(
        batch_size,
        seq_len,
        hidden_dim,
        dtype=packed_hidden_states.dtype,
        device=packed_hidden_states.device,
    )
    for i in range(batch_size):
        hidden_states[i, : cu_seqlens[i + 1] - cu_seqlens[i], :] = packed_hidden_states[
            :, cu_seqlens[i] : cu_seqlens[i + 1], :
        ]
    return hidden_states


"""
pack function: convert hidden_states to packed_hidden_states (batch_size=1)
"""


def pack(hidden_states, cu_seqlens):
    batch_size, seq_len, hidden_dim = hidden_states.shape
    seq_len_list = cu_seqlens[1:] - cu_seqlens[:-1]
    seq_len_list_3d = seq_len_list.unsqueeze(1).unsqueeze(2)
    indices_3d = (
        torch.arange(seq_len, device=hidden_states.device)
        .unsqueeze(0)
        .unsqueeze(2)
        .repeat(batch_size, 1, hidden_dim)
    )
    mask_3d = indices_3d < seq_len_list_3d
    packed_hidden_states = hidden_states[mask_3d].view(-1, hidden_dim)
    return packed_hidden_states


"""
Generate random cu_seqlens for testing
"""


def generate_random_cu_seqlens(seq_len, batch_size):
    if batch_size > 1:
        ret = sorted(random.sample(range(1, seq_len), batch_size - 1))
    else:
        ret = []
    cu_seqlens = [0] + ret + [seq_len]
    assert batch_size == len(cu_seqlens) - 1
    return cu_seqlens


def main():
    # config tested with A100
    hidden_dim = 24
    seq_len = 56 * 112 * 100
    batch_size = 8
    device = "cuda"
    warmup = 25
    trials = 50

    itype = torch.float32
    rtol, atol = (6e-4, 2e-3) if itype == torch.float32 else (3e-3, 5e-3)
    if itype == torch.bfloat16:
        rtol, atol = 3e-2, 5e-2
    rtolw, atolw = (1e-3, 1e-3)
    # If we have z, the errors on the weights seem higher
    rtolw = max(rtolw, rtol)
    atolw = max(atolw, atol)

    # Generate random cu_seqlens for testing
    cu_seqlens = generate_random_cu_seqlens(seq_len, batch_size)
    cu_seqlens = torch.tensor(cu_seqlens, device=device)
    print(f"Generate random cu_seqlens = {cu_seqlens.tolist()}")

    # Generate packed_hidden_states with random values for testing
    # packed_hidden_states (batch_size=1) should be forwarded with cu_seqlens
    hidden_states_list = [
        torch.randn(l, hidden_dim, device=device)
        for l in (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
    ]
    packed_hidden_states = torch.cat(hidden_states_list, dim=0).unsqueeze(0)
    # hidden_states should be forwarded without cu_seqlens
    hidden_states = unpack(packed_hidden_states, cu_seqlens)

    # Check: sum of seq_len of item in hidden_states_list should be equal to seq_len of packed_hidden_states
    assert sum([hs.shape[0] for hs in hidden_states_list]) == packed_hidden_states.shape[1]
    # Check: max of seq_len of item in hidden_states_list should be equal to seq_len of hidden_states
    assert max([hs.shape[0] for hs in hidden_states_list]) == hidden_states.shape[1]

    # creat one simple mamba block
    mamba_ref = Mamba(
        # This module uses roughly 3 * expand * d_model^2 parameters
        d_model=hidden_dim,  # Model dimension d_model
        d_state=16,  # SSM state expansion factor
        d_conv=4,  # Local convolution width
        expand=2,  # Block expansion factor
    ).to(device)

    # reference output for forwardding hidden_states
    # warmup
    for _ in range(warmup):
        for hidden_states in hidden_states_list:
            out_ref_original = mamba_ref(hidden_states.unsqueeze(0))

    # testing
    separate_forward_time = []
    out_refs = []
    for i in range(trials):
        for hidden_states in hidden_states_list:
            start_time = time.time()
            out_ref_original = mamba_ref(hidden_states.unsqueeze(0))
            end_time = time.time()
            separate_forward_time.append(end_time - start_time)
            if i == 0:  # save once to compare with batched forward
                out_refs.append(out_ref_original)

    # output for forwardding packed_hidden_states with cu_seqlens
    mamba = copy.deepcopy(mamba_ref)

    # warmup
    for _ in range(warmup):
        out = mamba(packed_hidden_states, cu_seqlens)

    # testing
    batched_forward_time = []
    for i in range(trials):
        start_time = time.time()
        out = mamba(packed_hidden_states, cu_seqlens)
        end_time = time.time()
        batched_forward_time.append(end_time - start_time)
        if i == 0:  # save once to compare with separate forward
            out_batched = out[0]

    for i in range(cu_seqlens.shape[0] - 1):
        # Testing the max/mean diff
        print(
            f"max diff for output in varlen_mamba fwd pass: {(out_batched[cu_seqlens[i] : cu_seqlens[i + 1]] - out_refs[i]).abs().max().item()}"
        )
        print(
            f"mean diff for output in varlen_mamba fwd pass: {(out_batched[cu_seqlens[i] : cu_seqlens[i + 1]] - out_refs[i]).abs().mean().item()}"
        )
        assert torch.allclose(
            out_batched[cu_seqlens[i] : cu_seqlens[i + 1]], out_refs[i], rtol=rtol, atol=atol
        )

    print("Total forward time for separate: ", np.sum(np.array(separate_forward_time)))
    print("Total forward time for batched: ", np.sum(np.array(batched_forward_time)))


main()

Output:

Generate random cu_seqlens = [0, 75092, 179251, 374603, 450164, 503071, 545276, 571269, 627200]
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.646816987805778e-09
max diff for output in varlen_mamba fwd pass: 1.1920928955078125e-07
mean diff for output in varlen_mamba fwd pass: 2.7164301918958245e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.7169722027764465e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.7203550523324793e-09
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.7136011215844746e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.7419433390463155e-09
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.724727554692663e-09
max diff for output in varlen_mamba fwd pass: 7.450580596923828e-08
mean diff for output in varlen_mamba fwd pass: 2.725233594347287e-09
Total forward time for separate:  0.7590713500976562
Total forward time for batched:  1.8863234519958496

Environment:

GPU: NVIDIA A100 80GB PCIe
Driver Version: 550.127.08
CUDA Version: 12.4
Python 3.11.11
torch 2.5.1+cu124
causal-conv1d 1.5.0.post8

Hi @fzsomb ,

Sorry for the late response. I have checked with the code and there are two points should be pointed out:

The performance of batched inputs is bottlenecked by the on-the-fly construction of seq_idx in mamba kernels, especially when the batched seq_len is quite large. In actual training scenarios, cu_seqlens/seq_idx/position_ids should be prepared in dataloader like this. And finally would not affect the training performance. (because dataloder process and training process are seperate processes, so they won't block each other.) Therefore cu_seqlens/seq_idx/position_ids are required to be passed in for better performance.
We refine the implementation (modified from this) and the performance for batched inputs is improved a bit. Please pull the latest code and then try.

Using my performance test script, the performance comparison for forward pass is shown here:

Generate random cu_seqlens = [0, 239450, 335932, 339432, 429781, 449130, 490937, 596597, 627200]
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.5034383455135867e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.5367627998207354e-09
max diff for output in varlen_mamba fwd pass: 5.21540641784668e-08
mean diff for output in varlen_mamba fwd pass: 2.57035126516314e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.532647425113055e-09
max diff for output in varlen_mamba fwd pass: 7.450580596923828e-08
mean diff for output in varlen_mamba fwd pass: 2.5388156021932673e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.545051724922587e-09
max diff for output in varlen_mamba fwd pass: 7.450580596923828e-08
mean diff for output in varlen_mamba fwd pass: 2.5323849683900335e-09
max diff for output in varlen_mamba fwd pass: 7.450580596923828e-08
mean diff for output in varlen_mamba fwd pass: 2.5522350899365165e-09
Total forward time for separate:  0.4627113342285156
Total forward time for batched:  0.015723705291748047

Environment

GPU: NVIDIA H100 80GB
Driver Version: 535.183.06
CUDA Version: 12.8
Python 3.10.15
torch:  2.8.0.dev20250326+cu128
causal-conv1d:  1.5.0.post8

You would see nearly 30x speedup for the example batched inputs, measured by the forward time in varlen_mamba block.

And as a comparison, if you comment out the two lines and let them be computed on-the-fly in varlen_mamba forward pass.

    # seq_idx = torch.cat([torch.full((s,), i, dtype=torch.int32, device=cu_seqlens.device) 
    #         for i, s in enumerate(cu_seqlens[1:]-cu_seqlens[:-1])], dim=0).unsqueeze(0)
    # position_ids = (torch.arange((cu_seqlens[1:] - cu_seqlens[:-1]).sum(), device=cu_seqlens.device) 
    #                 - torch.repeat_interleave(cu_seqlens[:-1], (cu_seqlens[1:] - cu_seqlens[:-1]))).to(torch.int32).unsqueeze(0)

The performance would be:

Generate random cu_seqlens = [0, 18132, 64239, 140493, 295415, 457765, 490790, 602167, 627200]
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.427928968984361e-09
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.4688540101180934e-09
max diff for output in varlen_mamba fwd pass: 7.450580596923828e-08
mean diff for output in varlen_mamba fwd pass: 2.4755322236558186e-09
max diff for output in varlen_mamba fwd pass: 8.940696716308594e-08
mean diff for output in varlen_mamba fwd pass: 2.461660431052337e-09
max diff for output in varlen_mamba fwd pass: 6.705522537231445e-08
mean diff for output in varlen_mamba fwd pass: 2.4677455634503076e-09
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.458302228447451e-09
max diff for output in varlen_mamba fwd pass: 7.450580596923828e-08
mean diff for output in varlen_mamba fwd pass: 2.4643185270178947e-09
max diff for output in varlen_mamba fwd pass: 5.960464477539063e-08
mean diff for output in varlen_mamba fwd pass: 2.453486080966627e-09
Total forward time for separate:  0.46008753776550293
Total forward time for batched:  0.5033142566680908

We can clearly see that the performance of varlen_mamba is bottlenecked by the on-the-fly constructure of seq_idx. Therefore, in actual training scenarios, we need to prepare the necessary cu_seqlens/seq_idx/position_ids in dataloader (usually been doned in the dataloader collate_fn) and pass them into the varlen_mamba kernels.

zigzagcai added 2 commits March 14, 2024 15:17

add cu_seqlens support and ensure numerical equality

d28e1b0

add notes for variable length sequences

a78a9eb

This was referenced Mar 14, 2024

Variable input sequence length #236

Open

Question about does mamba support variable-length input or cu_seqlens like flash attention? #180

Open

zigzagcai force-pushed the feat/add-cu_seqlens branch from 77e58cb to a78a9eb Compare March 15, 2024 09:31

zigzagcai and others added 6 commits March 15, 2024 17:32

fix typos

e223353

fix typos

5955450

Merge branch 'main' into feat/add-cu_seqlens

ca189f6

fix typos

c2d5b88

fix typos

db0dd09

Merge branch 'main' into feat/add-cu_seqlens

842bef5

zigzagcai force-pushed the feat/add-cu_seqlens branch from aea08ca to 842bef5 Compare March 18, 2024 09:19

refine cu_seqlens implementation

e7774aa

Dmovic and others added 4 commits March 19, 2024 10:56

Merge branch 'feat/add-cu_seqlens' into feat/add-cu_seqlens

1ccc60f

Merge pull request #1 from Dmovic/feat/add-cu_seqlens

4bf2697

[Fix Typos] Fix while loop

add unit test for variable length

f357c44

update unit test

6b98161

fix typos

e4af927

remove smem implementation because const vals and bi-search is enough

cda4b5a

zigzagcai force-pushed the feat/add-cu_seqlens branch from 0a15f1d to cda4b5a Compare August 15, 2024 06:29

zigzagcai mentioned this pull request Sep 13, 2024

Question about support for sequence parallel #176

Open

Add unittest for test_mamba_cu_seqlens_equivalence_nlayers.py

f463a65

zigzagcai added 2 commits April 25, 2025 15:32

Merge remote-tracking branch 'origin/main' into feat/add-cu_seqlens

7cd6bb2

Merge remote-tracking branch 'upstream/main' into feat/add-cu_seqlens

b02b243

xtwigs mentioned this pull request May 13, 2025

Variable length generation jxiw/varlen_mamba#1

Open

zigzagcai added 3 commits June 11, 2025 17:20

Merge remote-tracking branch 'upstream/main' into feat/add-cu_seqlens

1887385

refactor implementation to use pos_ids as the sequence boundary ident…

fc30cf7

…ifier

remove unnecessary syncthreads

93403bf

[Feature] Support variable-length sequences for mamba block #244

Are you sure you want to change the base?

[Feature] Support variable-length sequences for mamba block #244

Conversation

zigzagcai commented Mar 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zigzagcai commented Mar 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricPaul03 commented Mar 18, 2024

Uh oh!

zigzagcai commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricPaul03 commented Mar 18, 2024

Uh oh!

zigzagcai commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zigzagcai commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricPaul03 commented Mar 19, 2024

Uh oh!

EricPaul03 commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricPaul03 commented Mar 19, 2024

Uh oh!

junphine commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zigzagcai commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zigzagcai commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zigzagcai commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JindongJiang commented Aug 11, 2024

Uh oh!

zigzagcai commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zigzagcai commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JindongJiang commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JindongJiang commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zigzagcai commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zigzagcai commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zigzagcai commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JindongJiang commented Aug 15, 2024

Uh oh!

JindongJiang commented Aug 15, 2024

Uh oh!

zigzagcai commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bali-eng commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zongtianhu commented Jan 20, 2025

Uh oh!

CacatuaAlan commented Apr 4, 2025

Uh oh!

zigzagcai commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

zigzagcai commented Mar 14, 2024 •

edited

Loading

zigzagcai commented Mar 14, 2024 •

edited

Loading

zigzagcai commented Mar 18, 2024 •

edited

Loading

zigzagcai commented Mar 19, 2024 •

edited

Loading

zigzagcai commented Mar 19, 2024 •

edited

Loading

EricPaul03 commented Mar 19, 2024 •

edited

Loading

junphine commented Mar 21, 2024 •

edited

Loading

zigzagcai commented Mar 21, 2024 •

edited

Loading

zigzagcai commented Mar 21, 2024 •

edited

Loading

zigzagcai commented Aug 9, 2024 •

edited

Loading

zigzagcai commented Aug 11, 2024 •

edited

Loading

zigzagcai commented Aug 11, 2024 •

edited

Loading

JindongJiang commented Aug 11, 2024 •

edited

Loading

JindongJiang commented Aug 11, 2024 •

edited

Loading

zigzagcai commented Aug 15, 2024 •

edited

Loading

zigzagcai commented Aug 15, 2024 •

edited

Loading

zigzagcai commented Aug 15, 2024 •

edited

Loading

zigzagcai commented Aug 21, 2024 •

edited

Loading

bali-eng commented Sep 4, 2024 •

edited

Loading

zigzagcai commented Apr 25, 2025 •

edited

Loading

zigzagcai commented Apr 25, 2025 •

edited

Loading

xtwigs commented May 12, 2025 •

edited

Loading

zigzagcai commented Jun 18, 2025 •

edited

Loading