Replies: 2 comments
-
I think here, d = num_heads * head_dim, it doesn't actually split the dimension of a head, but rather distributes the heads across different GPUs. The computation between the two all-to-all operations is basically the same as tensor parallelism. Outside of all-to-all, it's equivalent to sequence-level data parallelism (DP). Regarding the use of sequence parallelism (SP), I have a question as well. DS-SP is used in mega-ds, but it seems to be incompatible with PP (Pipeline Parallelism) and EP (Expert Parallelism) other than TP? Additionally, Megatron's native SP doesn't seem to be well-validated in mega-ds. |
Beta Was this translation helpful? Give feedback.
-
@inkcherry I noticed now that this is mentioned in the introduction as:
while the Methods section where these details should be present, only includes vague information:
So you are right, it's sequence parallelism throughout most of the run, except inside the blue rectangle in the picture where it is head parallelism. Thank you! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I don't understand why the attention matrix provided by sentence parallelism in DS Ulysses is correct. Looking at the picture 2 from the paper:
DS Ulysses implementation
There are three steps on the computation of the attention matrix:
Total comm cost: 3 all-to-all of$Nh$ elements + 1 all-to-all of $Nh$ elements
Here's a demo in python. Assume$K$ and $Q$ to be of shape $[2,2]$ . In a serial run:
Now assume$P=2$ , and split $K$ and $Q$ by 2 processes as $Q0$ , $Q1$ , $K0$ and $K1$ . You can't recover the softmax above from the partial softmaxs below:
So I believe that the sequence-parallel$QK^T$ in DS-Ulysses is not equivalent to a serial implementation. And that different number of processes give different $QK^T$ values. What am I missing?
alternative implementation
I think the correct would be to add an "all-reduce sum" as mentioned above, or do a regular matrix-matrix multiplication, as in:
Total cost: 1 all-scatter of$Nh$ elements + P-1 all-to-all of $Nh$ elements
Remarks about memory consumption and max sequence length
DS-Ulysses implementation yields a$P$ -sized linear reduction of memory on the embeddings but still requires quadratic memory complexity on the sequence length in order to store the full attention matrix $[N,N]$ on every process, which is the big memory culprit in transformer models. Which is non-logical to me, that an implementation of sequence-parallelism does not parallelize the largest sequence-related tensor (the attention of size $[N,N]$ ). Long story short, if Ulysses requires each process to be able to hold $[N,N]$ in memory, then you better off never using this Ulysses at all (or any combination of sequence and data parallelism), because doing only data parallelism will give less communication, higher samples/sec, and has the same max memory requirements ($[N,N]$ ).
The alternative implementation I propose yields a reduction of$P$ times on the attention matrix and on all but 1 embedding. This allows for a higher sentence length and can be combined with data parallelism. $P$ is generally small, so the algorithm runs fast. And the MatMul algorithm proposed can be improved to run in less than $P-1$ comm steps.
Does my thinking make sense?
Thank you
(cc: @bm-synth , my work alias)
Beta Was this translation helpful? Give feedback.
All reactions