[PP + EP][Master Thread] Enable Pipeline Parallelism (PP) and Expert Parallelism (EP)

## Background 

We are starting an effort on enabling PP + EP for our training experiments on Mamba MoE. Comparing to other parallelism (FSDP, CP, TP), PP can be much more complicated to be added to an existing code base for two reasons:
1. It requires **modification on mode side**. Unlike other parallelism which requires no change on model side, PP requires model-side modifications for the model to support PP. This means we need to modify Mamba repo (details are discussed in https://github.com/foundation-model-stack/fms-fsdp/issues/134)
2. It requires **rewriting on training side**. Unlike other parallelism which can be easily added to an existing training script by adding an extra dimension of device_mesh and stack the new parallelism with existing ones, PP requires revamp of a big portion of the training script (details are discussed in [TODO])

Given the complexity, we are going to do this with a **muti-stage plan** and provide **limited support**.

## Muti-Stage Plan (updated 03/25)

### Stage 1a: EP + MoE Path
1. modify Mamba repo to support MoE Mamba
2. enable EP (Expert Parallel)
3. enable EP with fast kernel

Note: we should test this on 1d EP only (i.e. only 1 copy of each expert without duplication that requires further all-reduce)

### Stage 1b: PP + Mamba Path

1. modify Mamba repo to support pp.
2. a complete revamp of train.py to support PP.

Note: 1a and 1b are independent efforts so we should proceed in parallel.

### Stage 2: [PP + EP] x [Mamba MoE]
Combine the efforts from both 1a and 1b:
1. merge the Mamba-side modifications:  MoE Mamba + PP Mamba.
2. merge the train script:  2d/3d device mesh to merge PP and EP.



## Limited Support

Any of the limitations below can be lifted as needed, but we started with limited support on:

1. PP + EP only.  The composability of PP with other parallelism can be very complicated, and it takes a lot of effort to maintain and update.  We have no intention to use this effort to make this repo become another TorchTitan/Megatron-LM that supports an arbitrary combo of all parallelism (PP + FSDP +  CP + TP + EP). For this effort, **we will target a clean solution on PP+EP only**.
2. handcrafted and hardcoded PP schedule.  It is not hard to implement an automated PP split for Mamba models, but doing it manually with hardcoded splitting has two benefits: 1. it is much safer and more clear to have a hardcoded physical schedule.   2. It is easier to modify and try different schedules to tune the model performance and find the best splitting points.    Since we will likely have a fixed setup (a small set of fixed amount of GPUs with fixed PP size), we are better with a non-automated fashion.  And we can always add the automation as needed.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PP + EP][Master Thread] Enable Pipeline Parallelism (PP) and Expert Parallelism (EP) #133

Background

Muti-Stage Plan (updated 03/25)

Stage 1a: EP + MoE Path

Stage 1b: PP + Mamba Path

Stage 2: [PP + EP] x [Mamba MoE]

Limited Support

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PP + EP][Master Thread] Enable Pipeline Parallelism (PP) and Expert Parallelism (EP) #133

Description

Background

Muti-Stage Plan (updated 03/25)

Stage 1a: EP + MoE Path

Stage 1b: PP + Mamba Path

Stage 2: [PP + EP] x [Mamba MoE]

Limited Support

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions