PD disaggregation
This update simplifies running SGLang with Prefill-Decode disaggregation.
Previously, PD disaggregation required configuring router on the gateway, which meant
the gateway had to run in the same cluster as the service to communicate with service
replicas.
With this update, router is configured on a service replica group instead. This allows
using a standard gateway outside the service cluster.
Below is an example service configuration for running zai-org/GLM-4.5-Air-FP8 using replica groups:
type: service
name: prefill-decode
image: lmsysorg/sglang:latest
env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
replicas:
- count: 1
commands:
- pip install sglang_router
- |
python -m sglang_router.launch_router \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4
- count: 1..4
scaling:
metric: rps
target: 3
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: H200
- count: 1..8
scaling:
metric: rps
target: 2
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--port 8000
resources:
gpu: H200
port: 8000
model: zai-org/GLM-4.5-Air-FP8
# Custom probe is required for PD disaggregation.
probes:
- type: http
url: /health
interval: 15sNote: this setup requires the service fleet or cluster to provide a CPU node for the
router replica.
Kubernetes
The kubernetes backend adds support for both network and instance volumes.
Network volumes
You can either create a new network volume or register an existing one. To create a new
network volume, specify size and optionally storage_class_name and/or
access_modes:
type: volume
backend: kubernetes
name: my-volume
size: 100GBThis automatically creates a PersistentVolumeClaim and associates it with the volume.
If you don't specify
storage_class_name, the decision is delegated to the
DefaultStorageClassadmission controller, if enabled.If you don't specify
access_modes, it defaults to[ReadWriteOnce]. To attach
volumes to multiple runs at the same time, set it to[ReadWriteMany]or
[ReadWriteMany, ReadOnlyMany].
To reuse an existing PersistentVolumeClaim, specify its name in claim_name:
type: volume
backend: kubernetes
name: my-volume
claim_name: existing-pvcOnce a volume configuration is applied, you can attach it to your runs via volumes:
type: dev-environment
name: vscode-vol
ide: vscode
volumes:
- name: my-volume
path: /volume_dataInstance volumes
In addition to network volumes, the kubernetes backend now supports instance volumes:
type: dev-environment
name: vscode-vol
ide: vscode
volumes:
- instance_path: /mnt/volume
path: /volume_dataUnlike network volumes, which persist across instances, instance volumes persist data
only within a particular instance. They are useful for storing caches or when you
manually mount a shared filesystem into the instance path.
Note: using volumes with the
kubernetesbackend requires the corresponding
permissions.
Performance
Fetching backend offers for the first time has been optimized and is now much faster. As
a result, dstack apply, dstack offer, and the offers UI are all more responsive.
Here are the improvements for some of the major backends:
- aws — 41.43s => 6.61s (6.3x)
- azure — 12.49s => 5.50s (2.3x)
- gcp — 13.51s => 5.20s (2.6x)
- nebius — 10.74s => 3.80s (2.8x)
- runpod — 9.36s => 0.09s (104x)
- verda — 9.49s => 2.33s (4.1x)
Fleets
In-place update
Backend fleets now support initial in-place updates. You can update nodes,
reservation, tags, resources, backends, regions, availability_zones,
instance_types, spot_policy, and max_price without re-creating the entire fleet.
If existing idle instances do not match the updated configuration, dstack replaces
them.
Default resources
Fleets used to have default resources set to cpu=2.. mem=8GB.. disk=100GB.. when
left unspecified. This meant any offers with fewer resources were excluded from such
fleets. If you wanted to run on a mem=4GB VM, you had to specify resources in both
the run and fleet configurations.
Now fleets have no default resources, so all offers are available by default. If you
need to add extra constraints on which offers can be provisioned in a fleet, specify
resources explicitly.
Run configurations continue to have default minimum resources set to
cpu=2.. mem=8GB.. disk=100GB.. to avoid provisioning instances that are too small.
Offers
The dstack offer CLI command now supports the --fleet argument, which allows you to
see only offers from the specified fleets.
dstack offer --fleet my-fleet --fleet another-project/other-fleetThe same is now supported in the UI on both the Offers and Launch pages.
Exports
Importers can now delete an import via
dstack import delete <export-project>/<export-name>. This is useful when an export
was created by the exporter, but the importer no longer needs it and does not want to
wait until the exporter deletes it.
AWS
RTX Pro 6000
The aws backend adds support for g7e.* instances offering RTXPRO6000 GPUs.
Docker
Default Docker registry
If you'd like to cache Docker images through your own Docker registry, you can now
configure it when starting the dstack server:
export DSTACK_SERVER_DEFAULT_DOCKER_REGISTRY=<registry base hostname>
export DSTACK_SERVER_DEFAULT_DOCKER_REGISTRY_USERNAME=<registry username>
export DSTACK_SERVER_DEFAULT_DOCKER_REGISTRY_PASSWORD=<registry password>These settings should only be used for registries that act as a pull-through cache for
Docker Hub. This is useful if you would like to avoid rate limits when you have too
many image pulls.
Migration note
Warning
Since v0.20.0, dstack has required fleets before runs can be submitted.
Until now, the deprecated DSTACK_FF_AUTOCREATED_FLEETS_ENABLED feature flag allowed submitting runs without fleets. In 0.20.17, this flag has been removed.
What's changed
- Drop deprecated scheduled tasks by @r4victor in #3749
- [Docs]: Rename REST API -> HTTP API by @jvstme in #3748
- Rework runner job submission flow by @un-def in #3743
- Default Docker registry and credentials by @jvstme in #3747
- Detect Verda provisioning errors earlier by @jvstme in #3753
- Optimize Python DB tests by @r4victor in #3755
- Add case study on Graphsignal's use of dstack for inference benchmarking by @peterschmidt85 in #3751
- Allow combining on/off idle_duration between runs and fleets by @r4victor in #3756
- Fix no offers retry for scheduled runs by @r4victor in #3759
- Support dynamic run waiting CLI status with extra renderables by @r4victor in #3760
- Kubernetes: add instance volumes support by @un-def in #3758
- Init gateways in background by @r4victor in #3762
- Store source backend config by @r4victor in #3764
- Show offers in dstack apply for elastic container fleets by @peterschmidt85 in #3754
- Support cloud fleet in-place update by @r4victor in #3766
- Set up HTTP ALB listener for ACM gateway by @r4victor in #3767
- Evict jobs if instance is no longer imported by @jvstme in #3772
- Implement cloud fleet in-place update for provisioning fields by @r4victor in #3775
- Drop fleet default min resources by @r4victor in #3776
- Support --fleet in dstack offer by @peterschmidt85 in #3774
- Support imported fleets in
dstack fleet getby @jvstme in #3773 - Limit fleet consolidation attempts by @r4victor in #3777
- [Docs]: Examples cleanup and installation updates by @peterschmidt85 in #3765
- Support AWS G7e (
RTXPRO6000) instances by @jvstme in #3752 - Support imported fleets in
dstack eventby @jvstme in #3779 - Drop autocreated fleets by @r4victor in #3782
- Support fleet filters in the Offers and Launch UI by @peterschmidt85 in #3780
- Support router as replica with pipelines by @Bihan in #3721
- Pre-load offers catalog by @r4victor in #3785
- Parallelize get_project_backends_with_models by @r4victor in #3787
- Kubernetes: add support for volumes by @un-def in #3781
- Allow project admins to delete imports by @jvstme in #3783
- Skip best fleet search for dstack offer by @peterschmidt85 in #3788
- Disable go-integration-tests for release by @r4victor in #3791
Full changelog: 0.20.16...0.20.17