Skip to content

0.20.17

Latest

Choose a tag to compare

@peterschmidt85 peterschmidt85 released this 16 Apr 12:45
· 4 commits to master since this release
6216394

PD disaggregation

This update simplifies running SGLang with Prefill-Decode disaggregation.

Previously, PD disaggregation required configuring router on the gateway, which meant
the gateway had to run in the same cluster as the service to communicate with service
replicas.

With this update, router is configured on a service replica group instead. This allows
using a standard gateway outside the service cluster.

Below is an example service configuration for running zai-org/GLM-4.5-Air-FP8 using replica groups:

type: service
name: prefill-decode
image: lmsysorg/sglang:latest

env:
  - HF_TOKEN
  - MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
  - count: 1
    commands:
      - pip install sglang_router
      - |
        python -m sglang_router.launch_router \
          --host 0.0.0.0 \
          --port 8000 \
          --pd-disaggregation \
          --prefill-policy cache_aware
    router:
      type: sglang
    resources:
      cpu: 4

  - count: 1..4
    scaling:
      metric: rps
      target: 3
    commands:
      - |
        python -m sglang.launch_server \
          --model-path $MODEL_ID \
          --disaggregation-mode prefill \
          --disaggregation-transfer-backend nixl \
          --host 0.0.0.0 \
          --port 8000 \
          --disaggregation-bootstrap-port 8998
    resources:
      gpu: H200

  - count: 1..8
    scaling:
      metric: rps
      target: 2
    commands:
      - |
        python -m sglang.launch_server \
          --model-path $MODEL_ID \
          --disaggregation-mode decode \
          --disaggregation-transfer-backend nixl \
          --host 0.0.0.0 \
          --port 8000
    resources:
      gpu: H200

port: 8000
model: zai-org/GLM-4.5-Air-FP8

# Custom probe is required for PD disaggregation.
probes:
  - type: http
    url: /health
    interval: 15s

Note: this setup requires the service fleet or cluster to provide a CPU node for the
router replica.

Kubernetes

The kubernetes backend adds support for both network and instance volumes.

Network volumes

You can either create a new network volume or register an existing one. To create a new
network volume, specify size and optionally storage_class_name and/or
access_modes:

type: volume
backend: kubernetes
name: my-volume

size: 100GB

This automatically creates a PersistentVolumeClaim and associates it with the volume.

If you don't specify storage_class_name, the decision is delegated to the
DefaultStorageClass admission controller, if enabled.

If you don't specify access_modes, it defaults to [ReadWriteOnce]. To attach
volumes to multiple runs at the same time, set it to [ReadWriteMany] or
[ReadWriteMany, ReadOnlyMany].

To reuse an existing PersistentVolumeClaim, specify its name in claim_name:

type: volume
backend: kubernetes
name: my-volume

claim_name: existing-pvc

Once a volume configuration is applied, you can attach it to your runs via volumes:

type: dev-environment
name: vscode-vol

ide: vscode

volumes:
  - name: my-volume
    path: /volume_data

Instance volumes

In addition to network volumes, the kubernetes backend now supports instance volumes:

type: dev-environment
name: vscode-vol

ide: vscode

volumes:
  - instance_path: /mnt/volume
    path: /volume_data

Unlike network volumes, which persist across instances, instance volumes persist data
only within a particular instance. They are useful for storing caches or when you
manually mount a shared filesystem into the instance path.

Note: using volumes with the kubernetes backend requires the corresponding
permissions
.

Performance

Fetching backend offers for the first time has been optimized and is now much faster. As
a result, dstack apply, dstack offer, and the offers UI are all more responsive.
Here are the improvements for some of the major backends:

- aws — 41.43s => 6.61s (6.3x)
- azure — 12.49s => 5.50s (2.3x)
- gcp — 13.51s => 5.20s (2.6x)
- nebius — 10.74s => 3.80s (2.8x)
- runpod — 9.36s => 0.09s (104x)
- verda — 9.49s => 2.33s (4.1x)

Fleets

In-place update

Backend fleets now support initial in-place updates. You can update nodes,
reservation, tags, resources, backends, regions, availability_zones,
instance_types, spot_policy, and max_price without re-creating the entire fleet.
If existing idle instances do not match the updated configuration, dstack replaces
them.

Default resources

Fleets used to have default resources set to cpu=2.. mem=8GB.. disk=100GB.. when
left unspecified. This meant any offers with fewer resources were excluded from such
fleets. If you wanted to run on a mem=4GB VM, you had to specify resources in both
the run and fleet configurations.

Now fleets have no default resources, so all offers are available by default. If you
need to add extra constraints on which offers can be provisioned in a fleet, specify
resources explicitly.

Run configurations continue to have default minimum resources set to
cpu=2.. mem=8GB.. disk=100GB.. to avoid provisioning instances that are too small.

Offers

The dstack offer CLI command now supports the --fleet argument, which allows you to
see only offers from the specified fleets.

dstack offer --fleet my-fleet --fleet another-project/other-fleet

The same is now supported in the UI on both the Offers and Launch pages.

Exports

Importers can now delete an import via
dstack import delete <export-project>/<export-name>. This is useful when an export
was created by the exporter, but the importer no longer needs it and does not want to
wait until the exporter deletes it.

AWS

RTX Pro 6000

The aws backend adds support for g7e.* instances offering RTXPRO6000 GPUs.

Docker

Default Docker registry

If you'd like to cache Docker images through your own Docker registry, you can now
configure it when starting the dstack server:

export DSTACK_SERVER_DEFAULT_DOCKER_REGISTRY=<registry base hostname>
export DSTACK_SERVER_DEFAULT_DOCKER_REGISTRY_USERNAME=<registry username>
export DSTACK_SERVER_DEFAULT_DOCKER_REGISTRY_PASSWORD=<registry password>

These settings should only be used for registries that act as a pull-through cache for
Docker Hub. This is useful if you would like to avoid rate limits when you have too
many image pulls.

Migration note

Warning

Since v0.20.0, dstack has required fleets before runs can be submitted.

Until now, the deprecated DSTACK_FF_AUTOCREATED_FLEETS_ENABLED feature flag allowed submitting runs without fleets. In 0.20.17, this flag has been removed.

What's changed

Full changelog: 0.20.16...0.20.17