Skip to content

fix(scheduler): prevent 5xx errors during pipeline update (#7072)#7106

Open
jackson-wright wants to merge 2 commits intoSeldonIO:v2from
jackson-wright:fix-pipeline-update-race-condition-upstream
Open

fix(scheduler): prevent 5xx errors during pipeline update (#7072)#7106
jackson-wright wants to merge 2 commits intoSeldonIO:v2from
jackson-wright:fix-pipeline-update-race-condition-upstream

Conversation

@jackson-wright
Copy link

Summary

Fixes a race condition in the scheduler where updating a Pipeline CRD causes the pipeline to return 5xx errors for a window after the update, even though the new version is healthy and the CRD shows Ready: True.

Closes #7072

Root cause

When a Pipeline is updated, the scheduler creates version N+1 alongside the existing version N. Once N+1 is ready, terminatePipelineGwOldUnterminatedPipelinesIfNeeded marks N's PipelineGwStatus as PipelineTerminate. This fires a sendPipelineEvents call for N, which had two bugs:

Bug 1 — 503 "no healthy upstream" (envoy)
sendPipelineEvents unconditionally called sendPipelineStreamsEventMsg with an empty StreamNames list, causing AddPipelineClustersCreateClusterForStreams to build a zero-endpoint envoy cluster. Envoy returns 503 until the next event restores the route.

Bug 2 — 500 "pipeline not found" (pipeline-gw)
The PipelineTerminate switch case still called sendPipelineEventsToStreamWithTimestamp, which sent a PipelineDelete event to the pipeline gateway. The pipeline-gw's handleDeletePipeline calls DeletePipeline(name), which removes the entry keyed by "{name}.pipeline" from its KafkaManager — the same key held by N+1. After this, every infer request through the pipeline-gw returns 500 because LoadOrStorePipeline(name, loadOnly=true) finds nothing.

This is consistent with the existing comment in terminatePipelineGwOldUnterminatedPipelinesIfNeeded:

"we don't need to send any event/message to pipeline-gw since the pipeline is loaded based on name on the pipeline-gw side"

Changes

  • scheduler/pkg/server/pipeline_status.go: Add IsLatestVersion guard in sendPipelineEvents. When the terminating version is not the latest: (1) skip sendPipelineStreamsEventMsg so envoy routes for N+1 are preserved; (2) skip sendPipelineEventsToStreamWithTimestamp so the pipeline-gw delete is not sent, and directly transition N to PipelineTerminated.
  • scheduler/pkg/server/pipeline_status_test.go: Three regression tests covering both bugs and verifying the guard does not affect normal pipeline deletion.

Test Plan

  • TestPipelineUpdateDoesNotClearEnvoyRoutes — old version termination does not publish an empty PipelineStreamsEventMsg (Bug 1)
  • TestPipelineUpdateOldVersionDeleteNotSentToGateway — old version termination does not send PipelineDelete to the gateway (Bug 2)
  • TestPipelineDeleteLatestVersionSentToGateway — terminating the latest version (normal deletion) still sends PipelineDelete to the gateway
  • All 37 scheduler packages pass: go test ./...
  • Validated end-to-end in a live cluster: 746 inference requests at 0.5s intervals during a pipeline spec update, zero 5xx responses

)

When a Pipeline CRD is updated, the scheduler creates a new version (N+1)
alongside the old version (N). Once N+1 is ready, N is terminated. During
this termination, sendPipelineEvents fires with PipelineGwStatus ==
PipelineTerminate for the old version. Two bugs caused the pipeline to
return 5xx errors for the window between termination and cleanup:

Bug 1 — 503 (envoy "no healthy upstream"):
sendPipelineEvents unconditionally called sendPipelineStreamsEventMsg with
an empty StreamNames list, which caused AddPipelineClusters to build an
envoy cluster with zero endpoints. The fix guards this call with an
IsLatestVersion check: if N is not the latest version, skip the envoy
route removal so N+1's routes remain intact.

Bug 2 — 500 (pipeline-gw "pipeline not found"):
The PipelineTerminate switch case still called
sendPipelineEventsToStreamWithTimestamp, which sent a PipelineDelete event
to the pipeline-gw. The pipeline-gw's handleDeletePipeline calls
DeletePipeline(name), which removes the entry keyed by "{name}.pipeline"
from its KafkaManager — the same key held by N+1. This wiped N+1's Kafka
consumer, causing every subsequent request to fail with 500 even though
the CRD showed Ready: True and envoy routes were intact.

The fix extends the IsLatestVersion guard to the PipelineTerminate switch
case: when N is not the latest version, skip the pipeline-gw delete and
directly transition N to PipelineTerminated. This is consistent with the
comment in terminatePipelineGwOldUnterminatedPipelinesIfNeeded: "we don't
need to send any event/message to pipeline-gw since the pipeline is loaded
based on name on the pipeline-gw side."

Fixes: SeldonIO#7072
…ion (SeldonIO#7072)

Adds three tests that cover the race condition fixed in the preceding commit:

TestPipelineUpdateDoesNotClearEnvoyRoutes
  Verifies that terminating an old pipeline version during an update does
  not publish a PipelineStreamsEventMsg with an empty StreamNames list.
  An empty list causes AddPipelineClusters to build a zero-endpoint envoy
  cluster, returning 503 "no healthy upstream" (Bug 1).

TestPipelineUpdateOldVersionDeleteNotSentToGateway
  Verifies that terminating an old pipeline version during an update does
  not send a PipelineDelete operation to the pipeline gateway. A delete
  message causes the pipeline-gw to call DeletePipeline(name), removing
  the KafkaManager entry shared with the new version and breaking
  inference with 500 errors (Bug 2).

TestPipelineDeleteLatestVersionSentToGateway
  Verifies that the IsLatestVersion guard does NOT suppress the delete
  message when the latest (and only) version of a pipeline is terminated.
  This is the normal pipeline deletion path and must continue to work.

Covers: SeldonIO#7072
@CLAassistant
Copy link

CLAassistant commented Mar 11, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pipeline update causes race condition - pipeline removed from envoy despite Ready status

2 participants