fix(scheduler): prevent 5xx errors during pipeline update (#7072)#7106
Open
jackson-wright wants to merge 2 commits intoSeldonIO:v2from
Open
fix(scheduler): prevent 5xx errors during pipeline update (#7072)#7106jackson-wright wants to merge 2 commits intoSeldonIO:v2from
jackson-wright wants to merge 2 commits intoSeldonIO:v2from
Conversation
) When a Pipeline CRD is updated, the scheduler creates a new version (N+1) alongside the old version (N). Once N+1 is ready, N is terminated. During this termination, sendPipelineEvents fires with PipelineGwStatus == PipelineTerminate for the old version. Two bugs caused the pipeline to return 5xx errors for the window between termination and cleanup: Bug 1 — 503 (envoy "no healthy upstream"): sendPipelineEvents unconditionally called sendPipelineStreamsEventMsg with an empty StreamNames list, which caused AddPipelineClusters to build an envoy cluster with zero endpoints. The fix guards this call with an IsLatestVersion check: if N is not the latest version, skip the envoy route removal so N+1's routes remain intact. Bug 2 — 500 (pipeline-gw "pipeline not found"): The PipelineTerminate switch case still called sendPipelineEventsToStreamWithTimestamp, which sent a PipelineDelete event to the pipeline-gw. The pipeline-gw's handleDeletePipeline calls DeletePipeline(name), which removes the entry keyed by "{name}.pipeline" from its KafkaManager — the same key held by N+1. This wiped N+1's Kafka consumer, causing every subsequent request to fail with 500 even though the CRD showed Ready: True and envoy routes were intact. The fix extends the IsLatestVersion guard to the PipelineTerminate switch case: when N is not the latest version, skip the pipeline-gw delete and directly transition N to PipelineTerminated. This is consistent with the comment in terminatePipelineGwOldUnterminatedPipelinesIfNeeded: "we don't need to send any event/message to pipeline-gw since the pipeline is loaded based on name on the pipeline-gw side." Fixes: SeldonIO#7072
…ion (SeldonIO#7072) Adds three tests that cover the race condition fixed in the preceding commit: TestPipelineUpdateDoesNotClearEnvoyRoutes Verifies that terminating an old pipeline version during an update does not publish a PipelineStreamsEventMsg with an empty StreamNames list. An empty list causes AddPipelineClusters to build a zero-endpoint envoy cluster, returning 503 "no healthy upstream" (Bug 1). TestPipelineUpdateOldVersionDeleteNotSentToGateway Verifies that terminating an old pipeline version during an update does not send a PipelineDelete operation to the pipeline gateway. A delete message causes the pipeline-gw to call DeletePipeline(name), removing the KafkaManager entry shared with the new version and breaking inference with 500 errors (Bug 2). TestPipelineDeleteLatestVersionSentToGateway Verifies that the IsLatestVersion guard does NOT suppress the delete message when the latest (and only) version of a pipeline is terminated. This is the normal pipeline deletion path and must continue to work. Covers: SeldonIO#7072
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a race condition in the scheduler where updating a Pipeline CRD causes the pipeline to return 5xx errors for a window after the update, even though the new version is healthy and the CRD shows
Ready: True.Closes #7072
Root cause
When a Pipeline is updated, the scheduler creates version N+1 alongside the existing version N. Once N+1 is ready,
terminatePipelineGwOldUnterminatedPipelinesIfNeededmarks N'sPipelineGwStatusasPipelineTerminate. This fires asendPipelineEventscall for N, which had two bugs:Bug 1 — 503 "no healthy upstream" (envoy)
sendPipelineEventsunconditionally calledsendPipelineStreamsEventMsgwith an emptyStreamNameslist, causingAddPipelineClusters→CreateClusterForStreamsto build a zero-endpoint envoy cluster. Envoy returns 503 until the next event restores the route.Bug 2 — 500 "pipeline not found" (pipeline-gw)
The
PipelineTerminateswitch case still calledsendPipelineEventsToStreamWithTimestamp, which sent aPipelineDeleteevent to the pipeline gateway. The pipeline-gw'shandleDeletePipelinecallsDeletePipeline(name), which removes the entry keyed by"{name}.pipeline"from itsKafkaManager— the same key held by N+1. After this, every infer request through the pipeline-gw returns 500 becauseLoadOrStorePipeline(name, loadOnly=true)finds nothing.This is consistent with the existing comment in
terminatePipelineGwOldUnterminatedPipelinesIfNeeded:Changes
IsLatestVersionguard insendPipelineEvents. When the terminating version is not the latest: (1) skipsendPipelineStreamsEventMsgso envoy routes for N+1 are preserved; (2) skipsendPipelineEventsToStreamWithTimestampso the pipeline-gw delete is not sent, and directly transition N toPipelineTerminated.Test Plan
TestPipelineUpdateDoesNotClearEnvoyRoutes— old version termination does not publish an emptyPipelineStreamsEventMsg(Bug 1)TestPipelineUpdateOldVersionDeleteNotSentToGateway— old version termination does not sendPipelineDeleteto the gateway (Bug 2)TestPipelineDeleteLatestVersionSentToGateway— terminating the latest version (normal deletion) still sendsPipelineDeleteto the gatewaygo test ./...