Skip to content

[DRAFT] Add physicsShedder filter for latency-aware load shedding#3973

Draft
ivan-digital wants to merge 4 commits into
zalando:masterfrom
ivan-digital:feature/physics-shedder
Draft

[DRAFT] Add physicsShedder filter for latency-aware load shedding#3973
ivan-digital wants to merge 4 commits into
zalando:masterfrom
ivan-digital:feature/physics-shedder

Conversation

@ivan-digital
Copy link
Copy Markdown
Contributor

@ivan-digital ivan-digital commented Apr 18, 2026

What

New filter physicsShedder that watches latency to catch slow-but-200 backends (gray failures) — the case admissionControl misses because it only looks at error rate.

It learns what "normal" latency looks like for a route and starts rejecting some traffic when latency clearly drifts above that. Composes with admissionControl on the same route — both honor the Admission-Control response header so they don't double-count.

Closes #3828.

Known tradeoff

A sudden latency step (e.g. backend goes 50ms → 100ms) doesn't trigger shedding right away, because the noise estimate spikes with the change and ends up "explaining away" the spike. Confirmed with a local load test against a slow backend.

The filter still catches error bursts and gradual degradation. For the sudden-step case, would love feedback on the right fix — capping the threshold, lowering the multiplier, or bringing back a memory term.

Tests

Unit, fuzz, randomized invariants, race-checked, scenario tests, tracing, composition with admissionControl, plus a local load script under skptesting/. go test ./... -race clean. Verified the metrics endpoint emits the expected names against a running skipper.

Introduces physicsShedder, a route-level filter that catches gray
failures (slow-but-200 backends) by modeling incoming traffic as
resistance against a learned baseline.

  R = avgLatency/latencyTarget + errorWeight * errorRate
  threshold = mu + k*sigma   (EWMA baseline + adaptive deviation)
  pReject = max(0, (R - threshold) / R), clamped to 0.95

Complements admissionControl, which reacts only to error rate. Uses
the same Admission-Control header convention so the two filters
compose on one route without double-counting.

Known v1 tradeoff: a sudden step in latency inflates the adaptive
variance, which can keep the threshold above R during the transient
and let the new latency become the baseline. Collect data in
logInactive or inactive mode first; refine threshold formulation
based on feedback.

Closes zalando#3828

Tests include unit math, warmup gate, ring buffer rotation, mode
behavior, pre/post processor, response error counting, metrics
emission, tracing spans, fuzz on the math, randomized invariants,
concurrent hot path, admissionControl chain composition, and a
local load-test script under skptesting/.

Signed-off-by: ivan-digital <root@ivan.digital>
@ivan-digital ivan-digital changed the title Add physicsShedder filter for latency-aware load shedding [DRAFT] Add physicsShedder filter for latency-aware load shedding Apr 18, 2026
ivan-digital added 3 commits April 18, 2026 13:30
Signed-off-by: ivan-digital <root@ivan.digital>
Signed-off-by: ivan-digital <root@ivan.digital>
Signed-off-by: ivan-digital <root@ivan.digital>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

load shedding: Physics-based concurrency control

1 participant