storcon: do not retry sk migration ops if the quorum is reached #12698

DimasKovas · 2025-07-23T09:44:07Z

Problem

Currently sk migration algorithm retries requests to unavailable sk even if the quorum is reached. It slows down the algorithm if the sk is down.

Summary of changes

Cancel retries when the quorum of successful responses is reached.
Do not retry exclude requests on the finishing stage
Write test for migration the timeline from an unavailable sk.

github-actions · 2025-07-23T10:40:29Z

9064 tests run: 8414 passed, 0 failed, 650 skipped (full report)

Flaky tests (2)

Postgres 17

test_scrubber_physical_gc_ancestors[None]: release-arm64-with-lfc

Postgres 16

test_tx_abort_with_many_relations[v2]: release-arm64-with-lfc

Code coverage* (full report)

functions: 34.8% (8830 of 25356 functions)
lines: 45.9% (71519 of 155858 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
c53b454 at 2025-07-29T09:24:31.611Z :recycle:}

arpad-m

This PR means that we now basically always involve the reconciler, because we immediately cancel once we have responses from a quorum. I am not sure if this is a good idea.

DimasKovas · 2025-07-29T08:01:19Z

The cancel token is passed to backoff:retry. The description says "cancel cancels new attempts and the backoff sleep.". So, it does not cancel the ongoing operation. It only cancels new retries. We are still waiting for the ongoing operations to finish.

If all sks are up, then we will make one request to each of them, and wait for all of them to complete.

I think it's a reasonable compromise: if we already got a quorum of responses, stop starting new retries for request which are failing, but wait for the ongoing ones to finish.

There is one corner case when it doesn't work as expected: if somehow we didn't start any requests for one SK, but already got a quorum of responses from others, then we will not make any requests to this SK, and it will require reconciliation. But it's very unlikely because we do some network IO in the operations.

I agree that it's not very clear from the code that we only cancel retries. I suppose to rename the cancel token to cancel_new_retries token to make it more clear. WDYT?

arpad-m · 2025-07-29T09:47:07Z

Fair point about the cancellation. A future refactor might reasonably turn that cancellation token into one that cancels the entire request/future. We would rely on detail behaviour here.

I wonder if this behaviour should be a parameter, i.e. we turn it on if we know that a safekeeper is offline (say we didn't get heartbeats).

arpad-m

whatever, let's merge this

storcon: do not retry sk migration ops if the quorum is reached

961835a

DimasKovas requested a review from a team as a code owner July 23, 2025 09:44

DimasKovas requested a review from arpad-m July 23, 2025 09:44

Merge branch 'main' into diko/safekeeper_migrate_from_down_sk

e48ac9e

arpad-m requested changes Jul 28, 2025

View reviewed changes

rename cancel -> cancel_new_retries

c53b454

arpad-m approved these changes Aug 1, 2025

View reviewed changes

DimasKovas closed this Aug 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

storcon: do not retry sk migration ops if the quorum is reached #12698

storcon: do not retry sk migration ops if the quorum is reached #12698

DimasKovas commented Jul 23, 2025

Uh oh!

github-actions bot commented Jul 23, 2025 •

edited

Loading

Postgres 17

Postgres 16

Uh oh!

arpad-m left a comment

Uh oh!

DimasKovas commented Jul 29, 2025

Uh oh!

arpad-m commented Jul 29, 2025

Uh oh!

arpad-m left a comment

Uh oh!

Uh oh!

storcon: do not retry sk migration ops if the quorum is reached #12698

storcon: do not retry sk migration ops if the quorum is reached #12698

Conversation

DimasKovas commented Jul 23, 2025

Problem

Summary of changes

Uh oh!

github-actions bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

9064 tests run: 8414 passed, 0 failed, 650 skipped (full report)

Postgres 17

Postgres 16

Code coverage* (full report)

Uh oh!

arpad-m left a comment

Choose a reason for hiding this comment

Uh oh!

DimasKovas commented Jul 29, 2025

Uh oh!

arpad-m commented Jul 29, 2025

Uh oh!

arpad-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jul 23, 2025 •

edited

Loading