Skip to content

Conversation

DimasKovas
Copy link
Member

Problem

Currently sk migration algorithm retries requests to unavailable sk even if the quorum is reached. It slows down the algorithm if the sk is down.

Summary of changes

  • Cancel retries when the quorum of successful responses is reached.
  • Do not retry exclude requests on the finishing stage
  • Write test for migration the timeline from an unavailable sk.

@DimasKovas DimasKovas requested a review from a team as a code owner July 23, 2025 09:44
@DimasKovas DimasKovas requested a review from arpad-m July 23, 2025 09:44
Copy link

github-actions bot commented Jul 23, 2025

9064 tests run: 8414 passed, 0 failed, 650 skipped (full report)


Flaky tests (2)

Postgres 17

Postgres 16

Code coverage* (full report)

  • functions: 34.8% (8830 of 25356 functions)
  • lines: 45.9% (71519 of 155858 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
c53b454 at 2025-07-29T09:24:31.611Z :recycle:

Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR means that we now basically always involve the reconciler, because we immediately cancel once we have responses from a quorum. I am not sure if this is a good idea.

@DimasKovas
Copy link
Member Author

The cancel token is passed to backoff:retry. The description says "cancel cancels new attempts and the backoff sleep.". So, it does not cancel the ongoing operation. It only cancels new retries. We are still waiting for the ongoing operations to finish.

If all sks are up, then we will make one request to each of them, and wait for all of them to complete.

I think it's a reasonable compromise: if we already got a quorum of responses, stop starting new retries for request which are failing, but wait for the ongoing ones to finish.

There is one corner case when it doesn't work as expected: if somehow we didn't start any requests for one SK, but already got a quorum of responses from others, then we will not make any requests to this SK, and it will require reconciliation. But it's very unlikely because we do some network IO in the operations.

I agree that it's not very clear from the code that we only cancel retries. I suppose to rename the cancel token to cancel_new_retries token to make it more clear. WDYT?

@arpad-m
Copy link
Member

arpad-m commented Jul 29, 2025

Fair point about the cancellation. A future refactor might reasonably turn that cancellation token into one that cancels the entire request/future. We would rely on detail behaviour here.

I wonder if this behaviour should be a parameter, i.e. we turn it on if we know that a safekeeper is offline (say we didn't get heartbeats).

Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatever, let's merge this

@DimasKovas DimasKovas closed this Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storcon: lower sk migration timeout when one sk is unavailable
2 participants