[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs #12712

tristan957 · 2025-07-23T20:12:47Z

Problem

This is a follow-up to TODO, as part
of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem.

Summary of changes

The previous change added mechanism in compute_ctl that makes it possible to refresh the configuration of PG on-demand by having compute_ctl go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signal compute_ctl to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration.

PG will become suspicious that it is talking to the wrong pageservers in the following situations:

It cannot connect to a pageserver (e.g., getting a network-level connection refused error)
It can connect to a pageserver, but the pageserver does not return any data for the GetPage request
It can connect to a pageserver, but the pageserver returns a malformed response
It can connect to a pageserver, but there is an error receiving the GetPage request response for any other reason

This change also includes a minor tweak to compute_ctl's config refresh behavior. Upon receiving a request to refresh PG configuration, compute_ctl will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change, compute_ctl only attempts reconfiguration if the refreshed config points PG to different pageservers.

How is this tested?

The new code paths are exercised in all existing tests because this mechanism is on by default.

Explicitly tested in test_runner/regress/test_change_pageserver.py.

github-actions · 2025-07-23T20:13:03Z

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

github-actions · 2025-07-23T23:09:37Z

8987 tests run: 8338 passed, 0 failed, 649 skipped (full report)

Flaky tests (4)

Postgres 17

test_ps_unavailable_after_delete[DeletionAPIKind.FORCE]: release-arm64-with-lfc

Postgres 15

test_change_pageserver[False]: release-arm64-with-lfc

Postgres 14

test_tx_abort_with_many_relations[v2]: release-arm64-with-lfc
test_tx_abort_with_many_relations[v1]: release-arm64-with-lfc

Code coverage* (full report)

functions: 34.8% (8796 of 25297 functions)
lines: 45.9% (71347 of 155591 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
bd421c5 at 2025-07-24T16:25:16.129Z :recycle:}

mtyazici · 2025-07-24T06:34:53Z

A question, how will this help when control plane will return the same spec that was stored in the last operation for the endpoint? We won't be changing the pageserver information stored there.

tristan957 · 2025-07-24T07:06:20Z

A question, how will this help when control plane will return the same spec that was stored in the last operation for the endpoint? We won't be changing the pageserver information stored there.

This is a commit for the Hadron Compute Manager. It's a cherry pick from Hadron. See the !lakebase_mode for no-oping this in Neon.

erikgrinaker

LGTM.

We're considering doing something similar for Neon, see https://databricks.atlassian.net/browse/LKB-672.

…suspects that it is talking to the wrong PSs ## Problem This is a follow-up to TODO, as part of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem. ## Summary of changes The previous change added mechanism in `compute_ctl` that makes it possible to refresh the configuration of PG on-demand by having `compute_ctl` go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signal `compute_ctl` to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration. PG will become suspicious that it is talking to the wrong pageservers in the following situations: 1. It cannot connect to a pageserver (e.g., getting a network-level connection refused error) 2. It can connect to a pageserver, but the pageserver does not return any data for the GetPage request 3. It can connect to a pageserver, but the pageserver returns a malformed response 4. It can connect to a pageserver, but there is an error receiving the GetPage request response for any other reason This change also includes a minor tweak to `compute_ctl`'s config refresh behavior. Upon receiving a request to refresh PG configuration, `compute_ctl` will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change, `compute_ctl` only attempts reconfiguration if the refreshed config points PG to different pageservers. ## How is this tested? The new code paths are exercised in all existing tests because this mechanism is on by default. Explicitly tested in `test_runner/regress/test_change_pageserver.py`. Co-authored-by: Tristan Partin <tristan.partin@databricks.com>

tristan957 requested review from a team as code owners July 23, 2025 20:12

tristan957 requested review from dimitri, myrrc and erikgrinaker and removed request for a team July 23, 2025 20:12

tristan957 force-pushed the tristan957/compute_ctl-pg branch from d2d72eb to fc7d86f Compare July 23, 2025 20:24

tristan957 force-pushed the tristan957/compute_ctl-request branch from a2f80f4 to 3035eba Compare July 23, 2025 20:30

tristan957 requested a review from a team as a code owner July 23, 2025 20:30

tristan957 requested review from mattpodraza and mtyazici and removed request for a team July 23, 2025 20:30

tristan957 force-pushed the tristan957/compute_ctl-pg branch 2 times, most recently from 5835637 to f0e6970 Compare July 23, 2025 22:10

tristan957 force-pushed the tristan957/compute_ctl-request branch from 3035eba to 5df5be4 Compare July 24, 2025 00:12

tristan957 force-pushed the tristan957/compute_ctl-pg branch 2 times, most recently from 27cd677 to 383990b Compare July 24, 2025 00:14

tristan957 force-pushed the tristan957/compute_ctl-pg branch from 383990b to 1a7b6b2 Compare July 24, 2025 07:06

erikgrinaker approved these changes Jul 24, 2025

View reviewed changes

thesuhas approved these changes Jul 24, 2025

View reviewed changes

Base automatically changed from tristan957/compute_ctl-request to main July 24, 2025 14:36

tristan957 force-pushed the tristan957/compute_ctl-pg branch from 1a7b6b2 to bd421c5 Compare July 24, 2025 14:53

tristan957 added this pull request to the merge queue Jul 24, 2025

Merged via the queue into main with commit 89554af Jul 24, 2025
102 checks passed

tristan957 deleted the tristan957/compute_ctl-pg branch July 24, 2025 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs #12712

[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs #12712

Uh oh!

tristan957 commented Jul 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

github-actions bot commented Jul 23, 2025 •

edited

Loading

Postgres 17

Postgres 15

Postgres 14

Uh oh!

mtyazici commented Jul 24, 2025

Uh oh!

tristan957 commented Jul 24, 2025 •

edited

Loading

Uh oh!

erikgrinaker left a comment

Uh oh!

Uh oh!

Uh oh!

[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs #12712

[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs #12712

Uh oh!

Conversation

tristan957 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of changes

How is this tested?

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

github-actions bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

8987 tests run: 8338 passed, 0 failed, 649 skipped (full report)

Postgres 17

Postgres 15

Postgres 14

Code coverage* (full report)

Uh oh!

mtyazici commented Jul 24, 2025

Uh oh!

tristan957 commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tristan957 commented Jul 23, 2025 •

edited

Loading

github-actions bot commented Jul 23, 2025 •

edited

Loading

tristan957 commented Jul 24, 2025 •

edited

Loading