Skip to content

Conversation

tristan957
Copy link
Member

@tristan957 tristan957 commented Jul 23, 2025

Problem

This is a follow-up to TODO, as part
of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem.

Summary of changes

The previous change added mechanism in compute_ctl that makes it possible to refresh the configuration of PG on-demand by having compute_ctl go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signal compute_ctl to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration.

PG will become suspicious that it is talking to the wrong pageservers in the following situations:

  1. It cannot connect to a pageserver (e.g., getting a network-level connection refused error)
  2. It can connect to a pageserver, but the pageserver does not return any data for the GetPage request
  3. It can connect to a pageserver, but the pageserver returns a malformed response
  4. It can connect to a pageserver, but there is an error receiving the GetPage request response for any other reason

This change also includes a minor tweak to compute_ctl's config refresh behavior. Upon receiving a request to refresh PG configuration, compute_ctl will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change, compute_ctl only attempts reconfiguration if the refreshed config points PG to different pageservers.

How is this tested?

The new code paths are exercised in all existing tests because this mechanism is on by default.

Explicitly tested in test_runner/regress/test_change_pageserver.py.

@tristan957 tristan957 requested review from a team as code owners July 23, 2025 20:12
@tristan957 tristan957 requested review from dimitri, myrrc and erikgrinaker and removed request for a team July 23, 2025 20:12
Copy link

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

@tristan957 tristan957 force-pushed the tristan957/compute_ctl-pg branch from d2d72eb to fc7d86f Compare July 23, 2025 20:24
@tristan957 tristan957 force-pushed the tristan957/compute_ctl-request branch from a2f80f4 to 3035eba Compare July 23, 2025 20:30
@tristan957 tristan957 requested a review from a team as a code owner July 23, 2025 20:30
@tristan957 tristan957 requested review from mattpodraza and mtyazici and removed request for a team July 23, 2025 20:30
@tristan957 tristan957 force-pushed the tristan957/compute_ctl-pg branch 2 times, most recently from 5835637 to f0e6970 Compare July 23, 2025 22:10
Copy link

github-actions bot commented Jul 23, 2025

8987 tests run: 8338 passed, 0 failed, 649 skipped (full report)


Flaky tests (4)

Postgres 17

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 34.8% (8796 of 25297 functions)
  • lines: 45.9% (71347 of 155591 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
bd421c5 at 2025-07-24T16:25:16.129Z :recycle:

@tristan957 tristan957 force-pushed the tristan957/compute_ctl-request branch from 3035eba to 5df5be4 Compare July 24, 2025 00:12
@tristan957 tristan957 force-pushed the tristan957/compute_ctl-pg branch 2 times, most recently from 27cd677 to 383990b Compare July 24, 2025 00:14
@mtyazici
Copy link
Contributor

A question, how will this help when control plane will return the same spec that was stored in the last operation for the endpoint? We won't be changing the pageserver information stored there.

@tristan957
Copy link
Member Author

tristan957 commented Jul 24, 2025

A question, how will this help when control plane will return the same spec that was stored in the last operation for the endpoint? We won't be changing the pageserver information stored there.

This is a commit for the Hadron Compute Manager. It's a cherry pick from Hadron. See the !lakebase_mode for no-oping this in Neon.

@tristan957 tristan957 force-pushed the tristan957/compute_ctl-pg branch from 383990b to 1a7b6b2 Compare July 24, 2025 07:06
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

We're considering doing something similar for Neon, see https://databricks.atlassian.net/browse/LKB-672.

Base automatically changed from tristan957/compute_ctl-request to main July 24, 2025 14:36
…suspects that it is talking to the wrong PSs

## Problem

This is a follow-up to TODO, as part
of the effort to rewire the compute reconfiguration/notification
mechanism to make it more robust. Please refer to that commit or ticket
BRC-1778 for full context of the problem.

## Summary of changes

The previous change added mechanism in `compute_ctl` that makes it
possible to refresh the configuration of PG on-demand by having
`compute_ctl` go out to download a new config from the
control plane/HCC. This change wired this mechanism up with PG so that
PG will signal `compute_ctl` to refresh its configuration when it
suspects that it could be talking to incorrect pageservers due to a
stale configuration.

PG will become suspicious that it is talking to the wrong pageservers in
the following situations:
1. It cannot connect to a pageserver (e.g., getting a network-level
connection refused error)
2. It can connect to a pageserver, but the pageserver does not return
any data for the GetPage request
3. It can connect to a pageserver, but the pageserver returns a
malformed response
4. It can connect to a pageserver, but there is an error receiving the
GetPage request response for any other reason

This change also includes a minor tweak to `compute_ctl`'s config
refresh behavior. Upon receiving a request to refresh PG configuration,
`compute_ctl` will reach out to download a config, but it will not
attempt to apply the configuration if the config is the same as the old
config is it replacing. This optimization is added because the act of
reconfiguring itself requires working pageserver connections. In many
failure situations it is likely that PG detects an issue with a
pageserver before the control plane can detect the issue, migrate
tenants, and update the compute config. In this case even the latest
compute config won't point PG to working pageservers, causing the
configuration attempt to hang and negatively impact PG's
time-to-recovery. With this change, `compute_ctl` only attempts
reconfiguration if the refreshed config points PG to different pageservers.

## How is this tested?

The new code paths are exercised in all existing tests because this mechanism is on by default.

Explicitly tested in `test_runner/regress/test_change_pageserver.py`.

Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
@tristan957 tristan957 force-pushed the tristan957/compute_ctl-pg branch from 1a7b6b2 to bd421c5 Compare July 24, 2025 14:53
@tristan957 tristan957 added this pull request to the merge queue Jul 24, 2025
Merged via the queue into main with commit 89554af Jul 24, 2025
102 checks passed
@tristan957 tristan957 deleted the tristan957/compute_ctl-pg branch July 24, 2025 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants