-
Notifications
You must be signed in to change notification settings - Fork 765
[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs #12712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If this PR added a GUC in the Postgres fork or
If you're an external contributor, a Neon employee will assist in |
d2d72eb
to
fc7d86f
Compare
a2f80f4
to
3035eba
Compare
5835637
to
f0e6970
Compare
8987 tests run: 8338 passed, 0 failed, 649 skipped (full report)Flaky tests (4)Postgres 17
Postgres 15
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
bd421c5 at 2025-07-24T16:25:16.129Z :recycle: |
3035eba
to
5df5be4
Compare
27cd677
to
383990b
Compare
A question, how will this help when control plane will return the same spec that was stored in the last operation for the endpoint? We won't be changing the pageserver information stored there. |
This is a commit for the Hadron Compute Manager. It's a cherry pick from Hadron. See the |
383990b
to
1a7b6b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
We're considering doing something similar for Neon, see https://databricks.atlassian.net/browse/LKB-672.
…suspects that it is talking to the wrong PSs ## Problem This is a follow-up to TODO, as part of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem. ## Summary of changes The previous change added mechanism in `compute_ctl` that makes it possible to refresh the configuration of PG on-demand by having `compute_ctl` go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signal `compute_ctl` to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration. PG will become suspicious that it is talking to the wrong pageservers in the following situations: 1. It cannot connect to a pageserver (e.g., getting a network-level connection refused error) 2. It can connect to a pageserver, but the pageserver does not return any data for the GetPage request 3. It can connect to a pageserver, but the pageserver returns a malformed response 4. It can connect to a pageserver, but there is an error receiving the GetPage request response for any other reason This change also includes a minor tweak to `compute_ctl`'s config refresh behavior. Upon receiving a request to refresh PG configuration, `compute_ctl` will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change, `compute_ctl` only attempts reconfiguration if the refreshed config points PG to different pageservers. ## How is this tested? The new code paths are exercised in all existing tests because this mechanism is on by default. Explicitly tested in `test_runner/regress/test_change_pageserver.py`. Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
1a7b6b2
to
bd421c5
Compare
Problem
This is a follow-up to TODO, as part
of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem.
Summary of changes
The previous change added mechanism in
compute_ctl
that makes it possible to refresh the configuration of PG on-demand by havingcompute_ctl
go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signalcompute_ctl
to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration.PG will become suspicious that it is talking to the wrong pageservers in the following situations:
This change also includes a minor tweak to
compute_ctl
's config refresh behavior. Upon receiving a request to refresh PG configuration,compute_ctl
will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change,compute_ctl
only attempts reconfiguration if the refreshed config points PG to different pageservers.How is this tested?
The new code paths are exercised in all existing tests because this mechanism is on by default.
Explicitly tested in
test_runner/regress/test_change_pageserver.py
.