-
Notifications
You must be signed in to change notification settings - Fork 766
[BRC-1778] Add mechanism to compute_ctl
to pull a new config
#12711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
e359286
to
a2f80f4
Compare
HaoyuHuang
approved these changes
Jul 23, 2025
8987 tests run: 8337 passed, 0 failed, 650 skipped (full report)Flaky tests (6)Postgres 17
Postgres 16
Postgres 15
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
5df5be4 at 2025-07-24T01:16:23.237Z :recycle: |
a2f80f4
to
3035eba
Compare
## Problem We have been dealing with a number of issues with the SC compute notification mechanism. Various race conditions exist in the PG/HCC/cplane/PS distributed system, and relying on the SC to send notifications to the compute node to notify it of PS changes is not robust. We decided to pursue a more robust option where the compute node itself discovers whether it may be pointing to the incorrect PSs and proactively reconfigure itself if issues are suspected. ## Summary of changes To support this self-healing reconfiguration mechanism several pieces are needed. This PR adds a mechanism to `compute_ctl` called "refresh configuration", where the compute node reaches out to the control plane to pull a new config and reconfigure PG using the new config, instead of listening for a notification message containing a config to arrive from the control plane. Main changes to compute_ctl: 1. The `compute_ctl` state machine now has a new State, `RefreshConfigurationPending`. The compute node may enter this state upon receiving a signal that it may be using the incorrect page servers. 2. Upon entering the `RefreshConfigurationPending` state, the background configurator thread in `compute_ctl` wakes up, pulls a new config from the control plane, and reconfigures PG (with `pg_ctl reload`) according to the new config. 3. The compute node may enter the new `RefreshConfigurationPending` state from `Running` or `Failed` states. If the configurator managed to configure the compute node successfully, it will enter the `Running` state, otherwise, it stays in `RefreshConfigurationPending` and the configurator thread will wait for the next notification if an incorrect config is still suspected. 4. Added various plumbing in `compute_ctl` data structures to allow the configurator thread to perform the config fetch. The "incorrect config suspected" notification is delivered using a HTTP endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is currently not called by anyone other than the tests. In a follow up PR I will set up some code in the PG extension/libpagestore to call this HTTP endpoint whenever PG suspects that it is pointing to the wrong page servers. ## How is this tested? Modified `test_runner/regress/test_change_pageserver.py` to add a scenario where we use the new `/refresh_configuration` mechanism instead of the existing `/configure` mechanism (which requires us sending a full config to compute_ctl) to have the compute node reload and reconfigure its pageservers. I took one shortcut to reduce the scope of this change when it comes to testing: the compute node uses a local config file instead of pulling a config over the network from the HCC. This simplifies the test setup in the following ways: * The existing test framework is set up to use local config files for compute nodes only, so it's convenient if I just stick with it. * The HCC today generates a compute config with production settings (e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is probably not suitable in tests. We may need to add another test-only endpoint config to the control plane to make this work. The config-fetch part of the code is relatively straightforward (and well-covered in both production and the KIND test) so it is probably fine to replace it with loading from the local config file for these integration tests. In addition to making sure that the tests pass, I also manually inspected the logs to make sure that the compute node is indeed reloading the config using the new mechanism instead of going down the old `/configure` path (it turns out the test has bugs which causes compute `/configure` messages to be sent despite the test intending to disable/blackhole them). ```test 2024-09-24T18:53:29.573650Z INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request 2024-09-24T18:53:29.573689Z INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration 2024-09-24T18:53:29.573706Z INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json PG:2024-09-24 18:53:29.574 GMT [52799] LOG: received SIGHUP, reloading configuration files PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008" ... ``` Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
3035eba
to
5df5be4
Compare
thesuhas
approved these changes
Jul 24, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
We have been dealing with a number of issues with the SC compute notification mechanism. Various race conditions exist in the PG/HCC/cplane/PS distributed system, and relying on the SC to send notifications to the compute node to notify it of PS changes is not robust. We decided to pursue a more robust option where the compute node itself discovers whether it may be pointing to the incorrect PSs and proactively reconfigure itself if issues are suspected.
Summary of changes
To support this self-healing reconfiguration mechanism several pieces are needed. This PR adds a mechanism to
compute_ctl
called "refresh configuration", where the compute node reaches out to the control plane to pull a new config and reconfigure PG using the new config, instead of listening for a notification message containing a config to arrive from the control plane. Main changes to compute_ctl:compute_ctl
state machine now has a new State,RefreshConfigurationPending
. The compute node may enter this state upon receiving a signal that it may be using the incorrect page servers.RefreshConfigurationPending
state, the background configurator thread incompute_ctl
wakes up, pulls a new config from the control plane, and reconfigures PG (withpg_ctl reload
) according to the new config.RefreshConfigurationPending
state fromRunning
orFailed
states. If the configurator managed to configure the compute node successfully, it will enter theRunning
state, otherwise, it stays inRefreshConfigurationPending
and the configurator thread will wait for the next notification if an incorrect config is still suspected.compute_ctl
data structures to allow the configurator thread to perform the config fetch.The "incorrect config suspected" notification is delivered using a HTTP endpoint,
/refresh_configuration
, oncompute_ctl
. This endpoint is currently not called by anyone other than the tests. In a follow up PR I will set up some code in the PG extension/libpagestore to call this HTTP endpoint whenever PG suspects that it is pointing to the wrong page servers.How is this tested?
Modified
test_runner/regress/test_change_pageserver.py
to add a scenario where we use the new/refresh_configuration
mechanism instead of the existing/configure
mechanism (which requires us sending a full config to compute_ctl) to have the compute node reload and reconfigure its pageservers.I took one shortcut to reduce the scope of this change when it comes to testing: the compute node uses a local config file instead of pulling a config over the network from the HCC. This simplifies the test setup in the following ways:
The config-fetch part of the code is relatively straightforward (and well-covered in both production and the KIND test) so it is probably fine to replace it with loading from the local config file for these integration tests.
In addition to making sure that the tests pass, I also manually inspected the logs to make sure that the compute node is indeed reloading the config using the new mechanism instead of going down the old
/configure
path (it turns out the test has bugs which causes compute/configure
messages to be sent despite the test intending to disable/blackhole them).