[BRC-1778] Add mechanism to `compute_ctl` to pull a new config #12711

tristan957 · 2025-07-23T18:53:08Z

Problem

We have been dealing with a number of issues with the SC compute notification mechanism. Various race conditions exist in the PG/HCC/cplane/PS distributed system, and relying on the SC to send notifications to the compute node to notify it of PS changes is not robust. We decided to pursue a more robust option where the compute node itself discovers whether it may be pointing to the incorrect PSs and proactively reconfigure itself if issues are suspected.

Summary of changes

To support this self-healing reconfiguration mechanism several pieces are needed. This PR adds a mechanism to compute_ctl called "refresh configuration", where the compute node reaches out to the control plane to pull a new config and reconfigure PG using the new config, instead of listening for a notification message containing a config to arrive from the control plane. Main changes to compute_ctl:

The compute_ctl state machine now has a new State, RefreshConfigurationPending. The compute node may enter this state upon receiving a signal that it may be using the incorrect page servers.
Upon entering the RefreshConfigurationPending state, the background configurator thread in compute_ctl wakes up, pulls a new config from the control plane, and reconfigures PG (with pg_ctl reload) according to the new config.
The compute node may enter the new RefreshConfigurationPending state from Running or Failed states. If the configurator managed to configure the compute node successfully, it will enter the Running state, otherwise, it stays in RefreshConfigurationPending and the configurator thread will wait for the next notification if an incorrect config is still suspected.
Added various plumbing in compute_ctl data structures to allow the configurator thread to perform the config fetch.

The "incorrect config suspected" notification is delivered using a HTTP endpoint, /refresh_configuration, on compute_ctl. This endpoint is currently not called by anyone other than the tests. In a follow up PR I will set up some code in the PG extension/libpagestore to call this HTTP endpoint whenever PG suspects that it is pointing to the wrong page servers.

How is this tested?

Modified test_runner/regress/test_change_pageserver.py to add a scenario where we use the new /refresh_configuration mechanism instead of the existing /configure mechanism (which requires us sending a full config to compute_ctl) to have the compute node reload and reconfigure its pageservers.

I took one shortcut to reduce the scope of this change when it comes to testing: the compute node uses a local config file instead of pulling a config over the network from the HCC. This simplifies the test setup in the following ways:

The existing test framework is set up to use local config files for compute nodes only, so it's convenient if I just stick with it.
The HCC today generates a compute config with production settings (e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is probably not suitable in tests. We may need to add another test-only endpoint config to the control plane to make this work.

The config-fetch part of the code is relatively straightforward (and well-covered in both production and the KIND test) so it is probably fine to replace it with loading from the local config file for these integration tests.

In addition to making sure that the tests pass, I also manually inspected the logs to make sure that the compute node is indeed reloading the config using the new mechanism instead of going down the old /configure path (it turns out the test has bugs which causes compute /configure messages to be sent despite the test intending to disable/blackhole them).

2024-09-24T18:53:29.573650Z  INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request
2024-09-24T18:53:29.573689Z  INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration
2024-09-24T18:53:29.573706Z  INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json
PG:2024-09-24 18:53:29.574 GMT [52799] LOG:  received SIGHUP, reloading configuration files
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.extension_server_port" cannot be changed without restarting the server
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008"
...

compute_tools/src/compute.rs

github-actions · 2025-07-23T20:17:04Z

8987 tests run: 8337 passed, 0 failed, 650 skipped (full report)

Flaky tests (6)

Postgres 17

test_ps_unavailable_after_delete[DeletionAPIKind.FORCE]: release-arm64-with-lfc

Postgres 16

test_hot_standby_feedback: release-arm64-with-lfc
test_ps_unavailable_after_delete[DeletionAPIKind.FORCE]: release-arm64-with-lfc
test_timeline_size: release-x86-64-with-lfc

Postgres 15

test_ps_unavailable_after_delete[DeletionAPIKind.FORCE]: release-x86-64-with-lfc

Postgres 14

test_ps_unavailable_after_delete[DeletionAPIKind.FORCE]: release-x86-64-with-lfc

Code coverage* (full report)

functions: 34.7% (8802 of 25349 functions)
lines: 45.8% (71295 of 155692 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
5df5be4 at 2025-07-24T01:16:23.237Z :recycle:}

## Problem We have been dealing with a number of issues with the SC compute notification mechanism. Various race conditions exist in the PG/HCC/cplane/PS distributed system, and relying on the SC to send notifications to the compute node to notify it of PS changes is not robust. We decided to pursue a more robust option where the compute node itself discovers whether it may be pointing to the incorrect PSs and proactively reconfigure itself if issues are suspected. ## Summary of changes To support this self-healing reconfiguration mechanism several pieces are needed. This PR adds a mechanism to `compute_ctl` called "refresh configuration", where the compute node reaches out to the control plane to pull a new config and reconfigure PG using the new config, instead of listening for a notification message containing a config to arrive from the control plane. Main changes to compute_ctl: 1. The `compute_ctl` state machine now has a new State, `RefreshConfigurationPending`. The compute node may enter this state upon receiving a signal that it may be using the incorrect page servers. 2. Upon entering the `RefreshConfigurationPending` state, the background configurator thread in `compute_ctl` wakes up, pulls a new config from the control plane, and reconfigures PG (with `pg_ctl reload`) according to the new config. 3. The compute node may enter the new `RefreshConfigurationPending` state from `Running` or `Failed` states. If the configurator managed to configure the compute node successfully, it will enter the `Running` state, otherwise, it stays in `RefreshConfigurationPending` and the configurator thread will wait for the next notification if an incorrect config is still suspected. 4. Added various plumbing in `compute_ctl` data structures to allow the configurator thread to perform the config fetch. The "incorrect config suspected" notification is delivered using a HTTP endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is currently not called by anyone other than the tests. In a follow up PR I will set up some code in the PG extension/libpagestore to call this HTTP endpoint whenever PG suspects that it is pointing to the wrong page servers. ## How is this tested? Modified `test_runner/regress/test_change_pageserver.py` to add a scenario where we use the new `/refresh_configuration` mechanism instead of the existing `/configure` mechanism (which requires us sending a full config to compute_ctl) to have the compute node reload and reconfigure its pageservers. I took one shortcut to reduce the scope of this change when it comes to testing: the compute node uses a local config file instead of pulling a config over the network from the HCC. This simplifies the test setup in the following ways: * The existing test framework is set up to use local config files for compute nodes only, so it's convenient if I just stick with it. * The HCC today generates a compute config with production settings (e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is probably not suitable in tests. We may need to add another test-only endpoint config to the control plane to make this work. The config-fetch part of the code is relatively straightforward (and well-covered in both production and the KIND test) so it is probably fine to replace it with loading from the local config file for these integration tests. In addition to making sure that the tests pass, I also manually inspected the logs to make sure that the compute node is indeed reloading the config using the new mechanism instead of going down the old `/configure` path (it turns out the test has bugs which causes compute `/configure` messages to be sent despite the test intending to disable/blackhole them). ```test 2024-09-24T18:53:29.573650Z INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request 2024-09-24T18:53:29.573689Z INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration 2024-09-24T18:53:29.573706Z INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json PG:2024-09-24 18:53:29.574 GMT [52799] LOG: received SIGHUP, reloading configuration files PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008" ... ``` Co-authored-by: Tristan Partin <tristan.partin@databricks.com>

tristan957 requested review from a team as code owners July 23, 2025 18:53

tristan957 requested review from iddm, MMeent, NanoBjorn and nikitakalyanov July 23, 2025 18:53

tristan957 force-pushed the tristan957/compute_ctl-request branch 4 times, most recently from e359286 to a2f80f4 Compare July 23, 2025 19:12

HaoyuHuang approved these changes Jul 23, 2025

View reviewed changes

compute_tools/src/compute.rs Show resolved Hide resolved

compute_tools/src/compute.rs Show resolved Hide resolved

tristan957 force-pushed the tristan957/compute_ctl-request branch from a2f80f4 to 3035eba Compare July 23, 2025 20:30

tristan957 force-pushed the tristan957/compute_ctl-request branch from 3035eba to 5df5be4 Compare July 24, 2025 00:12

thesuhas approved these changes Jul 24, 2025

View reviewed changes

tristan957 added this pull request to the merge queue Jul 24, 2025

Merged via the queue into main with commit 90cd5a5 Jul 24, 2025
101 checks passed

tristan957 deleted the tristan957/compute_ctl-request branch July 24, 2025 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BRC-1778] Add mechanism to `compute_ctl` to pull a new config #12711

[BRC-1778] Add mechanism to `compute_ctl` to pull a new config #12711

Uh oh!

tristan957 commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 23, 2025 •

edited

Loading

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Uh oh!

Uh oh!

Uh oh!

[BRC-1778] Add mechanism to compute_ctl to pull a new config #12711

[BRC-1778] Add mechanism to compute_ctl to pull a new config #12711

Uh oh!

Conversation

tristan957 commented Jul 23, 2025

Problem

Summary of changes

How is this tested?

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

8987 tests run: 8337 passed, 0 failed, 650 skipped (full report)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

Uh oh!

Uh oh!

Uh oh!

[BRC-1778] Add mechanism to `compute_ctl` to pull a new config #12711

[BRC-1778] Add mechanism to `compute_ctl` to pull a new config #12711

github-actions bot commented Jul 23, 2025 •

edited

Loading