Skip to content

Conversation

tristan957
Copy link
Member

Problem

We have been dealing with a number of issues with the SC compute notification mechanism. Various race conditions exist in the PG/HCC/cplane/PS distributed system, and relying on the SC to send notifications to the compute node to notify it of PS changes is not robust. We decided to pursue a more robust option where the compute node itself discovers whether it may be pointing to the incorrect PSs and proactively reconfigure itself if issues are suspected.

Summary of changes

To support this self-healing reconfiguration mechanism several pieces are needed. This PR adds a mechanism to compute_ctl called "refresh configuration", where the compute node reaches out to the control plane to pull a new config and reconfigure PG using the new config, instead of listening for a notification message containing a config to arrive from the control plane. Main changes to compute_ctl:

  1. The compute_ctl state machine now has a new State, RefreshConfigurationPending. The compute node may enter this state upon receiving a signal that it may be using the incorrect page servers.
  2. Upon entering the RefreshConfigurationPending state, the background configurator thread in compute_ctl wakes up, pulls a new config from the control plane, and reconfigures PG (with pg_ctl reload) according to the new config.
  3. The compute node may enter the new RefreshConfigurationPending state from Running or Failed states. If the configurator managed to configure the compute node successfully, it will enter the Running state, otherwise, it stays in RefreshConfigurationPending and the configurator thread will wait for the next notification if an incorrect config is still suspected.
  4. Added various plumbing in compute_ctl data structures to allow the configurator thread to perform the config fetch.

The "incorrect config suspected" notification is delivered using a HTTP endpoint, /refresh_configuration, on compute_ctl. This endpoint is currently not called by anyone other than the tests. In a follow up PR I will set up some code in the PG extension/libpagestore to call this HTTP endpoint whenever PG suspects that it is pointing to the wrong page servers.

How is this tested?

Modified test_runner/regress/test_change_pageserver.py to add a scenario where we use the new /refresh_configuration mechanism instead of the existing /configure mechanism (which requires us sending a full config to compute_ctl) to have the compute node reload and reconfigure its pageservers.

I took one shortcut to reduce the scope of this change when it comes to testing: the compute node uses a local config file instead of pulling a config over the network from the HCC. This simplifies the test setup in the following ways:

  • The existing test framework is set up to use local config files for compute nodes only, so it's convenient if I just stick with it.
  • The HCC today generates a compute config with production settings (e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is probably not suitable in tests. We may need to add another test-only endpoint config to the control plane to make this work.

The config-fetch part of the code is relatively straightforward (and well-covered in both production and the KIND test) so it is probably fine to replace it with loading from the local config file for these integration tests.

In addition to making sure that the tests pass, I also manually inspected the logs to make sure that the compute node is indeed reloading the config using the new mechanism instead of going down the old /configure path (it turns out the test has bugs which causes compute /configure messages to be sent despite the test intending to disable/blackhole them).

2024-09-24T18:53:29.573650Z  INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request
2024-09-24T18:53:29.573689Z  INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration
2024-09-24T18:53:29.573706Z  INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json
PG:2024-09-24 18:53:29.574 GMT [52799] LOG:  received SIGHUP, reloading configuration files
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.extension_server_port" cannot be changed without restarting the server
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008"
...

@tristan957 tristan957 requested review from a team as code owners July 23, 2025 18:53
@tristan957 tristan957 force-pushed the tristan957/compute_ctl-request branch 4 times, most recently from e359286 to a2f80f4 Compare July 23, 2025 19:12
Copy link

github-actions bot commented Jul 23, 2025

8987 tests run: 8337 passed, 0 failed, 650 skipped (full report)


Flaky tests (6)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 34.7% (8802 of 25349 functions)
  • lines: 45.8% (71295 of 155692 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
5df5be4 at 2025-07-24T01:16:23.237Z :recycle:

@tristan957 tristan957 force-pushed the tristan957/compute_ctl-request branch from a2f80f4 to 3035eba Compare July 23, 2025 20:30
## Problem

We have been dealing with a number of issues with the SC compute
notification mechanism. Various race conditions exist in the
PG/HCC/cplane/PS distributed system, and relying on the SC to send
notifications to the compute node to notify it of PS changes is not
robust. We decided to pursue a more robust option where the compute node
itself discovers whether it may be pointing to the incorrect PSs and
proactively reconfigure itself if issues are suspected.

## Summary of changes

To support this self-healing reconfiguration mechanism several pieces
are needed. This PR adds a mechanism to `compute_ctl` called "refresh
configuration", where the compute node reaches out to the control plane
to pull a new config and reconfigure PG using the new config, instead of
listening for a notification message containing a config to arrive from
the control plane. Main changes to compute_ctl:

1. The `compute_ctl` state machine now has a new State,
`RefreshConfigurationPending`. The compute node may enter this state
upon receiving a signal that it may be using the incorrect page servers.
2. Upon entering the `RefreshConfigurationPending` state, the background
configurator thread in `compute_ctl` wakes up, pulls a new config from
the control plane, and reconfigures PG (with `pg_ctl reload`) according
to the new config.
3. The compute node may enter the new `RefreshConfigurationPending`
state from `Running` or `Failed` states. If the configurator
managed to configure the compute node successfully, it will enter the
`Running` state, otherwise, it stays in `RefreshConfigurationPending`
and the configurator thread will wait for the next notification if an
incorrect config is still suspected.
4. Added various plumbing in `compute_ctl` data structures to allow the
configurator thread to perform the config fetch.

The "incorrect config suspected" notification is delivered using a HTTP
endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is
currently not called by anyone other than the tests. In a follow up PR I
will set up some code in the PG extension/libpagestore to call this HTTP
endpoint whenever PG suspects that it is pointing to the wrong page
servers.

## How is this tested?

Modified `test_runner/regress/test_change_pageserver.py` to add a
scenario where we use the new `/refresh_configuration` mechanism instead
of the existing `/configure` mechanism (which requires us sending a full
config to compute_ctl) to have the compute node reload and reconfigure
its pageservers.

I took one shortcut to reduce the scope of this change when it comes to
testing: the compute node uses a local config file instead of pulling a
config over the network from the HCC. This simplifies the test setup in
the following ways:
* The existing test framework is set up to use local config files for
compute nodes only, so it's convenient if I just stick with it.
* The HCC today generates a compute config with production settings
(e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is
probably not suitable in tests. We may need to add another test-only
endpoint config to the control plane to make this work.

The config-fetch part of the code is relatively straightforward (and
well-covered in both production and the KIND test) so it is probably
fine to replace it with loading from the local config file for these
integration tests.

In addition to making sure that the tests pass, I also manually
inspected the logs to make sure that the compute node is indeed
reloading the config using the new mechanism instead of going down the
old `/configure` path (it turns out the test has bugs which causes
compute `/configure` messages to be sent despite the test intending to
disable/blackhole them).

```test
2024-09-24T18:53:29.573650Z  INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request
2024-09-24T18:53:29.573689Z  INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration
2024-09-24T18:53:29.573706Z  INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json
PG:2024-09-24 18:53:29.574 GMT [52799] LOG:  received SIGHUP, reloading configuration files
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.extension_server_port" cannot be changed without restarting the server
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008"
...
```

Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
@tristan957 tristan957 force-pushed the tristan957/compute_ctl-request branch from 3035eba to 5df5be4 Compare July 24, 2025 00:12
@tristan957 tristan957 added this pull request to the merge queue Jul 24, 2025
Merged via the queue into main with commit 90cd5a5 Jul 24, 2025
101 checks passed
@tristan957 tristan957 deleted the tristan957/compute_ctl-request branch July 24, 2025 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants