Skip to content

Commit fc7d86f

Browse files
[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs
## Problem This is a follow-up to TODO, as part of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem. ## Summary of changes The previous change added mechanism in `compute_ctl` that makes it possible to refresh the configuration of PG on-demand by having `compute_ctl` go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signal `compute_ctl` to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration. PG will become suspicious that it is talking to the wrong pageservers in the following situations: 1. It cannot connect to a pageserver (e.g., getting a network-level connection refused error) 2. It can connect to a pageserver, but the pageserver does not return any data for the GetPage request 3. It can connect to a pageserver, but the pageserver returns a malformed response 4. It can connect to a pageserver, but there is an error receiving the GetPage request response for any other reason This change also includes a minor tweak to `compute_ctl`'s config refresh behavior. Upon receiving a request to refresh PG configuration, `compute_ctl` will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change, `compute_ctl` only attempts reconfiguration if the refreshed config points PG to different pageservers. ## How is this tested? The new code paths are exercised in all existing tests because this mechanism is on by default. Explicitly tested in `test_runner/regress/test_change_pageserver.py`. Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
1 parent a2f80f4 commit fc7d86f

File tree

6 files changed

+108
-14
lines changed

6 files changed

+108
-14
lines changed

compute_tools/src/configurator.rs

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,17 @@ fn configurator_main_loop(compute: &Arc<ComputeNode>) {
9191
// node out of the `RefreshConfigurationPending` state. Would be nice if we can encode this invariant
9292
// into the type system.
9393
assert_eq!(state.status, ComputeStatus::RefreshConfigurationPending);
94+
95+
if state.pspec.as_ref().map(|ps| ps.pageserver_connstr.clone())
96+
== Some(pspec.pageserver_connstr.clone())
97+
{
98+
info!("Refresh configuration: Retrieved spec is the same as the current spec. Waiting for control plane to update the spec before attempting reconfiguration.");
99+
state.status = ComputeStatus::Running;
100+
compute.state_changed.notify_all();
101+
drop(state);
102+
std::thread::sleep(std::time::Duration::from_secs(5));
103+
continue;
104+
}
94105
// state.pspec is consumed by compute.reconfigure() below. Note that compute.reconfigure() will acquire
95106
// the compute.state lock again so we need to have the lock guard go out of scope here. We could add a
96107
// "locked" variant of compute.reconfigure() that takes the lock guard as an argument to make this cleaner,

pgxn/neon/extension_server.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
#include "extension_server.h"
1515
#include "neon_utils.h"
1616

17-
static int extension_server_port = 0;
17+
int hadron_extension_server_port = 0;
1818
static int extension_server_request_timeout = 60;
1919
static int extension_server_connect_timeout = 60;
2020

@@ -47,7 +47,7 @@ neon_download_extension_file_http(const char *filename, bool is_library)
4747
curl_easy_setopt(handle, CURLOPT_CONNECTTIMEOUT, (long)extension_server_connect_timeout /* seconds */ );
4848

4949
compute_ctl_url = psprintf("http://localhost:%d/extension_server/%s%s",
50-
extension_server_port, filename, is_library ? "?is_library=true" : "");
50+
hadron_extension_server_port, filename, is_library ? "?is_library=true" : "");
5151

5252
elog(LOG, "Sending request to compute_ctl: %s", compute_ctl_url);
5353

@@ -82,7 +82,7 @@ pg_init_extension_server()
8282
DefineCustomIntVariable("neon.extension_server_port",
8383
"connection string to the compute_ctl",
8484
NULL,
85-
&extension_server_port,
85+
&hadron_extension_server_port,
8686
0, 0, INT_MAX,
8787
PGC_POSTMASTER,
8888
0, /* no flags required */

pgxn/neon/libpagestore.c

Lines changed: 73 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313
#include <math.h>
1414
#include <sys/socket.h>
1515

16+
#include <curl/curl.h>
17+
1618
#include "libpq-int.h"
1719

1820
#include "access/xlog.h"
@@ -86,6 +88,8 @@ static int pageserver_response_log_timeout = 10000;
8688
/* 2.5 minutes. A bit higher than highest default TCP retransmission timeout */
8789
static int pageserver_response_disconnect_timeout = 150000;
8890

91+
static int conf_refresh_reconnect_attempt_threshold = 16;
92+
8993
typedef struct
9094
{
9195
char connstring[MAX_SHARDS][MAX_PAGESERVER_CONNSTRING_SIZE];
@@ -130,7 +134,7 @@ static uint64 pagestore_local_counter = 0;
130134
typedef enum PSConnectionState {
131135
PS_Disconnected, /* no connection yet */
132136
PS_Connecting_Startup, /* connection starting up */
133-
PS_Connecting_PageStream, /* negotiating pagestream */
137+
PS_Connecting_PageStream, /* negotiating pagestream */
134138
PS_Connected, /* connected, pagestream established */
135139
} PSConnectionState;
136140

@@ -401,7 +405,7 @@ get_shard_number(BufferTag *tag)
401405
}
402406

403407
static inline void
404-
CLEANUP_AND_DISCONNECT(PageServer *shard)
408+
CLEANUP_AND_DISCONNECT(PageServer *shard)
405409
{
406410
if (shard->wes_read)
407411
{
@@ -423,7 +427,7 @@ CLEANUP_AND_DISCONNECT(PageServer *shard)
423427
* complete the connection (e.g. due to receiving an earlier cancellation
424428
* during connection start).
425429
* Returns true if successfully connected; false if the connection failed.
426-
*
430+
*
427431
* Throws errors in unrecoverable situations, or when this backend's query
428432
* is canceled.
429433
*/
@@ -1030,6 +1034,58 @@ pageserver_disconnect_shard(shardno_t shard_no)
10301034
shard->state = PS_Disconnected;
10311035
}
10321036

1037+
// BEGIN HADRON
1038+
/*
1039+
* Nudge compute_ctl to refresh our configuration. Called when we suspect we may be
1040+
* connecting to the wrong pageservers due to a stale configuration.
1041+
*
1042+
* This is a best-effort operation. If we couldn't send the local loopback HTTP request
1043+
* to compute_ctl or if the request fails for any reason, we just log the error and move
1044+
* on.
1045+
*/
1046+
1047+
extern int hadron_extension_server_port;
1048+
1049+
static void
1050+
hadron_request_configuration_refresh() {
1051+
static CURL *handle = NULL;
1052+
CURLcode res;
1053+
char *compute_ctl_url;
1054+
1055+
if (handle == NULL)
1056+
{
1057+
handle = alloc_curl_handle();
1058+
1059+
curl_easy_setopt(handle, CURLOPT_CUSTOMREQUEST, "POST");
1060+
curl_easy_setopt(handle, CURLOPT_TIMEOUT, 3L /* seconds */ );
1061+
curl_easy_setopt(handle, CURLOPT_POSTFIELDS, "");
1062+
}
1063+
1064+
// Set the URL
1065+
compute_ctl_url = psprintf("http://localhost:%d/refresh_configuration", hadron_extension_server_port);
1066+
1067+
1068+
elog(LOG, "Sending refresh configuration request to compute_ctl: %s", compute_ctl_url);
1069+
1070+
curl_easy_setopt(handle, CURLOPT_URL, compute_ctl_url);
1071+
1072+
res = curl_easy_perform(handle);
1073+
if (res != CURLE_OK)
1074+
{
1075+
elog(WARNING, "compute_ctl refresh_configuration request failed: %s\n", curl_easy_strerror(res));
1076+
}
1077+
1078+
// In regular Postgres usage, it is not necessary to manually free memory allocated by palloc (psprintf) because
1079+
// it will be cleaned up after the "memory context" is reset (e.g. after the query or the transaction is finished).
1080+
// However, the number of times this function gets called during a single query/transaction can be unbounded due to
1081+
// the various retry loops around calls to pageservers. Therefore, we need to manually free this memory here.
1082+
if (compute_ctl_url != NULL)
1083+
{
1084+
pfree(compute_ctl_url);
1085+
}
1086+
}
1087+
// END HADRON
1088+
10331089
static bool
10341090
pageserver_send(shardno_t shard_no, NeonRequest *request)
10351091
{
@@ -1064,6 +1120,9 @@ pageserver_send(shardno_t shard_no, NeonRequest *request)
10641120
while (!pageserver_connect(shard_no, shard->n_reconnect_attempts < max_reconnect_attempts ? LOG : ERROR))
10651121
{
10661122
shard->n_reconnect_attempts += 1;
1123+
if (shard->n_reconnect_attempts > conf_refresh_reconnect_attempt_threshold) {
1124+
hadron_request_configuration_refresh();
1125+
}
10671126
}
10681127
shard->n_reconnect_attempts = 0;
10691128
} else {
@@ -1171,6 +1230,7 @@ pageserver_receive(shardno_t shard_no)
11711230
pfree(msg);
11721231
pageserver_disconnect(shard_no);
11731232
resp = NULL;
1233+
hadron_request_configuration_refresh();
11741234
}
11751235
else if (rc == -2)
11761236
{
@@ -1460,6 +1520,16 @@ pg_init_libpagestore(void)
14601520
PGC_SU_BACKEND,
14611521
0, /* no flags required */
14621522
NULL, NULL, NULL);
1523+
DefineCustomIntVariable("hadron.conf_refresh_reconnect_attempt_threshold",
1524+
"Threshold of the number of consecutive failed pageserver "
1525+
"connection attempts (per shard) before signaling "
1526+
"compute_ctl for a configuration refresh.",
1527+
NULL,
1528+
&conf_refresh_reconnect_attempt_threshold,
1529+
16, 0, INT_MAX,
1530+
PGC_USERSET,
1531+
0,
1532+
NULL, NULL, NULL);
14631533

14641534
DefineCustomIntVariable("neon.pageserver_response_log_timeout",
14651535
"pageserver response log timeout",

test_runner/fixtures/neon_fixtures.py

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4935,9 +4935,19 @@ def reconfigure(self, pageserver_id: int | None = None, safekeepers: list[int] |
49354935
# in the following commands.
49364936
if safekeepers is not None:
49374937
self.active_safekeepers = safekeepers
4938-
self.env.neon_cli.endpoint_reconfigure(
4939-
self.endpoint_id, self.tenant_id, pageserver_id, self.active_safekeepers
4940-
)
4938+
4939+
start_time = time.time()
4940+
while True:
4941+
try:
4942+
self.env.neon_cli.endpoint_reconfigure(
4943+
self.endpoint_id, self.tenant_id, pageserver_id, self.active_safekeepers
4944+
)
4945+
return
4946+
except RuntimeError as e:
4947+
if time.time() - start_time > 120:
4948+
raise e
4949+
log.warning(f"Reconfigure failed with error: {e}. Retrying...")
4950+
time.sleep(5)
49414951

49424952
def refresh_configuration(self):
49434953
assert self.endpoint_id is not None

test_runner/fixtures/workload.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,9 @@ def reconfigure(self) -> None:
7878
"""
7979
if self._endpoint is not None:
8080
with ENDPOINT_LOCK:
81+
# It's important that we update config.json before issuing the reconfigure request to make sure
82+
# that PG-initiated spec refresh doesn't mess things up by reverting to the old spec.
83+
self._endpoint.update_pageservers_in_config()
8184
self._endpoint.reconfigure()
8285

8386
def endpoint(self, pageserver_id: int | None = None) -> Endpoint:
@@ -97,10 +100,10 @@ def endpoint(self, pageserver_id: int | None = None) -> Endpoint:
97100
self._endpoint.start(pageserver_id=pageserver_id)
98101
self._configured_pageserver = pageserver_id
99102
else:
100-
if self._configured_pageserver != pageserver_id:
101-
self._configured_pageserver = pageserver_id
102-
self._endpoint.reconfigure(pageserver_id=pageserver_id)
103-
self._endpoint_config = pageserver_id
103+
# It's important that we update config.json before issuing the reconfigure request to make sure
104+
# that PG-initiated spec refresh doesn't mess things up by reverting to the old spec.
105+
self._endpoint.update_pageservers_in_config(pageserver_id=pageserver_id)
106+
self._endpoint.reconfigure(pageserver_id=pageserver_id)
104107

105108
connstring = self._endpoint.safe_psql(
106109
"SELECT setting FROM pg_settings WHERE name='neon.pageserver_connstring'"

test_runner/regress/test_change_pageserver.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ def reconfigure_endpoint(endpoint: Endpoint, pageserver_id: int, use_explicit_re
1717
# to make sure that PG-initiated config refresh doesn't mess things up by reverting to the old config.
1818
endpoint.update_pageservers_in_config(pageserver_id=pageserver_id)
1919

20-
# PG will eventually automatically refresh its configuration if it detects connectivity issues with pageservers.
20+
# PG will automatically refresh its configuration if it detects connectivity issues with pageservers.
2121
# We also allow the test to explicitly request a reconfigure so that the test can be sure that the
2222
# endpoint is running with the latest configuration.
2323
#

0 commit comments

Comments
 (0)