[BRC-3082] Monitor commit LSN lag among active SKs #12751

thesuhas · 2025-07-25T19:20:34Z

Problem

Commit https://github.com/databricks-eng/hadron/commit/e69c3d632b3919e79c2a1efb8f51d2230f0e137a added metrics (used for alerting) to indicate whether Safekeepers are operating with a degraded quorum due to some of them being down. However, even if all SKs are active/reachable, we probably still want to raise an alert if some of them are really slow or otherwise lagging behind, as it is technically still a "degraded quorum" situation.

Summary of changes

Added a new field max_active_safekeeper_commit_lag to the neon_perf_counters view that reports the lag between the most advanced and most lagging commit LSNs among active Safekeepers.

Commit LSNs are received from AppendResponse messages from SKs and recorded in the WalProposer's shared memory state.

Note that this lag is calculated among active SKs only to keep this alert clean. If there are inactive SKs the previous metric on active SKs will capture that instead.

Note: @chen-luo_data pointed out during the PR review that we can probably get the benefits of this metric with PromQL query like max (safekeeper_commit_lsn) by (tenant_id, timeline_id) - min(safekeeper_commit_lsn) by (tenant_id, timeline_id) on existing metrics exported by SKs.

Given that this code is already ready, @haoyu-huang_data suggested that I just check in this change anyway, as the reliability of prometheus metrics (and especially the aggregation operators when the result set cardinality is high) is somewhat questionable based on our prior experience.

How is this tested?

Added integration test test_wal_acceptor.py::test_max_active_safekeeper_commit_lag.

… walproposer (#895) Data corruptions are typically detected on the pageserver side when it replays WAL records. However, since PS doesn't synchronously replay WAL records as they are being ingested through safekeepers, we need some extra plumbing to feed information about pageserver-detected corruptions during compaction (and/or WAL redo in general) back to SK and PG for proper action. We don't yet know what actions PG/SK should take upon receiving the signal, but we should have the detection and feedback in place. Add an extra `corruption_detected` field to the `PageserverFeedback` message that is sent from PS -> SK -> PG. It's a boolean value that is set to true when PS detects a "critical error" that signals data corruption, and it's sent in all `PageserverFeedback` messages. Upon receiving this signal, the safekeeper raises a `safekeeper_ps_corruption_detected` gauge metric (value set to 1). The safekeeper then forwards this signal to PG where a `ps_corruption_detected` gauge metric (value also set to 1) is raised in the `neon_perf_counters` view. Added an integration test in `test_compaction.py::test_ps_corruption_detection_feedback` that confirms that the safekeeper and PG can receive the data corruption signal in the `PageserverFeedback` message in a simulated data corruption.

Today we don't have any indications (other than spammy logs in PG that nobody monitors) if the Walproposer in PG cannot connect to/get votes from all Safekeepers. This means we don't have signals indicating that the Safekeepers are operating at degraded redundancy. We need these signals. Added plumbing in PG extension so that the `neon_perf_counters` view exports the following gauge metrics on safekeeper health: - `num_configured_safekeepers`: The total number of safekeepers configured in PG. - `num_active_safekeepers`: The number of safekeepers that PG is actively streaming WAL to. An alert should be raised whenever `num_active_safekeepers` < `num_configured_safekeepers`. The metrics are implemented by adding additional state to the Walproposer shared memory keeping track of the active statuses of safekeepers using a simple array. The status of the safekeeper is set to active (1) after the Walproposer acquires a quorum and starts streaming data to the safekeeper, and is set to inactive (0) when the connection with a safekeeper is shut down. We scan the safekeeper status array in Walproposer shared memory when collecting the metrics to produce results for the gauges. Added coverage for the metrics to integration test `test_wal_acceptor.py::test_timeline_disk_usage_limit`.

Commit https://github.com/databricks-eng/hadron/commit/e69c3d632b3919e79c2a1efb8f51d2230f0e137a added metrics (used for alerting) to indicate whether Safekeepers are operating with a degraded quorum due to some of them being down. However, even if all SKs are active/reachable, we probably still want to raise an alert if some of them are really slow or otherwise lagging behind, as it is technically still a "degraded quorum" situation. Added a new field `max_active_safekeeper_commit_lag` to the `neon_perf_counters` view that reports the lag between the most advanced and most lagging commit LSNs among active Safekeepers. Commit LSNs are received from `AppendResponse` messages from SKs and recorded in the `WalProposer`'s shared memory state. Note that this lag is calculated among active SKs only to keep this alert clean. If there are inactive SKs the previous metric on active SKs will capture that instead. Note: @chen-luo_data pointed out during the PR review that we can probably get the benefits of this metric with PromQL query like `max (safekeeper_commit_lsn) by (tenant_id, timeline_id) - min(safekeeper_commit_lsn) by (tenant_id, timeline_id)` on existing metrics exported by SKs. Given that this code is already ready, @haoyu-huang_data suggested that I just check in this change anyway, as the reliability of prometheus metrics (and especially the aggregation operators when the result set cardinality is high) is somewhat questionable based on our prior experience. Added integration test `test_wal_acceptor.py::test_max_active_safekeeper_commit_lag`.

github-actions · 2025-07-25T19:20:50Z

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

skyzh

rest LGTM :)

storage_controller/src/hadron_k8s.rs

github-actions · 2025-07-25T22:37:02Z

9130 tests run: 8477 passed, 0 failed, 653 skipped (full report)

Flaky tests (3)

Postgres 17

test_ancestor_branch: release-x86-64-without-lfc
test_tx_abort_with_many_relations[v2]: release-arm64-without-lfc

Postgres 15

test_multiple_subscription_branching: release-arm64-with-lfc

Code coverage* (full report)

functions: 34.7% (8840 of 25484 functions)
lines: 45.7% (71640 of 156721 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
bf6f780 at 2025-07-31T15:53:26.016Z :recycle:}

… thesuhas/brc-3051

thesuhas · 2025-07-31T20:55:01Z

Closing as re-opened in hadron

thesuhas and others added 14 commits July 24, 2025 13:41

migrated changes from hadron

df12db2

fixed syntax issues

39c76e2

updated pg

a371e40

updated pg

3635fdc

updated pg

4f38ffc

fixed failing test

1e91a3b

fixed PR, removed spillover content from other PRs

043408a

skipping lakebase test

f8ffacb

cargo fmt

81a684c

python fmt

5510b2b

fixed syntax issues

4c0efd7

thesuhas added 5 commits July 25, 2025 15:22

python fmt

18cadae

removed kind test

e4e1a50

skipping test for lakebase

2de46df

fixed perfcounters

1805f05

Merge branch 'thesuhas/brc-3051' into thesuhas/brc-3082

2caf3e7

thesuhas marked this pull request as ready for review July 25, 2025 20:47

thesuhas requested review from a team as code owners July 25, 2025 20:47

thesuhas requested review from dimitri, knizhnik, skyzh and HaoyuHuang and removed request for a team and knizhnik July 25, 2025 20:47

thesuhas requested a review from tristan957 July 25, 2025 20:47

skyzh approved these changes Jul 25, 2025

View reviewed changes

storage_controller/src/hadron_k8s.rs Outdated Show resolved Hide resolved

remove hadron k8s

9bd0045

HaoyuHuang approved these changes Jul 25, 2025

View reviewed changes

thesuhas and others added 10 commits July 29, 2025 11:01

Delete scripts/neon_grep.txt

d524e5e

reset pg16

3e49e07

Merge branch 'thesuhas/brc-3051' of github.com:neondatabase/neon into…

323b75c

… thesuhas/brc-3051

reset pg 16

1a0e100

Delete scripts/neon_grep.txt

2790461

Merge branch 'main' into thesuhas/brc-3051

912fcf5

applied changes from neon #12126

78aa034

add nullptr checks

904e63c

Merge branch 'thesuhas/brc-3051' into thesuhas/brc-3082

591fc82

added lakebase_mode wrappers

c8b02ed

Base automatically changed from thesuhas/brc-3051 to main July 30, 2025 15:25

thesuhas added 6 commits July 30, 2025 11:29

Merge branch 'main' into thesuhas/brc-3082

4095ccc

enable lakebase mode

815e4a0

removed lakebase_mode condition for wal rate limiting

f3561c4

Merge branch 'main' into thesuhas/brc-3082

8579272

added lakebasemode

2b9f9ca

python fmt

bf6f780

thesuhas closed this Jul 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BRC-3082] Monitor commit LSN lag among active SKs #12751

[BRC-3082] Monitor commit LSN lag among active SKs #12751

Uh oh!

thesuhas commented Jul 25, 2025

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

skyzh left a comment

Uh oh!

Uh oh!

github-actions bot commented Jul 25, 2025 •

edited

Loading

Postgres 17

Postgres 15

Uh oh!

thesuhas commented Jul 31, 2025

Uh oh!

Uh oh!

[BRC-3082] Monitor commit LSN lag among active SKs #12751

[BRC-3082] Monitor commit LSN lag among active SKs #12751

Uh oh!

Conversation

thesuhas commented Jul 25, 2025

Problem

Summary of changes

How is this tested?

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

skyzh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

9130 tests run: 8477 passed, 0 failed, 653 skipped (full report)

Postgres 17

Postgres 15

Code coverage* (full report)

Uh oh!

thesuhas commented Jul 31, 2025

Uh oh!

Uh oh!

github-actions bot commented Jul 25, 2025 •

edited

Loading