Skip to content

Conversation

thesuhas
Copy link
Contributor

Problem

Commit https://github.com/databricks-eng/hadron/commit/e69c3d632b3919e79c2a1efb8f51d2230f0e137a added metrics (used for alerting) to indicate whether Safekeepers are operating with a degraded quorum due to some of them being down. However, even if all SKs are active/reachable, we probably still want to raise an alert if some of them are really slow or otherwise lagging behind, as it is technically still a "degraded quorum" situation.

Summary of changes

Added a new field max_active_safekeeper_commit_lag to the neon_perf_counters view that reports the lag between the most advanced and most lagging commit LSNs among active Safekeepers.

Commit LSNs are received from AppendResponse messages from SKs and recorded in the WalProposer's shared memory state.

Note that this lag is calculated among active SKs only to keep this alert clean. If there are inactive SKs the previous metric on active SKs will capture that instead.

Note: @chen-luo_data pointed out during the PR review that we can probably get the benefits of this metric with PromQL query like max (safekeeper_commit_lsn) by (tenant_id, timeline_id) - min(safekeeper_commit_lsn) by (tenant_id, timeline_id) on existing metrics exported by SKs.

Given that this code is already ready, @haoyu-huang_data suggested that I just check in this change anyway, as the reliability of prometheus metrics (and especially the aggregation operators when the result set cardinality is high) is somewhat questionable based on our prior experience.

How is this tested?

Added integration test test_wal_acceptor.py::test_max_active_safekeeper_commit_lag.

thesuhas and others added 14 commits July 24, 2025 13:41
… walproposer (#895)

Data corruptions are typically detected on the pageserver side when it
replays WAL records. However, since PS doesn't synchronously replay WAL
records as they are being ingested through safekeepers, we need some
extra plumbing to feed information about pageserver-detected corruptions
during compaction (and/or WAL redo in general) back to SK and PG for
proper action.

We don't yet know what actions PG/SK should take upon receiving the
signal, but we should have the detection and feedback in place.

Add an extra `corruption_detected` field to the `PageserverFeedback`
message that is sent from PS -> SK -> PG. It's a boolean value that is
set to true when PS detects a "critical error" that signals data
corruption, and it's sent in all `PageserverFeedback` messages. Upon
receiving this signal, the safekeeper raises a
`safekeeper_ps_corruption_detected` gauge metric (value set to 1). The
safekeeper then forwards this signal to PG where a
`ps_corruption_detected` gauge metric (value also set to 1) is raised in
the `neon_perf_counters` view.

Added an integration test in
`test_compaction.py::test_ps_corruption_detection_feedback` that
confirms that the safekeeper and PG can receive the data corruption
signal in the `PageserverFeedback` message in a simulated data
corruption.
Today we don't have any indications (other than spammy logs in PG that
nobody monitors) if the Walproposer in PG cannot connect to/get votes
from all Safekeepers. This means we don't have signals indicating that
the Safekeepers are operating at degraded redundancy. We need these
signals.

Added plumbing in PG extension so that the `neon_perf_counters` view
exports the following gauge metrics on safekeeper health:
- `num_configured_safekeepers`: The total number of safekeepers
configured in PG.
- `num_active_safekeepers`: The number of safekeepers that PG is
actively streaming WAL to.

An alert should be raised whenever `num_active_safekeepers` <
`num_configured_safekeepers`.

The metrics are implemented by adding additional state to the
Walproposer shared memory keeping track of the active statuses of
safekeepers using a simple array. The status of the safekeeper is set to
active (1) after the Walproposer acquires a quorum and starts streaming
data to the safekeeper, and is set to inactive (0) when the connection
with a safekeeper is shut down. We scan the safekeeper status array in
Walproposer shared memory when collecting the metrics to produce results
for the gauges.

Added coverage for the metrics to integration test
`test_wal_acceptor.py::test_timeline_disk_usage_limit`.
Commit
https://github.com/databricks-eng/hadron/commit/e69c3d632b3919e79c2a1efb8f51d2230f0e137a
added metrics (used for alerting) to indicate whether Safekeepers are
operating with a degraded quorum due to some of them being down.
However, even if all SKs are active/reachable, we probably still want to
raise an alert if some of them are really slow or otherwise lagging
behind, as it is technically still a "degraded quorum" situation.

Added a new field `max_active_safekeeper_commit_lag` to the
`neon_perf_counters` view that reports the lag between the most advanced
and most lagging commit LSNs among active Safekeepers.

Commit LSNs are received from `AppendResponse` messages from SKs and
recorded in the `WalProposer`'s shared memory state.

Note that this lag is calculated among active SKs only to keep this
alert clean. If there are inactive SKs the previous metric on active SKs
will capture that instead.

Note: @chen-luo_data pointed out during the PR review that we can
probably get the benefits of this metric with PromQL query like `max
(safekeeper_commit_lsn) by (tenant_id, timeline_id) -
min(safekeeper_commit_lsn) by (tenant_id, timeline_id)` on existing
metrics exported by SKs.

Given that this code is already ready, @haoyu-huang_data suggested that
I just check in this change anyway, as the reliability of prometheus
metrics (and especially the aggregation operators when the result set
cardinality is high) is somewhat questionable based on our prior
experience.

Added integration test
`test_wal_acceptor.py::test_max_active_safekeeper_commit_lag`.
Copy link

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

@thesuhas thesuhas marked this pull request as ready for review July 25, 2025 20:47
@thesuhas thesuhas requested review from a team as code owners July 25, 2025 20:47
@thesuhas thesuhas requested review from dimitri, knizhnik, skyzh and HaoyuHuang and removed request for a team and knizhnik July 25, 2025 20:47
@thesuhas thesuhas requested a review from tristan957 July 25, 2025 20:47
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM :)

Copy link

github-actions bot commented Jul 25, 2025

9130 tests run: 8477 passed, 0 failed, 653 skipped (full report)


Flaky tests (3)

Postgres 17

Postgres 15

Code coverage* (full report)

  • functions: 34.7% (8840 of 25484 functions)
  • lines: 45.7% (71640 of 156721 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
bf6f780 at 2025-07-31T15:53:26.016Z :recycle:

Base automatically changed from thesuhas/brc-3051 to main July 30, 2025 15:25
@thesuhas
Copy link
Contributor Author

Closing as re-opened in hadron

@thesuhas thesuhas closed this Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants