Skip to content

[CORE-14831] Cloud Topics: Metrics for L0 GC#29556

Open
oleiman wants to merge 9 commits intoredpanda-data:devfrom
oleiman:ct/core-14831/gc-metrics
Open

[CORE-14831] Cloud Topics: Metrics for L0 GC#29556
oleiman wants to merge 9 commits intoredpanda-data:devfrom
oleiman:ct/core-14831/gc-metrics

Conversation

@oleiman
Copy link
Member

@oleiman oleiman commented Feb 6, 2026

Grab bag of metrics for L0 GC. Some might not be that useful.

  • list errors (count)
  • delete errors (count)
  • epoch lag (gauge) (sort of...fuzzy)
  • max deleted epoch (gauge)
  • objects listed (count)
  • objects skipped that were ineligible due to epoch (count)
  • objects skipped that were ineligible due to age (count)
  • total collection rounds (count)

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

@oleiman oleiman self-assigned this Feb 6, 2026
@oleiman oleiman force-pushed the ct/core-14831/gc-metrics branch 2 times, most recently from 2c6d262 to 089091b Compare February 6, 2026 03:37
@oleiman oleiman marked this pull request as ready for review February 6, 2026 04:29
Copilot AI review requested due to automatic review settings February 6, 2026 04:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive metrics instrumentation for the L0 garbage collection (GC) process in cloud topics. The changes enable monitoring of GC progress, performance, and operational health.

Changes:

  • Added metrics tracking for GC operations including objects listed, skipped, deleted, collection rounds, and error counts
  • Implemented epoch-based lag tracking to monitor GC progress relative to eligible epochs
  • Added integration tests to verify metrics are correctly reported during GC operations

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/rptest/tests/cloud_topics/l0_gc_test.py Adds new test class with methods to verify GC metrics are properly reported
src/v/cloud_topics/level_zero/gc/level_zero_gc_probe.h Declares new metric tracking methods and internal state for epoch and operation counters
src/v/cloud_topics/level_zero/gc/level_zero_gc_probe.cc Implements metric registration and epoch lag calculation logic
src/v/cloud_topics/level_zero/gc/level_zero_gc.cc Instruments GC operations with probe calls to track metrics
src/v/cloud_topics/level_zero/gc/BUILD Adds dependency on cloud_topics types for epoch handling

@oleiman oleiman force-pushed the ct/core-14831/gc-metrics branch from 089091b to 895b9f5 Compare February 6, 2026 04:50
@oleiman oleiman requested a review from Copilot February 6, 2026 04:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

@oleiman oleiman marked this pull request as draft February 6, 2026 04:53
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ct/core-14831/gc-metrics branch from 895b9f5 to 37d46a7 Compare February 6, 2026 05:01
@oleiman oleiman marked this pull request as ready for review February 6, 2026 05:01
Given a globally max GC eligible epoch and a shard doing garbage collection,
define that shard's epoch lag as the difference between the max eligible epoch
and the oldest epoch we are still working on.

"Oldest epoch" is a bit subtle. To a first approximation, we can say that at
a given time for a given shard, the oldest epoch not yet collected is the
numerically smallest eligible epoch found in list results in the bucket.
Since we're tracking a monotonically increasing value, we can't track an
all-time minimum, so instead we say that the point-in-time min deletion epoch
is the minimum epoch found in a round of garbage collection, and we reset that
value on each trip through the GC loop. This way, the tracked value has an
opportunity to inch forward once all the objects of a given epoch have been
collected.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ct/core-14831/gc-metrics branch from 37d46a7 to 0efd5e5 Compare February 6, 2026 17:07
@oleiman oleiman added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Feb 6, 2026
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#80319
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ClusterQuotaPartitionMutationTest test_partition_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/80319#019c33fb-81d0-4081-975d-acfce3227d5a FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0026, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterQuotaPartitionMutationTest&test_method=test_partition_throttle_mechanism

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda claude-review Adding this label to a PR will trigger a workflow to review the code using claude.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants