Skip to content

[CORE 15747] Partial solution to feature gating cloud topics#29817

Merged
piyushredpanda merged 4 commits intoredpanda-data:devfrom
dotnwat:CORE-15747
Mar 15, 2026
Merged

[CORE 15747] Partial solution to feature gating cloud topics#29817
piyushredpanda merged 4 commits intoredpanda-data:devfrom
dotnwat:CORE-15747

Conversation

@dotnwat
Copy link
Member

@dotnwat dotnwat commented Mar 12, 2026

  • Feature gates cloud topics at the Kafka create topic API boundary
  • Gating the subsystem startup is a bit more complicated (follow up)
  • Gating at create time catches 99% of the cases we care about

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Copilot AI review requested due to automatic review settings March 12, 2026 23:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new cluster feature flag for “cloud topics” and wires feature-table awareness into Kafka CreateTopics validation so that redpanda.storage.mode=cloud is rejected unless the cloud_topics feature is active.

Changes:

  • Introduces features::feature::cloud_topics and registers it in the feature schema + to_string_view.
  • Extends topic request validation plumbing to pass a features::feature_table* into validators.
  • Gates redpanda.storage.mode=cloud in storage_mode_config_validator on feature::cloud_topics being active, and updates impacted unit tests/build deps.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/v/kafka/server/tests/validator_tests.cc Updates validator test callsites for new is_valid(..., feature_table*) signature.
src/v/kafka/server/tests/topic_utils_test.cc Updates test validators to accept feature_table* to match new predicate shape.
src/v/kafka/server/handlers/topics/validators.h Changes validator concept/signatures and adds feature-gating for cloud storage mode.
src/v/kafka/server/handlers/topics/topic_utils.h Threads feature_table* through validate_requests_range predicate/validator execution.
src/v/kafka/server/handlers/create_topics.cc Passes feature table (when available) into request validation.
src/v/kafka/server/BUILD Adds //src/v/features dependency to support new includes.
src/v/features/feature_table.h Adds cloud_topics feature bit and schema entry.
src/v/features/feature_table.cc Adds stringification for feature::cloud_topics.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 569 to 575
case model::redpanda_storage_mode::cloud:
if (
ft == nullptr
|| !ft->is_active(features::feature::cloud_topics)) {
return false;
}
return config::shard_local_cfg().cloud_topics_enabled();
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new cloud_topics feature gate logic in storage_mode_config_validator is not covered by tests (e.g., a create-topics request with redpanda.storage.mode=cloud should fail when the feature is inactive and succeed when it is active, assuming config allows it). Adding a focused unit/integration test would help prevent regressions in the gating behavior.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're cute

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a ducktape test

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 13, 2026

CI test results

test results on build#81723
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudTopicsL0GCEpochLagTest test_epoch_lag_and_catchup {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/81723#019ce47b-343c-4557-9012-935e89ffb3b4 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0110, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCEpochLagTest&test_method=test_epoch_lag_and_catchup
TestReadReplicaService test_identical_lwms_after_delete_records {"cloud_storage_type": 1, "partition_count": 5} integration https://buildkite.com/redpanda/redpanda/builds/81723#019ce478-6203-49a7-adef-b7a245622e91 FLAKY 18/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0472, p0=0.2428, reject_threshold=0.0100. adj_baseline=0.1349, p1=0.4815, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TestReadReplicaService&test_method=test_identical_lwms_after_delete_records
test results on build#81776
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudTopicsUpgradeTest test_cloud_topic_create_rejected_during_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/81776#019ce86c-65b2-4673-a1a9-de8157fc6c9b FAIL 0/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsUpgradeTest&test_method=test_cloud_topic_create_rejected_during_upgrade
CloudTopicsUpgradeTest test_cloud_topic_create_rejected_during_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/81776#019ce870-d25b-4cc7-8887-13a4bd22bef3 FAIL 0/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsUpgradeTest&test_method=test_cloud_topic_create_rejected_during_upgrade
test results on build#81779
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudTopicsUpgradeTest test_cloud_topic_create_rejected_during_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/81779#019ce8bd-9f76-47d7-a270-1ca941f1e0dc FAIL 0/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsUpgradeTest&test_method=test_cloud_topic_create_rejected_during_upgrade
CloudTopicsUpgradeTest test_cloud_topic_create_rejected_during_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/81779#019ce8bf-5473-4679-8fd6-eb9e181b30da FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsUpgradeTest&test_method=test_cloud_topic_create_rejected_during_upgrade
TestReadReplicaService test_identical_lwms_after_delete_records {"cloud_storage_type": 1, "partition_count": 5} integration https://buildkite.com/redpanda/redpanda/builds/81779#019ce8bf-5475-4a36-baf0-e5293fdad0fd FLAKY 36/41 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0444, p0=0.1004, reject_threshold=0.0100. adj_baseline=0.1274, p1=0.4111, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TestReadReplicaService&test_method=test_identical_lwms_after_delete_records
test results on build#81786
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/81786#019ce93f-a46a-4f4d-8355-85370b0763fc FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0239, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
src/v/wasm/tests/wasm_transform_test src/v/wasm/tests/wasm_transform_test unit https://buildkite.com/redpanda/redpanda/builds/81786#019ce929-42ba-464e-a8ac-d660e3104e59 FAIL 0/1

@dotnwat dotnwat added this to the v26.1.1-rc3 milestone Mar 13, 2026
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
WillemKauf
WillemKauf previously approved these changes Mar 13, 2026
Copy link
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'll repeat once more publicly that we probably need to find a better way of gating our config system, potentially by wrapping a feature_table accessor within our config store, and adding a definition for gated_property which can transparently respect features being active or not.

Though this also may require changes to the feature_table a la #29707 (i.e. suppressing updates to features until all nodes are updated, and then restarted).

The reason for this being if we have a check for a cluster config at start-up (say to construct a service like cloud_topics_app), gated_property<bool> cloud_topics_enabled may return false (if the feature has not become active yet), but in the future (say at topic creation time), cloud_topics_enabled may return true once the feature is active (but behavior of the system is now undefined/hard to reason about/here be dragons).

😓

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 13, 2026

Retry command for Build#81776

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cloud_topics/upgrade_test.py::CloudTopicsUpgradeTest.test_cloud_topic_create_rejected_during_upgrade

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#81779

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cloud_topics/upgrade_test.py::CloudTopicsUpgradeTest.test_cloud_topic_create_rejected_during_upgrade

@dotnwat
Copy link
Member Author

dotnwat commented Mar 13, 2026

ugh so annoying

Verify that cloud topic creation is rejected when the cloud_topics
feature has not yet activated after upgrading from v25.3.x, and
succeeds once the feature becomes active.
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gating the subsystem startup is a bit more complicated (follow up)

Curious why this is desirable. It seems like this would make things much more complicated and would work against the mental model of "if the whole cluster is on 26.1 then we're ready for a cloud topic"

@dotnwat
Copy link
Member Author

dotnwat commented Mar 13, 2026

Gating the subsystem startup is a bit more complicated (follow up)

Curious why this is desirable. It seems like this would make things much more complicated and would work against the mental model of "if the whole cluster is on 26.1 then we're ready for a cloud topic"

@andrwng yeh it's debatable if we do it or not. the bullet point is really there as a placeholder for that investigation/consideration.

right now we have these goals:

  1. we want cloud topics to be enabled by default (cloud_topics_enabled=true), so that
  2. when tiered storage is enabled (cloud_storage_enabled=true) so is cloud topics
  3. we also want feature gate to avoid some annoying situations when cloud topics isn't enabled on all nodes (e.g. missed a restart or something, like we saw from beta customer X).

On boot-up today we can control the construction of cloud_topics_app with a dependency between (1) and (2). But then we still have cloud_topics_enabled() checks scattered throughout the code base that are independent of cloud_storage_enabled() checks. Add to that the feature gate checks.

So maybe the answer here is that all cloud_topics_enabled checks should be rewritten to be equivalent to a conjunction of all 3 conditions? cc @WillemKauf

@WillemKauf
Copy link
Contributor

we want cloud topics to be enabled by default (cloud_topics_enabled=true), so that
when tiered storage is enabled (cloud_storage_enabled=true) so is cloud topics

Do we even have a use for cloud_topics_enabled then? Sounds to me like we want to backpedal on the cluster config and just rely on cloud_storage_enabled=true and a feature flag for cloud_topics. Of course, all the concerns and everything else I raised in my previous comment still stands if we make the change of

all cloud_topics_enabled checks should be rewritten to be equivalent to a conjunction of all 3 conditions?

@piyushredpanda piyushredpanda merged commit 58b416a into redpanda-data:dev Mar 15, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants