Skip to content

datalake/coordinator: add reset topic state escape hatch#29596

Merged
nvartolomei merged 2 commits intoredpanda-data:devfrom
nvartolomei:nv/iceberg-reset
Feb 20, 2026
Merged

datalake/coordinator: add reset topic state escape hatch#29596
nvartolomei merged 2 commits intoredpanda-data:devfrom
nvartolomei:nv/iceberg-reset

Conversation

@nvartolomei
Copy link
Contributor

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Copilot AI review requested due to automatic review settings February 12, 2026 17:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an admin “escape hatch” to reset datalake coordinator per-topic state, including optional per-partition last_committed overrides, and validates it with unit + e2e coverage.

Changes:

  • Add CoordinatorResetState admin RPC + generated client bindings and protobuf types.
  • Implement coordinator/frontend/state-machine support for reset_topic_state updates and overrides.
  • Add coordinator reset unit tests and a new datalake e2e test exercising reset behavior.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/rptest/tests/datalake/datalake_e2e_test.py Adds an end-to-end test for coordinator reset semantics and overrides.
tests/rptest/clients/types.py Adds a TopicSpec config key for iceberg partition spec used by tests.
tests/rptest/clients/admin/proto/redpanda/core/admin/internal/datalake/v1/datalake_pb2_connect.py Adds client stubs for the new CoordinatorResetState RPC.
tests/rptest/clients/admin/proto/redpanda/core/admin/internal/datalake/v1/datalake_pb2.pyi Adds typing stubs for new protobuf messages.
tests/rptest/clients/admin/proto/redpanda/core/admin/internal/datalake/v1/datalake_pb2.py Updates generated protobuf module with new messages + service method.
src/v/redpanda/admin/services/datalake/datalake.h Exposes coordinator_reset_state handler in admin service impl.
src/v/redpanda/admin/services/datalake/datalake.cc Implements admin RPC handler calling coordinator frontend reset.
src/v/datalake/coordinator/types.h Adds RPC request/reply types for reset topic state + overrides.
src/v/datalake/coordinator/types.cc Adds logging formatter for reset reply.
src/v/datalake/coordinator/tests/state_update_test.cc Adds unit tests for reset update behavior.
src/v/datalake/coordinator/tests/BUILD Adds deps for new reset/override types in tests.
src/v/datalake/coordinator/state_update.h Adds reset_topic_state update key + update struct.
src/v/datalake/coordinator/state_update.cc Implements apply/can_apply + logging for reset update.
src/v/datalake/coordinator/state_machine.cc Applies the new reset update key in the STM.
src/v/datalake/coordinator/service.h Adds RPC endpoint for reset topic state.
src/v/datalake/coordinator/service.cc Implements RPC dispatch into frontend reset handler.
src/v/datalake/coordinator/rpc.json Declares new RPC method for codegen.
src/v/datalake/coordinator/partition_state_override.h Introduces the override struct for per-partition fields.
src/v/datalake/coordinator/partition_state_override.cc Implements logging formatter for overrides.
src/v/datalake/coordinator/frontend.h Adds frontend reset API + local execution helper.
src/v/datalake/coordinator/frontend.cc Implements frontend reset path and coordinator-manager invocation.
src/v/datalake/coordinator/coordinator.h Adds coordinator sync API to replicate reset updates.
src/v/datalake/coordinator/coordinator.cc Implements STM replication of the reset update.
src/v/datalake/coordinator/BUILD Adds build targets/deps for new override type.
proto/redpanda/core/admin/internal/datalake/v1/datalake.proto Adds protobuf messages + admin RPC definition.

@vbotbuildovich

This comment was marked as outdated.

@vbotbuildovich

This comment was marked as outdated.

Operators may need to clear pending files from the datalake coordinator
when Iceberg catalog state becomes inconsistent (e.g. after manual
catalog modifications) which can result in stuck coordinator.

The new CoordinatorResetState RPC is exposed via the admin API with
SUPERUSER authorization and plumbed through the coordinator frontend,
RPC service, and state machine.
andrwng
andrwng previously approved these changes Feb 17, 2026
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM! One nit/question about the test, but doesn't need to block

@nvartolomei
Copy link
Contributor Author

/ci-repeat 1
skip-redpanda-build
skip-rebase

andrwng
andrwng previously approved these changes Feb 18, 2026
Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. i have one question about the override logic.

auto check_res = update.can_apply(stm_->state());
if (check_res.has_error()) {
vlog(
datalake_log.debug,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: INFO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy pasted. Can address in follow up.

Comment on lines +18 to +22
if (p.last_committed.has_value()) {
fmt::print(o, "{{last_committed: {}}}", p.last_committed.value());
} else {
fmt::print(o, "{{last_committed: nullopt}}");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick/fyi: I think just formatting the optional will do the right thing, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work without additional includes here and I refused to add them because it is a hack and we'll have to get rid of it soon-ish.

This violates C++ standard guidelines https://nvartolomei.com/cpp-code-hygiene/#specializing-standard-library-know-your-boundaries.

In newer fmt, ostream formatters aren't used afaik. Preferred correct code rather than yoloing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed that it was an ostream operator... why are we even writing one, exactly?

@nvartolomei
Copy link
Contributor Author

/ci-repeat 1
skip-redpanda-build
skip-rebase

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 19, 2026

Retry command for Build#80732

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cloud_topics/e2e_test.py::EndToEndCloudTopicsCompactionTest.test_compact@{"cloud_topics_compaction_key_map_memory_kb":10}
tests/rptest/tests/describe_topics_test.py::DescribeTopicsTest.test_describe_topics_with_documentation_and_types
tests/rptest/tests/cloud_topics/e2e_test.py::EndToEndCloudTopicsCompactionTest.test_compact@{"cloud_topics_compaction_key_map_memory_kb":3}
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCTest.test_l0_gc@{"cloud_storage_type":1}
tests/rptest/tests/cloud_topics/e2e_test.py::EndToEndCloudTopicsTest.test_write
tests/rptest/tests/cloud_topics/e2e_test.py::EndToEndCloudTopicsCompactionTest.test_compact@{"cloud_topics_compaction_key_map_memory_kb":131072}
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkBasicTests.test_disallowed_topic_properties
tests/rptest/tests/cloud_topics/iceberg_test.py::EndToEndCloudTopicsIcebergCompactionTest.test_compaction_preserves_all_offsets_in_iceberg@{"cloud_storage_type":1}
tests/rptest/tests/cloud_topics/retention_test.py::CloudTopicsRetentionTest.test_size_based_retention@{"cloud_storage_type":1}
tests/rptest/tests/cloud_topics/e2e_test.py::EndToEndCloudTopicsTest.test_get_size
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCTest.test_l0_gc@{"cloud_storage_type":2}
tests/rptest/tests/data_migrations_api_test.py::DataMigrationsApiTest.test_conflicting_group_migrations
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCAdminTest.test_basic_pause_unpause@{"cloud_storage_type":1}
tests/rptest/tests/cloud_topics/retention_test.py::CloudTopicsRetentionTest.test_time_based_retention@{"cloud_storage_type":1}
tests/rptest/tests/cloud_topics/e2e_test.py::EndToEndCloudTopicsTxTest.test_write
tests/rptest/tests/cloud_topics/retention_test.py::CloudTopicsRetentionTest.test_size_based_retention@{"cloud_storage_type":2}
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCAdminTest.test_single_node_pause_unpause@{"cloud_storage_type":1}
tests/rptest/tests/cloud_topics/e2e_test.py::EndToEndCloudTopicsTest.test_delete_records
tests/rptest/tests/data_migrations_api_test.py::DataMigrationsApiTest.test_cloud_topic_unmount_rejected
tests/rptest/tests/cloud_topics/iceberg_test.py::EndToEndCloudTopicsIcebergDeletionTest.test_deletion_blocked_until_translated@{"cloud_storage_type":1}
tests/rptest/tests/cloud_topics/retention_test.py::CloudTopicsRetentionTest.test_time_based_retention@{"cloud_storage_type":2}

@nvartolomei
Copy link
Contributor Author

/ci-repeat 1

@nvartolomei nvartolomei disabled auto-merge February 19, 2026 11:37
oleiman
oleiman previously approved these changes Feb 19, 2026
In some cases we want to manually choose from where translation
should resume after a coordinator reset. Add per-partition
last_committed overrides and a reset_all_partitions flag.

By default reset_all_partitions is false and the request is a no-op
unless partition_overrides are provided — only the listed partitions
have their pending entries cleared and last_committed set. Setting
reset_all_partitions to true clears all partitions first, then
applies any overrides. This default-safe design minimizes surprise
and makes it harder to accidentally wipe state.
@nvartolomei nvartolomei dismissed stale reviews from oleiman and andrwng via f44ca9d February 20, 2026 11:44
@nvartolomei nvartolomei requested a review from oleiman February 20, 2026 11:45
@nvartolomei nvartolomei merged commit f87258f into redpanda-data:dev Feb 20, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants