Skip to content

feat(kafka): add topic pattern subscription support#1002

Open
touhou09 wants to merge 3 commits intoArroyoSystems:masterfrom
touhou09:feature/kafka-topic-pattern-subscription
Open

feat(kafka): add topic pattern subscription support#1002
touhou09 wants to merge 3 commits intoArroyoSystems:masterfrom
touhou09:feature/kafka-topic-pattern-subscription

Conversation

@touhou09
Copy link

@touhou09 touhou09 commented Feb 4, 2026

Summary

Add support for subscribing to multiple Kafka topics using regex patterns.
This enables scenarios where a single Arroyo pipeline can consume from multiple topics matching a pattern (e.g., ^logs-.* or ^edge[0-9]+_raw$).

Motivation

Currently, Arroyo's Kafka source connector only supports subscribing to a single explicit topic. In many real-world scenarios (e.g., multi-tenant systems, edge device data collection), data is partitioned across multiple topics with a common naming pattern. This PR adds the ability to subscribe to all topics matching a regex pattern using Kafka's native pattern subscription feature.

Changes

  • Add topic_pattern option to Kafka source table configuration
  • Implement pattern-based subscription using rdkafka's subscribe() method
  • Update KafkaState to track offsets by (topic, partition) key for multi-topic support
  • Add validation: topic and topic_pattern are mutually exclusive
  • Add validation: topic_pattern only valid for source tables, not sinks
  • Update existing tests to work with new field structure

Usage Example

CREATE TABLE multi_topic_source (
    data TEXT
) WITH (
    connector = 'kafka',
    type = 'source',
    topic_pattern = '^logs-.*',
    format = 'json',
    bootstrap_servers = 'kafka:9092',
    'source.offset' = 'earliest'
);

SELECT * FROM multi_topic_source;

Behavior

When using topic_pattern:

  • Kafka handles partition assignment via consumer group protocol
  • New topics matching the pattern are automatically discovered
  • Rebalancing happens automatically when topics are added/removed
  • The topic metadata field reflects the actual topic each message came from

Breaking Changes

None. The topic field remains the default and recommended option for single-topic subscriptions.

Test Plan

  • Updated existing unit tests to use new field structure (Option<String> for topic)
  • Integration tests with pattern subscription (requires Kafka cluster)
  • Manual testing pending

Related Issues

This addresses the need for multi-topic subscription in stream processing pipelines.

Add support for subscribing to multiple Kafka topics using regex patterns.
This enables scenarios where a single Arroyo pipeline can consume from
multiple topics matching a pattern (e.g., `^logs-.*` or `^edge[0-9]+_raw$`).

Changes:
- Add `topic_pattern` option to Kafka source table configuration
- Implement pattern-based subscription using rdkafka's `subscribe()` method
- Update `KafkaState` to track offsets by (topic, partition) key for multi-topic support
- Add validation: `topic` and `topic_pattern` are mutually exclusive
- Add validation: `topic_pattern` only valid for source tables, not sinks

Usage in SQL:
```sql
CREATE TABLE multi_topic_source (
    ...
) WITH (
    connector = 'kafka',
    type = 'source',
    topic_pattern = '^logs-.*',
    ...
);
```

When using `topic_pattern`:
- Kafka handles partition assignment via consumer group protocol
- New topics matching the pattern are automatically discovered
- Rebalancing happens automatically when topics are added/removed

Closes: #XXXX
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant