Skip to content

Conversation

jessejlt
Copy link
Member

@jessejlt jessejlt commented Aug 1, 2025

Summary

This PR implements a comprehensive backfill feature for SummaryRules that allows users to process historical data by specifying a date range. The feature enables SummaryRules to automatically advance through time windows from a specified start datetime until reaching an end datetime.

Features

Core Backfill Functionality

  • Date Range Specification: Users can specify start and end datetimes in SummaryRule specs
  • Serial Execution: Backfill operations run one at a time to prevent Kusto overload
  • Progress Tracking: Current backfill position is stored in the spec for visibility
  • Auto-completion: Backfill field is automatically removed when complete
  • Parallel Normal Execution: Backfill runs alongside normal interval execution

API Design

spec:
  backfill:
    start: "2024-01-01T00:00:00Z"    # Current backfill position
    end: "2024-01-31T23:59:59Z"      # Target end time  
    operationId: "operation-id"      # Active operation (managed by system)

Key Capabilities

  • Interval Alignment: Uses same interval boundaries as normal execution
  • Ingestion Delay: Respects configured ingestion delays
  • Error Handling: Robust retry logic for failed operations
  • Validation: 30-day max lookback period with proper date validation
  • Backward Compatibility: Existing SummaryRules work unchanged

Implementation Details

Commits Included

  • feat: add backfill specification to SummaryRule CRD - Core API extension
  • feat: add backfill helper methods to SummaryRule - Time window calculation logic
  • feat: integrate backfill execution into SummaryRuleTask - Task manager integration
  • test: add comprehensive unit tests for backfill task integration - Full test coverage
  • refactor: optimize backfill operation status checking and fix progress advancement - Performance improvements
  • docs: add backfill feature documentation and examples - User documentation

Architecture Decisions

  1. Single Operation Tracking: Tracks one backfill operation at a time for simplicity and Kusto protection
  2. Spec-based Progress: Current position stored in spec for easy monitoring and resume capability
  3. Serial Execution: Prevents overwhelming Kusto with expensive backfill queries
  4. Reconcile Integration: Backfill advances naturally through the existing reconcile loop

Performance Optimizations

  • Operation Status Checking: Optimized to query single operations vs. listing all operations
  • Window Calculation: Reuses existing time window calculation patterns
  • Query Parameterization: Uses parameterized KQL queries to prevent injection

Testing

Unit Test Coverage

  • ✅ Backfill helper methods with various intervals and edge cases
  • ✅ Task integration with mock clocks and fake Kusto clients
  • ✅ Error handling and retry scenarios
  • ✅ Progress advancement and completion detection
  • ✅ Integration with normal rule execution

Validation Scenarios

  • Date range validation (start < end, max 30-day lookback)
  • Time window calculation with different intervals
  • Operation tracking and cleanup
  • Concurrent normal execution during backfill

Documentation

User-Facing Documentation

  • CRD Reference: Updated with comprehensive backfill field documentation
  • Examples: Three practical examples (basic, advanced, migration scenarios)
  • Troubleshooting Guide: Common issues, monitoring, and optimization strategies
  • Best Practices: Performance considerations and limitations

Example Usage

apiVersion: adx-mon.azure.com/v1
kind: SummaryRule
metadata:
  name: example-backfill
spec:
  database: "metrics"
  table: "usage_metrics"
  interval: "1h"
  body: |
    metrics
    | where TimeGenerated between (todatetime('{{ ._startTime }}') .. todatetime('{{ ._endTime }}'))
    | summarize AvgCPU = avg(cpu_percent) by bin(TimeGenerated, 1h)
  backfill:
    start: "2024-01-01T00:00:00Z"
    end: "2024-01-31T23:59:59Z"

Safety & Reliability

Backward Compatibility

  • All backfill fields are optional
  • Existing SummaryRules continue to work unchanged
  • No breaking changes to the API

Error Resilience

  • Failed operations retry on next reconcile cycle
  • Progress is preserved across operator restarts
  • Clear error messages for validation failures

Resource Protection

  • Serial execution prevents Kusto overload
  • 30-day max lookback prevents excessive resource usage
  • Progress tracking enables monitoring and intervention

Next Steps

  1. Review: Please review the implementation and provide feedback
  2. Testing: Consider running integration tests with real Kusto clusters
  3. Documentation: Verify examples work in your environment
  4. Release Planning: Determine appropriate version bump for this feature

Related Issues

This PR addresses the need for historical data processing in SummaryRules, enabling users to:

  • Migrate historical data to new summary tables
  • Backfill data after rule modifications
  • Process data during system outages or maintenance windows

Note: This is a draft PR for initial review. Please provide feedback on the implementation approach and any concerns before marking ready for review.

- Defines backfill spec with start/end datetime and operationId tracking
- Uses serial execution to prevent overwhelming Kusto with expensive queries
- Implements progress tracking via spec mutation with natural completion
- Includes 6-task implementation plan with comprehensive testing
- Maintains backwards compatibility with existing SummaryRules
- Adds 30-day max lookback validation and proper error handling
- Add BackfillSpec struct with start, end, and operationId fields
- Add optional Backfill field to SummaryRuleSpec with validation
- Add 30-day max lookback period constant
- Generate updated CRD manifests with kubebuilder validation
- Maintain backwards compatibility with existing SummaryRules
- Add HasActiveBackfill() to check for active backfill configuration
- Add IsBackfillComplete() to detect completion
- Add GetNextBackfillWindow() for time window calculation with interval alignment
- Add AdvanceBackfillProgress() to move backfill position forward
- Add operation management methods (Set/Get/Clear BackfillOperation)
- Add RemoveBackfill() for cleanup when complete
- Include comprehensive unit tests covering all edge cases
- Handle ingestion delay correctly in window calculations
- Add handleBackfillExecution() method for backfill operation management
- Add handleBackfillOperationStatus() to track running backfill operations
- Add submitNextBackfillWindow() to submit next backfill window
- Integrate backfill execution into main Run() loop alongside normal execution
- Handle backfill completion and automatic cleanup
- Support operation retry and error handling for backfill operations
- Log backfill progress for visibility and debugging
- Add TestSummaryRuleTask_BackfillIntegration with 9 test scenarios
- Cover operation submission, progress advancement, error handling
- Test retry logic, completion detection, and edge cases
- Include ingestion delay support and missing operation handling
- Follow existing test patterns with proper mocking and assertions
…s advancement

- Add GetOperation function field to SummaryRuleTask for single-operation queries
- Implement parameterized KQL query following databaseExists pattern for injection prevention
- Update handleBackfillOperationStatus to use efficient single-operation lookup
- Factor out shared calculateTimeWindow method for backfill operations
- Add comprehensive comments explaining why NextExecutionWindow cannot share logic
- Fix AdvanceBackfillProgress to properly reverse delay calculation
- Update all backfill tests to mock new GetOperation function
- Maintain backward compatibility with existing execution patterns
- Update CRD reference with comprehensive backfill documentation
- Add backfill field to SummaryRule key fields and examples
- Create detailed backfill feature section with workflow explanation
- Add examples directory with three comprehensive YAML examples:
  - Basic hourly aggregation with 30-day backfill
  - Advanced example with ingestion delay and criteria
  - Large-scale daily migration example
- Include comprehensive troubleshooting guide covering:
  - Common issues and solutions
  - Progress monitoring techniques
  - Performance optimization strategies
  - Advanced troubleshooting commands
- Add examples reference to main documentation index
- Document limitations, best practices, and monitoring approaches
@jessejlt jessejlt requested a review from Copilot August 1, 2025 18:15
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive backfill feature for SummaryRules that enables historical data processing by specifying date ranges. The feature allows SummaryRules to automatically advance through time windows from a start datetime to an end datetime while continuing normal interval execution in parallel.

Key changes:

  • API Extension: Added BackfillSpec to the SummaryRule CRD with start/end datetime fields and operation tracking
  • Core Logic: Implemented backfill helper methods for window calculation, progress tracking, and completion detection
  • Task Integration: Extended SummaryRuleTask to handle backfill operations alongside normal execution with serial processing to prevent Kusto overload

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
operator/manifests/crds/summaryrules_crd.yaml Added backfill field schema with validation rules to CRD manifest
kustomize/bases/summaryrules_crd.yaml Identical backfill field addition to base CRD definition
api/v1/summaryrule_types.go Core backfill implementation with BackfillSpec type and helper methods
api/v1/summaryrule_types_test.go Comprehensive unit tests for all backfill helper methods
api/v1/zz_generated.deepcopy.go Auto-generated deepcopy methods for BackfillSpec
ingestor/adx/tasks.go Task execution integration with backfill operation handling
ingestor/adx/tasks_test.go Integration tests for backfill task execution
docs/crds.md Updated CRD documentation with backfill feature details
docs/index.md Added link to examples directory
docs/examples/*.yaml Example configurations for basic, advanced, and migration scenarios
docs/examples/backfill-troubleshooting.md Comprehensive troubleshooting guide
docs/examples/README.md Overview of backfill examples and quick start guide
.github/plans/summaryrules/backfill.md Detailed implementation plan and design decisions

- Format long KQL queries using multi-line strings for better readability
- Add comprehensive documentation to helper functions with actual value examples
- Decompose large functions into single-responsibility helpers
- Extract shared utilities (ensureClockIsSet, getIngestionDelay, capWindowEndToCurrentTime)
- Maintain architectural separation between backfill and normal execution time calculations
- All tests passing, no behavioral changes
Copy link
Member Author

@jessejlt jessejlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed review feedback! I've addressed all four comments with comprehensive improvements:

Summary of Changes

Code Quality Improvements:

  1. Multi-line KQL formatting - Enhanced readability of complex Kusto queries
  2. Enhanced documentation - Added actual values (30 days) to constant comments
  3. Function decomposition - Broke down large functions into focused, single-responsibility helpers
  4. Strategic code reuse - Identified and implemented reuse opportunities while preserving architectural differences

Testing & Validation:

  • All existing tests continue to pass
  • No behavioral changes introduced
  • Function decomposition enables better testability

Architecture Preservation:

  • Maintained documented separation between calculateTimeWindow and NextExecutionWindow
  • Created shared utilities for peripheral concerns (clock validation, delay calculation)
  • Followed coding guidelines for focused single-purpose functions

The refactoring significantly improves code maintainability while preserving all existing functionality. Ready for another review cycle!

@jessejlt jessejlt marked this pull request as ready for review August 4, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant