Skip to content

Conversation

@kbuci
Copy link
Contributor

@kbuci kbuci commented Feb 11, 2026

Describe the issue this Pull Request addresses

Adds a config to allow clustering plan strategy to sort file slices by commit time (earlier first) before file size when building clustering groups. This helps use cases (e.g. stitching) that want to cluster older data first to reduce lag. The behavior is opt-in; default remains size-only sorting to preserve existing behavior.

#17956

Summary and Changelog

Summary: New config hoodie.clustering.earlier_instants_first (default false). When enabled, PartitionAwareClusteringPlanStrategy sorts file slices by base file commit time ascending, then by file size descending, so older data is clustered first.

Changelog:

  • HoodieClusteringConfig: Added EARLIER_INSTANTS_FIRST config property (default false) and Builder.withEarlierInstantsFirst(Boolean).
  • HoodieWriteConfig: Added isEarlierInstantsFirst().
  • PartitionAwareClusteringPlanStrategy: Replaced size-only sort with a configurable comparator: when isEarlierInstantsFirst() is true, sort by commit time then by file size (desc); otherwise keep previous size-descending behavior.
  • TestSparkSizeBasedClusteringPlanStrategy: Added createFileSliceWithCommitTime(long, String) and tests: testEarlierInstantsFirstEnabled, testEarlierInstantsFirstDisabled, testCommitTimeOrderingWithSameSizes, testSortingBehaviorComparisonWithAndWithoutEarlierInstantsFirst.

Impact

  • Public API: New config hoodie.clustering.earlier_instants_first and builder method HoodieClusteringConfig.Builder.withEarlierInstantsFirst(Boolean). No breaking changes.
  • User-facing: Optional behavior; default false keeps current clustering order (by size only).
  • Performance: Negligible (one extra comparator key when enabled).

Risk Level

Low. Behavior is off by default. Sorting change is limited to PartitionAwareClusteringPlanStrategy and covered by new and existing unit tests in TestSparkSizeBasedClusteringPlanStrategy

Documentation Update

  • Config: Document hoodie.clustering.earlier_instants_first in the clustering config section (description and default false).
  • Website: Optional short note under clustering / tuning that this config can be used to prioritize older data when needed (e.g. stitching). No website change required for merge if docs are in code/config only.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 11, 2026
@kbuci kbuci changed the title feature(table-services) Support clustering file groups with earlier i… feature(table-services) Support clustering file groups with earlier instants times first Feb 11, 2026
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants