feature(table-services) Support clustering file groups with earlier instants times first #18174
+203
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
Adds a config to allow clustering plan strategy to sort file slices by commit time (earlier first) before file size when building clustering groups. This helps use cases (e.g. stitching) that want to cluster older data first to reduce lag. The behavior is opt-in; default remains size-only sorting to preserve existing behavior.
#17956
Summary and Changelog
Summary: New config
hoodie.clustering.earlier_instants_first(defaultfalse). When enabled,PartitionAwareClusteringPlanStrategysorts file slices by base file commit time ascending, then by file size descending, so older data is clustered first.Changelog:
EARLIER_INSTANTS_FIRSTconfig property (defaultfalse) andBuilder.withEarlierInstantsFirst(Boolean).isEarlierInstantsFirst().isEarlierInstantsFirst()is true, sort by commit time then by file size (desc); otherwise keep previous size-descending behavior.createFileSliceWithCommitTime(long, String)and tests:testEarlierInstantsFirstEnabled,testEarlierInstantsFirstDisabled,testCommitTimeOrderingWithSameSizes,testSortingBehaviorComparisonWithAndWithoutEarlierInstantsFirst.Impact
hoodie.clustering.earlier_instants_firstand builder methodHoodieClusteringConfig.Builder.withEarlierInstantsFirst(Boolean). No breaking changes.falsekeeps current clustering order (by size only).Risk Level
Low. Behavior is off by default. Sorting change is limited to
PartitionAwareClusteringPlanStrategyand covered by new and existing unit tests inTestSparkSizeBasedClusteringPlanStrategyDocumentation Update
hoodie.clustering.earlier_instants_firstin the clustering config section (description and defaultfalse).Contributor's checklist