Skip to content

fix: Support data pruning using nested partition columns#18126

Open
linliu-code wants to merge 2 commits intoapache:masterfrom
linliu-code:nested_partitioning
Open

fix: Support data pruning using nested partition columns#18126
linliu-code wants to merge 2 commits intoapache:masterfrom
linliu-code:nested_partitioning

Conversation

@linliu-code
Copy link
Collaborator

@linliu-code linliu-code commented Feb 7, 2026

Describe the issue this Pull Request addresses

There's a change in behavior for for SparkHoodieTableFileIndex since 0.14.1. The StructType(partitionFields) returned doesn't have the full path and causing data validation failures. This behavior was changed as part of this PR https://github.com/apache/hudi/pull/9863/changes

Summary and Changelog

If there's a table with a nested partition column whose leaf name conflicts with another top level field the partitionedSchema passed to the new file group reader is incorrect. The fix is to return the partition field with the full path name instead of the inner field name.

Impact

Medium

Risk Level

Low.

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 7, 2026
@linliu-code linliu-code force-pushed the nested_partitioning branch 3 times, most recently from d6f9ca7 to 413fa60 Compare February 8, 2026 01:00
@linliu-code linliu-code changed the title fix: Reproduce nested partition columns pruning data validation failure fix: Support data pruning using nested partition columns Feb 8, 2026
@linliu-code linliu-code marked this pull request as ready for review February 8, 2026 05:50
@linliu-code linliu-code requested a review from yihua February 8, 2026 05:54
@nsivabalan
Copy link
Contributor

@hudi-bot run azure

@linliu-code
Copy link
Collaborator Author

@hudi-bot run azure

The command seems not working. Let me push it again to trigger the Azure test.

@apache apache deleted a comment from hudi-bot Feb 10, 2026
@@ -2546,6 +2453,204 @@ class TestCOWDataSource extends HoodieSparkClientTestBase with ScalaAssertionSup
writeToHudi(opt, firstUpdateDF, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
})
}

@ParameterizedTest
@CsvSource(Array("COW", "MOR"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this test be extracted and called from TestCOWDataSource for COW table and TestMORDataSource for MOR table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg.

val partitionFields: Array[StructField] = partitionColumns.get().filter(column => nameFieldMap.contains(column))
.map(column => nameFieldMap.apply(column))
.map(column => StructField(column, nameFieldMap.apply(column).dataType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could generateFieldMap(schema) be fixed instead? In case it is reused later, the same issue will be encountered.

val partitionFields: Array[StructField] = partitionColumns.get().filter(column => nameFieldMap.contains(column))
.map(column => nameFieldMap.apply(column))
.map(column => StructField(column, nameFieldMap.apply(column).dataType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix correctly addresses the dataSchema exclusion problem (line 171-172), but I'm wondering if it could cause issues with partition predicate binding at line 258: partitionSchema.indexWhere(a.name == _.name). When Spark resolves a filter like nested_record.level = 'INFO' against a partition schema with field name "nested_record.level", does the AttributeReference.name reliably come through as "nested_record.level" rather than just "level"? If not, indexWhere would return -1 and blow up. Have you verified that partition pruning works end-to-end with this change with L258 triggered?

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants