fix: Support data pruning using nested partition columns#18126
fix: Support data pruning using nested partition columns#18126linliu-code wants to merge 2 commits intoapache:masterfrom
Conversation
d6f9ca7 to
413fa60
Compare
|
@hudi-bot run azure |
The command seems not working. Let me push it again to trigger the Azure test. |
413fa60 to
eaf9bd8
Compare
| @@ -2546,6 +2453,204 @@ class TestCOWDataSource extends HoodieSparkClientTestBase with ScalaAssertionSup | |||
| writeToHudi(opt, firstUpdateDF, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) | |||
| }) | |||
| } | |||
|
|
|||
| @ParameterizedTest | |||
| @CsvSource(Array("COW", "MOR")) | |||
There was a problem hiding this comment.
Could this test be extracted and called from TestCOWDataSource for COW table and TestMORDataSource for MOR table?
| val partitionFields: Array[StructField] = partitionColumns.get().filter(column => nameFieldMap.contains(column)) | ||
| .map(column => nameFieldMap.apply(column)) | ||
| .map(column => StructField(column, nameFieldMap.apply(column).dataType)) |
There was a problem hiding this comment.
Could generateFieldMap(schema) be fixed instead? In case it is reused later, the same issue will be encountered.
| val partitionFields: Array[StructField] = partitionColumns.get().filter(column => nameFieldMap.contains(column)) | ||
| .map(column => nameFieldMap.apply(column)) | ||
| .map(column => StructField(column, nameFieldMap.apply(column).dataType)) |
There was a problem hiding this comment.
The fix correctly addresses the dataSchema exclusion problem (line 171-172), but I'm wondering if it could cause issues with partition predicate binding at line 258: partitionSchema.indexWhere(a.name == _.name). When Spark resolves a filter like nested_record.level = 'INFO' against a partition schema with field name "nested_record.level", does the AttributeReference.name reliably come through as "nested_record.level" rather than just "level"? If not, indexWhere would return -1 and blow up. Have you verified that partition pruning works end-to-end with this change with L258 triggered?
Describe the issue this Pull Request addresses
There's a change in behavior for for SparkHoodieTableFileIndex since 0.14.1. The StructType(partitionFields) returned doesn't have the full path and causing data validation failures. This behavior was changed as part of this PR https://github.com/apache/hudi/pull/9863/changes
Summary and Changelog
If there's a table with a nested partition column whose leaf name conflicts with another top level field the partitionedSchema passed to the new file group reader is incorrect. The fix is to return the partition field with the full path name instead of the inner field name.
Impact
Medium
Risk Level
Low.
Documentation Update
Contributor's checklist