fix: Support data pruning using nested partition columns by linliu-code · Pull Request #18126 · apache/hudi

linliu-code · 2026-02-07T23:25:56Z

Describe the issue this Pull Request addresses

There's a change in behavior for for SparkHoodieTableFileIndex since 0.14.1. The StructType(partitionFields) returned doesn't have the full path and causing data validation failures. This behavior was changed as part of this PR https://github.com/apache/hudi/pull/9863/changes

Summary and Changelog

If there's a table with a nested partition column whose leaf name conflicts with another top level field the partitionedSchema passed to the new file group reader is incorrect. The fix is to return the partition field with the full path name instead of the inner field name.

Impact

Medium

Risk Level

Low.

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

nsivabalan · 2026-02-09T03:10:07Z

@hudi-bot run azure

linliu-code · 2026-02-09T17:33:52Z

@hudi-bot run azure

The command seems not working. Let me push it again to trigger the Azure test.

yihua · 2026-02-10T16:25:57Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

@@ -2546,6 +2453,204 @@ class TestCOWDataSource extends HoodieSparkClientTestBase with ScalaAssertionSup
      writeToHudi(opt, firstUpdateDF, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
    })
  }
+
+  @ParameterizedTest
+  @CsvSource(Array("COW", "MOR"))


Could this test be extracted and called from TestCOWDataSource for COW table and TestMORDataSource for MOR table?

yihua · 2026-02-10T18:23:44Z

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

        val partitionFields: Array[StructField] = partitionColumns.get().filter(column => nameFieldMap.contains(column))
-          .map(column => nameFieldMap.apply(column))
+          .map(column => StructField(column, nameFieldMap.apply(column).dataType))


Could generateFieldMap(schema) be fixed instead? In case it is reused later, the same issue will be encountered.

yihua · 2026-02-10T18:39:45Z

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

        val partitionFields: Array[StructField] = partitionColumns.get().filter(column => nameFieldMap.contains(column))
-          .map(column => nameFieldMap.apply(column))
+          .map(column => StructField(column, nameFieldMap.apply(column).dataType))


The fix correctly addresses the dataSchema exclusion problem (line 171-172), but I'm wondering if it could cause issues with partition predicate binding at line 258: partitionSchema.indexWhere(a.name == _.name). When Spark resolves a filter like nested_record.level = 'INFO' against a partition schema with field name "nested_record.level", does the AttributeReference.name reliably come through as "nested_record.level" rather than just "level"? If not, indexWhere would return -1 and blow up. Have you verified that partition pruning works end-to-end with this change with L258 triggered?

hudi-bot · 2026-02-11T02:09:25Z

CI report:

0de0f23 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Handle nested map and array columns in MDT

ff71010

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 7, 2026

linliu-code force-pushed the nested_partitioning branch 3 times, most recently from d6f9ca7 to 413fa60 Compare February 8, 2026 01:00

linliu-code changed the title ~~fix: Reproduce nested partition columns pruning data validation failure~~ fix: Support data pruning using nested partition columns Feb 8, 2026

linliu-code marked this pull request as ready for review February 8, 2026 05:50

linliu-code requested a review from yihua February 8, 2026 05:54

nsivabalan approved these changes Feb 9, 2026

View reviewed changes

Fix the issue and add tests

eaf9bd8

linliu-code force-pushed the nested_partitioning branch from 413fa60 to eaf9bd8 Compare February 9, 2026 17:34

apache deleted a comment from hudi-bot Feb 10, 2026

yihua reviewed Feb 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Support data pruning using nested partition columns#18126

fix: Support data pruning using nested partition columns#18126
linliu-code wants to merge 2 commits intoapache:masterfrom
linliu-code:nested_partitioning

linliu-code commented Feb 7, 2026 •

edited

Loading

Uh oh!

nsivabalan commented Feb 9, 2026

Uh oh!

linliu-code commented Feb 9, 2026

Uh oh!

yihua Feb 10, 2026

Uh oh!

linliu-code Feb 10, 2026

Uh oh!

yihua Feb 10, 2026

Uh oh!

yihua Feb 10, 2026

Uh oh!

hudi-bot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

linliu-code commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan commented Feb 9, 2026

Uh oh!

linliu-code commented Feb 9, 2026

Uh oh!

yihua Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

linliu-code Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 11, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linliu-code commented Feb 7, 2026 •

edited

Loading