perf: Add HoodieROPathFilter support during file listing to prevent driver OOM by suryaprasanna · Pull Request #18136 · apache/hudi

suryaprasanna · 2026-02-08T23:27:45Z

Describe the issue this Pull Request addresses

This PR addresses OOM (Out of Memory) issues on the Spark driver when querying large Hudi datasets that have multiple versions of files in the same partition. When file listing is performed without filtering, all file versions are loaded into memory, which can cause the driver to run out of memory.

Summary and Changelog

Users can now enable path filtering during file listing to avoid loading multiple file versions into memory on the driver. This is controlled by the new config hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter.

Changes:

Added new config FILE_INDEX_LIST_FILE_STATUSES_USING_RO_PATH_FILTER (default: false) to enable HoodieROPathFilter during file listing
Extended HoodieTableMetadata, BaseTableMetadata, and FileSystemBackedTableMetadata to support optional StoragePathFilter parameter
Added FSUtils.getAllDataFilesInPartition overload that accepts path filter option
Created HoodieROTableStoragePathFilter wrapper to adapt Hadoop PathFilter to Hudi's StoragePathFilter interface
Updated BaseHoodieTableFileIndex to use path filter when enabled, constructing FileSlices directly from filtered files
Modified SparkHoodieTableFileIndex to wrap and apply HoodieROTablePathFilter during partition file listing

Impact

Config Changes:

New config: hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter (default: false)
- When enabled, applies HoodieROTablePathFilter during file listing to filter out older file versions
- Helps prevent OOM issues on driver for large tables with multiple file versions

Performance:

Reduces memory pressure on Spark driver for datasets with multiple file versions per partition
Enables successful queries on very large tables that previously failed with OOM errors

Risk Level

Low - The feature is behind a config flag (default: false) and does not change existing behavior unless explicitly enabled.

Verification:

Existing unit tests pass
Tested on large production datasets with OOM issues - queries now succeed with filter enabled

Documentation Update

Config documentation is included in the withDocumentation method of the new config property.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

suryaprasanna · 2026-02-09T16:53:18Z

@nsivabalan seems like the checks are not running on the PR. Can you please chek?

…on the driver Reviewers: O955 Project Hoodie Project Reviewer: Add blocking reviewers, pwason, jingli, meenalb, singh.sumit Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, pwason Tags: #has_java JIRA Issues: HUDI-6646 Differential Revision: https://code.uberinternal.com/D17441111 Fix build failures Fix checkstyle Refactor code Create unit tests

nsivabalan · 2026-02-09T23:54:30Z

can we consider both table types, and all query types and ensure we wire in the config only wherever applicable.

suryaprasanna · 2026-02-09T23:54:19Z

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

@@ -143,6 +146,7 @@ public abstract class BaseHoodieTableFileIndex implements AutoCloseable {
   * @param configProperties             unifying configuration (in the form of generic properties)
   * @param queryType                    target query type
   * @param queryPaths                   target DFS paths being queried


Add MOR condition.

hudi-bot · 2026-02-11T02:09:34Z

CI report:

7bdd47f Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2026-02-11T03:18:03Z

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

        .fromProperties(configProperties)
        .enable(configProperties.getBoolean(ENABLE.key(), DEFAULT_METADATA_ENABLE_FOR_READERS)
-            && HoodieTableMetadataUtil.isFilesPartitionAvailable(metaClient))
+            && HoodieTableMetadataUtil.isFilesPartitionAvailable(metaClient)


hey @yihua : lets chat about this as well tomorrow.

nsivabalan · 2026-02-11T03:29:26Z

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

+
+    if (useROPathFilterForListing && !shouldIncludePendingCommits) {
+      // Group files by partition path, then by file group ID
+      Map<String, PartitionPath> partitionsMap = new HashMap<>();


can we move this to a private method.
generatePartitionFileSlicesPostROTablePathFilter

nsivabalan · 2026-02-11T03:30:44Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java

   * By passing metaClient and completedTimeline, we can sync the view seen from this class against HoodieFileIndex class
   */
-  public HoodieROTablePathFilter(Configuration conf,
+  public HoodieROTablePathFilter(StorageConfiguration conf,


hey @yihua : can you review the changes in this patch

nsivabalan · 2026-02-11T03:31:41Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

        " them (if possible).")

+  val FILE_INDEX_LIST_FILE_STATUSES_USING_RO_PATH_FILTER: ConfigProperty[Boolean] =
+    ConfigProperty.key("hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter")


hoodie.datasource.read.file.index.optimize.listing.using.path.filter

nsivabalan · 2026-02-11T03:32:30Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

      properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key, listingModeOverride)
    }

+    var hoodieROTablePathFilterBasedFileListingEnabled = getConfigValue(options, sqlConf,


once we fix the config key, lets fix these vars as well

nsivabalan · 2026-02-11T03:35:51Z

...datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/common/TestROPathFilterOnRead

+        val result = spark.sql(s"select id, name, price, ts from $tableName order by id").collect()
+        // Should have deleted records where id % 3 = 0 (3, 6, 9)
+        // Should have doubled price for even ids (2, 4, 8, 10)
+        assert(result.length == 7) // 10 - 3 deleted = 7


do you think below assertion would work.

we can rename one of the earlier versions of a file slice so that HoodieBaseFile parsing will fail.
so, if RO table path filter works as intended, listing files from a given partition should not fail, since we won't even try to parse the file.

but if RO table path filter did not work, it would fail.

yihua

Nice work on adding an opt-in path filter to avoid driver OOM during file listing — the feature is well-motivated and the config is cleanly gated behind a default-off flag. The main concerns are correctness issues in the new loadFileSlicesForPartitions fast path: the partition path passed to FileSlice appears to be absolute rather than relative, and the partition map lookup can NPE if the key doesn't match. It's also worth clarifying MOR table compatibility and fixing the shared Hadoop config mutation in getPartitionPathFilter before merging.

yihua · 2026-02-10T00:01:20Z

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

+      Map<String, PartitionPath> partitionsMap = new HashMap<>();
+      partitions.forEach(p -> partitionsMap.put(p.path, p));
+      Map<PartitionPath, List<FileSlice>> partitionToFileSlices = new HashMap<>();
+


The partitionPathStr here is the absolute path (pathInfo.getPath().getParent().toString()), but FileSlice expects a relative partition path. The existing code path via HoodieTableFileSystemView always uses relative paths. This would cause mismatches downstream wherever FileSlice.getPartitionPath() is used. Should this be relPartitionPath instead?

yihua · 2026-02-10T00:01:20Z

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

+        // Create FileSlice obj from StoragePathInfo.
+        String partitionPathStr = pathInfo.getPath().getParent().toString();
+        String relPartitionPath = FSUtils.getRelativePartitionPath(basePath, pathInfo.getPath().getParent());
+        HoodieBaseFile baseFile = new HoodieBaseFile(pathInfo);


If relPartitionPath doesn't exactly match a key in partitionsMap, partitionPathObj will be null and the computeIfAbsent call below will throw NPE. This could happen with path normalization differences (trailing slashes, scheme differences). Could you add a null check or use getRelativePartitionPath consistently with how PartitionPath.path was originally set?

yihua · 2026-02-10T00:01:20Z

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

+    List<StoragePathInfo> allFiles = listPartitionPathFiles(partitions, activeTimeline);
+    log.info("On {} with query instant as {}, it took {}ms to list all files {} Hudi partitions",
+        metaClient.getTableConfig().getTableName(), queryInstant.map(instant -> instant).orElse("N/A"),
+        timer.endTimer(), partitions.size());


Have you considered what happens with MOR tables here? The HoodieROTablePathFilter only returns base files (it calls fsView.getLatestBaseFiles()), so this path constructs FileSlices without log files. The !shouldIncludePendingCommits guard doesn't prevent MOR tables from reaching this code. It might be worth adding a table-type check (COW only) or documenting this limitation.

yihua · 2026-02-10T00:01:20Z

hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java

@@ -146,6 +147,12 @@ public List<StoragePathInfo> getAllFilesInPartition(StoragePath partitionPath) t
  }

  @Override


It looks like the @Override annotation that belonged to getAllFilesInPartitions(Collection<String>) has been absorbed by the new method insertion. In the diff, the @Override on line 148 now applies to the new two-arg overload, while the original single-arg method (which is the actual interface abstract method) loses its @Override. Could you add @Override back to the original method?

yihua · 2026-02-10T00:01:20Z

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

+      val conf = HadoopFSUtils.getStorageConf(spark.sparkContext.hadoopConfiguration)
+      if (specifiedQueryInstant.isDefined) {
+        conf.set(HoodieCommonConfig.TIMESTAMP_AS_OF.key(), specifiedQueryInstant.get)
+      }


HadoopFSUtils.getStorageConf(spark.sparkContext.hadoopConfiguration) wraps the shared Hadoop config without copying it. The subsequent conf.set(TIMESTAMP_AS_OF, ...) would mutate the global Spark session config, which could affect other queries in the same session. Could you use getStorageConfWithCopy instead?

yihua · 2026-02-10T00:01:21Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+      hoodieROTablePathFilterBasedFileListingEnabled = getConfigValue(options, sqlConf,
+        "spark." + DataSourceReadOptions.FILE_INDEX_LIST_FILE_STATUSES_USING_RO_PATH_FILTER.key, null)
+      if (hoodieROTablePathFilterBasedFileListingEnabled != null) {
+        properties.setProperty(DataSourceReadOptions.FILE_INDEX_LIST_FILE_STATUSES_USING_RO_PATH_FILTER.key,


nit: the comment says "For 0.14 rollout" — looks like this was copied from the HMS listing config block. This is a new 1.2.0 config, so the comment is misleading.

yihua · 2026-02-11T22:38:39Z

hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java

+    return getAllFilesInPartitions(partitions);
+  }
+
  public Map<String, List<StoragePathInfo>> getAllFilesInPartitions(Collection<String> partitions)


nit: Let this call getAllFilesInPartitions(partitions, Option.empty()) to be easier to read?

yihua · 2026-02-11T22:40:13Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadata.java

  Map<String, List<StoragePathInfo>> getAllFilesInPartitions(Collection<String> partitionPaths)
      throws IOException;

+  default Map<String, List<StoragePathInfo>> getAllFilesInPartitions(Collection<String> partitionPaths,
+                                                                      Option<StoragePathFilter> pathFilterOption)
+      throws IOException {
+    return getAllFilesInPartitions(partitionPaths);
+  }


Make Map<String, List<StoragePathInfo>> getAllFilesInPartitions(Collection<String> partitionPaths) to have default implementation of getAllFilesInPartitions(partitionPaths, Option.empty()) so subclasses can avoid the repeating code? Then getAllFilesInPartitions(Collection<String> partitionPaths, Option<StoragePathFilter> pathFilterOption) becomes an abstract method.

yihua · 2026-02-11T22:42:27Z

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

+    HoodieTimer timer = HoodieTimer.start();
+    List<StoragePathInfo> allFiles = listPartitionPathFiles(partitions, activeTimeline);
+    log.info("On {} with query instant as {}, it took {}ms to list all files {} Hudi partitions",
+        metaClient.getTableConfig().getTableName(), queryInstant.map(instant -> instant).orElse("N/A"),


nit: queryInstant.map(instant -> instant).orElse("N/A") — the .map(instant -> instant) is a no-op. You can simplify to queryInstant.orElse("N/A").

yihua · 2026-02-11T22:52:30Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java


  public HoodieROTablePathFilter() {
-    this(new Configuration());
+    this(HadoopFSUtils.getStorageConf());


HoodieROTablePathFilter and BaseFileOnlyRelation should no longer be used based on the latest master; instead, HoodieCopyOnWriteSnapshotHadoopFsRelationFactory is used.

+1

lets TAL at all implementations extending from HoodieBaseHadoopFsRelationFactory and we write them in

yihua

A better and general approach would be adding a file system view based on the latest snapshot only to limit the size of file slices in memory, which is used by the file index. That should solve the problem with better layering.

nsivabalan · 2026-02-11T23:49:20Z

hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java

+    return getAllDataFilesInPartition(storage, partitionPath, Option.empty());
+  }
+
+  public static List<StoragePathInfo> getAllDataFilesInPartition(HoodieStorage storage,


we might need to change the naming, now that its not all files.

nsivabalan

lets add tests for time travel query as well

suryaprasanna · 2026-02-11T23:50:46Z

...datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/common/TestROPathFilterOnRead

+      // Update data in first partition
+      spark.sql(s"update $tableName set price = 15.0 where id = 1")
+
+      // Query single partition with ROPathFilter


Add time travel test case. And use a different name for getAllFilesInPartitions

nsivabalan · 2026-02-12T19:23:14Z

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

+        // Add the FileSlice to partitionToFileSlices
+        PartitionPath partitionPathObj = partitionsMap.get(relPartitionPath);
+        List<FileSlice> fileSlices = partitionToFileSlices.computeIfAbsent(partitionPathObj, k -> new ArrayList<>());
+        fileSlices.add(fileSlice);


can we avoid this special handling.
lets route all the files into FSV.
so that we maintain one flow for all cases.

just that the the input files could have already been filtered (if path filter is applied), or could be referring to all files(if no path filter).
much simpler from maintainability standpoint.

nsivabalan · 2026-02-12T19:24:41Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java


  public HoodieROTablePathFilter() {
-    this(new Configuration());
+    this(HadoopFSUtils.getStorageConf());


+1

lets TAL at all implementations extending from HoodieBaseHadoopFsRelationFactory and we write them in

nsivabalan · 2026-02-12T19:28:27Z

A better and general approach would be adding a file system view based on the latest snapshot only to limit the size of file slices in memory, which is used by the file index. That should solve the problem with better layering.

Hey @yihua : based on latest state of the patch, I feel it nicely sits w/n HoodieTableMetadata and so, we can leverage this w/ any of FSV.
Just that I see some special handling of FileIndex after filtering which we can avoid (shared feedback above).
otherwise, current layering seems ok to me.

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 8, 2026

suryaprasanna force-pushed the path-filter-during-listing branch 2 times, most recently from 95fda80 to 2dabf5d Compare February 9, 2026 16:16

suryaprasanna added 3 commits February 9, 2026 18:23

Remove extra unit test to avoid higher unit test runtimes

49b93a7

Update documentation

9b8395e

suryaprasanna force-pushed the path-filter-during-listing branch from 2dabf5d to 9b8395e Compare February 9, 2026 18:23

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Feb 9, 2026

apache deleted a comment from hudi-bot Feb 10, 2026

suryaprasanna commented Feb 10, 2026

View reviewed changes

nsivabalan reviewed Feb 11, 2026

View reviewed changes

yihua reviewed Feb 11, 2026

View reviewed changes

nsivabalan reviewed Feb 11, 2026

View reviewed changes

suryaprasanna commented Feb 11, 2026

View reviewed changes

nsivabalan reviewed Feb 12, 2026

View reviewed changes

		@@ -146,6 +147,12 @@ public List<StoragePathInfo> getAllFilesInPartition(StoragePath partitionPath) t
		}

		@Override

Comments

Conversation

suryaprasanna commented Feb 8, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

suryaprasanna commented Feb 9, 2026

Uh oh!

nsivabalan commented Feb 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 11, 2026

CI report:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants