Skip to content

Comments

feat(common): Add API to fetch log files created on or before given instant time#18142

Open
nada-attia wants to merge 1 commit intoapache:masterfrom
nada-attia:nada_oss_commit_porting_03
Open

feat(common): Add API to fetch log files created on or before given instant time#18142
nada-attia wants to merge 1 commit intoapache:masterfrom
nada-attia:nada_oss_commit_porting_03

Conversation

@nada-attia
Copy link

Describe the issue this Pull Request addresses

This PR adds a new API getAllLogFilesWithMaxCommit to LogReaderUtils that retrieves log files created by commits with instant timestamps less than or equal to a specified max commit time. This is useful for MDT (Metadata Table) consistency checks and log file validation scenarios.

Summary and Changelog

Summary: Adds a new utility method to fetch log files filtered by commit time, along with the commit times associated with each block in those log files.

Changelog:

  • Added getAllLogFilesWithMaxCommit method to LogReaderUtils class
  • The method:
    • Gets the filtered timeline based on commits completed before or on the max instant
    • Gets all file slices in given partitions based on max commit instant
    • Gets all log files from each file slice
    • For each log file, returns the list of commit instant times for blocks created on or before the max commit instant time
  • Added unit tests in TestLogReaderUtils.java

Impact

Public API Changes:

  • Added new public static method getAllLogFilesWithMaxCommit(HoodieTableMetaClient, AbstractTableFileSystemView, List<String>, String, HoodieEngineContext) to LogReaderUtils

User-Facing Changes:
Users can now programmatically retrieve log files filtered by a maximum commit time, which is useful for:

  • MDT consistency validation
  • Log file auditing and debugging
  • Custom tooling for log file analysis

Risk Level

Low

The changes are purely additive:

  • New utility method added to existing class
  • No modification to existing functionality
  • Well-tested with unit tests
  • Uses existing, stable Hudi APIs internally

Documentation Update

None - This is an internal utility method. The Javadoc on the method provides sufficient documentation for developers.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…nstant time

Add getAllLogFilesWithMaxCommit API to LogReaderUtils which:
- Gets filtered timeline based on commits modified before or on the max instant
- Gets all file slices in given partitions based on max commit instant
- Gets all log files from each file slice
- For each log file, returns the list of commit instant times for blocks
  created on or before the max commit instant time

This API is useful for MDT consistency checks and log file validation.

Adapted from internal commit c5f7c924d1a3ad386dd498dc41761e7284d0e60e
to use the modern Storage API instead of deprecated FileSystem API.
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 10, 2026
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build


// filter out all commits completed after the max commit completion time
HoodieTimeline filteredTimeline = fsView.getTimeline().filter(instant -> !fsView.getTimeline()
.findInstantsModifiedAfterByCompletionTime(maxCommitCompletionTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we introduce findInstantsModifiedBeforeByCompletionTime in the HoodieTimeline only. rather than trying to re-use findInstantsModifiedAfterByCompletionTime and negating the output

HoodieLogBlock block = reader.next();
String logBlockInstantTime = block.getLogBlockHeader().get(INSTANT_TIME);
// check if the log file contains a block created by a commit that is older than or equal to max commit
if (filteredTimeline.containsInstant(logBlockInstantTime)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one of the log block's instant time is in archived timeline, this might not return it right?
shouldn't we do containsOrBeforeTimelineStarts kind of api

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im under the impression, that we cannot archive a deltacommit which has not been compacted and cleaned, can you confirm? if that is true, the above case (the associated commit for a log block being archived) wouldn't be possible.

JavaRDD<HoodieRecord> writeRecords2 = jsc().parallelize(records2, 1);
List<WriteStatus> statuses2 = client.upsert(writeRecords2, secondCommit).collect();
assertNoWriteErrors(statuses2);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we validate from completed commit metadata, that log files were infact produced. due to small file handling, chances that log files may not be produced, but just parquet files

context());

// Verify results
assertNotNull(logFilesWithMaxCommit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also assert this is non empty

// Should have same or more entries when including more commits
assertTrue(allLogFiles.size() >= logFilesWithMaxCommit.size(),
"Should have same or more log files when including third commit");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a test where 1st log files commit time is archived.

Comment on lines +97 to +98
HoodieLogFormat.Reader reader = HoodieLogFormat.newReader(metaClient.getStorage(), logFile, null);
while (reader.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader should be wrapped in a try-with-resources block. If reader.hasNext() or reader.next() throws, the reader won't be closed and it leaks a file handle. Every other usage of HoodieLogFormat.newReader in the codebase uses try-with-resources. Could you do the same here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants