feat(common): Add API to fetch log files created on or before given instant time#18142
feat(common): Add API to fetch log files created on or before given instant time#18142nada-attia wants to merge 1 commit intoapache:masterfrom
Conversation
…nstant time Add getAllLogFilesWithMaxCommit API to LogReaderUtils which: - Gets filtered timeline based on commits modified before or on the max instant - Gets all file slices in given partitions based on max commit instant - Gets all log files from each file slice - For each log file, returns the list of commit instant times for blocks created on or before the max commit instant time This API is useful for MDT consistency checks and log file validation. Adapted from internal commit c5f7c924d1a3ad386dd498dc41761e7284d0e60e to use the modern Storage API instead of deprecated FileSystem API.
|
|
||
| // filter out all commits completed after the max commit completion time | ||
| HoodieTimeline filteredTimeline = fsView.getTimeline().filter(instant -> !fsView.getTimeline() | ||
| .findInstantsModifiedAfterByCompletionTime(maxCommitCompletionTime) |
There was a problem hiding this comment.
can we introduce findInstantsModifiedBeforeByCompletionTime in the HoodieTimeline only. rather than trying to re-use findInstantsModifiedAfterByCompletionTime and negating the output
| HoodieLogBlock block = reader.next(); | ||
| String logBlockInstantTime = block.getLogBlockHeader().get(INSTANT_TIME); | ||
| // check if the log file contains a block created by a commit that is older than or equal to max commit | ||
| if (filteredTimeline.containsInstant(logBlockInstantTime)) { |
There was a problem hiding this comment.
if one of the log block's instant time is in archived timeline, this might not return it right?
shouldn't we do containsOrBeforeTimelineStarts kind of api
There was a problem hiding this comment.
im under the impression, that we cannot archive a deltacommit which has not been compacted and cleaned, can you confirm? if that is true, the above case (the associated commit for a log block being archived) wouldn't be possible.
| JavaRDD<HoodieRecord> writeRecords2 = jsc().parallelize(records2, 1); | ||
| List<WriteStatus> statuses2 = client.upsert(writeRecords2, secondCommit).collect(); | ||
| assertNoWriteErrors(statuses2); | ||
|
|
There was a problem hiding this comment.
can we validate from completed commit metadata, that log files were infact produced. due to small file handling, chances that log files may not be produced, but just parquet files
| context()); | ||
|
|
||
| // Verify results | ||
| assertNotNull(logFilesWithMaxCommit); |
There was a problem hiding this comment.
can we also assert this is non empty
| // Should have same or more entries when including more commits | ||
| assertTrue(allLogFiles.size() >= logFilesWithMaxCommit.size(), | ||
| "Should have same or more log files when including third commit"); | ||
| } |
There was a problem hiding this comment.
lets add a test where 1st log files commit time is archived.
| HoodieLogFormat.Reader reader = HoodieLogFormat.newReader(metaClient.getStorage(), logFile, null); | ||
| while (reader.hasNext()) { |
There was a problem hiding this comment.
The reader should be wrapped in a try-with-resources block. If reader.hasNext() or reader.next() throws, the reader won't be closed and it leaks a file handle. Every other usage of HoodieLogFormat.newReader in the codebase uses try-with-resources. Could you do the same here?
Describe the issue this Pull Request addresses
This PR adds a new API
getAllLogFilesWithMaxCommittoLogReaderUtilsthat retrieves log files created by commits with instant timestamps less than or equal to a specified max commit time. This is useful for MDT (Metadata Table) consistency checks and log file validation scenarios.Summary and Changelog
Summary: Adds a new utility method to fetch log files filtered by commit time, along with the commit times associated with each block in those log files.
Changelog:
getAllLogFilesWithMaxCommitmethod toLogReaderUtilsclassTestLogReaderUtils.javaImpact
Public API Changes:
getAllLogFilesWithMaxCommit(HoodieTableMetaClient, AbstractTableFileSystemView, List<String>, String, HoodieEngineContext)toLogReaderUtilsUser-Facing Changes:
Users can now programmatically retrieve log files filtered by a maximum commit time, which is useful for:
Risk Level
Low
The changes are purely additive:
Documentation Update
None - This is an internal utility method. The Javadoc on the method provides sufficient documentation for developers.
Contributor's checklist