Skip to content

Conversation

vividroyjeong
Copy link
Collaborator

@vividroyjeong vividroyjeong commented Sep 19, 2025

Update VDYP batch to process the partitioned files in parallel

TAC

  • Partitioned files are able to be processed by VDYP batch in parallel (at least two partitions at a time)
  • Output is stored in a structed way next to the partiioned input files for aggregation later

Out of Scope

  • Aggregation of the output files is a separate step from this
  • Updating the DB with progress as processing completes / fails
  • This module does not need to be deployed to OpenShift

Developer Note:

  • To proceed, call the projection and generate the correct output files using an internal poly.csv and layer.csv.
  • Adjust the existing skeleton to accommodate the actual file formats required for the projection.

Copy link
Collaborator

@pminter-vivid pminter-vivid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides my specific line commetns I am slightly concerned about the memory expense and processing of parsing out csv inputs only to recreate CSV input streams for projection. Do you see a version of this in which we do not parse the csvs besides partitioining them by FEATURE_ID using just the first bit of each line (up to the comma) and just writing out csvs to partition folders for input. Then for the projection service simply reading the csvs out of the partition input folder. Is there a reason this would not work?

layerReader.open(executionContext);

// Load all polygon data
loadAllPolygonData();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that these two functions while efficient may use too much memory in production. Batch will be operating very large data sets loading all that into memory to partition may be expensive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address memory concerns, I've refactored the batch module to process files line-by-line with configurable chunk sizes (default 10), store raw CSV data without parsing into Java objects, and use lazy loading to keep only the current chunk in memory.

/**
* Loads parameters from the configured parameters file
*/
private Parameters loadParameters() throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reading parameters file from a configuration location just for now? since the processing ticket is skipping actually reading data from the job? It seems like this should be referenced from the BatchRecord or some parent of the BatchRecord not an application.properties setting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. I've refactored the approach - Parameters are now passed from job level via the BatchJobRequest.

@vividroyjeong
Copy link
Collaborator Author

I implemented your suggestions: the batch module now extracts FEATURE_ID by reading only the line's first part, writes raw CSV data to partition folders without full parsing. refactoring - batch processing with chunk-based approach and streaming CSV partitioning.

Copy link
Collaborator

@pminter-vivid pminter-vivid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could refactor that Utility sanitze function before merging but I also won't hold this out because of it.

Feel free to refactor that function and then merge or merge if you disagree with the refactor.

* Sanitizes filename for safe logging
* Removes control characters, line breaks, and limits length.
*/
private String sanitizeFilename(String filename) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate of sanitizeFileName. As soon as you are making a duplicate of somethign that operates statically you should make a static Utils Class that does this processing for you.

Please refactor to Utils.SanitizeForLogging(String filename)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pminter-vivid refactored to use 'Utils.SanitizeForLogging.' Thanks for the review.

Copy link

@vividroyjeong vividroyjeong merged commit d072606 into main Sep 25, 2025
12 checks passed
@vividroyjeong vividroyjeong deleted the feature/VDYP-714 branch October 7, 2025 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants