-
Notifications
You must be signed in to change notification settings - Fork 4
VDYP-714: VDYP Batch - Execution #245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides my specific line commetns I am slightly concerned about the memory expense and processing of parsing out csv inputs only to recreate CSV input streams for projection. Do you see a version of this in which we do not parse the csvs besides partitioining them by FEATURE_ID using just the first bit of each line (up to the comma) and just writing out csvs to partition folders for input. Then for the projection service simply reading the csvs out of the partition input folder. Is there a reason this would not work?
layerReader.open(executionContext); | ||
|
||
// Load all polygon data | ||
loadAllPolygonData(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned that these two functions while efficient may use too much memory in production. Batch will be operating very large data sets loading all that into memory to partition may be expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address memory concerns, I've refactored the batch module to process files line-by-line with configurable chunk sizes (default 10), store raw CSV data without parsing into Java objects, and use lazy loading to keep only the current chunk in memory.
batch/src/main/java/ca/bc/gov/nrs/vdyp/batch/service/VdypProjectionService.java
Outdated
Show resolved
Hide resolved
/** | ||
* Loads parameters from the configured parameters file | ||
*/ | ||
private Parameters loadParameters() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this reading parameters file from a configuration location just for now? since the processing ticket is skipping actually reading data from the job? It seems like this should be referenced from the BatchRecord or some parent of the BatchRecord not an application.properties setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right. I've refactored the approach - Parameters are now passed from job level via the BatchJobRequest.
…rdBean to enable CSV header extraction
…reaming CSV partitioning
batch/src/main/java/ca/bc/gov/nrs/vdyp/batch/controller/BatchController.java
Fixed
Show fixed
Hide fixed
batch/src/main/java/ca/bc/gov/nrs/vdyp/batch/controller/BatchController.java
Fixed
Show fixed
Hide fixed
batch/src/main/java/ca/bc/gov/nrs/vdyp/batch/controller/BatchController.java
Fixed
Show fixed
Hide fixed
batch/src/main/java/ca/bc/gov/nrs/vdyp/batch/controller/BatchController.java
Fixed
Show fixed
Hide fixed
batch/src/main/java/ca/bc/gov/nrs/vdyp/batch/service/StreamingCsvPartitioner.java
Fixed
Show fixed
Hide fixed
I implemented your suggestions: the batch module now extracts FEATURE_ID by reading only the line's first part, writes raw CSV data to partition folders without full parsing. refactoring - batch processing with chunk-based approach and streaming CSV partitioning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could refactor that Utility sanitze function before merging but I also won't hold this out because of it.
Feel free to refactor that function and then merge or merge if you disagree with the refactor.
* Sanitizes filename for safe logging | ||
* Removes control characters, line breaks, and limits length. | ||
*/ | ||
private String sanitizeFilename(String filename) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a duplicate of sanitizeFileName. As soon as you are making a duplicate of somethign that operates statically you should make a static Utils Class that does this processing for you.
Please refactor to Utils.SanitizeForLogging(String filename)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pminter-vivid refactored to use 'Utils.SanitizeForLogging.' Thanks for the review.
|
Update VDYP batch to process the partitioned files in parallel
TAC
Out of Scope
Developer Note: