feat: get issueDts from AWS in parallel #113
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Linear Issue
IDSSE-1077
IDSSE-1301
Changes
Publisher
StreamLostError
pika exception as well?issueDt
s available in AWS for any dataset in parallelProtocolUtils.get_issues()
accepts a custom value withmax_workers
argument.Explanation
Previously the ProtocolUtils would sequentially crawl a given AWS bucket (such as NBM) to discover the most recent issuance datetimes available.
ls
call would wait to complete before attempting tols
the next folder. This meant thatget_issues()
response time increased linearly as thenum_issues
argument increased.Now the function uses simple Python threading to look for recent issueDts in AWS in parallel, sending all the S3
ls
requests at once and then sorting through what files each one found. By default it does this with up to 24 parallel threads, which just experimentally seemed to be a pretty fast number.