Skip to content

Conversation

mackenzie-grimes-noaa
Copy link
Contributor

@mackenzie-grimes-noaa mackenzie-grimes-noaa commented Jul 1, 2025

Linear Issue

IDSSE-1077
IDSSE-1301

Changes

  • Don't try to re-establish RabbitMQ connection inside of Publisher
    • Catch StreamLostError pika exception as well?
  • Look up issueDts available in AWS for any dataset in parallel
    • Default is 24 threads, which was pretty fast during testing. ProtocolUtils.get_issues() accepts a custom value with max_workers argument.

Explanation

Previously the ProtocolUtils would sequentially crawl a given AWS bucket (such as NBM) to discover the most recent issuance datetimes available.

  • So each AWS S3 ls call would wait to complete before attempting to ls the next folder. This meant that get_issues() response time increased linearly as the num_issues argument increased.
  • Getting the latest issueDt might take 500 ms, the latest 3 issueDts would take 1.5 seconds, the latest 6 issueDts would take about 3 seconds, and so on.

Now the function uses simple Python threading to look for recent issueDts in AWS in parallel, sending all the S3 ls requests at once and then sorting through what files each one found. By default it does this with up to 24 parallel threads, which just experimentally seemed to be a pretty fast number.

@mackenzie-grimes-noaa mackenzie-grimes-noaa merged commit 95bef15 into main Jul 2, 2025
2 checks passed
@mackenzie-grimes-noaa mackenzie-grimes-noaa deleted the feat/get-issues-parallel branch July 2, 2025 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant