Enable async and distributed processing for the ML backend #910

vanessavmac · 2025-08-04T21:27:56Z

Summary

The current batch image processing system runs MLJobs as a single celery task. This causes issues when processing large numbers of image (i.e. 100+ images) since the long running task can be interrupted or lost.

This PR uses celery as the task queue and rabbitmq as the message broker to send batches of images as individual process_pipeline_requests into queues dedicated to a specific ML pipeline. Processing services can pick up tasks based on the pipelines they host. A periodic celery beat task listens for completed process_pipeline_requests and enqueues save_results tasks.

The planning of this feature was discussed in #515. See the comments beginning at #515 (comment)

List of Changes

Add a process_pipeline_request task which takes a PipelineRequest and returns the model's results. This is defined on the processing service.
Update the Job model to include subtasks and inprogress_subtasks to track the celery tasks queued (these can be either process_pipeline_request or save_results tasks
Add a periodic check_ml_job_status which checks the subtasks of an MLJob, updates the job status, and schedules save_results tasks
Update processing services to include celery workers that subscribe to their pipelines' queues
Introduces a new MLTaskRecord model which stores the results and stats of a celery task

Related Issues

Addresses (in part?) #515

Detailed Description

Potential side effects or risks associated with the changes...

How to Test the Changes

Instructions on how to test the changes Include references to automated and/or manual tests that were created/used to
test the changes.

Screenshots

If applicable, add screenshots to help explain this PR (ex. Before and after for UI changes).

Deployment Notes

Include instructions if this PR requires specific steps for its deployment (database migrations, config changes, etc.)

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

…void shadowing the FlatBugDetector model

…or-panama-trip

netlify · 2025-08-04T21:28:03Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`2fa57ef`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/68c8a5b1ce72f500086feb33

…nc-distributed-ml-backend

…use serially scheduled tasks instead of periodic task

…error logging

…ter the job start time; refactoring

…quest)

vanessavmac and others added 30 commits March 23, 2025 11:17

Set up customizable local processing service

8aad275

Set up separate docker compose stack, rename ml backend services

61b45a4

WIP: README.md

4a03c7e

Improve processing flow

09d7dfb

fix: tests and postgres connection

996674e

Update READMEs with minimal/example setups

ce973fc

fix: transformers fixed version

bf7178d

Add tests

41efa42

Typos, warn --> warnings

78babeb

Add support for Darsa flat-bug

8d28d01

chore: Change the Pipeline class name to FlatBugDetectorPipeline to a…

bb22514

…void shadowing the FlatBugDetector model

Move README

1dbc5f0

Address comment tasks

fe1a9f4

Merge branch 'main' into 747-get-antenna-to-work-locally-on-laptops-f…

7747f3a

…or-panama-trip

Update README

1978cbe

Pass in pipeline request config, properly cache models, simplifications

82ac82d

Pass in pipeline request config, properly cache models, simplifications

7d733f9

fix: update docker compose instructions & build path

07d61d9

feat: use ["insect"] for the default zero-shot class

d129029

feat: try to use faster version of zero-shot detector

76ce2d8

feat: use gpu if available

035b952

fix: update minimal docker compose build path

1230386

Add back crop_image_url

45dbacf

Support re-processing detections and skipping localizer

7361fb2

fix: correctly pass candidate labels for zero shot object detector

3f722c8

Support re-processing detections and skipping localizer

075a7ec

fix: merge conflict

85c676d

fix: allow empty pipeline request config

cbd7ae0

fix: allow empty pipeline request config

7d15ffb

clean up

c2881b4

vanessavmac changed the base branch from main to 706-support-for-reprocessing-detections-and-skipping-detector August 4, 2025 21:28

vanessavmac requested a review from mihow August 4, 2025 21:37

Improvements to handle large batches

d0380b9

Base automatically changed from 706-support-for-reprocessing-detections-and-skipping-detector to main August 16, 2025 01:36

Merge branch 'main' of github.com:RolnickLab/antenna into 515-new-asy…

2594049

…nc-distributed-ml-backend

f-PLT changed the title ~~Async distributed ML Backend~~ Enable async and distributed processing for the ML backend Aug 28, 2025

vanessavmac added 5 commits August 30, 2025 15:58

Add batch processing unit test; bulk db updates; fix duplicate logs; …

57e6691

…use serially scheduled tasks instead of periodic task

Fix for "get() returned more than one AlgorithmCategoryMap" error

f785dda

Allow synchronous

0a22d53

Fix job progress if no images are submitted

d139734

Subscribe antenna celeryworker to all pipeline queues; add more task …

a83dd20

…error logging

This was referenced Sep 4, 2025

Ensure that a processing job's status is always updated correctly #721

Open

Fix jobs initialized with wrong status #937

Merged

vanessavmac added 5 commits September 4, 2025 01:08

Rename celery to antenna queue; only query ml task records created af…

0707433

…ter the job start time; refactoring

Merge branch 'main' into 515-new-async-distributed-ml-backend

0103b7e

Re-subscribe to queues before processing images; fix test issues

3a3b881

Add missing migration; rename antenna celeryworker

7c86612

Use transaction.on_commit with all async celery tasks

c016d47

vanessavmac marked this pull request as ready for review September 5, 2025 04:18

vanessavmac and others added 6 commits September 5, 2025 00:36

Test clean up

fa510ed

feat: isolate the CI / test compose stack from other containers

6da55a9

feat: fix isoloated CI stack (rely on compose project name)

652f47f

fix: run migrations from celery start command, other fixes for tests

f3b588a

fix: rabbitmq credentials for tests & local dev

5654ed0

draft: methods for inspecting celery tasks during tests

875d3cb

mihow added the VISS-SSEC label Sep 9, 2025

vanessavmac added 3 commits September 13, 2025 15:17

feat: add health check; fix: rabbitmq credentials, minio ci set up

be351f8

draft: unit test changes

e550531

draft: some more unit test updates (working up to process_pipeline_re…

2fa57ef

…quest)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable async and distributed processing for the ML backend #910

Enable async and distributed processing for the ML backend #910

Uh oh!

vanessavmac commented Aug 4, 2025 •

edited

Loading

Uh oh!

netlify bot commented Aug 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Enable async and distributed processing for the ML backend #910

Are you sure you want to change the base?

Enable async and distributed processing for the ML backend #910

Uh oh!

Conversation

vanessavmac commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Related Issues

Detailed Description

How to Test the Changes

Screenshots

Deployment Notes

Checklist

Uh oh!

netlify bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

Uh oh!

vanessavmac commented Aug 4, 2025 •

edited

Loading

netlify bot commented Aug 4, 2025 •

edited

Loading