Skip to content

Conversation

rkuo-danswer
Copy link
Contributor

Description

Fixes https://linear.app/danswer/issue/DAN-1820/watchdog-on-celery-beat-in-supervisord

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Copy link

vercel bot commented Apr 15, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 15, 2025 9:27pm

@rkuo-danswer rkuo-danswer marked this pull request as ready for review April 15, 2025 20:57
@rkuo-danswer rkuo-danswer requested a review from a team as a code owner April 15, 2025 20:57
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR implements a watchdog mechanism for monitoring and auto-restarting the Celery beat scheduler when it becomes unresponsive, using Redis-based heartbeat tracking.

  • Critical bug: monitor-celery-beat task in beat_schedule.py incorrectly uses MONITOR_PROCESS_MEMORY instead of a dedicated monitoring task type
  • Unreachable log message in supervisord_watchdog.py due to infinite loop in main function
  • Missing error handling for subprocess.call in supervisord_watchdog.py when restarting processes
  • Typo in log message ("succeded") in supervisord_watchdog.py
  • Celery version upgrade from beta (5.5.0b4) to stable (5.5.1) improves reliability

8 file(s) reviewed, 6 comment(s)
Edit PR Review Bot Settings | Greptile

f"elapsed_threshold={MAX_AGE_SECONDS}"
)

subprocess.call(["supervisorctl", "-c", conf, "restart", program])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: subprocess.call return value should be checked for errors

Suggested change
subprocess.call(["supervisorctl", "-c", conf, "restart", program])
result = subprocess.call(["supervisorctl", "-c", conf, "restart", program])
if result != 0:
logger.error(f"Failed to restart {program} (exit code {result})")

Richard Kuo (Onyx) added 2 commits April 15, 2025 14:22
@rkuo-danswer rkuo-danswer enabled auto-merge April 15, 2025 21:53
@rkuo-danswer rkuo-danswer added this pull request to the merge queue Apr 15, 2025
Merged via the queue into main with commit 2ac41c3 Apr 15, 2025
10 of 11 checks passed
@rkuo-danswer rkuo-danswer deleted the feature/celery-beat-watchdog branch April 15, 2025 22:58
aronszanto pushed a commit to aronszanto/onyx that referenced this pull request Apr 26, 2025
* upgrade celery to release version

* make the watchdog script more reusable

* use constant

* code review

* catch interrupt

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* upgrade celery to release version

* make the watchdog script more reusable

* use constant

* code review

* catch interrupt

---------

Co-authored-by: Richard Kuo (Onyx) <rkuo@onyx.app>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants