You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've encountered an issue recently where the mail queue stopped processing for unclear reasons. Although the watchdog system correctly identified and triggered warnings, those warnings were sent via email — which, due to the queue issue, never got delivered. This created a situation where the system was aware of the problem, but I remained unaware of it.
To improve resilience and observability, I’d like to propose adding Prometheus-compatible metrics for:
Mail queue status (e.g., length, processing health)
Watchdog alerts and status reports
Benefits:
Provides a secondary, more robust alerting path
Enables integration with existing Prometheus/Grafana setups
Reduces dependency on email for critical alerts
Helps in identifying issues before they become critical
I believe this would make the system more reliable, especially in production environments.