|
1 | 1 | # Postmortem
|
2 | 2 |
|
3 |
| -## TODO: write about the db connection issue incident |
| 3 | +## Overview |
| 4 | + |
| 5 | +Between November 26th and 28th, our database experienced an outage due to insufficient memory on the database VM. This resulted in all requests dependent on the database failing with status code 500. The incident was identified through our monitoring stack, which provided critical insights into the issue. |
| 6 | + |
| 7 | +## Incident Timeline |
| 8 | + |
| 9 | +- **November 26th:** The database began experiencing memory issues, leading to service disruptions. |
| 10 | +- **November 27th:** Continued failures as the database container attempted to restart but failed with status code 137, indicating insufficient memory. |
| 11 | +- **November 28th:** The issue persisted until manual intervention was performed. |
| 12 | + |
| 13 | +## Visual Evidence |
| 14 | + |
| 15 | +### Request Duration |
| 16 | + |
| 17 | + |
| 18 | +### Error Rate |
| 19 | + |
| 20 | + |
| 21 | +These graphs show the spike in request durations and error rates during the incident period, to the point of requests failing. |
| 22 | + |
| 23 | +## Root Cause |
| 24 | + |
| 25 | +The root cause of the outage was a lack of memory for the database VM. |
| 26 | + |
| 27 | +## Resolution |
| 28 | + |
| 29 | +- **Immediate Actions:** |
| 30 | + - Restarted the server and the database container, which seemed to work for a while, hence the normal data at a period between the 26th and 27th (and slow response time, once again). |
| 31 | + - Planned on making a swap memory to stabilize the virtual machine. |
| 32 | + |
| 33 | +- **Long-term Solution:** |
| 34 | + - After repeated failure of the database, we decided to scale vertically, using a more powerful virtual machine (Azure B2s instance). |
| 35 | + |
| 36 | +## Lessons Learned |
| 37 | + |
| 38 | +- **Monitoring Effectiveness:** |
| 39 | + - Our monitoring stack was instrumental in identifying the issue. |
| 40 | + - However, the lack of automated alerting delayed our response time. |
| 41 | + |
| 42 | +## How to prevent this in the future |
| 43 | + |
| 44 | +1. **Implement Alerting:** |
| 45 | + - Set up Grafana alerts to notify us of critical issues, such as low memory, high CPU usage, etc. |
0 commit comments