Skip to content

Commit 06f3dca

Browse files
Merge pull request #175 from OBS-DevOps24/mandatory-2
postmortem for database incident
2 parents 9612f8f + 62374c5 commit 06f3dca

File tree

3 files changed

+43
-1
lines changed

3 files changed

+43
-1
lines changed

docs/mandatory-2/5_Postmortem.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,45 @@
11
# Postmortem
22

3-
## TODO: write about the db connection issue incident
3+
## Overview
4+
5+
Between November 26th and 28th, our database experienced an outage due to insufficient memory on the database VM. This resulted in all requests dependent on the database failing with status code 500. The incident was identified through our monitoring stack, which provided critical insights into the issue.
6+
7+
## Incident Timeline
8+
9+
- **November 26th:** The database began experiencing memory issues, leading to service disruptions.
10+
- **November 27th:** Continued failures as the database container attempted to restart but failed with status code 137, indicating insufficient memory.
11+
- **November 28th:** The issue persisted until manual intervention was performed.
12+
13+
## Visual Evidence
14+
15+
### Request Duration
16+
![Request Duration](./assets/request-duration.png)
17+
18+
### Error Rate
19+
![Error Rate](./assets/error-rate.png)
20+
21+
These graphs show the spike in request durations and error rates during the incident period, to the point of requests failing.
22+
23+
## Root Cause
24+
25+
The root cause of the outage was a lack of memory for the database VM.
26+
27+
## Resolution
28+
29+
- **Immediate Actions:**
30+
- Restarted the server and the database container, which seemed to work for a while, hence the normal data at a period between the 26th and 27th (and slow response time, once again).
31+
- Planned on making a swap memory to stabilize the virtual machine.
32+
33+
- **Long-term Solution:**
34+
- After repeated failure of the database, we decided to scale vertically, using a more powerful virtual machine (Azure B2s instance).
35+
36+
## Lessons Learned
37+
38+
- **Monitoring Effectiveness:**
39+
- Our monitoring stack was instrumental in identifying the issue.
40+
- However, the lack of automated alerting delayed our response time.
41+
42+
## How to prevent this in the future
43+
44+
1. **Implement Alerting:**
45+
- Set up Grafana alerts to notify us of critical issues, such as low memory, high CPU usage, etc.
22.8 KB
Loading
51.3 KB
Loading

0 commit comments

Comments
 (0)