# Post-Mortem Template

# Post-Mortem

> **Project:** Bilko
> **Version:** 0.1
> **Date:** 2026-02-23
> **Author:** Ops Architect
> **Status:** Draft (Template — fill in per P0 incident)
> **Reviewers:** Tech Lead, Alem Bašić

## Document History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1     | 2026-02-23 | Ops Architect | Initial draft |

---

## INSTRUCTIONS

Post-mortems are required for all P0 incidents and recommended for P1. Schedule within 5 business days of incident resolution.

**Blameless culture:** This document is about systems and processes, not people. The goal is to learn and prevent recurrence, not to assign blame.

Create a new file: `POST-MORTEM-YYYY-MM-DD-<title>.md` in `docs/operations/post-mortems/`

---

# Post-Mortem: [INCIDENT TITLE]

**Post-Mortem Date:** YYYY-MM-DD
**Incident Date:** YYYY-MM-DD
**Incident Reference:** INC-YYYY-MM-DD-NNN
**Facilitator:** [Name]
**Attendees:** [Names]
**Duration of post-mortem session:** [X minutes]

---

## 1. Executive Summary

**What happened:**
[2-3 sentences: the incident, impact, and resolution]

**Why it happened:**
[1-2 sentences: root cause in plain language]

**What we're doing to prevent recurrence:**
[1-2 sentences: top action items]

---

## 2. Impact Summary

| Metric | Value |
|--------|-------|
| Incident duration | X hours Y minutes |
| Detection time | X minutes from first symptom |
| Response time | X minutes from alert to first action |
| Users impacted | [All / Specific org / None] |
| Financial records affected | [None / Describe] |
| Downtime cost (est.) | [€X in lost productivity / TBD] |
| GDPR breach notification required | [Yes / No] |

---

## 3. Timeline (Detailed)

| Time (CET) | Event | Who | Notes |
|-----------|-------|-----|-------|
| HH:MM | [Event] | [Person] | [Notes] |

**Key timestamps:**
- **First symptom:** HH:MM
- **Alert fired:** HH:MM (detection lag: X min)
- **Incident declared:** HH:MM (response lag: X min)
- **Root cause identified:** HH:MM (diagnosis duration: X min)
- **Fix applied:** HH:MM
- **Service restored:** HH:MM
- **Incident closed:** HH:MM
- **Total user impact duration:** X min

---

## 4. Root Cause Analysis

### What happened technically

[Detailed technical explanation of the failure chain. Be specific: which component, which code path, which query.]

**For Bilko financial incidents, this section must include:**
- Which accounting module was affected (VAT / double-entry / invoice calc / currency)
- Was any financial data corrupted? If yes, which organizations, which time window
- Were NUMERIC(19,4) values preserved correctly during the incident?

### Why it happened

[The "5 Whys" — trace back to the systemic cause]

1. **Why** did users lose access? → API returned 503 errors
2. **Why** did the API return 503? → Railway service restarted due to OOM
3. **Why** did the service run out of memory? → Invoice PDF generation loaded entire result set into memory
4. **Why** did we not catch this? → No memory profiling in development, load testing not done
5. **Why** was there no load testing? → No performance test plan existed

**Root cause (systemic):** [e.g., "Lack of memory usage monitoring and load testing prior to feature launch"]

### Contributing Factors

| Factor | Category | Severity |
|--------|----------|----------|
| [Factor] | [Process / Code / Infrastructure / Communication] | [High / Med / Low] |

---

## 5. What Went Well

<!-- Genuine positives — what did the team do right? -->

- [e.g., BetterStack alert fired within 2 minutes of downtime starting]
- [e.g., Rollback procedure was documented and worked on first try]
- [e.g., Financial data integrity was preserved — no accounting records corrupted]
- [e.g., User communication was clear and timely]

---

## 6. What Went Poorly

<!-- Be honest — what failed in process, tooling, or response? -->

- [e.g., No memory usage alerting was configured before launch]
- [e.g., The runbook did not cover OOM scenarios]
- [e.g., Detection took 8 minutes because uptime check interval was 5 min]
- [e.g., Only one person knew how to access Railway logs]

---

## 7. Action Items

### High Priority (P0/P1 — complete within 2 weeks)

| # | Action | Category | Owner | Due | Status |
|---|--------|----------|-------|-----|--------|
| 1 | [Action] | Prevention | [Name] | YYYY-MM-DD | Open |
| 2 | [Action] | Detection | [Name] | YYYY-MM-DD | Open |

### Medium Priority (P2 — complete within 1 month)

| # | Action | Category | Owner | Due | Status |
|---|--------|----------|-------|-----|--------|
| 3 | [Action] | Process | [Name] | YYYY-MM-DD | Open |

### Low Priority (P3 — add to backlog)

| # | Action | Category | Owner | Due | Status |
|---|--------|----------|-------|-----|--------|
| 4 | [Action] | Nice-to-have | [Name] | Backlog | Open |

**Action categories:** Prevention, Detection, Response, Documentation, Process, Tooling

---

## 8. Lessons Learned

### Technical Lessons

[What did we learn about the technology, the system design, or the code?]

**For Bilko financial system incidents:**
- [e.g., NUMERIC(19,4) Decimal arithmetic must be tested under concurrent load]
- [e.g., VAT calculation should be validated server-side even if client sends computed totals]

### Process Lessons

[What did we learn about our operations process, monitoring, or communication?]

### Culture Lessons

[What did we learn about team practices, communication patterns, or organizational factors?]

---

## 9. Metrics Targets for Next Quarter

Based on this incident, these metrics are now tracked:

| Metric | Current | Target | By |
|--------|---------|--------|----|
| Mean time to detect (MTTD) | X min | < 3 min | YYYY-MM-DD |
| Mean time to respond (MTTR) | X min | < 10 min | YYYY-MM-DD |
| Mean time to resolve (MTTR) | X min | < 60 min | YYYY-MM-DD |

---

## 10. Follow-Up Schedule

- [ ] Action items tracked in GitHub Issues (label: `post-mortem-action`)
- [ ] 2-week check-in: verify P0/P1 actions completed
- [ ] 1-month check-in: verify P2 actions completed
- [ ] Next post-mortem: review if similar incidents recurred

---

## Approval
| Role | Name | Date | Signature |
|------|------|------|-----------|
| Facilitator | | | |
| Reviewer | Alem Bašić | | |