Post-Mortem Template
Post-Mortem
Project: Bilko
Version: 0.1
Date: 2026-02-23
Author: Ops Architect
Status: Draft (Template — fill in per P0 incident)
Reviewers: Tech Lead, Alem Bašić
Document History
Version
Date
Author
Changes
0.1
2026-02-23
Ops Architect
Initial draft
INSTRUCTIONS
Post-mortems are required for all P0 incidents and recommended for P1. Schedule within 5 business days of incident resolution.
Blameless culture: This document is about systems and processes, not people. The goal is to learn and prevent recurrence, not to assign blame.
Create a new file: POST-MORTEM-YYYY-MM-DD-
.md in docs/operations/post-mortems/
Post-Mortem: [INCIDENT TITLE]
Post-Mortem Date: YYYY-MM-DD
Incident Date: YYYY-MM-DD
Incident Reference: INC-YYYY-MM-DD-NNN
Facilitator: [Name]
Attendees: [Names]
Duration of post-mortem session: [X minutes]
1. Executive Summary
What happened:
[2-3 sentences: the incident, impact, and resolution]
Why it happened:
[1-2 sentences: root cause in plain language]
What we're doing to prevent recurrence:
[1-2 sentences: top action items]
2. Impact Summary
Metric
Value
Incident duration
X hours Y minutes
Detection time
X minutes from first symptom
Response time
X minutes from alert to first action
Users impacted
[All / Specific org / None]
Financial records affected
[None / Describe]
Downtime cost (est.)
[€X in lost productivity / TBD]
GDPR breach notification required
[Yes / No]
3. Timeline (Detailed)
Time (CET)
Event
Who
Notes
HH:MM
[Event]
[Person]
[Notes]
Key timestamps:
First symptom: HH:MM
Alert fired: HH:MM (detection lag: X min)
Incident declared: HH:MM (response lag: X min)
Root cause identified: HH:MM (diagnosis duration: X min)
Fix applied: HH:MM
Service restored: HH:MM
Incident closed: HH:MM
Total user impact duration: X min
4. Root Cause Analysis
What happened technically
[Detailed technical explanation of the failure chain. Be specific: which component, which code path, which query.]
For Bilko financial incidents, this section must include:
Which accounting module was affected (VAT / double-entry / invoice calc / currency)
Was any financial data corrupted? If yes, which organizations, which time window
Were NUMERIC(19,4) values preserved correctly during the incident?
Why it happened
[The "5 Whys" — trace back to the systemic cause]
Why did users lose access? → API returned 503 errors
Why did the API return 503? → Railway service restarted due to OOM
Why did the service run out of memory? → Invoice PDF generation loaded entire result set into memory
Why did we not catch this? → No memory profiling in development, load testing not done
Why was there no load testing? → No performance test plan existed
Root cause (systemic): [e.g., "Lack of memory usage monitoring and load testing prior to feature launch"]
Contributing Factors
Factor
Category
Severity
[Factor]
[Process / Code / Infrastructure / Communication]
[High / Med / Low]
5. What Went Well
[e.g., BetterStack alert fired within 2 minutes of downtime starting]
[e.g., Rollback procedure was documented and worked on first try]
[e.g., Financial data integrity was preserved — no accounting records corrupted]
[e.g., User communication was clear and timely]
6. What Went Poorly
[e.g., No memory usage alerting was configured before launch]
[e.g., The runbook did not cover OOM scenarios]
[e.g., Detection took 8 minutes because uptime check interval was 5 min]
[e.g., Only one person knew how to access Railway logs]
7. Action Items
High Priority (P0/P1 — complete within 2 weeks)
#
Action
Category
Owner
Due
Status
1
[Action]
Prevention
[Name]
YYYY-MM-DD
Open
2
[Action]
Detection
[Name]
YYYY-MM-DD
Open
Medium Priority (P2 — complete within 1 month)
#
Action
Category
Owner
Due
Status
3
[Action]
Process
[Name]
YYYY-MM-DD
Open
Low Priority (P3 — add to backlog)
#
Action
Category
Owner
Due
Status
4
[Action]
Nice-to-have
[Name]
Backlog
Open
Action categories: Prevention, Detection, Response, Documentation, Process, Tooling
8. Lessons Learned
Technical Lessons
[What did we learn about the technology, the system design, or the code?]
For Bilko financial system incidents:
[e.g., NUMERIC(19,4) Decimal arithmetic must be tested under concurrent load]
[e.g., VAT calculation should be validated server-side even if client sends computed totals]
Process Lessons
[What did we learn about our operations process, monitoring, or communication?]
Culture Lessons
[What did we learn about team practices, communication patterns, or organizational factors?]
9. Metrics Targets for Next Quarter
Based on this incident, these metrics are now tracked:
Metric
Current
Target
By
Mean time to detect (MTTD)
X min
< 3 min
YYYY-MM-DD
Mean time to respond (MTTR)
X min
< 10 min
YYYY-MM-DD
Mean time to resolve (MTTR)
X min
< 60 min
YYYY-MM-DD
10. Follow-Up Schedule
Action items tracked in GitHub Issues (label: post-mortem-action )
2-week check-in: verify P0/P1 actions completed
1-month check-in: verify P2 actions completed
Next post-mortem: review if similar incidents recurred
Approval
Role
Name
Date
Signature
Facilitator
Reviewer
Alem Bašić