Post-Mortem Template

Post-Mortem 
 
 Project: Bilko
 Version: 0.1
 Date: 2026-02-23
 Author: Ops Architect
 Status: Draft (Template — fill in per P0 incident)
 Reviewers: Tech Lead, Alem Bašić 
 
 Document History 
 
 
 
 Version 
 Date 
 Author 
 Changes 
 
 
 
 
 0.1 
 2026-02-23 
 Ops Architect 
 Initial draft 
 
 
 
 
 INSTRUCTIONS 
 Post-mortems are required for all P0 incidents and recommended for P1. Schedule within 5 business days of incident resolution. 
 Blameless culture: This document is about systems and processes, not people. The goal is to learn and prevent recurrence, not to assign blame. 
 Create a new file: POST-MORTEM-YYYY-MM-DD-<title>.md in docs/operations/post-mortems/ 
 
 Post-Mortem: [INCIDENT TITLE] 
 Post-Mortem Date: YYYY-MM-DD
 Incident Date: YYYY-MM-DD
 Incident Reference: INC-YYYY-MM-DD-NNN
 Facilitator: [Name]
 Attendees: [Names]
 Duration of post-mortem session: [X minutes] 
 
 1. Executive Summary 
 What happened: 
[2-3 sentences: the incident, impact, and resolution] 
 Why it happened: 
[1-2 sentences: root cause in plain language] 
 What we're doing to prevent recurrence: 
[1-2 sentences: top action items] 
 
 2. Impact Summary 
 
 
 
 Metric 
 Value 
 
 
 
 
 Incident duration 
 X hours Y minutes 
 
 
 Detection time 
 X minutes from first symptom 
 
 
 Response time 
 X minutes from alert to first action 
 
 
 Users impacted 
 [All / Specific org / None] 
 
 
 Financial records affected 
 [None / Describe] 
 
 
 Downtime cost (est.) 
 [€X in lost productivity / TBD] 
 
 
 GDPR breach notification required 
 [Yes / No] 
 
 
 
 
 3. Timeline (Detailed) 
 
 
 
 Time (CET) 
 Event 
 Who 
 Notes 
 
 
 
 
 HH:MM 
 [Event] 
 [Person] 
 [Notes] 
 
 
 
 Key timestamps: 
 
 First symptom: HH:MM 
 Alert fired: HH:MM (detection lag: X min) 
 Incident declared: HH:MM (response lag: X min) 
 Root cause identified: HH:MM (diagnosis duration: X min) 
 Fix applied: HH:MM 
 Service restored: HH:MM 
 Incident closed: HH:MM 
 Total user impact duration: X min 
 
 
 4. Root Cause Analysis 
 What happened technically 
 [Detailed technical explanation of the failure chain. Be specific: which component, which code path, which query.] 
 For Bilko financial incidents, this section must include: 
 
 Which accounting module was affected (VAT / double-entry / invoice calc / currency) 
 Was any financial data corrupted? If yes, which organizations, which time window 
 Were NUMERIC(19,4) values preserved correctly during the incident? 
 
 Why it happened 
 [The "5 Whys" — trace back to the systemic cause] 
 
 Why did users lose access? → API returned 503 errors 
 Why did the API return 503? → Railway service restarted due to OOM 
 Why did the service run out of memory? → Invoice PDF generation loaded entire result set into memory 
 Why did we not catch this? → No memory profiling in development, load testing not done 
 Why was there no load testing? → No performance test plan existed 
 
 Root cause (systemic): [e.g., "Lack of memory usage monitoring and load testing prior to feature launch"] 
 Contributing Factors 
 
 
 
 Factor 
 Category 
 Severity 
 
 
 
 
 [Factor] 
 [Process / Code / Infrastructure / Communication] 
 [High / Med / Low] 
 
 
 
 
 5. What Went Well 

 
 [e.g., BetterStack alert fired within 2 minutes of downtime starting] 
 [e.g., Rollback procedure was documented and worked on first try] 
 [e.g., Financial data integrity was preserved — no accounting records corrupted] 
 [e.g., User communication was clear and timely] 
 
 
 6. What Went Poorly 

 
 [e.g., No memory usage alerting was configured before launch] 
 [e.g., The runbook did not cover OOM scenarios] 
 [e.g., Detection took 8 minutes because uptime check interval was 5 min] 
 [e.g., Only one person knew how to access Railway logs] 
 
 
 7. Action Items 
 High Priority (P0/P1 — complete within 2 weeks) 
 
 
 
 # 
 Action 
 Category 
 Owner 
 Due 
 Status 
 
 
 
 
 1 
 [Action] 
 Prevention 
 [Name] 
 YYYY-MM-DD 
 Open 
 
 
 2 
 [Action] 
 Detection 
 [Name] 
 YYYY-MM-DD 
 Open 
 
 
 
 Medium Priority (P2 — complete within 1 month) 
 
 
 
 # 
 Action 
 Category 
 Owner 
 Due 
 Status 
 
 
 
 
 3 
 [Action] 
 Process 
 [Name] 
 YYYY-MM-DD 
 Open 
 
 
 
 Low Priority (P3 — add to backlog) 
 
 
 
 # 
 Action 
 Category 
 Owner 
 Due 
 Status 
 
 
 
 
 4 
 [Action] 
 Nice-to-have 
 [Name] 
 Backlog 
 Open 
 
 
 
 Action categories: Prevention, Detection, Response, Documentation, Process, Tooling 
 
 8. Lessons Learned 
 Technical Lessons 
 [What did we learn about the technology, the system design, or the code?] 
 For Bilko financial system incidents: 
 
 [e.g., NUMERIC(19,4) Decimal arithmetic must be tested under concurrent load] 
 [e.g., VAT calculation should be validated server-side even if client sends computed totals] 
 
 Process Lessons 
 [What did we learn about our operations process, monitoring, or communication?] 
 Culture Lessons 
 [What did we learn about team practices, communication patterns, or organizational factors?] 
 
 9. Metrics Targets for Next Quarter 
 Based on this incident, these metrics are now tracked: 
 
 
 
 Metric 
 Current 
 Target 
 By 
 
 
 
 
 Mean time to detect (MTTD) 
 X min 
 < 3 min 
 YYYY-MM-DD 
 
 
 Mean time to respond (MTTR) 
 X min 
 < 10 min 
 YYYY-MM-DD 
 
 
 Mean time to resolve (MTTR) 
 X min 
 < 60 min 
 YYYY-MM-DD 
 
 
 
 
 10. Follow-Up Schedule 
 
 Action items tracked in GitHub Issues (label: post-mortem-action ) 
 2-week check-in: verify P0/P1 actions completed 
 1-month check-in: verify P2 actions completed 
 Next post-mortem: review if similar incidents recurred 
 
 
 Approval 
 
 
 
 Role 
 Name 
 Date 
 Signature 
 
 
 
 
 Facilitator 
 
 
 
 
 
 Reviewer 
 Alem Bašić