Post-Mortem Template Post-Mortem Project: Bilko Version: 0.1 Date: 2026-02-23 Author: Ops Architect Status: Draft (Template — fill in per P0 incident) Reviewers: Tech Lead, Alem Bašić Document History Version Date Author Changes 0.1 2026-02-23 Ops Architect Initial draft INSTRUCTIONS Post-mortems are required for all P0 incidents and recommended for P1. Schedule within 5 business days of incident resolution. Blameless culture: This document is about systems and processes, not people. The goal is to learn and prevent recurrence, not to assign blame. Create a new file: POST-MORTEM-YYYY-MM-DD-.md in docs/operations/post-mortems/ Post-Mortem: [INCIDENT TITLE] Post-Mortem Date: YYYY-MM-DD Incident Date: YYYY-MM-DD Incident Reference: INC-YYYY-MM-DD-NNN Facilitator: [Name] Attendees: [Names] Duration of post-mortem session: [X minutes] 1. Executive Summary What happened: [2-3 sentences: the incident, impact, and resolution] Why it happened: [1-2 sentences: root cause in plain language] What we're doing to prevent recurrence: [1-2 sentences: top action items] 2. Impact Summary Metric Value Incident duration X hours Y minutes Detection time X minutes from first symptom Response time X minutes from alert to first action Users impacted [All / Specific org / None] Financial records affected [None / Describe] Downtime cost (est.) [€X in lost productivity / TBD] GDPR breach notification required [Yes / No] 3. Timeline (Detailed) Time (CET) Event Who Notes HH:MM [Event] [Person] [Notes] Key timestamps: First symptom: HH:MM Alert fired: HH:MM (detection lag: X min) Incident declared: HH:MM (response lag: X min) Root cause identified: HH:MM (diagnosis duration: X min) Fix applied: HH:MM Service restored: HH:MM Incident closed: HH:MM Total user impact duration: X min 4. Root Cause Analysis What happened technically [Detailed technical explanation of the failure chain. Be specific: which component, which code path, which query.] For Bilko financial incidents, this section must include: Which accounting module was affected (VAT / double-entry / invoice calc / currency) Was any financial data corrupted? If yes, which organizations, which time window Were NUMERIC(19,4) values preserved correctly during the incident? Why it happened [The "5 Whys" — trace back to the systemic cause] Why did users lose access? → API returned 503 errors Why did the API return 503? → Railway service restarted due to OOM Why did the service run out of memory? → Invoice PDF generation loaded entire result set into memory Why did we not catch this? → No memory profiling in development, load testing not done Why was there no load testing? → No performance test plan existed Root cause (systemic): [e.g., "Lack of memory usage monitoring and load testing prior to feature launch"] Contributing Factors Factor Category Severity [Factor] [Process / Code / Infrastructure / Communication] [High / Med / Low] 5. What Went Well [e.g., BetterStack alert fired within 2 minutes of downtime starting] [e.g., Rollback procedure was documented and worked on first try] [e.g., Financial data integrity was preserved — no accounting records corrupted] [e.g., User communication was clear and timely] 6. What Went Poorly [e.g., No memory usage alerting was configured before launch] [e.g., The runbook did not cover OOM scenarios] [e.g., Detection took 8 minutes because uptime check interval was 5 min] [e.g., Only one person knew how to access Railway logs] 7. Action Items High Priority (P0/P1 — complete within 2 weeks) # Action Category Owner Due Status 1 [Action] Prevention [Name] YYYY-MM-DD Open 2 [Action] Detection [Name] YYYY-MM-DD Open Medium Priority (P2 — complete within 1 month) # Action Category Owner Due Status 3 [Action] Process [Name] YYYY-MM-DD Open Low Priority (P3 — add to backlog) # Action Category Owner Due Status 4 [Action] Nice-to-have [Name] Backlog Open Action categories: Prevention, Detection, Response, Documentation, Process, Tooling 8. Lessons Learned Technical Lessons [What did we learn about the technology, the system design, or the code?] For Bilko financial system incidents: [e.g., NUMERIC(19,4) Decimal arithmetic must be tested under concurrent load] [e.g., VAT calculation should be validated server-side even if client sends computed totals] Process Lessons [What did we learn about our operations process, monitoring, or communication?] Culture Lessons [What did we learn about team practices, communication patterns, or organizational factors?] 9. Metrics Targets for Next Quarter Based on this incident, these metrics are now tracked: Metric Current Target By Mean time to detect (MTTD) X min < 3 min YYYY-MM-DD Mean time to respond (MTTR) X min < 10 min YYYY-MM-DD Mean time to resolve (MTTR) X min < 60 min YYYY-MM-DD 10. Follow-Up Schedule Action items tracked in GitHub Issues (label: post-mortem-action ) 2-week check-in: verify P0/P1 actions completed 1-month check-in: verify P2 actions completed Next post-mortem: review if similar incidents recurred Approval Role Name Date Signature Facilitator Reviewer Alem Bašić