Post-Mortem Template

Post-Mortem

Project: Bilko Version: 0.1 Date: 2026-02-23 Author: Ops Architect Status: Draft (Template — fill in per P0 incident) Reviewers: Tech Lead, Alem Bašić

Document History

Version	Date	Author	Changes
0.1	2026-02-23	Ops Architect	Initial draft

INSTRUCTIONS

Post-mortems are required for all P0 incidents and recommended for P1. Schedule within 5 business days of incident resolution.

Blameless culture: This document is about systems and processes, not people. The goal is to learn and prevent recurrence, not to assign blame.

Create a new file: POST-MORTEM-YYYY-MM-DD-<title>.md in docs/operations/post-mortems/

Post-Mortem: [INCIDENT TITLE]

Post-Mortem Date: YYYY-MM-DD Incident Date: YYYY-MM-DD Incident Reference: INC-YYYY-MM-DD-NNN Facilitator: [Name] Attendees: [Names] Duration of post-mortem session: [X minutes]

1. Executive Summary

What happened: [2-3 sentences: the incident, impact, and resolution]

Why it happened: [1-2 sentences: root cause in plain language]

What we're doing to prevent recurrence: [1-2 sentences: top action items]

2. Impact Summary

Metric	Value
Incident duration	X hours Y minutes
Detection time	X minutes from first symptom
Response time	X minutes from alert to first action
Users impacted	[All / Specific org / None]
Financial records affected	[None / Describe]
Downtime cost (est.)	[€X in lost productivity / TBD]
GDPR breach notification required	[Yes / No]

3. Timeline (Detailed)

Time (CET)	Event	Who	Notes
HH:MM	[Event]	[Person]	[Notes]

Key timestamps:

First symptom: HH:MM
Alert fired: HH:MM (detection lag: X min)
Incident declared: HH:MM (response lag: X min)
Root cause identified: HH:MM (diagnosis duration: X min)
Fix applied: HH:MM
Service restored: HH:MM
Incident closed: HH:MM
Total user impact duration: X min

4. Root Cause Analysis

What happened technically

[Detailed technical explanation of the failure chain. Be specific: which component, which code path, which query.]

For Bilko financial incidents, this section must include:

Which accounting module was affected (VAT / double-entry / invoice calc / currency)
Was any financial data corrupted? If yes, which organizations, which time window
Were NUMERIC(19,4) values preserved correctly during the incident?

Why it happened

[The "5 Whys" — trace back to the systemic cause]

Why did users lose access? → API returned 503 errors
Why did the API return 503? → Railway service restarted due to OOM
Why did the service run out of memory? → Invoice PDF generation loaded entire result set into memory
Why did we not catch this? → No memory profiling in development, load testing not done
Why was there no load testing? → No performance test plan existed

Root cause (systemic): [e.g., "Lack of memory usage monitoring and load testing prior to feature launch"]

Contributing Factors

Factor	Category	Severity
[Factor]	[Process / Code / Infrastructure / Communication]	[High / Med / Low]

5. What Went Well

[e.g., BetterStack alert fired within 2 minutes of downtime starting]
[e.g., Rollback procedure was documented and worked on first try]
[e.g., Financial data integrity was preserved — no accounting records corrupted]
[e.g., User communication was clear and timely]

6. What Went Poorly

[e.g., No memory usage alerting was configured before launch]
[e.g., The runbook did not cover OOM scenarios]
[e.g., Detection took 8 minutes because uptime check interval was 5 min]
[e.g., Only one person knew how to access Railway logs]

7. Action Items

High Priority (P0/P1 — complete within 2 weeks)

#	Action	Category	Owner	Due	Status
1	[Action]	Prevention	[Name]	YYYY-MM-DD	Open
2	[Action]	Detection	[Name]	YYYY-MM-DD	Open

Medium Priority (P2 — complete within 1 month)

#	Action	Category	Owner	Due	Status
3	[Action]	Process	[Name]	YYYY-MM-DD	Open

Low Priority (P3 — add to backlog)

#	Action	Category	Owner	Due	Status
4	[Action]	Nice-to-have	[Name]	Backlog	Open

Action categories: Prevention, Detection, Response, Documentation, Process, Tooling

8. Lessons Learned

Technical Lessons

[What did we learn about the technology, the system design, or the code?]

For Bilko financial system incidents:

[e.g., NUMERIC(19,4) Decimal arithmetic must be tested under concurrent load]
[e.g., VAT calculation should be validated server-side even if client sends computed totals]

Process Lessons

[What did we learn about our operations process, monitoring, or communication?]

Culture Lessons

[What did we learn about team practices, communication patterns, or organizational factors?]

9. Metrics Targets for Next Quarter

Based on this incident, these metrics are now tracked:

Metric	Current	Target	By
Mean time to detect (MTTD)	X min	< 3 min	YYYY-MM-DD
Mean time to respond (MTTR)	X min	< 10 min	YYYY-MM-DD
Mean time to resolve (MTTR)	X min	< 60 min	YYYY-MM-DD

10. Follow-Up Schedule

Action items tracked in GitHub Issues (label: post-mortem-action)
2-week check-in: verify P0/P1 actions completed
1-month check-in: verify P2 actions completed
Next post-mortem: review if similar incidents recurred

Approval

Role	Name	Date	Signature
Facilitator
Reviewer	Alem Bašić