# Incident Report Template

# Incident Report

> **Project:** Bilko
> **Version:** 0.1
> **Date:** 2026-02-23
> **Author:** Ops Architect
> **Status:** Draft (Template — fill in per incident)
> **Reviewers:** Tech Lead, Alem Bašić

## Document History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1     | 2026-02-23 | Ops Architect | Initial draft |

---

## INSTRUCTIONS

Create a new incident report file for each incident:
- Filename: `INCIDENT-YYYY-MM-DD-<short-title>.md`
- Location: `docs/operations/incidents/`
- Fill in all sections within 48 hours of incident resolution
- P0 incidents require a full post-mortem (see `post-mortem.md`)

---

# Incident Report: [SHORT TITLE]

**Incident ID:** INC-YYYY-MM-DD-NNN
**Reported by:** [Name]
**Date detected:** YYYY-MM-DD HH:MM CET
**Date resolved:** YYYY-MM-DD HH:MM CET
**Total duration:** X hours Y minutes
**Severity:** P0 / P1 / P2 / P3

---

## 1. Incident Summary

<!-- 2-3 sentences: what happened, what was the impact, when was it resolved -->

[Example: On YYYY-MM-DD at HH:MM CET, the Bilko API became unavailable due to an out-of-memory error on Railway. All users were unable to create invoices or access their dashboard for 47 minutes. The incident was resolved by restarting the Railway service and deploying a memory optimization patch.]

---

## 2. Impact

| Metric | Value |
|--------|-------|
| Duration | X min |
| Users affected | [All / Specific org / None] |
| Data loss | [None / Describe if any] |
| Financial records at risk | [None / Describe if any] |
| Revenue impact | [None / Describe if applicable] |
| GDPR reportable | [Yes / No] — if Yes, notify Datatilsynet within 72h |

---

## 3. Timeline

| Time (CET) | Event |
|-----------|-------|
| HH:MM | First signs of degradation (detected by: BetterStack / user report / Sentry) |
| HH:MM | Incident declared by [Name] |
| HH:MM | [First diagnosis step] |
| HH:MM | [Root cause identified] |
| HH:MM | [Fix applied] |
| HH:MM | Service restored, health check green |
| HH:MM | Incident resolved, monitoring period started |
| HH:MM | Monitoring period ended — all clear |

---

## 4. Root Cause Analysis

**Root cause:** [One sentence — the actual technical cause]

**Example root causes for Bilko:**
- Railway API service exhausted 2GB RAM limit due to memory leak in invoice PDF generation
- Database connection pool exhausted (25 max connections on Railway Starter) under load
- Bad database migration caused index corruption on `invoices` table
- Expired SendGrid API key (not rotated) caused all invoice emails to fail
- Cloudflare R2 API credentials rotated without updating Railway env vars

**Contributing factors:**
- [Factor 1: e.g., No memory usage alerting configured]
- [Factor 2: e.g., No automated secret rotation reminder]

---

## 5. Detection

**How was the incident detected?**
- [ ] BetterStack uptime monitor alert
- [ ] Sentry error rate spike
- [ ] User report (direct to Alem / support)
- [ ] Routine health check
- [ ] Monitoring dashboard review

**Time to detect:** [X minutes from first symptom to alert]

**Was detection fast enough?** [Yes / No — if No, document what would have caught it faster]

---

## 6. Response

### Response Actions Taken

| Time | Action | Result |
|------|--------|--------|
| HH:MM | [Action taken] | [Result] |

### What Worked Well

- [e.g., BetterStack alert fired within 2 min]
- [e.g., Rollback procedure was documented and worked on first try]

### What Didn't Work

- [e.g., Railway logs were hard to search for the specific error]
- [e.g., No backup Railway contact — only one person on-call]

---

## 7. Financial Data Integrity Check

**Required for all P0/P1 incidents. Skip for P2/P3 that don't touch accounting.**

- [ ] All invoices created during incident window verified (count before vs after)
- [ ] VAT calculations for affected period verified correct
- [ ] Double-entry balances verified for all affected transactions
- [ ] No orphaned transactions (debit without credit) in database
- [ ] Affected organization(s) notified if any data discrepancy found

**Verification query (run after incident):**
```sql
-- Check for unbalanced entries during incident window
SELECT t.id, t.created_at, t.amount
FROM transactions t
WHERE t.created_at BETWEEN '<incident_start>' AND '<incident_end>'
  AND t.id NOT IN (
    SELECT DISTINCT "transactionId" FROM transaction_entries WHERE type = 'debit'
  );
-- Expected: 0 rows (all transactions have debit entries)

-- Check invoice totals match line item sums
SELECT i.id, i.total_amount,
       SUM(ii.quantity * ii.unit_price * (1 + ii.tax_rate/100)) as calculated_total
FROM invoices i
JOIN invoice_items ii ON ii."invoiceId" = i.id
WHERE i.created_at BETWEEN '<incident_start>' AND '<incident_end>'
GROUP BY i.id, i.total_amount
HAVING ABS(i.total_amount - SUM(ii.quantity * ii.unit_price * (1 + ii.tax_rate/100))) > 0.0001;
-- Expected: 0 rows
```

---

## 8. User Communication

| Channel | When Sent | Content Summary |
|---------|-----------|-----------------|
| status.bilko.io | HH:MM | "Investigating service disruption" |
| status.bilko.io | HH:MM | Status update with ETA |
| status.bilko.io | HH:MM | "Service restored" |
| Email to affected users | HH:MM (if needed) | [Summary of impact and resolution] |

---

## 9. Action Items

| # | Action | Owner | Due Date | Priority |
|---|--------|-------|----------|----------|
| 1 | [Preventive action] | [Owner] | YYYY-MM-DD | P0/P1/P2 |
| 2 | [Detection improvement] | [Owner] | YYYY-MM-DD | P1 |
| 3 | [Documentation update] | [Owner] | YYYY-MM-DD | P2 |

---

## 10. Follow-Up

- [ ] Post-mortem scheduled (required for P0 incidents): [Date/Time]
- [ ] Action items added to project backlog
- [ ] Runbook updated with new diagnosis/fix steps
- [ ] Monitoring improved to detect this issue faster next time

---

## Approval
| Role | Name | Date | Signature |
|------|------|------|-----------|
| Author (on-call) | | | |
| Reviewer | Alem Bašić | | |