SLA Report
SLA Report
Project:
{{PROJECT_NAME}}Drop Version:{{VERSION}}0.1.0 Date:{{DATE}}2026-02-23 Author:{{AUTHOR}}Platform Architect (AI) Status:Draft |In Review| ApprovedReviewers:{{REVIEWERS}}Alem Bašić (CEO)
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | Initial |
1. Overview
This document defines Drop's Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs). It also includes an example monthly SLA report based on simulated post-launch data. Use as a template for monthly operational reporting.
Reporting PeriodPeriod: February 2026 (example — pre-launch)
Environment: Production (AWS App Runner eu-west-1 + RDS PostgreSQL)
Report Author: Alem Bašić
Distribution: Internal (Alem Bašić, Platform Architect)
2. SLA Definitions
2.1 Service Level Targets
| SLA Commitment | Measurement Window | ||
|---|---|---|---|
| 99.9% | Monthly rolling | ||
| < 500ms | Monthly rolling | ||
| < 2,000ms | Monthly rolling | ||
| < 1% | Monthly rolling | ||
| > 99% | Monthly rolling | ||
| KYC Initiation Success Rate | > 99% | > 98% | Monthly rolling |
| Payment Initiation Success Rate | > 99% | > 98% | Monthly rolling |
| Health Check Recovery Time (RTO) | < 10 min | < 30 min | Per incident |
2.2 Availability Budget
| SLA Target | Monthly Allowed Downtime | Weekly Allowed Downtime |
|---|---|---|
| 99.9% | 43.8 minutes/month | 10.1 minutes/week |
| 99.95% | 21.9 minutes/month | 5.0 minutes/week |
| 99.99% | 4.4 minutes/month | 1.0 minutes/week |
Current infrastructure capability: 99.9% (SLA commitment). 99.95% is the SLO target once PgBouncer and multi-AZ standby are in place.
2.3 Exclusions
The following are excluded from SLA calculations:
- Scheduled maintenance windows (posted to BetterStack status page > 24h in advance)
- BankID OIDC provider outages (upstream at Vipps MobilePay)
- Open Banking provider outages (upstream — provider TBD)
- Force majeure events (AWS region outage, etc.)
- Sumsub KYC service outages (upstream)
2.3. Service Level Indicators (SLIs)
3.1 Availability
Definition: Percentage of minutes in the reporting period where GET /api/health returns HTTP 200 with {"status":"ok"}.
Measurement: BetterStack Drop Health Check monitor (1-minute check interval, 3 global locations).
Formula: (total_minutes - downtime_minutes) / total_minutes × 100
3.2 Response Time
Definition: p50 and p99 response time for all API requests.
Measurement: CloudWatch App Runner request metrics (future: add APM tool).
Alert threshold: p99 > 1,000ms triggers Slack #drop-ops alert.
3.3 Error Rate
Definition: Percentage of HTTP requests returning 5xx status codes.
Measurement: Slack alert fires when > 5 errors occur in 60 seconds (from src/lib/alerts.ts).
Formula: 5xx_requests / total_requests × 100
3.4 BankID Login Success Rate
Definition: Percentage of BankID login initiations that result in a successful session creation.
Measurement: Application audit log (audit_log table: session_created / bankid_initiate).
3.5 Transaction Success Rate
Definition: Percentage of initiated transactions (remittance + QR) that reach completed status.
Measurement: Database query: COUNT(*) WHERE status='completed' / COUNT(*) WHERE status IN ('completed','failed').
4. Monthly SLA SummaryReport Table— February 2026 (Example)
NOTE: This is a simulated example report for template purposes. Actual metrics will be populated from BetterStack, CloudWatch, and database queries after public launch.
4.1 Availability
| Metric | Actual | Status | |
|---|---|---|---|
| Uptime | 99.9% | 99.94% | PASS |
| Total downtime | < 43.8 min | 28 min (1 incident) | PASS |
| Incidents | N/A | 1 (INC-2026-001) | — |
| Maintenance windows | N/A | 0 | — |
Uptime chart (example):
Week 1 (Feb 03–09): ████████████████████ 100.00%
Week 2 (Feb 10–16): ████████████████████ 100.00%
Week 3 (Feb 17–23): ███████████████████▌ 99.72% (INC-2026-001: 28 min)
Week 4 (Feb 24–28): ████████████████████ 100.00%
─────────────────────────────────────────────────
Monthly: ████████████████████ 99.94%
4.2 Performance
| Metric | Target | Actual | Status |
|---|---|---|---|
| p50 response time | < 200ms | ~85ms | PASS |
| p99 response time | < 1,000ms | ~420ms | PASS |
| p99 during incident | N/A | Timeout (excluded) | N/A |
Slow endpoints (example):
| Endpoint | p99 Latency | Notes |
|---|---|---|
POST /api/transactions |
~380ms | Sumsub KYC check adds latency |
GET /api/bank-accounts/balance |
~350ms | Open Banking API round trip |
POST /api/auth/bankid/callback |
~180ms | DB session write |
GET /api/health |
~45ms | DB ping |
4.3 Error Rate
| Metric | Target | Actual | Status |
|---|---|---|---|
| Overall error rate | < 1% | 0.06% | PASS |
| Error rate (excl. incident) | < 0.1% | 0.03% | PASS |
| 5xx errors | < 1% | 0.06% | PASS |
| 4xx errors | N/A | 2.1% | INFO (mostly 401 auth failures) |
Top 5xx errors (example):
| Error | Count | Root Cause |
|---|---|---|
| 503 during INC-2026-001 | ~240 | DB connection pool exhaustion |
| 500 Sumsub timeout | 3 | Sumsub API latency spike |
| 503 cold start | 1 | App Runner scale-to-zero (if applicable) |
4.4 BankID Authentication
| Metric | Target | Actual | Status |
|---|---|---|---|
| Login success rate | > 99% | 99.7% | PASS |
| Login failures | — | 0.3% | INFO |
| CSRF rejections | — | 0 | — |
| Age verification failures | — | 2 | INFO |
Note: Login failures during INC-2026-001 excluded from calculation (upstream service impacted by pool exhaustion).
4.5 KYC (Sumsub)
| Metric | Target | Actual | Status |
|---|---|---|---|
| KYC initiation success rate | > 98% | 99.2% | PASS |
| GREEN (approved) rate | N/A | 91% | INFO |
| RED (rejected) rate | N/A | 6% | INFO |
| RETRY rate | N/A | 3% | INFO |
| Average review time | N/A | ~2h | INFO |
4.6 Transactions
| Metric | Target | Actual | Status |
|---|---|---|---|
| Transaction success rate | > 98% | TBD — Open Banking provider not yet live | TBD |
| Remittance completion | > 98% | TBD | TBD |
| QR payment completion | > 98% | TBD | TBD |
5. Incident Summary
| Incident ID | Severity | Start | Duration | Impact | Root Cause | Status |
|---|---|---|---|---|---|---|
| INC-2026-001 | P1 | 2026-02-20 10:30 UTC | 28 min | 100% users | RDS connection pool exhaustion | Closed |
SLA credit impact: 28 minutes downtime. Monthly uptime = 99.94%. SLA (99.9%) met. No SLA credit applicable.
6. Infrastructure Health
6.1 App Runner
| Metric | Value | Notes |
|---|---|---|
| Service status | RUNNING | — |
| Deployments this month | 3 | Bug fix, security patch, config update |
| Deployment failures | 0 | — |
| Average deployment time | ~4 min | — |
6.2 RDS PostgreSQL
| Metric | Value | Threshold | Status |
|---|---|---|---|
| DB instance status | available | — | PASS |
| Free storage space | ~18 GB | Alert if < 2 GB | PASS |
| Backup retention | 7 days | 7 days minimum | PASS |
| Last automated snapshot | < 24h ago | < 24h | PASS |
| PITR enabled | Yes | Required | PASS |
| Max connections (month) | ~45 (peak) | Alert at 70 | PASS (alarm not yet configured) |
6.3 BetterStack Monitors
| Monitor | Checks | Passed | Failed | Uptime |
|---|---|---|---|---|
| Drop Health Check | ~40,320 | ~40,292 | ~28 (during incident) | 99.93% |
| Drop Landing Page | ~40,320 | ~40,320 | 0 | 100.00% |
| US East Health | ~40,320 | ~40,292 | ~28 (during incident) | 99.93% |
7. Security & Compliance
| Check | Status | Notes | ||
|---|---|---|---|---|
| — | ||||
| logged | ||||
| 2026 | ||||
| 2026 | ||||
| cases | ||||
| — | ||||
| received |
Overall
8. SLA
compliance this period: {{OVERALL_STATUS}}
3. Availability ReportTrending
3.1 Uptime Percentage
| Uptime |
Error Rate | Incidents | ||||
|---|---|---|---|---|---|---|
| TBD | TBD | TBD |
Note: Only unplanned downtime counts against SLA uptime calculations. See Section 3.3 for maintenance exclusions.
3.2 Downtime Incidents
Total unplanned downtime: {{TOTAL_DOWNTIME}} minutes
Downtime excluded (scheduled maintenance): {{EXCL_DOWNTIME}} minutes
3.3 Maintenance Windows
4.9. PerformanceAction Items from This Report
4.1 Response Time
| |||||||
|
4.2 Throughput
Total requests served this period: {{TOTAL_REQUESTS}}
4.3 Error Rate
5. Incident Summary
5.1 Incidents by Severity
5.2 MTTR (Mean Time to Resolve)
5.3 MTTD (Mean Time to Detect)
6. SLA Breach Analysis
{{#if SLA_BREACH}}
Breach Details
| 1 |
Root Cause
{{BREACH_ROOT_CAUSE}}
Remediation
{{BREACH_REMEDIATION}}
Contractual Obligations
{{else}}
No SLA breaches this period. All commitments met.
{{/if}}
7. Trend Analysis
Availability Trend (Last 6 Months)
P95 Latency Trend (Last 6 Months)
8. Improvement Initiatives
9. Customer Communication Summary
10. NextSLA PeriodReport TargetsDistribution & Cadence
Report generation date: Last business day of each month. Data sources: BetterStack dashboard + CloudWatch + audit_log queries.
Related Documents
- Monitoring & Observability
- Incident Report INC-2026-001
- Post-Mortem INC-2026-001
- Operational Runbook
- Disaster Recovery Plan
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | Platform Architect (AI) | 2026-02-23 | |
| Reviewer | |||
| Approver | Alem Bašić |