Skip to main content

SLA Report

SLA Report

Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: In Review Reviewers: Alem Bašić (CEO)

Document History

Version Date Author Changes
0.1 2026-02-23 Platform Architect (AI) Initial SLA targets and example monthly report (pre-launch)

1. Overview

This document defines Drop's Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs). It also includes an example monthly SLA report based on simulated post-launch data. Use as a template for monthly operational reporting.

Reporting Period: February 2026 (example — pre-launch) Environment: Production (AWS App Runner eu-west-1 + RDS PostgreSQL) Report Author: Alem Bašić Distribution: Internal (Alem Bašić, Platform Architect)


2. SLA Definitions

2.1 Service Level Targets

Metric SLO Target SLA Commitment Measurement Window
Availability 99.95% 99.9% Monthly rolling
API Response Time (p50) < 200ms < 500ms Monthly rolling
API Response Time (p99) < 1,000ms < 2,000ms Monthly rolling
Error Rate < 0.1% < 1% Monthly rolling
BankID Login Success Rate > 99.5% > 99% Monthly rolling
KYC Initiation Success Rate > 99% > 98% Monthly rolling
Payment Initiation Success Rate > 99% > 98% Monthly rolling
Health Check Recovery Time (RTO) < 10 min < 30 min Per incident

2.2 Availability Budget

SLA Target Monthly Allowed Downtime Weekly Allowed Downtime
99.9% 43.8 minutes/month 10.1 minutes/week
99.95% 21.9 minutes/month 5.0 minutes/week
99.99% 4.4 minutes/month 1.0 minutes/week

Current infrastructure capability: 99.9% (SLA commitment). 99.95% is the SLO target once PgBouncer and multi-AZ standby are in place.

2.3 Exclusions

The following are excluded from SLA calculations:

  • Scheduled maintenance windows (posted to BetterStack status page > 24h in advance)
  • BankID OIDC provider outages (upstream at Vipps MobilePay)
  • Open Banking provider outages (upstream — provider TBD)
  • Force majeure events (AWS region outage, etc.)
  • Sumsub KYC service outages (upstream)

3. Service Level Indicators (SLIs)

3.1 Availability

Definition: Percentage of minutes in the reporting period where GET /api/health returns HTTP 200 with {"status":"ok"}.

Measurement: BetterStack Drop Health Check monitor (1-minute check interval, 3 global locations).

Formula: (total_minutes - downtime_minutes) / total_minutes × 100

3.2 Response Time

Definition: p50 and p99 response time for all API requests.

Measurement: CloudWatch App Runner request metrics (future: add APM tool).

Alert threshold: p99 > 1,000ms triggers Slack #drop-ops alert.

3.3 Error Rate

Definition: Percentage of HTTP requests returning 5xx status codes.

Measurement: Slack alert fires when > 5 errors occur in 60 seconds (from src/lib/alerts.ts).

Formula: 5xx_requests / total_requests × 100

3.4 BankID Login Success Rate

Definition: Percentage of BankID login initiations that result in a successful session creation.

Measurement: Application audit log (audit_log table: session_created / bankid_initiate).

3.5 Transaction Success Rate

Definition: Percentage of initiated transactions (remittance + QR) that reach completed status.

Measurement: Database query: COUNT(*) WHERE status='completed' / COUNT(*) WHERE status IN ('completed','failed').


4. Monthly SLA Report — February 2026 (Example)

NOTE: This is a simulated example report for template purposes. Actual metrics will be populated from BetterStack, CloudWatch, and database queries after public launch.

4.1 Availability

Metric Target Actual Status
Uptime 99.9% 99.94% PASS
Total downtime < 43.8 min 28 min (1 incident) PASS
Incidents N/A 1 (INC-2026-001)
Maintenance windows N/A 0

Uptime chart (example):

Week 1 (Feb 03–09):  ████████████████████ 100.00%
Week 2 (Feb 10–16):  ████████████████████ 100.00%
Week 3 (Feb 17–23):  ███████████████████▌  99.72% (INC-2026-001: 28 min)
Week 4 (Feb 24–28):  ████████████████████ 100.00%
─────────────────────────────────────────────────
Monthly:              ████████████████████  99.94%

4.2 Performance

Metric Target Actual Status
p50 response time < 200ms ~85ms PASS
p99 response time < 1,000ms ~420ms PASS
p99 during incident N/A Timeout (excluded) N/A

Slow endpoints (example):

Endpoint p99 Latency Notes
POST /api/transactions ~380ms Sumsub KYC check adds latency
GET /api/bank-accounts/balance ~350ms Open Banking API round trip
POST /api/auth/bankid/callback ~180ms DB session write
GET /api/health ~45ms DB ping

4.3 Error Rate

Metric Target Actual Status
Overall error rate < 1% 0.06% PASS
Error rate (excl. incident) < 0.1% 0.03% PASS
5xx errors < 1% 0.06% PASS
4xx errors N/A 2.1% INFO (mostly 401 auth failures)

Top 5xx errors (example):

Error Count Root Cause
503 during INC-2026-001 ~240 DB connection pool exhaustion
500 Sumsub timeout 3 Sumsub API latency spike
503 cold start 1 App Runner scale-to-zero (if applicable)

4.4 BankID Authentication

Metric Target Actual Status
Login success rate > 99% 99.7% PASS
Login failures 0.3% INFO
CSRF rejections 0
Age verification failures 2 INFO

Note: Login failures during INC-2026-001 excluded from calculation (upstream service impacted by pool exhaustion).

4.5 KYC (Sumsub)

Metric Target Actual Status
KYC initiation success rate > 98% 99.2% PASS
GREEN (approved) rate N/A 91% INFO
RED (rejected) rate N/A 6% INFO
RETRY rate N/A 3% INFO
Average review time N/A ~2h INFO

4.6 Transactions

Metric Target Actual Status
Transaction success rate > 98% TBD — Open Banking provider not yet live TBD
Remittance completion > 98% TBD TBD
QR payment completion > 98% TBD TBD

5. Incident Summary

Incident ID Severity Start Duration Impact Root Cause Status
INC-2026-001 P1 2026-02-20 10:30 UTC 28 min 100% users RDS connection pool exhaustion Closed

SLA credit impact: 28 minutes downtime. Monthly uptime = 99.94%. SLA (99.9%) met. No SLA credit applicable.


6. Infrastructure Health

6.1 App Runner

Metric Value Notes
Service status RUNNING
Deployments this month 3 Bug fix, security patch, config update
Deployment failures 0
Average deployment time ~4 min

6.2 RDS PostgreSQL

Metric Value Threshold Status
DB instance status available PASS
Free storage space ~18 GB Alert if < 2 GB PASS
Backup retention 7 days 7 days minimum PASS
Last automated snapshot < 24h ago < 24h PASS
PITR enabled Yes Required PASS
Max connections (month) ~45 (peak) Alert at 70 PASS (alarm not yet configured)

6.3 BetterStack Monitors

Monitor Checks Passed Failed Uptime
Drop Health Check ~40,320 ~40,292 ~28 (during incident) 99.93%
Drop Landing Page ~40,320 ~40,320 0 100.00%
US East Health ~40,320 ~40,292 ~28 (during incident) 99.93%

7. Security & Compliance

Check Status Notes
No PII data exposure incidents PASS
Audit log continuity PASS All events logged
Secret rotation (JWT_SECRET) Not yet due Rotation scheduled Q2 2026
Secret rotation (DB password) Not yet due Rotation scheduled Q2 2026
AML alerts reviewed N/A No open cases
Pending KYC > 24h 0
GDPR requests 0 None received

Month Uptime p99 Latency Error Rate Incidents
Feb 2026 99.94% ~420ms 0.06% 1 (P1)
Mar 2026 TBD TBD TBD
Apr 2026 TBD TBD TBD

9. Action Items from This Report

# Action Owner Due
1 Configure CloudWatch alarm on DatabaseConnections > 70 Alem Within 1 week
2 Implement PgBouncer / RDS Proxy before public launch Platform Before v1.0
3 Add p99 latency CloudWatch alarm (> 1,000ms → Slack alert) Platform Before v1.0
4 Review 4xx error volume — categorize auth failures vs user errors Alem Within 2 weeks

10. SLA Report Distribution & Cadence

Audience Frequency Format
Alem Bašić (CEO) Monthly This document
Platform Architect Monthly This document
Finanstilsynet (if required) Annually Aggregated uptime report

Report generation date: Last business day of each month. Data sources: BetterStack dashboard + CloudWatch + audit_log queries.



Approval

Role Name Date Signature
Author Platform Architect (AI) 2026-02-23
Reviewer
Approver Alem Bašić