Runbook: BankID Failure
Runbook: BankID Integration Failure
Service: BankID OAuth Authentication Severity: CRITICAL (blocks all logins) MTTR Target: <15 minutes Owner: John (AI Director)
Symptoms
Users report they cannot log in. Symptoms include:
Diagnosis
1. Check BankID Service Status
External status page:
# Check BankID status (no official status page, monitor Twitter)
open https://twitter.com/search?q=BankID%20Norge
# Or check community forums
open https://www.reddit.com/r/Norge/search?q=BankID
Quick test:
# Try BankID login from another service (e.g., tax portal)
open https://www.skatteetaten.no/person/
# If BankID works there but not in Drop → problem is our integration
2. Check Drop Logs
# CloudWatch Logs (production)
aws logs filter-log-events \
--log-group-name /aws/apprunner/drop-production \
--filter-pattern "bankid" \
--start-time $(date -u -d '10 minutes ago' +%s)000 \
--region eu-west-1
# Look for:
# - "BankID OAuth error: invalid_client"
# - "BankID callback failed: invalid_state"
# - "BankID API timeout"
3. Check Environment Variables
# Verify BankID credentials are set
aws apprunner describe-service \
--service-arn <ARN> \
--region eu-west-1 \
| jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \
| grep BANKID
# Expected:
# BANKID_CLIENT_ID: <client-id>
# BANKID_CLIENT_SECRET: <exists> (value hidden)
# BANKID_CALLBACK_URL: https://getdrop.no/api/auth/bankid/callback
4. Check OAuth Flow
Test OAuth initiation:
# Start OAuth flow
curl -X POST https://getdrop.no/api/auth/bankid/initiate \
-H "Content-Type: application/json" \
-d '{"redirectUrl": "/dashboard"}' \
-v
# Expected: HTTP 302 redirect to BankID with state parameter
# If 500: Check BANKID_CLIENT_ID and BANKID_CALLBACK_URL
Test OAuth callback:
# Simulate callback (replace <code> and <state> with real values from BankID redirect)
curl -X GET "https://getdrop.no/api/auth/bankid/callback?code=<code>&state=<state>" \
-v
# Expected: HTTP 302 redirect to /dashboard with auth cookie
# If 401: Check BANKID_CLIENT_SECRET
# If 400: Check state validation logic
Common Causes & Solutions
Cause 1: BankID Service Outage (External)
Probability: 5% (BankID is highly reliable)
Symptoms:
- All BankID logins fail across all services
- BankID status page reports incident
- Social media mentions BankID outage
Solution:
-
Communicate: Post status update to users
Subject: Login temporarily unavailable Body: BankID authentication is experiencing issues. We're monitoring the situation and will restore service as soon as BankID is back online. Estimated: <X> minutes. -
Monitor: Watch BankID Twitter/status for updates
-
Fallback (if available): If demo mode exists, consider temporary activation:
# Enable demo mode (ONLY in emergency, requires Alem approval) aws apprunner update-service --service-arn <ARN> \ --source-configuration "ImageRepository={...}" \ --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=demo}" -
Post-incident: Document outage duration, user impact
ETA: Depends on BankID (typically <2 hours)
Cause 2: Invalid OAuth Credentials
Probability: 20% (after credential rotation or environment change)
Symptoms:
- Logs show: "invalid_client" or "unauthorized_client"
- OAuth flow fails immediately (no redirect to BankID)
Solution:
-
Verify credentials in Vaultwarden:
bw get item "BankID OAuth" --session $BW_SESSION -
Update App Runner environment variables:
aws apprunner update-service --service-arn <ARN> \ --source-configuration "ImageRepository={...}" \ --instance-configuration "EnvironmentVariables={ BANKID_CLIENT_ID=<correct-client-id>, BANKID_CLIENT_SECRET=<correct-secret> }" -
Trigger deployment:
aws apprunner start-deployment --service-arn <ARN> --region eu-west-1 -
Test: Attempt login after deployment completes (3-5 minutes)
ETA: 10 minutes
Cause 3: Callback URL Mismatch
Probability: 15% (after domain change or deployment error)
Symptoms:
- Logs show: "redirect_uri_mismatch"
- BankID redirects to wrong URL (404 or CORS error)
Solution:
-
Check registered callback URL in BankID portal:
- Login to BankID integration portal
- Navigate to OAuth settings
- Verify callback URL:
https://getdrop.no/api/auth/bankid/callback
-
If mismatch, update BankID portal:
- Change redirect URI to match current domain
- Save changes (may require approval, 1-2 hours)
-
Update App Runner env var:
aws apprunner update-service --service-arn <ARN> \ --source-configuration "ImageRepository={...}" \ --instance-configuration "EnvironmentVariables={ BANKID_CALLBACK_URL=https://getdrop.no/api/auth/bankid/callback }" -
Test: Login flow should work after both changes
ETA: 15 minutes (if no BankID approval required), 2 hours (if approval needed)
Cause 4: State Parameter Validation Failure
Probability: 10% (race condition or session timeout)
Symptoms:
- Logs show: "Invalid state parameter"
- User completes BankID flow but callback rejects
Solution:
-
Check session storage:
- BankID state is stored in server session
- If session expires before callback (>10 min), state is lost
-
Increase session timeout (if needed):
// src/lib/auth.ts const SESSION_TIMEOUT = 15 * 60 * 1000; // 15 minutes (was 10) -
Clear stale sessions:
# If using Redis for sessions redis-cli FLUSHDB # If using database sessions sqlite3 drop.db "DELETE FROM sessions WHERE expires_at < datetime('now');" -
Ask user to retry: State timeout is usually one-time issue
ETA: 5 minutes
Cause 5: BankID API Rate Limiting
Probability: 5% (during high-traffic events)
Symptoms:
- Logs show: "rate_limit_exceeded" or HTTP 429
- Intermittent failures (some users succeed, others fail)
Solution:
-
Check rate limit headers in logs:
X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1640000000 -
Wait for rate limit reset: Typically resets every 60 seconds
-
Implement exponential backoff (if not present):
// src/lib/bankid-client.ts async function callBankIDAPI(retries = 3) { try { return await fetch(url); } catch (error) { if (error.status === 429 && retries > 0) { await sleep(1000 * (4 - retries)); // 1s, 2s, 3s return callBankIDAPI(retries - 1); } throw error; } } -
Contact BankID support: If rate limits are too low for production traffic
ETA: 5 minutes (automatic), 1-2 days (if support ticket needed)
Cause 6: Network/Firewall Issues
Probability: 5% (AWS security group misconfiguration)
Symptoms:
- Logs show: "Connection timeout" or "ECONNREFUSED"
- BankID API requests never reach destination
Solution:
-
Check outbound rules (App Runner → BankID):
# App Runner egress is unrestricted by default # Check VPC connector security group (if using VPC) aws ec2 describe-security-groups --group-ids <vpc-connector-sg> --region eu-west-1 -
Test connectivity from container:
# Exec into running container (if possible) curl -v https://oidc.bankid.no/.well-known/openid-configuration # Expected: HTTP 200 with JSON response # If timeout: Network/firewall issue -
Check DNS resolution:
nslookup oidc.bankid.no # Should resolve to BankID IP addresses -
Whitelist BankID IPs (if using strict firewall):
- Contact BankID for IP ranges
- Add to AWS security group outbound rules
ETA: 15 minutes (if quick fix), 1 hour (if requires networking changes)
Emergency Workarounds
Option 1: Fallback to Demo Mode (Temporary)
Use case: BankID outage affects all users, estimated >1 hour downtime
Steps:
-
Enable demo mode:
aws apprunner update-service --service-arn <ARN> \ --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=demo}" -
Communicate to users:
Subject: Temporary login method available Body: Due to BankID outage, we've enabled demo login. Use email/password to access your account. BankID will be restored as soon as possible. -
Monitor BankID status
-
Revert to BankID when available:
aws apprunner update-service --service-arn <ARN> \ --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=live}"
Risk: Demo mode may bypass KYC checks. Only use with Alem approval.
Option 2: Redirect to Status Page
Use case: BankID outage, no ETA, no fallback available
Steps:
-
Deploy maintenance page:
# Update health endpoint to return 503 # This triggers BetterStack alert + status page update -
Show user-friendly message:
<h1>Login Temporarily Unavailable</h1> <p>Our authentication provider (BankID) is experiencing issues.</p> <p>We expect service to resume within <strong>X minutes</strong>.</p> <p>Status updates: <a href="https://status.drop.no">status.drop.no</a></p> -
Monitor and communicate updates every 30 minutes
Post-Incident Actions
-
Document incident:
# Create incident report touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-bankid-failure.md -
Root cause analysis:
- What triggered the failure?
- Why didn't monitoring detect it sooner?
- What prevented faster recovery?
-
Update monitoring:
- Add synthetic BankID login test (every 5 min)
- Alert on OAuth callback failures >5/min
-
Update runbook:
- Add new failure mode if discovered
- Improve diagnosis steps based on what worked
-
Team debrief (if >30 min outage):
- Review timeline
- Identify improvements
- Update on-call procedures
Escalation
| Time | Action |
|---|---|
| 0 min | John starts diagnosis |
| 5 min | If not resolved, alert Alem via Slack + SMS |
| 15 min | If BankID outage confirmed, enable fallback (Alem approval) |
| 30 min | If still unresolved, schedule team call |
| 1 hour | If major outage, public communication via email/social media |
Contacts
- BankID Support: [email protected]
- BankID Phone: +47 XXXX XXXX (24/7 for critical issues)
- Internal: Alem (CEO, final decision on fallback modes)
Related Documentation
docs/architecture/authentication.md— BankID OAuth flowsrc/app/api/auth/bankid/route.ts— BankID integration codedocs/dr-runbook.md— Infrastructure disaster recovery- Vaultwarden item: "BankID OAuth" — Credentials
Last Updated: 2026-02-22 Next Review: Before Phase 2 (Banking Integration)
No comments to display
No comments to display