# Runbook: NBS IPS Outage

# Runbook: NBS IPS Outage

## Purpose

This runbook provides step-by-step procedures for detecting, triaging, and responding to NBS IPS (Narodna Banka Srbije Instant Payment System) outages or degraded performance.

## Trigger

Any of the following conditions indicate potential NBS IPS outage:

1. **High rejection rate:** >10% of payment initiation requests returning `RJCT` status
2. **Timeout spike:** >5 consecutive timeout errors (>30s response time)
3. **HTTP 5xx errors:** NBS IPS API returning 500/502/503/504
4. **NBS status page alert:** Official communication from NBS about system maintenance or outage

## Preconditions

Before following this runbook, ensure:

- [ ] You have access to production logs (CloudWatch, Datadog, or local Docker logs)
- [ ] You have access to NBS IPS status page: https://www.nbs.rs (check payment system availability)
- [ ] You have NBS contact details: platne.institucije@nbs.rs, +381 11 3027 100
- [ ] You have access to Drop Srbija admin dashboard (to pause outbound payments if needed)
- [ ] You have access to status page publishing tool (to notify users)

## Step-by-Step Response

### Step 1: Verify the Issue (ETA: 2 minutes)

**Check Docker logs for NBS IPS errors:**

```bash
docker logs dropsrbija-api | grep NbsIpsLog | tail -50
```

**Look for patterns:**

- Multiple `RJCT` (rejection) responses with error code `AM05` (duplicate submission)
- Multiple `RJCT` with error code `TECH` (technical error)
- Timeout messages: `java.net.SocketTimeoutException: Read timed out`
- HTTP 503 Service Unavailable
- HTTP 500 Internal Server Error

**Example of normal log:**

```
2026-04-16T10:15:32Z [NbsIpsLog] transaction_id=abc123 request_type=initiate response_status=200 nbs_ips_status=ACCP
```

**Example of outage log:**

```
2026-04-16T10:45:12Z [NbsIpsLog] transaction_id=def456 request_type=initiate response_status=503 error_message="Service Temporarily Unavailable"
2026-04-16T10:45:23Z [NbsIpsLog] transaction_id=ghi789 request_type=initiate response_status=500 error_message="Internal Server Error"
```

**Check metrics dashboard** (if available):

- Transaction success rate (should be >95%)
- Average NBS IPS response time (should be <2s)
- Error rate by HTTP status code

### Step 2: Check NBS IPS Status Page (ETA: 1 minute)

**Official NBS status page:**

1. Go to https://www.nbs.rs
2. Navigate to: **Payment Systems → IPS → System Availability**
3. Check for announcements:
   - Scheduled maintenance windows
   - Incident notifications
   - System degradation alerts

**If NBS confirms outage:**
- Note the estimated resolution time (ERT)
- Proceed to Step 4 (Customer Communication)

**If NBS shows "All Systems Operational":**
- Outage may be isolated to Drop Srbija's connection
- Proceed to Step 3 (Technical Diagnosis)

### Step 3: Technical Diagnosis (ETA: 5 minutes)

**Possible causes of isolated failures:**

1. **Network issue between Drop Srbija and NBS:**
   - Check VPN/VPC connectivity (if applicable)
   - Verify mTLS certificates haven't expired
   - Test network path: `curl -I https://ips.nbs.rs`

2. **Rate limiting:**
   - NBS may throttle requests if Drop exceeds transaction quota
   - Check `nbs_ips_logs` table for HTTP 429 (Too Many Requests)
   - Solution: Implement exponential backoff (already in adapter)

3. **Authentication failure:**
   - mTLS client certificate may have expired
   - API key rotation (if NBS uses API keys)
   - Check for HTTP 401/403 errors in logs

4. **Partner bank integration issue:**
   - If Drop operates as agent under Article 24, outage may be at partner bank's IPS gateway
   - Contact bank technical support: [Partner Bank Support Number]

**Run diagnostic test transaction:**

```bash
# Send test payment (100 RSD to known-good recipient)
curl -X POST http://localhost:3003/v1/ips/initiate \
  -H "Authorization: Bearer <admin_jwt>" \
  -H "Content-Type: application/json" \
  -d '{
    "recipientPhone": "+381601234567",
    "amount": 100,
    "description": "IPS diagnostic test"
  }'
```

**Expected responses:**

- **Success:** `{"transactionId": "...", "status": "PENDING"}` → NBS IPS is working
- **Timeout:** No response after 30s → Network/connectivity issue
- **RJCT:** `{"status": "failed", "error": "TECH"}` → NBS technical error
- **HTTP 503:** NBS IPS is down

### Step 4: Customer Communication (ETA: 3 minutes)

**If outage is confirmed (NBS or Drop technical issue):**

**1. Post status page update** (https://status.dropsrbija.rs or in-app banner):

**Serbian template:**

```
🔴 Problemi sa trenutnim plaćanjima

NBS instant plaćanja su trenutno nedostupna zbog [razloga]. 
Radimo na rešavanju problema.

Očekivano vreme povratka: [ETA ili "u najkraćem roku"]

Vaša sredstva su sigurna. Pokušajte ponovo za nekoliko minuta.

Ažurirano: [timestamp]
```

**English translation (for reference):**

```
🔴 Issues with instant payments

NBS instant payments are currently unavailable due to [reason]. 
We are working on resolving the issue.

Expected resolution time: [ETA or "as soon as possible"]

Your funds are safe. Please try again in a few minutes.

Updated: [timestamp]
```

**2. Send SMS to active users** (optional, for extended outages >30 min):

**SMS template:**

```
Drop Srbija: Instant plaćanja trenutno nedostupna zbog problema sa NBS sistemom. Vaša sredstva su sigurna. Pokušajte ponovo za 30 min. Info: dropsrbija.rs/status
```

**3. Email to high-value users** (optional, for outages >2 hours):

Subject: `Obaveštenje o trenutnim problemima sa plaćanjima`

Body: [See incident-notification-procedure.md for full email template]

### Step 5: Escalate to NBS (if needed) (ETA: 5 minutes)

**When to escalate:**

- Outage duration >15 minutes AND NBS status page shows "operational"
- You suspect issue is specific to Drop Srbija's integration
- Multiple banks reporting similar issues (check Serbian fintech community channels)

**NBS Contact:**

- **Email:** platne.institucije@nbs.rs
- **Phone:** +381 11 3027 100 (business hours: 08:00-16:00 CET)
- **Emergency Phone:** [TBD — request during onboarding]

**Escalation email template:**

```
Subject: IPS Payment Failures — Drop Srbija (Urgent)

Poštovani,

Prijavljujemo tehničke probleme sa IPS plaćanjima preko Drop Srbija platforme.

Simptomi:
- Vreme početka: [timestamp]
- Procenat neuspelih transakcija: [X%]
- HTTP status kodovi: [500/503/timeout]
- Broj pogođenih korisnika: [Y]

Dijagnostika:
- NBS status stranica pokazuje "operativno"
- mTLS sertifikati validni do: [expiry date]
- Test transakcije vraćaju: [error message]

Molimo za hitnu pomoć u dijagnostici problema.

Kontakt: [your name], [phone], [email]

Drop Srbija d.o.o.
[Company registration details]
```

### Step 6: Implement Workaround (if available) (ETA: 10 minutes)

**Workaround Option 1: Switch to Alternate Bank Adapter**

If Drop has partnerships with multiple banks:

```bash
# Update environment variable (requires deployment)
export BANK_PARTNER=alternate_bank_id

# Restart API
docker compose restart api
```

**Workaround Option 2: Queue Transactions for Retry**

Backend already implements exponential backoff retry for failed transactions:

- First retry: 30 seconds
- Second retry: 2 minutes
- Third retry: 10 minutes
- After 3 failures: Mark transaction as `failed`, user receives notification

**No manual intervention needed** — adapter handles retries automatically.

**Workaround Option 3: Pause Outbound Payments**

If outage is prolonged (>2 hours) and retries are causing cascading failures:

```bash
# Via admin API (requires admin JWT)
curl -X POST http://localhost:3003/admin/payments/pause \
  -H "Authorization: Bearer <admin_jwt>" \
  -d '{"reason": "NBS IPS outage", "estimatedResumptionTime": "2026-04-16T14:00:00Z"}'
```

**Effect:**
- New payment requests return HTTP 503 with message: "Plaćanja su privremeno nedostupna"
- Existing pending transactions continue retry attempts
- Users see in-app banner: "Trenutno ne primamo nova plaćanja"

**Resume payments:**

```bash
curl -X POST http://localhost:3003/admin/payments/resume \
  -H "Authorization: Bearer <admin_jwt>"
```

### Step 7: Monitor Recovery (ETA: Ongoing)

**Once NBS IPS is back online:**

1. **Check transaction backlog:**

```bash
# Count transactions in "processing" state
psql -h localhost -p 5434 -U dropsrbija -d dropsrbija_prod -c \
  "SELECT COUNT(*) FROM transactions WHERE status = 'processing' AND created_at > NOW() - INTERVAL '2 hours';"
```

2. **Verify retry processing:**

Backend retries failed transactions automatically. Monitor logs:

```bash
docker logs dropsrbija-api | grep "RetryProcessor" | tail -20
```

Expected output:

```
[RetryProcessor] Retrying transaction abc123 (attempt 1/3)
[NbsIpsLog] transaction_id=abc123 response_status=200 nbs_ips_status=ACCP
[RetryProcessor] Transaction abc123 succeeded on retry
```

3. **Update status page:**

```
✅ Problemi rešeni

NBS instant plaćanja su ponovo dostupna. Sva odložena plaćanja će biti procesirana automatski.

Hvala na strpljenju.

Ažurirano: [timestamp]
```

4. **Post-incident review** (within 24 hours):

- Total downtime duration
- Number of affected transactions
- Number of users impacted
- Root cause (NBS outage vs Drop technical issue)
- Lessons learned
- Action items (e.g., multi-bank redundancy, better monitoring)

### Step 8: Report to NBS (if required) (ETA: 72 hours)

**If outage meets NBS incident reporting criteria:**

- Duration >2 hours, OR
- >1000 failed transactions, OR
- Security/data breach involved

**Follow incident notification procedure:**

See [incident-notification-procedure.md](../comms/decisions/incident-notification-procedure.md) for full 3-track reporting protocol:

- **Track 1:** NBS initial notification (within 4 hours)
- **Track 2:** NBS detailed report (within 72 hours)
- **Track 3:** Poverenik data breach notification (if applicable)

## Expected Outcome

**Success criteria:**

- ✅ Outage detected within 5 minutes of occurrence
- ✅ Root cause identified (NBS outage vs Drop technical issue)
- ✅ Users notified within 15 minutes (status page update)
- ✅ NBS contacted (if Drop-specific issue)
- ✅ Workaround implemented (if available)
- ✅ Service restored (or ETA communicated)
- ✅ Post-incident review completed

**Metrics to track:**

- **MTTD (Mean Time to Detect):** <5 minutes
- **MTTR (Mean Time to Resolve):** <30 minutes for Drop issues, variable for NBS outages
- **User notification latency:** <15 minutes

## Escalation Path

**Level 1 (On-call engineer):** Follow Steps 1-6 above

**Level 2 (Technical Lead):** If unresolved after 30 minutes
- Contact: Petter Graff (CodeCraft, petter-graff@alai.no)
- Decide: Implement workaround, escalate to NBS, or wait for NBS resolution

**Level 3 (CEO):** If outage >2 hours OR customer impact >10,000 users
- Contact: Alem Basic (alem@alai.no, +47 404 74 251)
- Decide: Public communication strategy, regulatory notification (NBS + Poverenik)

## Post-Outage Actions

**Mandatory:**

1. **Update runbook** — If new failure mode discovered, add to Step 3 diagnostic checklist
2. **Improve monitoring** — Add alert for specific error pattern that triggered outage
3. **Document lessons learned** — Add entry to [Decision Log](./05-decision-log.md)

**Optional (if pattern recurs):**

4. **Implement multi-bank redundancy** — Partner with 2+ banks for IPS access
5. **Pre-queue transactions** — Buffer transactions locally during known NBS maintenance windows
6. **Enhanced monitoring** — Set up synthetic transaction every 5 minutes to detect outages faster

## Related Documents

- [Incident Notification Procedure](../comms/decisions/incident-notification-procedure.md) — Full NBS/Poverenik reporting protocol
- [Architecture Overview](./01-architecture-overview.md) — NBS IPS integration design
- [Decision Log](./05-decision-log.md) — Historical incident post-mortems

---

**Last Updated:** 2026-04-16  
**Next Review:** After first real NBS IPS outage (to validate runbook effectiveness)