# Bilko Demo Event-Loop Freeze — Root Cause, Fix & Hardening (2026-06-08, MC #103134)

# Bilko Demo Event-Loop Freeze — Root Cause, Fix &amp; Hardening (2026-06-08, MC #103134)

## 1. Incident Summary

**Date:** 2026-06-08  
**Service:** bilko-api-demo (Cloud Run, europe-north1, tribal-sign-487920-k0)  
**Live revision at incident:** bilko-api-demo-00139-b26 (git sha c8dfb6c)  
**MC:** #103134 (child of #103132)  
**Detected by:** CEO (Alem Basic) — no automated alerts existed

**Symptom:** POST /api/v1/auth/login and GET /api/v1/health intermittently returned HTTP 504 with server-side latency of exactly 59.999 s (Cloud Run hard timeout). Observed failure rate: 17–25% of requests. Both endpoints froze simultaneously on both Cloud Run instances. A manual container restart did not fix the issue.

**Why restart did not help:** The blocking code paths are baked into the deployed image (c8dfb6c). The freeze re-triggers the moment any user navigates to Reports or Billing/Usage, which re-executes the same bare JDBC calls on the Netty event-loop thread.

---

## 2. Root Cause — Causal Chain

Source: 5-agent synthesis (petter-synthesis.md) + source-verified app-layer audit (dev-applayer.md) + data/storage audit (devops2-data.md).

### 2.1 Ktor Netty Thread Pool Exhaustion (Primary)

Ktor runs on Netty. On a 1-vCPU Cloud Run container with no callGroupSize set (main branch pre-fix), Netty defaults to approximately 2 call-handling threads. With containerConcurrency=8, Cloud Run routes up to 8 simultaneous requests into the container, all competing for those 2 threads.

Any route that executes a bare orgTransaction() call not wrapped in dbQuery{} runs its JDBC work directly on the Netty call thread, blocking it for the duration of the DB round-trip (10–100 ms typical, up to 30 s if HikariCP must wait for a pool slot). With only 2 call threads and 8 concurrent slots, two simultaneous requests to a bare-orgTransaction route exhaust both call threads. The event loop is frozen.

**Confirmed bare orgTransaction sites on main at incident time:**

<table id="bkmrk-fileline%28s%29route-rep"><thead><tr><th>File</th><th>Line(s)</th><th>Route</th></tr></thead><tbody><tr><td>ReportRoutes.kt</td><td>26</td><td>GET /reports (country lookup)</td></tr><tr><td>ReportRoutes.kt</td><td>240</td><td>GET /reports/kpo (country pre-check)</td></tr><tr><td>ReportRoutes.kt</td><td>264</td><td>GET /reports/kpo/export/pdf (country pre-check)</td></tr><tr><td>ReportRoutes.kt</td><td>285</td><td>GET /reports/kpo/export/xlsx (country pre-check)</td></tr><tr><td>BillingRoutes.kt</td><td>214</td><td>GET /billing/usage (plan tier + invoice count)</td></tr><tr><td>MarketRoutes.kt</td><td>2 sites</td><td>resolveMarketPlugin call sites</td></tr><tr><td>ResourceAccessFilter.kt</td><td>multiple</td><td>Per-request security filter</td></tr><tr><td>Authentication.kt</td><td>impersonation path</td><td>Per impersonation request</td></tr></tbody></table>

### 2.2 PermissionService.loadFromDb — New RBAC Blocking Call (Critical, Entra cutover)

The Entra External ID cutover (MC #103140) introduced RBAC via requirePermission(), which is called on every authenticated request across all 17 route files. On first call per role (4 roles: owner/admin/accountant/viewer), resolve() calls loadFromDb():

```
// BEFORE fix — blocking on Netty event-loop thread
private fun loadFromDb(role: String): Set<String> {
    return transaction {  // bare JDBC on Netty thread
        RolePermissionsTable.selectAll()...
    }
}
```

This fires on every cold-start / scale-out event for every authenticated route. Without the fix, the Entra cutover would have immediately re-introduced a freeze worse than the original: 4 role cache-misses per instance restart, each running blocking JDBC on the event loop. Source: /tmp/evidence-103134/rebase-on-entra.md.

### 2.3 HikariCP connectionTimeout=30,000 ms (Secondary Amplifier)

When the Dispatchers.IO pool workers (correctly dispatched DB calls) hit HikariCP pool pressure — 10 connections shared across 8 concurrent slots — a pool-borrow wait can last up to 30 s per request. Two stacked pool-borrow waits = 60 s = Cloud Run hard kill. This amplifier remains even after the bare-orgTransaction calls are fixed; it is addressed by reducing connectionTimeout to 5 s (P0.3).

### 2.4 Why /health Freezes (Not a DB Problem)

HealthRoutes.kt lines 10–21 are confirmed DB-independent (zero blocking calls, pure in-memory response). The health freeze is not because health touches the DB. It freezes because both Netty call threads are occupied by blocked orgTransaction callers. Health cannot be scheduled. The TCP startup probe continues to pass because the JVM still has port 4001 bound; only HTTP scheduling is stalled.

### 2.5 Why Login Freezes (Login Is the Victim, Not the Cause)

AuthRoutes.kt line 150 correctly wraps login in dbQuery{}. Login itself is not a blocking caller. Login freezes because it cannot obtain a call thread to execute — both threads are consumed by the bare-orgTransaction routes.

### 2.6 Tertiary — Absent Server-Side DB Timeouts

No statement\_timeout, idle\_in\_transaction\_session\_timeout, or lock\_timeout on Cloud SQL (confirmed: databaseFlags null). A single slow query holds a pool connection indefinitely, compounding pool pressure.

### 2.7 gcsfuse Is Not a Cause of Auth/Health Freezes

gcsfuse GC timestamps do not align with freeze onsets (GC at 06:35:04 UTC, freeze at 06:34:52 and 06:35:43). Login does no file I/O. gcsfuse FUSE blocking I/O affects receipt endpoints only; it is a separate hygiene issue and was also offloaded to Dispatchers.IO in the fix.

---

## 3. Fix Applied (MC #103134)

Branch: fix/demo-freeze-on-entra-103134, commit 5252182, rebased on origin/main @ b21b366 (Entra cutover). Merged as PR #279 to main 839ee7f. Source: /tmp/evidence-103134/rebase-on-entra.md.

### Code Fixes

<table id="bkmrk-filechange-routes%2Fre"><thead><tr><th>File</th><th>Change</th></tr></thead><tbody><tr><td>routes/ReportRoutes.kt (4 sites)</td><td>Bare orgTransaction wrapped in dbQuery{}</td></tr><tr><td>routes/BillingRoutes.kt (line 214)</td><td>Bare orgTransaction in /billing/usage wrapped in dbQuery{}</td></tr><tr><td>routes/MarketRoutes.kt (2 sites)</td><td>resolveMarketPlugin call sites wrapped in dbQuery{}</td></tr><tr><td>plugins/ResourceAccessFilter.kt</td><td>fetchResourceOwner + logSecurityViolation made suspend + withContext(IO); Thread.sleep replaced with delay</td></tr><tr><td>plugins/Authentication.kt</td><td>Impersonation transaction wrapped in withContext(Dispatchers.IO)</td></tr><tr><td>services/PermissionService.kt (new finding)</td><td>loadFromDb() now uses runBlocking { withContext(Dispatchers.IO) { ... } } — RBAC per-request DB call offloaded from event loop</td></tr><tr><td>services/ReceiptService.kt</td><td>readLocalDocument + persistLocalIfEnabled made suspend + withContext(IO)</td></tr><tr><td>routes/ExpenseRoutes.kt</td><td>readLocalDocument call moved out of dbQuery{} (now suspend-safe directly)</td></tr><tr><td>plugins/Database.kt</td><td>connectionTimeout 30,000 ms reduced to 5,000 ms</td></tr><tr><td>routes/HealthRoutes.kt</td><td>Added GET /api/v1/health/deep — DB-touching endpoint with withTimeout 3 s + Dispatchers.IO + SELECT 1</td></tr></tbody></table>

### Infrastructure Fixes (cloudbuild-stage.yaml + cloudbuild.yaml)

<table id="bkmrk-changebeforeafter-me"><thead><tr><th>Change</th><th>Before</th><th>After</th></tr></thead><tbody><tr><td>Memory</td><td>512 Mi</td><td>1 Gi</td></tr><tr><td>Liveness probe</td><td>None</td><td>httpGet /api/v1/health/deep, port 4001, initialDelaySeconds 30, periodSeconds 30, failureThreshold 2, timeoutSeconds 8</td></tr><tr><td>Startup probe type</td><td>TCP port 4001, period 240s, failureThreshold 1, timeout 240s</td><td>httpGet /api/v1/health, port 4001, period 10s, failureThreshold 3, timeout 5s</td></tr><tr><td>containerConcurrency (cloudbuild templates)</td><td>80 (default)</td><td>8</td></tr></tbody></table>

### Build and Test Results

```
./gradlew build -x detekt: BUILD SUCCESSFUL in 12s
Tests: 2561 passed, 1 skipped (EntraLiveCiamTokenWp5Test — assumeTrue on stale token file), 0 failures
```

---

## 4. Industry-Standard Scorecard

Condensed from petter-synthesis.md section 2. Full findings in layer audit files.

### Application Layer Deviations

<table id="bkmrk-componentstatusfindi"><thead><tr><th>Component</th><th>Status</th><th>Finding</th></tr></thead><tbody><tr><td>Ktor/Netty callGroupSize</td><td>DEVIATES (pre-fix)</td><td>Defaults to ~2 call threads on 1 vCPU when unset. callGroupSize=32 added in fix; must be confirmed in Cloud Run env (KTOR\_CALL\_GROUP\_SIZE=32 or production profile).</td></tr><tr><td>dbQuery{} discipline</td><td>FIXED</td><td>5 bare orgTransaction sites + PermissionService.loadFromDb all wrapped. Remaining raw transaction{} calls are safely nested inside outer dbQuery{} or on non-hot paths.</td></tr><tr><td>Login path</td><td>STANDARD</td><td>AuthRoutes.kt:150 correctly wraps in dbQuery{}. Login was the victim of thread starvation, not a cause.</td></tr><tr><td>BCrypt dispatcher</td><td>DEVIATES (minor)</td><td>BCrypt (CPU-bound 250–400 ms) runs on Dispatchers.IO. Should be Dispatchers.Default. Not the freeze cause. Follow-on P1.3.</td></tr><tr><td>HikariCP connectionTimeout</td><td>FIXED</td><td>30,000 ms reduced to 5,000 ms. Pool-borrow wait now fails fast with 503.</td></tr><tr><td>Login transaction count</td><td>DEVIATES</td><td>AuthService.login() makes 3 separate orgTransaction{} calls (3 pool borrows). Follow-on P1.1: consolidate to 1.</td></tr><tr><td>Ktor RequestTimeout plugin</td><td>DEVIATES</td><td>Not installed. Stuck handlers held until Cloud Run kills at 60 s. Follow-on P1.5: add 10 s server-side timeout.</td></tr><tr><td>/health/deep endpoint</td><td>ADDED</td><td>DB-touching SELECT 1, 3 s internal timeout, returns 200 {db:up} or 503.</td></tr><tr><td>gcsfuse (ReceiptService)</td><td>FIXED (hygiene)</td><td>Blocking POSIX FUSE I/O offloaded to withContext(IO). Not the auth/health freeze cause, but blocked safe concurrency &gt; 1–8.</td></tr></tbody></table>

### Infrastructure Layer Deviations

<table id="bkmrk-componentstatusfindi-1"><thead><tr><th>Component</th><th>Status</th><th>Finding</th></tr></thead><tbody><tr><td>containerConcurrency</td><td>DEVIATES</td><td>8 with blocking paths present. Industry standard: 1 until all blocking DB/IO paths are on Dispatchers.IO. Reduced from 80 to 8 in fix; full revert to 1 deferred pending gcsfuse SDK swap (P1.2).</td></tr><tr><td>minScale/maxScale</td><td>DEVIATES</td><td>min=2 max=2 — no scaling relief valve. With both instances holding frozen slots, no new instance absorbs traffic. Follow-on P2.2: maxScale=5 + pool=2.</td></tr><tr><td>HikariCP pool math</td><td>FRAGILE</td><td>pool=10, maxScale=2 = 20 connections vs ~22 usable on db-f1-micro. Within bounds by 2 connections only.</td></tr><tr><td>Cloud SQL databaseFlags</td><td>DEVIATES</td><td>Zero flags. No statement\_timeout (follow-on MC #103133, blocked on MC #103149), no idle\_in\_transaction\_session\_timeout, no lock\_timeout.</td></tr><tr><td>Liveness probe</td><td>ADDED</td><td>HTTP liveness on /health/deep. Without it, frozen containers ran for the entire incident duration with no auto-recovery.</td></tr><tr><td>Startup probe</td><td>FIXED</td><td>TCP 240s/1 threshold → HTTP GET /health 10s/3 threshold.</td></tr><tr><td>Cloud Monitoring alerts</td><td>DEVIATES</td><td>0 policies (gcloud monitoring policies list = 0 items confirmed 2026-06-08T08:54 UTC). Incident was CEO-detected. Follow-on P2.1: 5 alert policies.</td></tr><tr><td>Memory</td><td>FIXED</td><td>512 Mi to 1 Gi. JVM + HikariCP + gcsfuse daemon on 512 Mi was tight.</td></tr><tr><td>Cloud SQL public IP</td><td>DEVIATES</td><td>ipv4Enabled=true. No open authorized networks found (mitigated), but public surface exists. Follow-on: private IP + VPC.</td></tr></tbody></table>

---

## 5. Deploy Path and Cross-Session Coordination

Source: /tmp/alai/p2p-pairing-evidence/john-7fedd67f-freeze-fix-coordination-20260608.md

This fix was developed in session 7fedd67f, rebased on the Entra External ID cutover branch (b21b366), and coordinated with the Entra/RBAC workstream (MC #103080, Phase 4). The agreed sequence:

1. **Step 1 (complete):** PR #279 merged fix/demo-freeze-on-entra-103134 to main (839ee7f). Stage auto-deploy triggered. Stage revision bilko-api-stage-00531-cub deployed. Proveo PASS issued.
2. **Step 2 (pending):** Demo and production semver tag is owned by the Entra/RBAC workstream (MC #103080). That single semver tag vX.Y.Z ships Entra + RBAC + freeze fix together via bilko-main-deploy trigger. Neither session tags unilaterally.

**Deploy pipeline mechanics (tool-verified):**

- bilko-stage-auto-deploy: push to branch matching ^main$ triggers cloudbuild-stage.yaml
- bilko-main-deploy: semver tag matching ^v.\*$ triggers cloudbuild.yaml (demo, 8 gates + approval required)
- cloudbuild-demo-api.yaml: DEAD CODE — orphaned trigger deleted 2026-05-21. Do not reference or invoke.

---

## 6. Proveo Stage Validation — PASS

Source: /tmp/evidence-103134/proveo-stage-verdict.json + /tmp/evidence-103134/proveo-stage-validation.md  
Timestamp: 2026-06-08T09:54:10Z | Revision: bilko-api-stage-00531-cub (git sha 839ee7f)

<table id="bkmrk-checkresultdetail-%2Fh"><thead><tr><th>Check</th><th>Result</th><th>Detail</th></tr></thead><tbody><tr><td>/health x30 sequential</td><td>PASS</td><td>0/30 failures, max latency 170 ms</td></tr><tr><td>/health x20 concurrent</td><td>PASS</td><td>0/20 failures, max latency 279 ms</td></tr><tr><td>/health/deep x15 (DB-touching)</td><td>PASS</td><td>Endpoint present, db:up, all under 200 ms</td></tr><tr><td>Concurrency starvation test (6 waves x20 = 120 requests)</td><td>PASS</td><td>0 non-200, 0x 504@60s. Wave 1: 2/20 requests at ~10.8 s HTTP 200 (pool warm-up artifact, not freeze). Waves 2–6: zero over 2 s. Original 504@59.999s pattern NOT observed.</td></tr><tr><td>Auth path</td><td>PASS</td><td>POST /auth/login 410 (legacy retired); POST /auth/entra/session (invalid) 400; GET /invoices (no auth) 401; GET /contacts (no auth) 401. Entra + RBAC enforced.</td></tr><tr><td>Probe config</td><td>PASS</td><td>livenessProbe httpGet /health/deep present; startupProbe httpGet /health present; memory 1Gi; git sha 839ee7f confirmed.</td></tr></tbody></table>

**Overall verdict: PASS.** The event-loop freeze fix is confirmed on stage. The original 504@59.999s symptom is not reproduced under sustained concurrent load. All DB-touching paths return fast. Entra-only auth path correctly enforced.

---

## 7. Follow-Up Tasks

<table id="bkmrk-mc-%2F-itemdescription"><thead><tr><th>MC / Item</th><th>Description</th><th>Status</th></tr></thead><tbody><tr><td>MC #103133</td><td>Cloud SQL statement\_timeout + idle\_in\_transaction\_session\_timeout + lock\_timeout</td><td>Blocked on GCP-write channel MC #103149</td></tr><tr><td>MC #103148</td><td>Gate wc bug</td><td>Open</td></tr><tr><td>MC #103173</td><td>Claim-gate stale-transcript loop</td><td>Open</td></tr><tr><td>Task #7</td><td>PermissionService pre-warm cache at JVM startup (4-role cache hit on first request, no cold DB miss)</td><td>Deferred — recommended polish</td></tr><tr><td>P1.1</td><td>Consolidate AuthService.login() 3 transactions into 1 (reduce pool borrows per login from 3 to 1)</td><td>Deferred</td></tr><tr><td>P1.2</td><td>Replace gcsfuse with GCS Storage SDK in ReceiptService (prerequisite for safe containerConcurrency &gt; 8)</td><td>Deferred</td></tr><tr><td>P1.3</td><td>Move BCrypt from Dispatchers.IO to Dispatchers.Default (CPU-bound work on correct dispatcher)</td><td>Deferred</td></tr><tr><td>P1.5</td><td>Install Ktor RequestTimeout plugin (force-kill stuck handlers at 10 s, return 503 not 504)</td><td>Deferred</td></tr><tr><td>P2.1</td><td>Create 5 Cloud Monitoring alert policies (5xx rate, p99 latency, instance ceiling, Cloud SQL connections, /health/deep uptime)</td><td>Deferred — no blocking dependency</td></tr><tr><td>P2.2</td><td>Scaling fix: containerConcurrency=1, maxScale=5, pool=2 (requires P1.2 first)</td><td>Deferred pending P1.2</td></tr><tr><td>P2.6</td><td>Add k6 event-loop-freeze regression gate to CI (spec in test-observability.md section 4 B5)</td><td>Deferred</td></tr></tbody></table>

---

## 8. Evidence References

- /tmp/alai/bilko-infra-team/petter-synthesis.md — 5-agent root-cause synthesis (Petter Graff)
- /tmp/alai/bilko-infra-team/devops1-cloudrun.md — Cloud Run compute/scaling layer audit (FlowForge)
- /tmp/alai/bilko-infra-team/devops2-data.md — Cloud SQL / HikariCP / gcsfuse / networking audit (FlowForge)
- /tmp/alai/bilko-infra-team/dev-applayer.md — Application layer audit (CodeCraft / Martin Kleppmann lens)
- /tmp/alai/bilko-infra-team/test-observability.md — Observability gaps + Proveo validation plan (Angie Jones)
- /tmp/evidence-103134/rebase-on-entra.md — Fix branch: PermissionService finding, rebase evidence, build/test results
- /tmp/evidence-103134/proveo-stage-verdict.json — Stage validation machine-readable verdict (Proveo, 2026-06-08T09:54:10Z)
- /tmp/evidence-103134/proveo-stage-validation.md — Stage validation human-readable report
- /tmp/alai/p2p-pairing-evidence/john-7fedd67f-freeze-fix-coordination-20260608.md — Cross-session coordination with Entra workstream