# Bilko Security & Engineering Decisions (Observability Program)

# Bilko Security &amp; Engineering Decisions (Observability Program)

This page covers security findings, live fixes, engineering decisions, and arming prerequisites for the Bilko observability/self-healing program. It is a companion to the [Program Overview (MC #103328)](https://docs.alai.no/books/bilko-balkan-accounting-saas/page/bilko-observability-self-healing-program-overview-mc-103328) page.

---

## 1. CRITICAL Security Finding F7 and Fix (MC #103369 to #103371)

### What Securion found (MC #103369, 2026-06-10)

Securion reviewed `POST /auth/test/session` on **bilko-demo-api.alai.no** (live trial API). Verdict: **MOVE\_OFF\_PROD**. Source: /tmp/evidence-103369/verification.json

<table id="bkmrk-idseverityfinding-f7"><thead><tr><th>ID</th><th>Severity</th><th>Finding</th></tr></thead><tbody><tr><td>F7</td><td>CRITICAL</td><td>createTestSession() accepted arbitrary email. No whitelist. Leaked secret mints owner JWT for any registered prospect in bilko-demo-db.</td></tr><tr><td>F6</td><td>HIGH</td><td>Endpoint on live customer trial surface (app.bilko.cloud / bilko-demo-api.alai.no).</td></tr><tr><td>F3</td><td>HIGH</td><td>Generic auth bucket 200 req/min on demo - no endpoint-specific rate-limit.</td></tr><tr><td>F2</td><td>MEDIUM</td><td>Non-constant-time string compare (Kotlin !=). Timing side-channel defect.</td></tr><tr><td>F5</td><td>MEDIUM</td><td>RLS isolates E2E tenant but F7 expanded blast radius to all demo users.</td></tr><tr><td>F1</td><td>HIGH</td><td>Endpoint always registered at startup; 404 only when secret absent.</td></tr><tr><td>F4</td><td>LOW</td><td>Secret strength operator-dependent; no enforced entropy or rotation schedule.</td></tr></tbody></table>

### Fix deployed (MC #103371, 2026-06-10)

PR #330 merged. Deploy run 27272876257 success. Revision bilko-api-demo-00179-wdz at 100% traffic. Source: /tmp/evidence-103371/verification.json

<table id="bkmrk-remediationwhat-chan"><thead><tr><th>Remediation</th><th>What changed</th></tr></thead><tbody><tr><td>Email whitelist (F7 closed)</td><td>testEmail hard-checked against BILKO\_E2E\_TEST\_EMAIL env var. Non-matching returns HTTP 403 BILKO-AUTH-003. DB lookup never reached.</td></tr><tr><td>Constant-time compare (F2)</td><td>Replaced Kotlin != with MessageDigest.isEqual() on both testEmail and secret.</td></tr><tr><td>Dedicated rate-limit (F3)</td><td>5 req/min sub-bucket for /auth/test/session, independent of AUTH\_RATE\_LIMIT\_PER\_MINUTE.</td></tr><tr><td>Sentry audit (F3)</td><td>Sentry.captureMessage on any secret mismatch - structured event, not just warn log.</td></tr><tr><td>PERMANENT gate</td><td>F7-WHITELIST-GATE test added to ciam-auth-lifecycle.spec.ts (PR #332). Deploy pipeline now 3/3 tests; blocks on whitelist regression.</td></tr></tbody></table>

### Live proof (Proveo independent verification)

Source: /tmp/verify-103371/verification.json and /tmp/evidence-103371/verification.json

<table id="bkmrk-probeexpectedgotresu"><thead><tr><th>Probe</th><th>Expected</th><th>Got</th><th>Result</th></tr></thead><tbody><tr><td>valid secret + non-whitelisted email (attacker@example.com)</td><td>403</td><td>403</td><td>PASS</td></tr><tr><td>valid secret + seeded E2E email</td><td>200</td><td>200</td><td>PASS</td></tr><tr><td>wrong secret + seeded email</td><td>401</td><td>401</td><td>PASS</td></tr></tbody></table>

Deploy run 27274186928 (post-gate PR): success, 3/3 passed, F7-WHITELIST-GATE active.

### Residual open findings (fix before first real paying customers)

- **F4:** No enforced entropy minimum or rotation schedule for BILKO\_E2E\_TOKEN\_SECRET.
- **F1/F6:** Endpoint permanently registered when secret is set; on customer trial surface. Migrate E2E to ephemeral no-traffic revision before meaningful real-customer volume.

---

## 2. CIAM E2E Blocking Gate (MC #103365)

### Design

A Playwright spec (apps/e2e/tests/ciam-auth-lifecycle.spec.ts) runs as a mandatory blocking step in the GCP deploy pipeline (continue-on-error: false). Targets bilko-demo-api.alai.no specifically because stage masks RLS bugs (documented lesson). Source: /tmp/evidence-103365/verification.json

**Token-seed pattern:** spec calls POST /auth/test/session with the 64-char BILKO\_E2E\_TOKEN\_SECRET from GCP Secret Manager to obtain a real Bilko JWT, then exercises 9 authenticated steps:

1. POST /auth/test/session - 200 (JWT minted)
2. GET /auth/me - 200 (email + RLS identity confirmed)
3. GET /settings/users - 200 (tenant isolation: 1 user in org)
4. PUT /settings {vatNumber} - 200 (supplier OIB seeded)
5. POST /contacts - 201 (authenticated write)
6. POST /invoices - 201 (invoice create)
7. GET /invoices/{id} - 200 (RLS tenant read-back)
8. POST /auth/logout - 204 (refresh token revoked)
9. POST /auth/mobile/refresh (stale token) - 401 (revocation proven)

### Two-sided proof (Proveo)

- **Green:** all 9 steps pass in 1800ms. Deploy proceeds.
- **Red:** bad secret returns 401. Gate cannot be bypassed.

**Current state post-#103371:** 3/3 tests (original 9-step spec + F7-WHITELIST-GATE). Permanent blocking gate in every future deploy.

---

## 3. First Real Incident (503, 2026-06-10)

### Root cause (verified, /tmp/evidence-incident-503/finding.json)

**Transient Cloud Run scale-from-zero + revision cutover blips. NOT a code bug. No outage.**

- bilko-api-demo has minScale=0 (scale-to-zero). Revisions 00186 and 00187 deployed during the 20:36-20:43 UTC window (PR #330/#332 merges via CD).
- 503 latency: 11-16ms (immediate infra-level reject, no Kotlin stack trace).
- An interleaved 200 on the same endpoint at 20:36:44 confirmed service was otherwise healthy.
- Health endpoint bilko-demo-api.alai.no/api/v1/health returned 200 throughout.
- Alert pipeline worked: error-tracking alert fired and was investigated correctly.

### Actions taken

- Alert threshold tuned: &gt;0 errors to **&gt;3 errors/5 min**. Single deploy-cutover blips no longer page.
- min-instances=1 deferred until real paying customers (cost trade-off).

### Sentinel blind-spot exposed and fixed (MC #103420)

The incident revealed the error-count policy (ID 2342970117877340710, "Bilko API Demo - Backend ERROR log rate") was not in the Sentinel's hardcoded ALERT\_POLICIES list. The Sentinel was silently blind to it. Source: /tmp/evidence-103420/verification.json

**Fix:** live gcloud alpha monitoring policies list dynamic discovery with 5-minute on-disk cache (~/system/state/bilko-sentinel-policy-cache.json) and embedded fallback. Sentinel now evaluates 9 policies / 13 conditions per cycle. Fallback WARN log emitted if gcloud fails (no silent blind spots). Cache hit log confirms the target policy on every cycle.

---

## 4. Engineering Decisions

*John (AI Director) on CEO delegation. Decision 2 grounded in Kelsey Hightower (SRE) advisory consult. Canonical file: /Users/makinja/business/ALAI-Holding-AS/products/Bilko/docs/infrastructure/DECISIONS-observability-2026-06-10.md*

### Decision 1 - E2E test-session endpoint location

**ACCEPT-WITH-HARDENING. Keep on demo. Overrides pre-fix MOVE\_OFF\_PROD verdict.**

Why: F7 (the CRITICAL basis of MOVE\_OFF\_PROD) is fully remediated. Residual controls are strong (64-char secret, constant-time compare, 5/min rate-limit, Sentry audit, F7-WHITELIST-GATE). Demo is where RLS coverage lives; moving to stage forfeits the gate's purpose. No real customers yet; LOW residual risk.

Future trigger: before meaningful real-customer volume, migrate to a dedicated ephemeral no-traffic Cloud Run revision. Rotate secret on schedule.

### Decision 2 - Tier-1 auto-remediation

**Do NOT enable Tier-1 now. Stay Tier-0 (read-only). Earn promotion via the bar below.**

Grounding: recent agent-caused production incidents (IAM wipe; F7 hole introduced by an agent's own change); Tier-0 is hours old, zero signal calibration; Cloud Run rollback is not migration-aware on a financial system.

#### Kelsey Hightower (SRE) consult

Independent conclusion: automated remediation is justified only after (a) calibrated track record of correct proposals, (b) migration-safe rollbacks, and (c) action set signed off by a human engineer. The promotion bar operationalizes this.

#### Promotion bar: Tier-0 to Tier-1 (ALL must be true)

1. 30+ days Tier-0 live AND 20+ evaluated proposals (extend window until 20 proposals).
2. Proposal false-positive rate below 5% (human verdict within 24h; "root cause wrong" or "fix would worsen" = FP).
3. ZERO proposals that would have caused a secondary incident if auto-executed.
4. At least 1 ground-truth case: Tier-0 diagnosed correctly, human executed that exact fix, it resolved the incident.
5. Schema-deploy coupling audit complete + deploy manifest records migrations per revision (rollback safety).
6. Synthetic Entra-CIAM auth probe added to observability (bad rollback can break auth silently).
7. Revisions that are themselves rollbacks are tagged (never roll back to a known-bad revision).
8. Tier-1 action set signed off by a human engineer (not just CEO).

#### Tier-1 permitted actions (enforced, not advisory)

- Permitted only: roll back to N-1, scale min-instances 0 to 1, Slack escalation.
- Never-automate (must live at IAM, not just code): any IAM/policy, any Cloud SQL op, any secret, any DNS/LB/network, rollback older than N-1, action during in-flight deploy or protected business window.
- Pre-fire (ALL must be true): alert firing 5+ min; LLM confidence above calibrated threshold; target revision healthy 10+ min; no migration in bad revision; no prior action in last 60 min; 3-min human-ack window elapsed.
- Circuit breakers: max 2 actions/24h; auto-disable after any failed remediation; pre-action IAM-diff vs known-good snapshot; single-writer lock; audit log written before execution.

---

## 5. Tier-1 Arming Prerequisites (MC #103439, Securion MC #103436)

*Source: /tmp/evidence-103436/verification.json. Securion verdict: HARDENING\_REQUIRED.*

Shadow is structurally inert today. Dual barrier confirmed: (1) handleIncident() returns before execution block when MODE==='shadow'; (2) executeRollback and executeScaleFloor throw as their first statement in shadow. Two independent mechanisms. No mutation path exists in shadow.

### Hard blockers before ack or auto mode

<table id="bkmrk-findingseverityrequi"><thead><tr><th>Finding</th><th>Severity</th><th>Required fix before arming</th></tr></thead><tbody><tr><td>F5 - Ledger has no integrity protection</td><td>HIGH</td><td>bilko-sentinel-tier1-ledger.jsonl is writable and unsigned. Forged human\_verdict entries could satisfy the promotion bar. Fix: HMAC-sign each row; verify on read. BLOCK ack/auto until done.</td></tr><tr><td>F7 - SA actual GCP IAM roles unverified</td><td>MEDIUM</td><td>alai-cli-deployer SA project-level bindings not verified at review time. In shadow: must hold only monitoring.viewer + logging.viewer + run.viewer. For auto: roles/run.developer scoped to bilko-api-demo and bilko-web-demo only (resource-level condition). Must NOT hold cloudsql.\*, iam.\*, secretmanager.\*, or dns.\* roles.</td></tr><tr><td>F4 - Ack poll unwired; future identity risk</td><td>INFO (future MEDIUM)</td><td>When ack poll is wired: approver must be verified against hardcoded Slack user ID allowlist. Channel membership not sufficient. Require thread\_ts match to prevent cross-incident approval.</td></tr></tbody></table>

### Additional findings (non-blocking for shadow)

<table id="bkmrk-findingseveritysumma"><thead><tr><th>Finding</th><th>Severity</th><th>Summary</th></tr></thead><tbody><tr><td>F6 - IAM snapshot bootstrap window</td><td>MEDIUM</td><td>Snapshot can be deleted to force re-baseline. Seal after first write; alert on deletion/recreation.</td></tr><tr><td>F2 - Object.freeze({MODE}) is a no-op</td><td>MEDIUM</td><td>Misleading call; remove or replace with comment. MODE is immutable by JS const semantics in strict mode.</td></tr><tr><td>F8 - Gate 8 inconsistent with Gate 4</td><td>LOW</td><td>Gate 8 warns-and-passes when deploy manifest absent; Gate 4 blocks. Align to block.</td></tr><tr><td>F3 - Module integrity not checked at load</td><td>LOW</td><td>Add SHA-256 startup integrity check for Tier-1 module path.</td></tr><tr><td>F9 - Tier-1 missing execute bit</td><td>LOW</td><td>chmod +x /Users/makinja/system/tools/bilko-sentinel-tier1.js (cosmetic, no runtime impact).</td></tr></tbody></table>

**Current state:** Tier-1 running in shadow mode. All proposals logged to ~/system/logs/bilko-sentinel-tier1-ledger.jsonl. Calibration clock started. Review at 30 days / 20 proposals.