# Bilko Observability & Self-Healing — Program Overview (MC #103328)

# Bilko Observability &amp; Self-Healing — Program Overview (MC #103328)

**This is the single entry point for the entire Bilko observability and self-healing program.**It links every sub-page, states current status, and gives a quick orientation for any CEO, agent, or engineer arriving for the first time.

---

## Environment Topology

<table id="bkmrk-tiercloud-run-servic"><thead><tr><th>Tier</th><th>Cloud Run services</th><th>Public URL</th><th>Purpose</th></tr></thead><tbody><tr><td>PROD (demo)</td><td>bilko-api-demo, bilko-web-demo</td><td>app.bilko.cloud / bilko-demo-api.alai.no</td><td>Live trial surface — real prospects register here</td></tr><tr><td>STAGE</td><td>bilko-api-stage, bilko-web-stage</td><td>stage.bilko.cloud (internal)</td><td>CI validation; masks some RLS bugs (documented lesson)</td></tr><tr><td>DORMANT</td><td>bilko-web</td><td>—</td><td>Old web service; superseded by bilko-web-demo</td></tr></tbody></table>

**GCP project:** tribal-sign-487920-k0. All observability targets bilko-api-demo and bilko-web-demo as the canonical production equivalents.

---

## Program Arc (2026-06-09 → 2026-06-11)

1. **GCP-native observability baseline** (MC #103329/P1-A) — Cloud Monitoring dashboard (070613fa…), latency/traffic/saturation/5xx alerts wired end-to-end.
2. **Validation** (MC #103331) — Proveo independent PASS confirming all alert signals fire correctly.
3. **Docs** (MC #103332) — Initial BookStack page published.
4. **Sentinel Tier-0 built** (MC #103337) — Read-only agent: detect → diagnose via Ollama → propose → notify. Proveo PASS. LaunchAgent running at PID 11465 (com.alai.bilko-sentinel).
5. **CD-green + GCP error-tracking** (MC #103364) — CD pipeline repaired; error log metric + alert added. Threshold tuned to &gt;3 errors/5 min after the first real incident (see below).
6. **CIAM E2E blocking gate** (MC #103365) — Playwright CIAM auth-lifecycle spec added as a mandatory blocking step in the deploy pipeline. Proveo two-sided PASS (green + fails-on-broken).
7. **CRITICAL security review** (MC #103369, Securion) — /auth/test/session endpoint on bilko-demo-api found to accept arbitrary email → impersonation risk (F7 = CRITICAL). Full findings + F7 remediation path issued.
8. **F7 security fix deployed** (MC #103371) — Email whitelist, constant-time compare, 5/min rate-limit, Sentry audit. Verified live: attacker email → 403, seeded email → 200. CIAM E2E gate now 3/3 (F7-WHITELIST-GATE added). PR #330 merged, PR #332 (gate) merged.
9. **First real incident handled (503, 2026-06-10)** — Transient 503s during Cloud Run revision cutover/scale-from-zero (not a code bug). Alert pipeline worked. Threshold tuned &gt;0 → &gt;3. See Security &amp; Decisions page for full post-mortem.
10. **Sentinel dynamic-discovery fix** (MC #103420) — Sentinel was missing the error-count policy (hardcoded list). Fixed to live gcloud discovery + 5-min cache + embedded fallback. 9 policies now evaluated each cycle, 13 conditions total.
11. **Tier-1 shadow** (MC #103435) — Shadow-only armed auto-remediation module built. Structurally inert (dual barrier confirmed by Securion). Will NOT be promoted to ack/auto without clearing the promotion bar.
12. **Securion Tier-1 review** (MC #103436) — Parisa Tabriz lens review. Shadow inert confirmed. 8 findings; 3 must be resolved before ack/auto (F5 HMAC ledger, F7 SA IAM scope, F4 ack allowlist). See Security &amp; Decisions page.
13. **Tier-1 arming prerequisites** (MC #103439) — Hard blockers catalogued. Tier-0 calibration clock starts now. Review at 30 days.
14. **Dashboard maturity roadmap** (MC #103393) — Backlog. SLOs, error-rate tiles, distributed tracing, business metrics. CEO decision: document now, build before meaningful paying-customer volume.

---

## Master Status Table

<table id="bkmrk-mctitlestatusevidenc"><thead><tr><th>MC</th><th>Title</th><th>Status</th><th>Evidence / Notes</th></tr></thead><tbody><tr><td>\#103329 (P1-A)</td><td>GCP-native observability</td><td>DONE — Proveo PASS</td><td>Dashboard 070613fa…; alerts wired</td></tr><tr><td>\#103331</td><td>Validation</td><td>DONE — Proveo PASS</td><td>All alert signals verified</td></tr><tr><td>\#103332</td><td>Docs (initial)</td><td>DONE</td><td>Page 3101 published</td></tr><tr><td>\#103337</td><td>Tier-0 Sentinel build</td><td>DONE — Proveo PASS</td><td>LaunchAgent PID 11465 live</td></tr><tr><td>\#103364</td><td>CD-fix + error-tracking</td><td>DONE</td><td>Threshold &gt;3/5min after 503 incident</td></tr><tr><td>\#103365</td><td>CIAM E2E blocking gate</td><td>DONE — Proveo PASS</td><td>2-sided proof; gate blocks on broken</td></tr><tr><td>\#103369</td><td>Securion test-endpoint review</td><td>DONE — verdict MOVE\_OFF\_PROD (pre-fix); overridden post-F7 fix per Decision 1</td><td>/tmp/evidence-103369/verification.json</td></tr><tr><td>\#103371</td><td>F7 security fix</td><td>DONE — Proveo PASS</td><td>PR #330+#332 merged; 3/3 gate; live proof attacker→403</td></tr><tr><td>\#103393</td><td>Dashboard maturity roadmap</td><td>BACKLOG (not-now)</td><td>SLOs, tracing, business metrics — before real paying customers</td></tr><tr><td>\#103420</td><td>Sentinel dynamic-discovery fix</td><td>DONE — AgentForge PASS</td><td>9 policies, 5-min cache, embedded fallback</td></tr><tr><td>\#103435</td><td>Tier-1 shadow build</td><td>DONE — shadow inert</td><td>Dual barrier; Securion review attached</td></tr><tr><td>\#103436</td><td>Securion Tier-1 review</td><td>DONE — HARDENING\_REQUIRED before ack/auto</td><td>8 findings; F5/F7/F4 block arming</td></tr><tr><td>\#103439</td><td>Tier-1 arming prerequisites</td><td>IN PROGRESS — calibration clock running</td><td>30-day / 20-proposal bar; see Decisions page</td></tr></tbody></table>

---

## Key Live URLs

- **GCP Monitoring Dashboard:** https://console.cloud.google.com/monitoring/dashboards?project=tribal-sign-487920-k0 (filter: slug 070613fa…)
- **Demo API health:** https://bilko-demo-api.alai.no/api/v1/health
- **Demo app:** https://app.bilko.cloud

---

## Documentation Map

<table id="bkmrk-pagewhat-it-covers-p"><thead><tr><th>Page</th><th>What it covers</th></tr></thead><tbody><tr><td>[Page 3101 — Bilko Observability (GCP-native)](https://docs.alai.no/books/bilko-balkan-accounting-saas/page/bilko-observability-gcp-native-2026-06-10)</td><td>Cloud Monitoring setup, dashboard tiles, alert policies, runbook for each alert type</td></tr><tr><td>[Page 3102 — Bilko Sentinel Tier-0](https://docs.alai.no/books/bilko-balkan-accounting-saas/page/bilko-sentinel-tier-0-self-healing-agent-2026-06-10)</td><td>Tier-0 agent design, how detect/diagnose/propose/notify works, LaunchAgent config, audit log path</td></tr><tr><td>[Page 3106 — Bilko Sentinel Tier-1 (shadow-first)](https://docs.alai.no/books/bilko-balkan-accounting-saas/page/bilko-sentinel-tier-1-bounded-auto-remediation-shadow-first-2026-06-11)</td><td>Tier-1 architecture, shadow barriers, action set, circuit breakers, pre-fire gates</td></tr><tr><td>**This page** — Program Overview (#103328)</td><td>Single entry point: arc, status table, links to all sub-pages</td></tr><tr><td>[Page — Security &amp; Engineering Decisions](https://docs.alai.no/books/bilko-balkan-accounting-saas/page/bilko-security-engineering-decisions-observability-program)</td><td>F7 security fix, CIAM gate design, first incident post-mortem, Tier-1 arming prerequisites, architectural decisions</td></tr></tbody></table>