Bilko Observability & Self-Healing — Program Overview (MC #103328) Bilko Observability & Self-Healing — Program Overview (MC #103328) This is the single entry point for the entire Bilko observability and self-healing program. It links every sub-page, states current status, and gives a quick orientation for any CEO, agent, or engineer arriving for the first time. Environment Topology Tier Cloud Run services Public URL Purpose PROD (demo) bilko-api-demo, bilko-web-demo app.bilko.cloud / bilko-demo-api.alai.no Live trial surface — real prospects register here STAGE bilko-api-stage, bilko-web-stage stage.bilko.cloud (internal) CI validation; masks some RLS bugs (documented lesson) DORMANT bilko-web — Old web service; superseded by bilko-web-demo GCP project: tribal-sign-487920-k0. All observability targets bilko-api-demo and bilko-web-demo as the canonical production equivalents. Program Arc (2026-06-09 → 2026-06-11) GCP-native observability baseline (MC #103329/P1-A) — Cloud Monitoring dashboard (070613fa…), latency/traffic/saturation/5xx alerts wired end-to-end. Validation (MC #103331) — Proveo independent PASS confirming all alert signals fire correctly. Docs (MC #103332) — Initial BookStack page published. Sentinel Tier-0 built (MC #103337) — Read-only agent: detect → diagnose via Ollama → propose → notify. Proveo PASS. LaunchAgent running at PID 11465 (com.alai.bilko-sentinel). CD-green + GCP error-tracking (MC #103364) — CD pipeline repaired; error log metric + alert added. Threshold tuned to >3 errors/5 min after the first real incident (see below). CIAM E2E blocking gate (MC #103365) — Playwright CIAM auth-lifecycle spec added as a mandatory blocking step in the deploy pipeline. Proveo two-sided PASS (green + fails-on-broken). CRITICAL security review (MC #103369, Securion) — /auth/test/session endpoint on bilko-demo-api found to accept arbitrary email → impersonation risk (F7 = CRITICAL). Full findings + F7 remediation path issued. F7 security fix deployed (MC #103371) — Email whitelist, constant-time compare, 5/min rate-limit, Sentry audit. Verified live: attacker email → 403, seeded email → 200. CIAM E2E gate now 3/3 (F7-WHITELIST-GATE added). PR #330 merged, PR #332 (gate) merged. First real incident handled (503, 2026-06-10) — Transient 503s during Cloud Run revision cutover/scale-from-zero (not a code bug). Alert pipeline worked. Threshold tuned >0 → >3. See Security & Decisions page for full post-mortem. Sentinel dynamic-discovery fix (MC #103420) — Sentinel was missing the error-count policy (hardcoded list). Fixed to live gcloud discovery + 5-min cache + embedded fallback. 9 policies now evaluated each cycle, 13 conditions total. Tier-1 shadow (MC #103435) — Shadow-only armed auto-remediation module built. Structurally inert (dual barrier confirmed by Securion). Will NOT be promoted to ack/auto without clearing the promotion bar. Securion Tier-1 review (MC #103436) — Parisa Tabriz lens review. Shadow inert confirmed. 8 findings; 3 must be resolved before ack/auto (F5 HMAC ledger, F7 SA IAM scope, F4 ack allowlist). See Security & Decisions page. Tier-1 arming prerequisites (MC #103439) — Hard blockers catalogued. Tier-0 calibration clock starts now. Review at 30 days. Dashboard maturity roadmap (MC #103393) — Backlog. SLOs, error-rate tiles, distributed tracing, business metrics. CEO decision: document now, build before meaningful paying-customer volume. Master Status Table MC Title Status Evidence / Notes #103329 (P1-A) GCP-native observability DONE — Proveo PASS Dashboard 070613fa…; alerts wired #103331 Validation DONE — Proveo PASS All alert signals verified #103332 Docs (initial) DONE Page 3101 published #103337 Tier-0 Sentinel build DONE — Proveo PASS LaunchAgent PID 11465 live #103364 CD-fix + error-tracking DONE Threshold >3/5min after 503 incident #103365 CIAM E2E blocking gate DONE — Proveo PASS 2-sided proof; gate blocks on broken #103369 Securion test-endpoint review DONE — verdict MOVE_OFF_PROD (pre-fix); overridden post-F7 fix per Decision 1 /tmp/evidence-103369/verification.json #103371 F7 security fix DONE — Proveo PASS PR #330+#332 merged; 3/3 gate; live proof attacker→403 #103393 Dashboard maturity roadmap BACKLOG (not-now) SLOs, tracing, business metrics — before real paying customers #103420 Sentinel dynamic-discovery fix DONE — AgentForge PASS 9 policies, 5-min cache, embedded fallback #103435 Tier-1 shadow build DONE — shadow inert Dual barrier; Securion review attached #103436 Securion Tier-1 review DONE — HARDENING_REQUIRED before ack/auto 8 findings; F5/F7/F4 block arming #103439 Tier-1 arming prerequisites IN PROGRESS — calibration clock running 30-day / 20-proposal bar; see Decisions page Key Live URLs GCP Monitoring Dashboard: https://console.cloud.google.com/monitoring/dashboards?project=tribal-sign-487920-k0 (filter: slug 070613fa…) Demo API health: https://bilko-demo-api.alai.no/api/v1/health Demo app: https://app.bilko.cloud Documentation Map Page What it covers Page 3101 — Bilko Observability (GCP-native) Cloud Monitoring setup, dashboard tiles, alert policies, runbook for each alert type Page 3102 — Bilko Sentinel Tier-0 Tier-0 agent design, how detect/diagnose/propose/notify works, LaunchAgent config, audit log path Page 3106 — Bilko Sentinel Tier-1 (shadow-first) Tier-1 architecture, shadow barriers, action set, circuit breakers, pre-fire gates This page — Program Overview (#103328) Single entry point: arc, status table, links to all sub-pages Page — Security & Engineering Decisions F7 security fix, CIAM gate design, first incident post-mortem, Tier-1 arming prerequisites, architectural decisions