Bilko Sentinel — Tier-0 Self-Healing Agent 2026-06-10
Status
LIVE and Proveo-verified as of 2026-06-10. MC #103337 (AgentForge implementation) + MC #103337 Proveo independent verification. Parent MC #103328.
What It Is
Bilko Sentinel is a read-only ops agent that runs on ANVIL every 3 minutes. It follows a four-stage pipeline:
- Detect — queries the 8 GCP Cloud Monitoring alert policy conditions via the Monitoring REST API (GET only). Evaluates the last 6 minutes of time-series data locally against each condition's threshold.
- Enrich — on a breach, fetches recent Cloud Run logs and the current revision/traffic split for the affected service.
- Diagnose — calls FORGE Ollama (
qwen2.5:7b-instruct-q8_0at10.0.0.2:11434) with a structured JSON prompt (temperature 0.1) to produce a root-cause hypothesis and recommended action. Falls back to a deterministic template per cause category if Ollama is unreachable. - Propose — posts exactly one structured proposal per unique incident to Slack #ceo and email [email protected]. Deduplicates by incident key; does not re-notify the same breach for 24 hours.
It never changes anything. Proveo independently verified: zero mutating verbs, no GCP mutations of any kind (no run deploy, no set-iam-policy, no SQL writes, no secrets writes). The only HTTP POST in the script goes to the Ollama local inference endpoint, not to googleapis.com.
Infrastructure
| Component | Location |
|---|---|
| Script | /Users/makinja/system/tools/bilko-sentinel.js |
| LaunchAgent plist | /Users/makinja/Library/LaunchAgents/com.alai.bilko-sentinel.plist |
| State file (dedup) | /Users/makinja/system/state/bilko-sentinel-state.json |
| Audit log | /Users/makinja/system/logs/bilko-sentinel-audit.jsonl |
| Run log | /Users/makinja/system/logs/bilko-sentinel.log |
| Host | ANVIL (makinja local Mac) |
| Schedule | 180-second interval, RunAtLoad=true |
| Node.js path | /opt/homebrew/bin/node |
Policies Monitored (8 policies, 10 conditions)
- Cloud SQL CPU utilization high (prod + stage)
- Container restart/crash on prod services
- HTTP 5xx rate high on bilko-api-demo
- HTTP 5xx rate high on bilko-web-demo
- Request latency P95 high on prod services (API + Web — 2 conditions)
- CIAM — High 429 rate on bilko-api-demo (legacy from MC #103245)
- Cloud SQL connections near max on bilko-demo-db
- Uptime check failed (app.bilko.cloud + app-api.bilko.cloud — 2 conditions)
Severity Scale
| Label | Meaning |
|---|---|
| P1-DOWN | Service is down or uptime check failing |
| P2-DEGRADED | Elevated error rate or restart loop |
| P3-WARN | Latency spike, DB pressure, CIAM abuse rate |
Notification Format
Every proposal contains:
- Header:
BILKO SENTINEL — PROPOSAL (Tier-0, no action taken) - Incident ID, severity, env, resource, condition name
- Metric value vs threshold (exact numbers)
- Root-cause hypothesis (Ollama-generated or deterministic fallback)
- Proposed remediation steps (for human to execute)
- GCP Console link for the alert incident
- Detected timestamp
Dedup key format: bilko-{policyId[-8:]}-{condId[-8:]}. Once notified, silent for 24 hours on the same condition.
Proveo Verification Summary
Proveo (MC #103337) independently verified all critical properties:
| Property | Method | Result |
|---|---|---|
| Read-only guarantee | Exhaustive grep of all spawnSync calls and HTTP methods | CONFIRMED — zero mutating verbs |
| LaunchAgent loaded + healthy | launchctl list | grep bilko-sentinel — LastExitStatus=0 | PASS |
| Detect → Propose → Slack delivery | Independent verifier script with synthetic threshold (2ms vs real 9.5ms P95) | PASS — Slack message confirmed in #ceo at 04:24 UTC |
| Detect → Propose → Email delivery | Same synthetic test | PASS — Message-ID confirmed in audit DB |
| Dedup across cycles | Real 2-cycle disk-persistence test (not code inspection only) | PASS — Cycle 2 silent, no second Slack message |
| Healthy = silent | Normal threshold against real metric value | PASS — zero messages sent |
| No GCP mutation | Cloud Run revision before/after comparison | PASS — bilko-api-demo-00167-h9v unchanged |
Honest gaps noted by AgentForge (now closed by Proveo): email exit-code quirk (fixed in script via stdout check); dedup 2-cycle test (now independently proven); Ollama not re-exercised in Proveo test (builder's synthtest confirmed it live).
Runbook
Pause sentinel
launchctl unload ~/Library/LaunchAgents/com.alai.bilko-sentinel.plist
Resume sentinel
launchctl load ~/Library/LaunchAgents/com.alai.bilko-sentinel.plist
Check last run status
launchctl list | grep bilko-sentinel
# PID="-" = not currently running (between intervals). LastExitStatus=0 = healthy.
tail -20 /Users/makinja/system/logs/bilko-sentinel.log
View audit trail
tail -f /Users/makinja/system/logs/bilko-sentinel-audit.jsonl
Tune alert thresholds
Edit the ALERT_POLICIES array in /Users/makinja/system/tools/bilko-sentinel.js, then reload the agent:
launchctl unload ~/Library/LaunchAgents/com.alai.bilko-sentinel.plist
# edit the script
launchctl load ~/Library/LaunchAgents/com.alai.bilko-sentinel.plist
Tier Model and Safety Rationale
The tier model was defined after the 2026-06 IAM incident, in which an automated set-iam-policy call wiped project IAM. The lesson: any agent that can mutate production infra must earn trust via a demonstrated read-only track record first.
| Tier | Capability | Status | Safety gates |
|---|---|---|---|
| Tier 0 — current | Detect + Diagnose + Propose. Read-only. Posts structured proposal to #ceo and [email protected]. Zero blast radius. | LIVE | No code path to write to GCP. Proveo-verified. |
| Tier 1 — future MC | Bounded auto-remediation: Cloud Run revision rollback, instance scale adjustment, hung service restart. Circuit breaker (max N actions/hour). Full audit trail. Never touches DB schema, IAM, secrets, or financial data. Always announces before acting. | NOT BUILT — separate MC required | Explicit CEO approval token (/tmp/bilko-sentinel-tier1-approved) required before any mutation. Separate script (bilko-sentinel-tier1.js). Only after Tier-0 proves signal quality over weeks. |
| Tier 2 | Broader autonomy. | Probably never for a prod-financial SaaS | N/A |
The IAM incident reference is intentional: Tier-1 will be built with a hard whitelist of reversible Cloud Run and scaling operations only. No set-iam-policy, no SQL DDL, no secret rotation — ever.