Bilko Sentinel — Tier-1 Bounded Auto-Remediation (Shadow-First) 2026-06-11
Status
BUILT — SHADOW mode. MC #103435 (AgentForge build) + MC #103436 (Securion adversarial review). Parent MC #103328. Module: /Users/makinja/system/tools/bilko-sentinel-tier1.js. Tier-0 remains the active detection layer; Tier-1 is shadow-calibrating only.
SRE rationale: DECISIONS-observability-2026-06-10.md, Decision 2 — Kelsey Hightower consult. The bar is not yet met; ship the muscle, start the calibration clock, arm only when proven.
What Tier-1 Is
Tier-1 is the agent that can fix. Where Tier-0 only detects, diagnoses, and proposes — Tier-1 computes the exact bounded action and, when armed, executes it. Permitted action set (exhaustive):
- Roll back
bilko-api-demoorbilko-web-demoto N-1 only (never older) - Set Cloud Run
--min-instances0→1 (warm floor for cold-start incidents) - Slack escalation (always permitted, labelled with current mode)
Never-automate (enforced at IAM, not policy): no IAM/policy change, no Cloud SQL op, no secret op, no DNS/LB/network, no rollback older than N-1, no action during an in-flight deploy or protected business window. Invoked by the Tier-0 loop via try/catch — any Tier-1 error is caught and logged, Tier-0 continues unaffected.
Modes (SENTINEL_TIER1_MODE)
Mode is read once at startup, immutable for the life of the process. Unrecognised or missing value resolves to shadow. The LaunchAgent plist deliberately omits the key.
| Mode | Behaviour | Current state |
|---|---|---|
| shadow (DEFAULT) | On a breach: compute the bounded action, announce to #ceo with gate evaluation, write one row to calibration ledger. Execute nothing. | ACTIVE |
ack |
Same as shadow, but if a human posts APPROVE in the #ceo thread within 3 min, execute the action (all 8 gates must pass). Slack poll loop is a follow-on — currently defers conservatively. Requires F4 hardening first. | NOT YET WIRED |
auto |
Execute automatically when all 8 gates pass and promotionBarMet() returns true. Silence in ack window = proceed. FORBIDDEN until bar met + human-engineer sign-off. | BLOCKED (PROMOTION_BAR_NOT_MET) |
Current State: SHADOW — Confirmed Inert
Securion adversarial review (MC #103436) + AgentForge build verification (MC #103435) independently confirm: in shadow mode this agent cannot mutate prod. Two independent structural barriers:
handleIncident()enters anif (MODE === 'shadow')block and returns at line 852 — the execution block at line 861+ is outside and structurally unreachable.executeRollback()(line 625) andexecuteScaleFloor()(line 675) each throw as their literal first statement when MODE === 'shadow'. Both barriers are independently sufficient.
Live production traffic verified unchanged during shadow simulation: bilko-api-demo-00192-sfv @ 100% before and after. Auto mode tested with SENTINEL_TIER1_MODE=auto and hard-blocked with PROMOTION_BAR_NOT_MET.
8 Pre-fire Gates
ALL eight must be true before any execution in ack or auto mode. In shadow they are evaluated and recorded to the ledger only.
| # | Gate | Detail |
|---|---|---|
| 1 | Alert sustained ≥5 min | Prevents action on transient spikes. Measured from incident.firstSeenAt. |
| 2 | Calibrated LLM confidence | Requires "high" until ≥5 ledger reviews; adjusts to "medium" with ledger data. Derived from calibration, not hardcoded. |
| 3 | N-1 confirmed healthy ≥10 min | Rollback target must have been in Ready=True state for ≥10 min before the current (bad) revision. Unknown = block + escalate. |
| 4 | No schema migration in bad revision | Requires deploy manifest (~/system/state/bilko-deploy-manifest.json). Honest block: absent manifest = BLOCK + escalate to human. Rolling back across a schema migration can corrupt data. |
| 5 | Cooldown: no action in last 60 min | One action per 60-minute window across all types. |
| 6 | 3-min human-ack window | HOLD or ABORT in #ceo thread cancels. Shadow: informational. Ack: requires explicit APPROVE. Auto: silence = proceed. |
| 7 | IAM diff vs known-good snapshot | Compares live Cloud Run service IAM policy against ~/system/state/bilko-sentinel-iam-snapshot.json. Mismatch = block + escalate. Motivated by the 2026-06 IAM wipe incident. |
| 8 | N-1 is not itself a rollback revision | Prevents rolling back to a revision tagged as a known-bad rollback. Checked against deploy manifest isRollback flag. |
Circuit Breakers
- Max 2 actions / 24h across all types. Third incident in 24h → human-only escalation.
- Self-disable after failed remediation: post-action health check at +5 min; if service still unhealthy →
circuitOpen=true, SENTINEL-CIRCUIT-OPEN to Slack + email. Manual re-enable: setcircuitOpen=falsein~/system/state/bilko-sentinel-tier1-state.json. - Single-writer lock: atomic file lock (
~/system/state/bilko-sentinel-tier1.lock) — prevents race on same revision. - Audit-before-execute:
~/system/logs/bilko-sentinel-audit.jsonlwritten before any gcloud mutation verb. Log write failure → action does not fire. - Two Slack announcements per action: BEFORE ("about to roll back X to rev Y in 3 min unless HOLD") and AFTER ("rolled back, health check in 5 min").
Promotion Bar: shadow → auto
promotionBarMet() is a hard gate on the auto path. Reads the calibration ledger at runtime — not hardcoded. Currently returns FALSE; auto path refuses with PROMOTION_BAR_NOT_MET.
| Criterion | Required | Current (2026-06-11) |
|---|---|---|
| daysLive_gte30 | ≥30 days since first ledger entry | 0 days — NOT MET |
| evaluatedProposals_gte20 | ≥20 proposals in ledger | 2 — NOT MET |
| fpRate_lt5pct | Human-reviewed FP rate <5% | 100% default — NOT MET |
| groundTruthHit | ≥1 row with human_verdict=correct | 0 — NOT MET |
| deployManifestExists | ~/system/state/bilko-deploy-manifest.json present | Absent — NOT MET |
When the bar is eventually met, human-engineer sign-off is still required before the plist is updated to set SENTINEL_TIER1_MODE=auto. That is an explicit audited step, not an automatic promotion.
Arming Prerequisites (Securion #103436 — MC #103439)
These gate arming (ack/auto), not shadow. Securion re-review required before the mode key is added to the plist.
| Finding | Severity | Required before arming |
|---|---|---|
| F5 — Ledger integrity | HIGH | HMAC-sign each ledger row (key stored outside ledger path). promotionBarMet() must verify HMAC before counting. Without this, forged human_verdict entries can satisfy the promotion bar and arm auto. |
| F7 — SA IAM scope | MEDIUM | Verify alai-cli-deployer holds only monitoring.viewer + logging.viewer + run.viewer in shadow. For auto: run.developer scoped by resource condition to bilko-api-demo + bilko-web-demo only. Must NOT hold cloudsql.*, iam.*, secretmanager.*, dns.*. |
| F4 — Ack approver allowlist | INFO | Before Slack poll loop is wired: define constant with allowed Slack user IDs. Poll must verify message.user + thread_ts — not just message text. |
| F6 — IAM snapshot seal | MEDIUM | chmod 0444 after first write. Add sealed flag; require manual unsealing for reset. Populate bilko-web-demo immediately. |
| F2 — Misleading Object.freeze | MEDIUM | Remove Object.freeze({MODE}) at line 57 — it freezes a discarded object, not the const binding. Replace with clarifying comment. |
| F8 — Gate 8 inconsistency | LOW | Align Gate 8 with Gate 4: block (not warn-and-pass) when deploy manifest absent or N-1 not in manifest. |
| F3 — Module integrity | LOW | Add startup SHA-256 check of the module file against a stored known-good value outside the module path. |
All items tracked in MC #103439. Securion re-review required before mode key is added to plist.
Calibration Ledger
Every shadow proposal appends one row to /Users/makinja/system/logs/bilko-sentinel-tier1-ledger.jsonl
Row schema: { ts, incidentId, policyName, condName, resource, diagnosis, confidence, computedAction, n1Info, gates, mode, human_verdict }
The human_verdict field starts as "not-yet-reviewed". Update to: correct | wrong-rootcause | would-have-worsened. A weekly summary is posted to #ceo automatically: proposal count, reviewed count, FP rate, ground-truth hits, promotion bar status.
node /Users/makinja/system/tools/bilko-sentinel-tier1.js --weekly-summary
Infrastructure
| Component | Location |
|---|---|
| Module | /Users/makinja/system/tools/bilko-sentinel-tier1.js |
| Weekly summary plist | /Users/makinja/system/tools/com.alai.bilko-sentinel-tier1-weekly-summary.plist |
| Calibration ledger | /Users/makinja/system/logs/bilko-sentinel-tier1-ledger.jsonl |
| IAM snapshot | /Users/makinja/system/state/bilko-sentinel-iam-snapshot.json |
| State file | /Users/makinja/system/state/bilko-sentinel-tier1-state.json |
| Execution audit log | /Users/makinja/system/logs/bilko-sentinel-audit.jsonl |
| Run log | /Users/makinja/system/logs/bilko-sentinel-tier1.log |
| Single-writer lock | /Users/makinja/system/state/bilko-sentinel-tier1.lock |
| GCP project | tribal-sign-487920-k0, region europe-north1 |
| SA | alai-cli-deployer@tribal-sign-487920-k0.iam.gserviceaccount.com |
| Allowed services | bilko-api-demo, bilko-web-demo |
| Host | ANVIL (makinja local Mac) — invoked by Tier-0 loop |
Runbook
Self-test in shadow
node /Users/makinja/system/tools/bilko-sentinel-tier1.js --self-test
Evaluate promotion bar
node /Users/makinja/system/tools/bilko-sentinel-tier1.js --promotionbar-test
# exits 0 if bar met, 1 if not
Inspect calibration ledger
tail -f /Users/makinja/system/logs/bilko-sentinel-tier1-ledger.jsonl
Re-enable after circuit-open
# Edit ~/system/state/bilko-sentinel-tier1-state.json
# Set "circuitOpen": false
Flip to ack mode (AFTER all hardening items + Securion re-review)
# Add to LaunchAgent plist EnvironmentVariables:
# <key>SENTINEL_TIER1_MODE</key><string>ack</string>
launchctl unload ~/Library/LaunchAgents/com.alai.bilko-sentinel.plist
launchctl load ~/Library/LaunchAgents/com.alai.bilko-sentinel.plist