Reality Anchor — Probe Daemons and Watchdog
Reality Anchor — Probe Daemons and Watchdog
Status: DRAFT (MC #101450, 2026-05-19) Author: FlowForge / Kelsey Hightower Doctrine: Reality Anchor v1 (approved 2026-05-15, docs.alai.no/books/system-architecture/page/reality-anchor-doctrine-v1-final)
Why This Exists
The Reality Anchor doctrine (2026-05-15) established that probe output IS evidence — deterministic tool output, not LLM inference. Two probe daemons were deployed to provide continuous fleet health signals:
com.john.auto-verify-regression— regression suite against the anti-hallucination probe librarycom.john.ollama-health-probe— Ollama fleet health (ANVIL + FORGE endpoints)
In the week of 2026-05-11 to 2026-05-18, both daemons stopped producing fresh state output. Root cause: auto-verify-regression was scheduled at StartCalendarInterval (once daily at 06:00) rather than a continuous interval. Combined with the absence of a watchdog, there was no circuit-breaker to detect and recover from the audit blind spot.
This document describes the fix applied under MC #101450 and the ongoing watchdog architecture.
Daemon Inventory
1. com.john.auto-verify-regression
| Property | Value |
|---|---|
| Plist | ~/Library/LaunchAgents/com.john.auto-verify-regression.plist |
| Script | ~/system/tools/auto-verify-regression.js |
| Interval | 900 seconds (15 minutes) — changed from daily StartCalendarInterval |
| RunAtLoad | true |
| Stdout log | ~/system/logs/auto-verify-regression.log |
| State written | ~/system/logs/auto-verify-regression.log (tail -1 = regression result) |
What it does: Runs the 5-probe regression suite against the anti-hallucination probe library. Each probe runs a known-bad case (expected FAIL) and a known-good case (expected PASS). Emits 5/5 PASS or lists failures. Failure = evidence pipeline degraded.
2. com.john.ollama-health-probe
| Property | Value |
|---|---|
| Plist | ~/Library/LaunchAgents/com.john.ollama-health-probe.plist |
| Script | ~/system/tools/ollama-health-probe.sh |
| Interval | 60 seconds (unchanged) |
| RunAtLoad | true |
| Stdout log | ~/system/logs/ollama-health-probe.out |
| State written | ~/system/state/ollama-fleet.json |
What it does: Probes localhost:11434 (ANVIL) and 10.0.0.2:11434 (FORGE) via GET /api/tags. Writes JSON status (healthy/degraded/down) to ollama-fleet.json. Sends Slack alert to #ops on status transitions. DEGRADED = primary down, backup (Tailscale) up.
3. com.john.reality-anchor-watchdog (NEW — MC #101450)
| Property | Value |
|---|---|
| Plist | ~/Library/LaunchAgents/com.john.reality-anchor-watchdog.plist |
| Script | ~/system/tools/reality-anchor-watchdog.sh |
| Interval | 3600 seconds (1 hour) |
| RunAtLoad | true |
| Alert log | ~/.cache/reality-anchor-stale-alerts.log |
What it does: Checks mtime of each probe's state file every hour. If any state file has not been written in > 24 hours, it:
- Logs
STALE_PROBE_ALERTto~/.cache/reality-anchor-stale-alerts.log - Calls
launchctl start <daemon>for one auto-restart attempt - Logs the restart result (success or escalation-needed)
If state is fresh, logs OK with current age.
Alert Path
Probe state file mtime > 24h
→ reality-anchor-watchdog fires
→ ~/.cache/reality-anchor-stale-alerts.log (STALE_PROBE_ALERT line)
→ launchctl start <probe> (auto-restart attempt)
→ if restart fails: "ESCALATION NEEDED" logged
Manual escalation path:
grep "ESCALATION NEEDED" ~/.cache/reality-anchor-stale-alerts.log
→ Slack #ops manual alert
→ CEO notification if probe offline > 48h
Future: connect reality-anchor-stale-alerts.log growth to a Slack webhook. When file size increases since last check cycle, post to #ops. This closes the loop from watchdog to human-visible alert without requiring a separate daemon.
Recovery Runbook
If probes are stale:
# 1. Check state
launchctl list | grep -E "auto-verify-regression|ollama-health-probe|reality-anchor-watchdog"
cat ~/.cache/reality-anchor-stale-alerts.log | tail -20
# 2. Manual restart (watchdog does this automatically, but for immediate action)
launchctl start com.john.auto-verify-regression
launchctl start com.john.ollama-health-probe
# 3. Verify within 60s
ls -lat ~/system/state/ollama-fleet.json ~/system/logs/auto-verify-regression.log
# 4. If plist is unloaded (not listed at all):
launchctl load ~/Library/LaunchAgents/com.john.auto-verify-regression.plist
launchctl load ~/Library/LaunchAgents/com.john.ollama-health-probe.plist
launchctl load ~/Library/LaunchAgents/com.john.reality-anchor-watchdog.plist
E2E Test
Proveo validation test: ~/system/tests/reality-anchor-recovery-test.sh
--dry-runflag: mocks destructive steps (safe for CI / scheduled validation)- Live mode: requires operator confirmation before stopping Ollama
- Tests A (stop detection), B (recovery detection), C (watchdog stale alert)
Run: bash ~/system/tests/reality-anchor-recovery-test.sh --dry-run
Change Log
| Date | Change | MC |
|---|---|---|
| 2026-05-15 | Reality Anchor doctrine approved; probes deployed | #100818–#100833 |
| 2026-05-19 | auto-verify-regression interval changed to 900s; watchdog created | #101450 |