Reality Anchor — Probe Daemons and Watchdog

Status: DRAFT (MC #101450, 2026-05-19) Author: FlowForge / Kelsey Hightower Doctrine: Reality Anchor v1 (approved 2026-05-15, docs.alai.no/books/system-architecture/page/reality-anchor-doctrine-v1-final)

Why This Exists

The Reality Anchor doctrine (2026-05-15) established that probe output IS evidence — deterministic tool output, not LLM inference. Two probe daemons were deployed to provide continuous fleet health signals:

com.john.auto-verify-regression — regression suite against the anti-hallucination probe library
com.john.ollama-health-probe — Ollama fleet health (ANVIL + FORGE endpoints)

In the week of 2026-05-11 to 2026-05-18, both daemons stopped producing fresh state output. Root cause: auto-verify-regression was scheduled at StartCalendarInterval (once daily at 06:00) rather than a continuous interval. Combined with the absence of a watchdog, there was no circuit-breaker to detect and recover from the audit blind spot.

This document describes the fix applied under MC #101450 and the ongoing watchdog architecture.

Daemon Inventory

1. com.john.auto-verify-regression

Property	Value
Plist	~/Library/LaunchAgents/com.john.auto-verify-regression.plist
Script	~/system/tools/auto-verify-regression.js
Interval	900 seconds (15 minutes) — changed from daily StartCalendarInterval
RunAtLoad	true
Stdout log	~/system/logs/auto-verify-regression.log
State written	~/system/logs/auto-verify-regression.log (tail -1 = regression result)

What it does: Runs the 5-probe regression suite against the anti-hallucination probe library. Each probe runs a known-bad case (expected FAIL) and a known-good case (expected PASS). Emits 5/5 PASS or lists failures. Failure = evidence pipeline degraded.

2. com.john.ollama-health-probe

Property	Value
Plist	~/Library/LaunchAgents/com.john.ollama-health-probe.plist
Script	~/system/tools/ollama-health-probe.sh
Interval	60 seconds (unchanged)
RunAtLoad	true
Stdout log	~/system/logs/ollama-health-probe.out
State written	~/system/state/ollama-fleet.json

What it does: Probes localhost:11434 (ANVIL) and 10.0.0.2:11434 (FORGE) via GET /api/tags. Writes JSON status (healthy/degraded/down) to ollama-fleet.json. Sends Slack alert to #ops on status transitions. DEGRADED = primary down, backup (Tailscale) up.

3. com.john.reality-anchor-watchdog (NEW — MC #101450)

Property	Value
Plist	~/Library/LaunchAgents/com.john.reality-anchor-watchdog.plist
Script	~/system/tools/reality-anchor-watchdog.sh
Interval	3600 seconds (1 hour)
RunAtLoad	true
Alert log	~/.cache/reality-anchor-stale-alerts.log

What it does: Checks mtime of each probe's state file every hour. If any state file has not been written in > 24 hours, it:

Logs STALE_PROBE_ALERT to ~/.cache/reality-anchor-stale-alerts.log
Calls launchctl start <daemon> for one auto-restart attempt
Logs the restart result (success or escalation-needed)

If state is fresh, logs OK with current age.

Alert Path

Probe state file mtime > 24h
  → reality-anchor-watchdog fires
    → ~/.cache/reality-anchor-stale-alerts.log (STALE_PROBE_ALERT line)
    → launchctl start <probe> (auto-restart attempt)
    → if restart fails: "ESCALATION NEEDED" logged

Manual escalation path:
  grep "ESCALATION NEEDED" ~/.cache/reality-anchor-stale-alerts.log
  → Slack #ops manual alert
  → CEO notification if probe offline > 48h

Future: connect reality-anchor-stale-alerts.log growth to a Slack webhook. When file size increases since last check cycle, post to #ops. This closes the loop from watchdog to human-visible alert without requiring a separate daemon.

Recovery Runbook

If probes are stale:

# 1. Check state
launchctl list | grep -E "auto-verify-regression|ollama-health-probe|reality-anchor-watchdog"
cat ~/.cache/reality-anchor-stale-alerts.log | tail -20

# 2. Manual restart (watchdog does this automatically, but for immediate action)
launchctl start com.john.auto-verify-regression
launchctl start com.john.ollama-health-probe

# 3. Verify within 60s
ls -lat ~/system/state/ollama-fleet.json ~/system/logs/auto-verify-regression.log

# 4. If plist is unloaded (not listed at all):
launchctl load ~/Library/LaunchAgents/com.john.auto-verify-regression.plist
launchctl load ~/Library/LaunchAgents/com.john.ollama-health-probe.plist
launchctl load ~/Library/LaunchAgents/com.john.reality-anchor-watchdog.plist

E2E Test

Proveo validation test: ~/system/tests/reality-anchor-recovery-test.sh

--dry-run flag: mocks destructive steps (safe for CI / scheduled validation)
Live mode: requires operator confirmation before stopping Ollama
Tests A (stop detection), B (recovery detection), C (watchdog stale alert)

Run: bash ~/system/tests/reality-anchor-recovery-test.sh --dry-run

Change Log

Date	Change	MC
2026-05-15	Reality Anchor doctrine approved; probes deployed	#100818–#100833
2026-05-19	auto-verify-regression interval changed to 900s; watchdog created	#101450