Disk & Memory Health Alarms — What Fires, Where It Lands, How to Test

Disk & Memory Health Alarms — What Fires, Where It Lands, How to Test 
 Why This System Exists 
 On 2026-06-02, makinja's /System/Volumes/Data volume reached 100% capacity (145Mi free). This caused system-wide failures: 
 
 Bash/sshd/mosh-server failed with ENOSPC errors 
 CEO was locked out (unable to mosh in from ab-mac) 
 Nobody was alerted — the health monitor logged breaches to a SQLite database that no one actively monitored 
 
 The root cause of the disk fill was evidence_ledger bloat (92.9M duplicate rows, 21GB database — fixed in MC #102796). However, the alert silence was a separate critical gap: the monitoring system recorded breaches but never notified anyone. 
 This document describes the alarm system built in MC #102812 to ensure health breaches reach the CEO immediately. 
 Related incident memo: incident_diskfull_evidence_ledger_bloat_2026-06-02.md 
 
 What the Monitor Checks 
 Script: /Users/makinja/system/tools/health-monitor-anvil.js 
 The monitor runs these checks every 300 seconds (5 minutes): 
 1. Disk Usage 
 
 Volumes checked: 
 
 makinja host: Both df / (root) AND df /System/Volumes/Data (where user data lives on APFS) 
 ANVIL host: Only df / (single-volume system) 
 
 
 Thresholds: 
 
 WARN: 80% 
 ALERT: 90% 
 CRITICAL: 95% 
 
 
 Value reported: Maximum of all checked volumes 
 
 2. Memory Usage 
 
 Source: vm_stat (macOS memory statistics) 
 Calculation: (wired + active + compressed pages) / total pages × 100 
 Thresholds: 
 
 WARN: 80% 
 ALERT: 90% 
 CRITICAL: 95% 
 
 
 
 3. CPU Load 
 
 Source: os.loadavg()[1] (5-minute load average) 
 Thresholds (M3 Ultra = 24 cores): 
 
 WARN: 8 
 ALERT: 12 
 CRITICAL: 20 
 
 
 
 4. Ollama Health 
 
 Check: HTTP GET to http://localhost:11434/api/tags (or $OLLAMA_HOST ) 
 Status: OK if responding with valid JSON, ALERT if unreachable/invalid 
 
 
 Where Alerts Land 
 When a threshold is breached, alerts are sent via this three-tier fallback chain : 
 Primary: Telegram 
 
 Target: Chat ID 224494223 (CEO's Telegram user ID) 
 Mechanism: Calls ~/system/tools/telegram-agent.js --send 
 Timeout: 10 seconds 
 
 Fallback 1: Email 
 
 Target: alem@alai.no 
 Mechanism: macOS mail command 
 Timeout: 5 seconds 
 
 Fallback 2: Log File 
 
 Path: ~/system/logs/health-monitor-alerts.log 
 Purpose: Last-resort record if all delivery channels fail 
 
 Alert Format 
 Subject: 🚨 [LEVEL] — [check_name] on [hostname]

[message]

Value: [current_value] | Threshold: [threshold]
Host: [hostname]
Time: [ISO timestamp]
 
 Example: 
 🚨 CRITICAL — disk on Makinja-sin-Mac-Studio.local

Disk /System/Volumes/Data: 95% used (NOTE: APFS local snapshots may hide reclaimed space; check tmutil listlocalsnapshots /)

Value: 95% | Threshold: 95%
Host: Makinja-sin-Mac-Studio.local
Time: 2026-06-02T19:34:29.983Z
 
 
 Cooldown and Deduplication 
 To prevent alert spam during sustained breaches: 
 State File 
 Path: ~/system/config/health-monitor-alert-state.json 
 Contains last-alert timestamps per check: 
 {
 "disk": 1735854869000,
 "memory": 1735854500000
}
 
 Cooldown Rules 
 
 Standard alerts (WARN/ALERT): Maximum 1 alert per check per 60 minutes 
 CRITICAL alerts: Always bypass cooldown (immediate notification) 
 
 Behavior Table 
 
 
 
 Scenario 
 Behavior 
 
 
 
 
 First disk WARN 
 Alert sent immediately 
 
 
 Second disk WARN 5 min later 
 Suppressed (within cooldown) 
 
 
 Disk CRITICAL 10 min later 
 Alert sent (bypasses cooldown) 
 
 
 Check recovers to OK 
 Next breach can alert after 60 min from last alert 
 
 
 
 
 The APFS Gotcha 
 Problem 1: Multiple Volumes 
 On modern macOS with APFS, user data lives on /System/Volumes/Data , NOT on / (root). A naive df / check would have missed the 2026-06-02 incident entirely . 
 Solution: The monitor checks BOTH volumes on makinja and reports the higher usage. 
 Problem 2: Local Time Machine Snapshots 
 APFS local snapshots (created by Time Machine) re-pin freed disk blocks until the snapshot is deleted. This means: 
 
 You delete 20GB of files 
 df still shows disk full 
 The space isn't reclaimed until snapshots are purged 
 
 Check snapshots: 
 tmutil listlocalsnapshots /
 
 Delete snapshots: 
 for snapshot in $(tmutil listlocalsnapshots / | grep 'com.apple.TimeMachine'); do
 sudo tmutil deletelocalsnapshots "${snapshot##*/}"
done
 
 Alert message includes this caveat: All disk breach alerts on makinja include the note: 
 
 "NOTE: APFS local snapshots may hide reclaimed space; check tmutil listlocalsnapshots /" 
 
 
 How to Test the System Safely 
 Dry-Run Mode (No Actual Alerts) 
 HEALTH_MONITOR_DRY_RUN=1 /opt/homebrew/bin/node ~/system/tools/health-monitor-anvil.js
 
 Output example: 
 [ALERT DRY-RUN] Would send: 🚨 WARN — cpu_load on Makinja-sin-Mac-Studio.local
5-min load average: 9.16

Value: 9.16 | Threshold: 8
Host: Makinja-sin-Mac-Studio.local
Time: 2026-06-02T19:34:29.983Z
 
 Force a Synthetic Breach 
 Option 1: Lower Thresholds Temporarily 
 Edit /Users/makinja/system/tools/health-monitor-anvil.js : 
 const THRESHOLDS = {
 cpu_load: { warn: 1, alert: 2, critical: 5 }, // Will trigger immediately
 memory: { warn: 10, alert: 20, critical: 30 },
 disk: { warn: 10, alert: 20, critical: 30 },
};
 
 Run once manually: 
 /opt/homebrew/bin/node ~/system/tools/health-monitor-anvil.js
 
 Check Telegram/email for alert delivery. 
 IMPORTANT: Restore original thresholds after testing. 
 Option 2: Mock a High Value 
 Temporarily modify a check function to return a breach value: 
 function checkDisk() {
 // ... existing code ...
 const maxPct = 96; // Force CRITICAL
 // ... rest of function
}
 
 Verify Alert Delivery 
 
 Telegram: Check chat 224494223 for message 
 Email: Check alem@alai.no inbox 
 Database: Query health_events table: 
 
 sqlite3 ~/system/databases/health-events.db \
 "SELECT timestamp, check_name, status, value, threshold, message 
 FROM health_events 
 WHERE status IN ('warn','alert','critical') 
 ORDER BY timestamp DESC 
 LIMIT 10;"
 
 
 Alert state: Check cooldown state: 
 
 cat ~/system/config/health-monitor-alert-state.json
 
 
 Scheduling 
 makinja (Mac Studio) 
 LaunchAgent: ~/Library/LaunchAgents/com.john.health-monitor.plist 
 Interval: 300 seconds (5 minutes) 
 Verify it's loaded: 
 launchctl list | grep com.john.health-monitor
 
 Expected output: 
 -	0	com.john.health-monitor
 
 (PID - or 0 means scheduled but not currently running; it starts on next interval) 
 Manual reload after changes: 
 launchctl unload ~/Library/LaunchAgents/com.john.health-monitor.plist
launchctl load ~/Library/LaunchAgents/com.john.health-monitor.plist
 
 ANVIL (M3 Ultra Remote Host) 
 Status: Deployment to ANVIL is pending (as of 2026-06-02). 
 Deployment steps (when ready): 
 # 1. Copy script
scp /Users/makinja/system/tools/health-monitor-anvil.js \
 ANVIL:/Users/makinja/system/tools/

# 2. Copy LaunchAgent plist
scp /Users/makinja/Library/LaunchAgents/com.john.health-monitor.plist \
 ANVIL:/Users/makinja/Library/LaunchAgents/

# 3. SSH into ANVIL and activate
ssh ANVIL
launchctl load ~/Library/LaunchAgents/com.john.health-monitor.plist
launchctl list | grep health-monitor

# 4. Test run
/opt/homebrew/bin/node ~/system/tools/health-monitor-anvil.js
 
 Note: ANVIL will only check df / (no /System/Volumes/Data check, as that's makinja-specific). 
 
 Database Logging 
 All checks (OK and breaches) are recorded to:
 Database: ~/system/databases/health-events.db 
 Table: health_events 
 Schema 
 CREATE TABLE health_events (
 id INTEGER PRIMARY KEY AUTOINCREMENT,
 timestamp TEXT NOT NULL DEFAULT (datetime('now')),
 source TEXT NOT NULL, -- 'anvil'
 check_name TEXT NOT NULL, -- 'disk', 'memory', 'cpu_load', 'ollama'
 status TEXT NOT NULL, -- 'ok', 'warn', 'alert', 'critical', 'error'
 value REAL, -- Measured value (e.g., 85.3 for 85.3%)
 threshold REAL, -- Threshold that was breached (e.g., 80)
 message TEXT, -- Human-readable message
 metadata TEXT -- JSON, if needed
);
 
 Query Recent Breaches 
 sqlite3 ~/system/databases/health-events.db <<SQL
SELECT datetime(timestamp, 'localtime') as time,
 check_name,
 status,
 value || CASE WHEN check_name IN ('disk','memory') THEN '%' ELSE '' END as value,
 message
FROM health_events
WHERE status != 'ok'
 AND timestamp > datetime('now', '-24 hours')
ORDER BY timestamp DESC
LIMIT 20;
SQL
 
 
 Related Fix: evidence_ledger Bloat 
 The root cause of the 2026-06-02 disk-full was a separate issue (MC #102796): 
 
 mc.js bootstrap inserted session_id: entry.session_id || null 
 SQLite's UNIQUE(task_id, session_id, action) constraint treats NULL as always distinct 
 Every cold-start re-imported ~2054 JSONL lines → 92.9M duplicate rows (21GB database) 
 
 Fix applied: 
 
 Added dedup index: UNIQUE INDEX idx_evidence_ledger_dedup ON evidence_ledger(task_id, COALESCE(session_id,''), COALESCE(file_path,''), action) 
 Pruned backups: mc-backlog-ttl-sweep.sh now keeps only last 3 TTL backups (was: keep all → 14 files/176GB) 
 Reclaimed space: Stopped litestream → wal_checkpoint(TRUNCATE) + VACUUM → restarted → purged APFS snapshots 
 
 Result: 92.9M rows → 1617, database 21GB → 33MB 
 Watch for regression: If disk fills again, check evidence_ledger row count first: 
 sqlite3 ~/system/databases/mission-control.db \
 "SELECT COUNT(*) FROM evidence_ledger;"
 
 If millions, the dedup index may have regressed. 
 
 Troubleshooting 
 No Alerts Received 
 
 
 Check LaunchAgent is running: 
 launchctl list | grep health-monitor
 
 If missing, load it manually (see Scheduling section). 
 
 
 Check recent events in database: 
 sqlite3 ~/system/databases/health-events.db \
 "SELECT * FROM health_events ORDER BY timestamp DESC LIMIT 5;"
 
 If no recent entries, the script isn't running. 
 
 
 Check Telegram agent: 
 /opt/homebrew/bin/node ~/system/tools/telegram-agent.js --send 224494223 "Test alert"
 
 If this fails, check Telegram token/chat ID. 
 
 
 Check email delivery: 
 echo "Test email body" | mail -s "Test subject" alem@alai.no
 
 If this fails, check macOS mail configuration. 
 
 
 Check log file: 
 tail -20 ~/system/logs/health-monitor-alerts.log
 
 
 
 False Positives (Unnecessary Alerts) 
 
 Disk: Check for APFS snapshots (see APFS Gotcha section) 
 Memory: vm_stat counts compressed memory; high usage may be normal under heavy load 
 CPU: Sustained load is normal during builds; adjust thresholds if needed 
 
 Alert Spam 
 
 Verify cooldown state file exists:
 cat ~/system/config/health-monitor-alert-state.json
 
 
 If file is corrupted or missing, the script will recreate it on next run 
 CRITICAL alerts bypass cooldown by design 
 
 
 Security Notes 
 Slack Integration is DISABLED 
 The original implementation included Slack delivery, but Slack token is disabled . Do not rely on Slack for alerts. 
 Telegram Token 
 The Telegram integration uses ~/system/tools/telegram-agent.js , which reads credentials from a secure location. If alerts stop working, verify the token is still valid: 
 /opt/homebrew/bin/node ~/system/tools/telegram-agent.js --verify
 
 
 Related Documentation 
 
 Incident memo: incident_diskfull_evidence_ledger_bloat_2026-06-02.md 
 MC task: #102812 
 Evidence_ledger fix: MC #102796 
 Implementation evidence: /tmp/alai/disk-mem-alarms-102812/flowforge-evidence.md 
 
 
 Last updated: 2026-06-02 (MC #102812) 
 Owner: FlowForge (Kelsey Hightower) 
 Documented by: Skillforge