Disk & Memory Health Alarms — What Fires, Where It Lands, How to Test

Disk & Memory Health Alarms — What Fires, Where It Lands, How to Test

Why This System Exists

On 2026-06-02, makinja's /System/Volumes/Data volume reached 100% capacity (145Mi free). This caused system-wide failures:

The root cause of the disk fill was evidence_ledger bloat (92.9M duplicate rows, 21GB database — fixed in MC #102796). However, the alert silence was a separate critical gap: the monitoring system recorded breaches but never notified anyone.

This document describes the alarm system built in MC #102812 to ensure health breaches reach the CEO immediately.


What the Monitor Checks

Script: /Users/makinja/system/tools/health-monitor-anvil.js

The monitor runs these checks every 300 seconds (5 minutes):

1. Disk Usage

2. Memory Usage

3. CPU Load

4. Ollama Health


Where Alerts Land

When a threshold is breached, alerts are sent via this three-tier fallback chain:

Primary: Telegram

Fallback 1: Email

Fallback 2: Log File

Alert Format

Subject: 🚨 [LEVEL] — [check_name] on [hostname]

[message]

Value: [current_value] | Threshold: [threshold]
Host: [hostname]
Time: [ISO timestamp]

Example:

🚨 CRITICAL — disk on Makinja-sin-Mac-Studio.local

Disk /System/Volumes/Data: 95% used (NOTE: APFS local snapshots may hide reclaimed space; check tmutil listlocalsnapshots /)

Value: 95% | Threshold: 95%
Host: Makinja-sin-Mac-Studio.local
Time: 2026-06-02T19:34:29.983Z

Cooldown and Deduplication

To prevent alert spam during sustained breaches:

State File

Path: ~/system/config/health-monitor-alert-state.json

Contains last-alert timestamps per check:

{
  "disk": 1735854869000,
  "memory": 1735854500000
}

Cooldown Rules

Behavior Table

Scenario Behavior
First disk WARN Alert sent immediately
Second disk WARN 5 min later Suppressed (within cooldown)
Disk CRITICAL 10 min later Alert sent (bypasses cooldown)
Check recovers to OK Next breach can alert after 60 min from last alert

The APFS Gotcha

Problem 1: Multiple Volumes

On modern macOS with APFS, user data lives on /System/Volumes/Data, NOT on / (root). A naive df / check would have missed the 2026-06-02 incident entirely.

Solution: The monitor checks BOTH volumes on makinja and reports the higher usage.

Problem 2: Local Time Machine Snapshots

APFS local snapshots (created by Time Machine) re-pin freed disk blocks until the snapshot is deleted. This means:

Check snapshots:

tmutil listlocalsnapshots /

Delete snapshots:

for snapshot in $(tmutil listlocalsnapshots / | grep 'com.apple.TimeMachine'); do
  sudo tmutil deletelocalsnapshots "${snapshot##*/}"
done

Alert message includes this caveat: All disk breach alerts on makinja include the note:

"NOTE: APFS local snapshots may hide reclaimed space; check tmutil listlocalsnapshots /"


How to Test the System Safely

Dry-Run Mode (No Actual Alerts)

HEALTH_MONITOR_DRY_RUN=1 /opt/homebrew/bin/node ~/system/tools/health-monitor-anvil.js

Output example:

[ALERT DRY-RUN] Would send: 🚨 WARN — cpu_load on Makinja-sin-Mac-Studio.local
5-min load average: 9.16

Value: 9.16 | Threshold: 8
Host: Makinja-sin-Mac-Studio.local
Time: 2026-06-02T19:34:29.983Z

Force a Synthetic Breach

Option 1: Lower Thresholds Temporarily

Edit /Users/makinja/system/tools/health-monitor-anvil.js:

const THRESHOLDS = {
  cpu_load: { warn: 1, alert: 2, critical: 5 },  // Will trigger immediately
  memory: { warn: 10, alert: 20, critical: 30 },
  disk: { warn: 10, alert: 20, critical: 30 },
};

Run once manually:

/opt/homebrew/bin/node ~/system/tools/health-monitor-anvil.js

Check Telegram/email for alert delivery.

IMPORTANT: Restore original thresholds after testing.

Option 2: Mock a High Value

Temporarily modify a check function to return a breach value:

function checkDisk() {
  // ... existing code ...
  const maxPct = 96; // Force CRITICAL
  // ... rest of function
}

Verify Alert Delivery

  1. Telegram: Check chat 224494223 for message
  2. Email: Check alem@alai.no inbox
  3. Database: Query health_events table:
sqlite3 ~/system/databases/health-events.db \
  "SELECT timestamp, check_name, status, value, threshold, message 
   FROM health_events 
   WHERE status IN ('warn','alert','critical') 
   ORDER BY timestamp DESC 
   LIMIT 10;"
  1. Alert state: Check cooldown state:
cat ~/system/config/health-monitor-alert-state.json

Scheduling

makinja (Mac Studio)

LaunchAgent: ~/Library/LaunchAgents/com.john.health-monitor.plist

Interval: 300 seconds (5 minutes)

Verify it's loaded:

launchctl list | grep com.john.health-monitor

Expected output:

-	0	com.john.health-monitor

(PID - or 0 means scheduled but not currently running; it starts on next interval)

Manual reload after changes:

launchctl unload ~/Library/LaunchAgents/com.john.health-monitor.plist
launchctl load ~/Library/LaunchAgents/com.john.health-monitor.plist

ANVIL (M3 Ultra Remote Host)

Status: Deployment to ANVIL is pending (as of 2026-06-02).

Deployment steps (when ready):

# 1. Copy script
scp /Users/makinja/system/tools/health-monitor-anvil.js \
    ANVIL:/Users/makinja/system/tools/

# 2. Copy LaunchAgent plist
scp /Users/makinja/Library/LaunchAgents/com.john.health-monitor.plist \
    ANVIL:/Users/makinja/Library/LaunchAgents/

# 3. SSH into ANVIL and activate
ssh ANVIL
launchctl load ~/Library/LaunchAgents/com.john.health-monitor.plist
launchctl list | grep health-monitor

# 4. Test run
/opt/homebrew/bin/node ~/system/tools/health-monitor-anvil.js

Note: ANVIL will only check df / (no /System/Volumes/Data check, as that's makinja-specific).


Database Logging

All checks (OK and breaches) are recorded to: Database: ~/system/databases/health-events.db Table: health_events

Schema

CREATE TABLE health_events (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  timestamp TEXT NOT NULL DEFAULT (datetime('now')),
  source TEXT NOT NULL,              -- 'anvil'
  check_name TEXT NOT NULL,          -- 'disk', 'memory', 'cpu_load', 'ollama'
  status TEXT NOT NULL,              -- 'ok', 'warn', 'alert', 'critical', 'error'
  value REAL,                        -- Measured value (e.g., 85.3 for 85.3%)
  threshold REAL,                    -- Threshold that was breached (e.g., 80)
  message TEXT,                      -- Human-readable message
  metadata TEXT                      -- JSON, if needed
);

Query Recent Breaches

sqlite3 ~/system/databases/health-events.db <<SQL
SELECT datetime(timestamp, 'localtime') as time,
       check_name,
       status,
       value || CASE WHEN check_name IN ('disk','memory') THEN '%' ELSE '' END as value,
       message
FROM health_events
WHERE status != 'ok'
  AND timestamp > datetime('now', '-24 hours')
ORDER BY timestamp DESC
LIMIT 20;
SQL

The root cause of the 2026-06-02 disk-full was a separate issue (MC #102796):

Fix applied:

  1. Added dedup index: UNIQUE INDEX idx_evidence_ledger_dedup ON evidence_ledger(task_id, COALESCE(session_id,''), COALESCE(file_path,''), action)
  2. Pruned backups: mc-backlog-ttl-sweep.sh now keeps only last 3 TTL backups (was: keep all → 14 files/176GB)
  3. Reclaimed space: Stopped litestream → wal_checkpoint(TRUNCATE) + VACUUM → restarted → purged APFS snapshots

Result: 92.9M rows → 1617, database 21GB → 33MB

Watch for regression: If disk fills again, check evidence_ledger row count first:

sqlite3 ~/system/databases/mission-control.db \
  "SELECT COUNT(*) FROM evidence_ledger;"

If millions, the dedup index may have regressed.


Troubleshooting

No Alerts Received

  1. Check LaunchAgent is running:

    launchctl list | grep health-monitor
    

    If missing, load it manually (see Scheduling section).

  2. Check recent events in database:

    sqlite3 ~/system/databases/health-events.db \
      "SELECT * FROM health_events ORDER BY timestamp DESC LIMIT 5;"
    

    If no recent entries, the script isn't running.

  3. Check Telegram agent:

    /opt/homebrew/bin/node ~/system/tools/telegram-agent.js --send 224494223 "Test alert"
    

    If this fails, check Telegram token/chat ID.

  4. Check email delivery:

    echo "Test email body" | mail -s "Test subject" alem@alai.no
    

    If this fails, check macOS mail configuration.

  5. Check log file:

    tail -20 ~/system/logs/health-monitor-alerts.log
    

False Positives (Unnecessary Alerts)

Alert Spam


Security Notes

Slack Integration is DISABLED

The original implementation included Slack delivery, but Slack token is disabled. Do not rely on Slack for alerts.

Telegram Token

The Telegram integration uses ~/system/tools/telegram-agent.js, which reads credentials from a secure location. If alerts stop working, verify the token is still valid:

/opt/homebrew/bin/node ~/system/tools/telegram-agent.js --verify


Last updated: 2026-06-02 (MC #102812)
Owner: FlowForge (Kelsey Hightower)
Documented by: Skillforge


Revision #1
Created 2026-06-02 19:40:51 UTC by John
Updated 2026-06-02 19:40:51 UTC by John