# Disk & Memory Health Alarms — What Fires, Where It Lands, How to Test

# Disk & Memory Health Alarms — What Fires, Where It Lands, How to Test

## Why This System Exists

On 2026-06-02, makinja's `/System/Volumes/Data` volume reached 100% capacity (145Mi free). This caused system-wide failures:
- Bash/sshd/mosh-server failed with ENOSPC errors
- CEO was locked out (unable to mosh in from ab-mac)
- **Nobody was alerted** — the health monitor logged breaches to a SQLite database that no one actively monitored

The root cause of the disk fill was evidence_ledger bloat (92.9M duplicate rows, 21GB database — fixed in MC #102796). However, the *alert silence* was a separate critical gap: the monitoring system recorded breaches but never notified anyone.

This document describes the alarm system built in MC #102812 to ensure health breaches reach the CEO immediately.

**Related incident memo:** [incident_diskfull_evidence_ledger_bloat_2026-06-02.md](file:///Users/makinja/.claude/projects/-Users-makinja/memory/incident_diskfull_evidence_ledger_bloat_2026-06-02.md)

---

## What the Monitor Checks

**Script:** `/Users/makinja/system/tools/health-monitor-anvil.js`

The monitor runs these checks every 300 seconds (5 minutes):

### 1. Disk Usage
- **Volumes checked:**
  - **makinja host:** Both `df /` (root) AND `df /System/Volumes/Data` (where user data lives on APFS)
  - **ANVIL host:** Only `df /` (single-volume system)
- **Thresholds:**
  - WARN: 80%
  - ALERT: 90%
  - CRITICAL: 95%
- **Value reported:** Maximum of all checked volumes

### 2. Memory Usage
- **Source:** `vm_stat` (macOS memory statistics)
- **Calculation:** (wired + active + compressed pages) / total pages × 100
- **Thresholds:**
  - WARN: 80%
  - ALERT: 90%
  - CRITICAL: 95%

### 3. CPU Load
- **Source:** `os.loadavg()[1]` (5-minute load average)
- **Thresholds (M3 Ultra = 24 cores):**
  - WARN: 8
  - ALERT: 12
  - CRITICAL: 20

### 4. Ollama Health
- **Check:** HTTP GET to `http://localhost:11434/api/tags` (or `$OLLAMA_HOST`)
- **Status:** OK if responding with valid JSON, ALERT if unreachable/invalid

---

## Where Alerts Land

When a threshold is breached, alerts are sent via this **three-tier fallback chain**:

### Primary: Telegram
- **Target:** Chat ID `224494223` (CEO's Telegram user ID)
- **Mechanism:** Calls `~/system/tools/telegram-agent.js --send`
- **Timeout:** 10 seconds

### Fallback 1: Email
- **Target:** `alem@alai.no`
- **Mechanism:** macOS `mail` command
- **Timeout:** 5 seconds

### Fallback 2: Log File
- **Path:** `~/system/logs/health-monitor-alerts.log`
- **Purpose:** Last-resort record if all delivery channels fail

### Alert Format
```
Subject: 🚨 [LEVEL] — [check_name] on [hostname]

[message]

Value: [current_value] | Threshold: [threshold]
Host: [hostname]
Time: [ISO timestamp]
```

**Example:**
```
🚨 CRITICAL — disk on Makinja-sin-Mac-Studio.local

Disk /System/Volumes/Data: 95% used (NOTE: APFS local snapshots may hide reclaimed space; check tmutil listlocalsnapshots /)

Value: 95% | Threshold: 95%
Host: Makinja-sin-Mac-Studio.local
Time: 2026-06-02T19:34:29.983Z
```

---

## Cooldown and Deduplication

To prevent alert spam during sustained breaches:

### State File
**Path:** `~/system/config/health-monitor-alert-state.json`

Contains last-alert timestamps per check:
```json
{
  "disk": 1735854869000,
  "memory": 1735854500000
}
```

### Cooldown Rules
- **Standard alerts (WARN/ALERT):** Maximum 1 alert per check per 60 minutes
- **CRITICAL alerts:** Always bypass cooldown (immediate notification)

### Behavior Table
| Scenario | Behavior |
|----------|----------|
| First disk WARN | Alert sent immediately |
| Second disk WARN 5 min later | Suppressed (within cooldown) |
| Disk CRITICAL 10 min later | Alert sent (bypasses cooldown) |
| Check recovers to OK | Next breach can alert after 60 min from last alert |

---

## The APFS Gotcha

### Problem 1: Multiple Volumes
On modern macOS with APFS, user data lives on `/System/Volumes/Data`, NOT on `/` (root). A naive `df /` check would have **missed the 2026-06-02 incident entirely**.

**Solution:** The monitor checks BOTH volumes on makinja and reports the higher usage.

### Problem 2: Local Time Machine Snapshots
APFS local snapshots (created by Time Machine) re-pin freed disk blocks until the snapshot is deleted. This means:
- You delete 20GB of files
- `df` still shows disk full
- **The space isn't reclaimed until snapshots are purged**

**Check snapshots:**
```bash
tmutil listlocalsnapshots /
```

**Delete snapshots:**
```bash
for snapshot in $(tmutil listlocalsnapshots / | grep 'com.apple.TimeMachine'); do
  sudo tmutil deletelocalsnapshots "${snapshot##*/}"
done
```

**Alert message includes this caveat:** All disk breach alerts on makinja include the note:
> "NOTE: APFS local snapshots may hide reclaimed space; check tmutil listlocalsnapshots /"

---

## How to Test the System Safely

### Dry-Run Mode (No Actual Alerts)
```bash
HEALTH_MONITOR_DRY_RUN=1 /opt/homebrew/bin/node ~/system/tools/health-monitor-anvil.js
```

**Output example:**
```
[ALERT DRY-RUN] Would send: 🚨 WARN — cpu_load on Makinja-sin-Mac-Studio.local
5-min load average: 9.16

Value: 9.16 | Threshold: 8
Host: Makinja-sin-Mac-Studio.local
Time: 2026-06-02T19:34:29.983Z
```

### Force a Synthetic Breach

#### Option 1: Lower Thresholds Temporarily
Edit `/Users/makinja/system/tools/health-monitor-anvil.js`:
```javascript
const THRESHOLDS = {
  cpu_load: { warn: 1, alert: 2, critical: 5 },  // Will trigger immediately
  memory: { warn: 10, alert: 20, critical: 30 },
  disk: { warn: 10, alert: 20, critical: 30 },
};
```

Run once manually:
```bash
/opt/homebrew/bin/node ~/system/tools/health-monitor-anvil.js
```

Check Telegram/email for alert delivery.

**IMPORTANT:** Restore original thresholds after testing.

#### Option 2: Mock a High Value
Temporarily modify a check function to return a breach value:
```javascript
function checkDisk() {
  // ... existing code ...
  const maxPct = 96; // Force CRITICAL
  // ... rest of function
}
```

### Verify Alert Delivery
1. **Telegram:** Check chat 224494223 for message
2. **Email:** Check `alem@alai.no` inbox
3. **Database:** Query `health_events` table:
```bash
sqlite3 ~/system/databases/health-events.db \
  "SELECT timestamp, check_name, status, value, threshold, message 
   FROM health_events 
   WHERE status IN ('warn','alert','critical') 
   ORDER BY timestamp DESC 
   LIMIT 10;"
```
4. **Alert state:** Check cooldown state:
```bash
cat ~/system/config/health-monitor-alert-state.json
```

---

## Scheduling

### makinja (Mac Studio)
**LaunchAgent:** `~/Library/LaunchAgents/com.john.health-monitor.plist`

**Interval:** 300 seconds (5 minutes)

**Verify it's loaded:**
```bash
launchctl list | grep com.john.health-monitor
```

**Expected output:**
```
-	0	com.john.health-monitor
```

(PID `-` or `0` means scheduled but not currently running; it starts on next interval)

**Manual reload after changes:**
```bash
launchctl unload ~/Library/LaunchAgents/com.john.health-monitor.plist
launchctl load ~/Library/LaunchAgents/com.john.health-monitor.plist
```

### ANVIL (M3 Ultra Remote Host)

**Status:** Deployment to ANVIL is pending (as of 2026-06-02).

**Deployment steps (when ready):**
```bash
# 1. Copy script
scp /Users/makinja/system/tools/health-monitor-anvil.js \
    ANVIL:/Users/makinja/system/tools/

# 2. Copy LaunchAgent plist
scp /Users/makinja/Library/LaunchAgents/com.john.health-monitor.plist \
    ANVIL:/Users/makinja/Library/LaunchAgents/

# 3. SSH into ANVIL and activate
ssh ANVIL
launchctl load ~/Library/LaunchAgents/com.john.health-monitor.plist
launchctl list | grep health-monitor

# 4. Test run
/opt/homebrew/bin/node ~/system/tools/health-monitor-anvil.js
```

**Note:** ANVIL will only check `df /` (no `/System/Volumes/Data` check, as that's makinja-specific).

---

## Database Logging

All checks (OK and breaches) are recorded to:
**Database:** `~/system/databases/health-events.db`
**Table:** `health_events`

### Schema
```sql
CREATE TABLE health_events (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  timestamp TEXT NOT NULL DEFAULT (datetime('now')),
  source TEXT NOT NULL,              -- 'anvil'
  check_name TEXT NOT NULL,          -- 'disk', 'memory', 'cpu_load', 'ollama'
  status TEXT NOT NULL,              -- 'ok', 'warn', 'alert', 'critical', 'error'
  value REAL,                        -- Measured value (e.g., 85.3 for 85.3%)
  threshold REAL,                    -- Threshold that was breached (e.g., 80)
  message TEXT,                      -- Human-readable message
  metadata TEXT                      -- JSON, if needed
);
```

### Query Recent Breaches
```bash
sqlite3 ~/system/databases/health-events.db <<SQL
SELECT datetime(timestamp, 'localtime') as time,
       check_name,
       status,
       value || CASE WHEN check_name IN ('disk','memory') THEN '%' ELSE '' END as value,
       message
FROM health_events
WHERE status != 'ok'
  AND timestamp > datetime('now', '-24 hours')
ORDER BY timestamp DESC
LIMIT 20;
SQL
```

---

## Related Fix: evidence_ledger Bloat

The root cause of the 2026-06-02 disk-full was a separate issue (MC #102796):
- `mc.js` bootstrap inserted `session_id: entry.session_id || null`
- SQLite's `UNIQUE(task_id, session_id, action)` constraint treats NULL as always distinct
- Every cold-start re-imported ~2054 JSONL lines → **92.9M duplicate rows (21GB database)**

**Fix applied:**
1. Added dedup index: `UNIQUE INDEX idx_evidence_ledger_dedup ON evidence_ledger(task_id, COALESCE(session_id,''), COALESCE(file_path,''), action)`
2. Pruned backups: `mc-backlog-ttl-sweep.sh` now keeps only last 3 TTL backups (was: keep all → 14 files/176GB)
3. Reclaimed space: Stopped litestream → `wal_checkpoint(TRUNCATE)` + `VACUUM` → restarted → purged APFS snapshots

**Result:** 92.9M rows → 1617, database 21GB → 33MB

**Watch for regression:** If disk fills again, check `evidence_ledger` row count first:
```bash
sqlite3 ~/system/databases/mission-control.db \
  "SELECT COUNT(*) FROM evidence_ledger;"
```

If millions, the dedup index may have regressed.

---

## Troubleshooting

### No Alerts Received

1. **Check LaunchAgent is running:**
   ```bash
   launchctl list | grep health-monitor
   ```
   If missing, load it manually (see Scheduling section).

2. **Check recent events in database:**
   ```bash
   sqlite3 ~/system/databases/health-events.db \
     "SELECT * FROM health_events ORDER BY timestamp DESC LIMIT 5;"
   ```
   If no recent entries, the script isn't running.

3. **Check Telegram agent:**
   ```bash
   /opt/homebrew/bin/node ~/system/tools/telegram-agent.js --send 224494223 "Test alert"
   ```
   If this fails, check Telegram token/chat ID.

4. **Check email delivery:**
   ```bash
   echo "Test email body" | mail -s "Test subject" alem@alai.no
   ```
   If this fails, check macOS mail configuration.

5. **Check log file:**
   ```bash
   tail -20 ~/system/logs/health-monitor-alerts.log
   ```

### False Positives (Unnecessary Alerts)

- **Disk:** Check for APFS snapshots (see APFS Gotcha section)
- **Memory:** vm_stat counts compressed memory; high usage may be normal under heavy load
- **CPU:** Sustained load is normal during builds; adjust thresholds if needed

### Alert Spam

- Verify cooldown state file exists:
  ```bash
  cat ~/system/config/health-monitor-alert-state.json
  ```
- If file is corrupted or missing, the script will recreate it on next run
- CRITICAL alerts bypass cooldown by design

---

## Security Notes

### Slack Integration is DISABLED
The original implementation included Slack delivery, but **Slack token is disabled**. Do not rely on Slack for alerts.

### Telegram Token
The Telegram integration uses `~/system/tools/telegram-agent.js`, which reads credentials from a secure location. If alerts stop working, verify the token is still valid:
```bash
/opt/homebrew/bin/node ~/system/tools/telegram-agent.js --verify
```

---

## Related Documentation

- **Incident memo:** [incident_diskfull_evidence_ledger_bloat_2026-06-02.md](file:///Users/makinja/.claude/projects/-Users-makinja/memory/incident_diskfull_evidence_ledger_bloat_2026-06-02.md)
- **MC task:** #102812
- **Evidence_ledger fix:** MC #102796
- **Implementation evidence:** `/tmp/alai/disk-mem-alarms-102812/flowforge-evidence.md`

---

**Last updated:** 2026-06-02 (MC #102812)  
**Owner:** FlowForge (Kelsey Hightower)  
**Documented by:** Skillforge