Skip to main content

LightRAG Health Monitoring Runbook

LightRAG Observability Runbook (MC #99400 Updated)

Status: ACTIVE
Last Updated: 2026-05-06 (MC #99400 hotfix)
Owner: FlowForge
Related: MC #99400, MC #8545


Purpose

Health monitoring and troubleshooting for LightRAG observability stack following MC #99400 hotfix (secrets hygiene + probe topology resolution).

This runbook covers:

  • Probe topology (three separate probe surfaces)
  • Canonical CF Access service token
  • Vault-sourced token injection pattern
  • Troubleshooting failure modes
  • Auto-heal removal rationale

Probe Topology

LightRAG health is monitored via THREE separate probe surfaces, each with different purposes and authentication requirements:

Probe Surface Script Endpoint Auth Required Layer Probed Failure Modes Caught Status Labels
CF Tunnel Probe lightrag-health.sh (check_cf_tunnel) https://lightrag.alai.no/health YES (CF Access headers) Cloudflare Access + App CF auth failure (302), CF outage, app down HEALTHY | DOWN | AUTH_FAIL | UNREACHABLE
Direct VM Probe lightrag-health.sh (check_vm_direct_access) http://20.240.61.67:9621/health NO (internal IP) Azure VM app-layer App crash, VM down, network unreachable HEALTHY | DOWN | UNREACHABLE
Boot Probe boot.sh lines 88-94 http://20.240.61.67:9621/health (or ${LIGHTRAG_VM_IP}) NO (internal IP) Azure VM app-layer App crash, VM down HEALTHY | DOWN | UNREACHABLE
Monitoring Daemon com.john.lightrag-monitor (calls lightrag-health-with-alert.sh) Same as lightrag-health.sh (both CF tunnel + direct VM) YES for CF tunnel layer Both layers + Slack alerts All failure modes + sends alerts to #alerts Sends Slack alert on exit code 2

Key Decision (CEO approved): Boot probe uses direct VM IP for speed + no auth dependency. CF tunnel probe remains in lightrag-health.sh for external smoke-check.


Canonical CF Access Service Token

Token Location

Bitwarden Item ID: b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service token (note: typo in original item name)

⚠️ DO NOT use Bitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returned HTTP 302, token dead)

Scripts Using This Token

  • ~/system/tools/lightrag-health.sh (CF tunnel check)
  • ~/system/boot.sh (optional, not currently used since boot probe uses direct VM IP)

Daemons Using This Token

  • com.john.lightrag-monitor (LaunchAgent for scheduled health checks + Slack alerts)

Rotation Policy

Cadence: Manual (no TTL configured)
Owner: FlowForge
Next Rotation: TBD (CEO decision pending)

Note: CF Access service tokens do not expire by default unless a TTL is explicitly set. Current token has no expiration configured.


MC #99513 — Plist EnvironmentVariables Pattern (LaunchAgent Context)

Date: 2026-05-06
Why: LaunchAgent context has no TTY → BW interactive unlock impossible at scheduled fire times.
Pattern: CF Access credentials injected directly into plist EnvironmentVariables block.
Hardening: Plist mode 0600 (NOT default 0644) prevents launchctl print exposure to other local users.

Configuration

  • Plist: ~/Library/LaunchAgents/com.john.lightrag-monitor.plist
  • Required keys in <key>EnvironmentVariables</key>: CF_ACCESS_CLIENT_ID + CF_ACCESS_CLIENT_SECRET
  • Script lightrag-health.sh lines 28-31 use ${VAR:-} parameter expansion to preserve env-injected values; otherwise the script's own initialization clobbers env vars.

Known Regression

Trade-off accepted under MC #99513: plaintext CF Access credentials in plist (mode 0600).

  • Blast radius: single CF Access application (revocable in seconds from CF dashboard, not user/account-tier)
  • Supersede plan: when MC #99495 fleet Keychain Services migration lands, replace plist injection with Keychain-backed lookup.

Rotation Procedure (when CF token rotates)

  1. Revoke old token in Cloudflare Access dashboard
  2. Generate new client_id + client_secret in CF
  3. Update BW item b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2 (canonical) — login.username + login.password (legacy storage) AND custom fields cf-access-client-id / cf-access-client-secret (post-D2 future state)
  4. Edit plist EnvironmentVariables block with new values
  5. launchctl bootout gui/$(id -u)/com.john.lightrag-monitor && launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.john.lightrag-monitor.plist
  6. launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor — verify exit code 0 or 1

Cross-references

  • MC #99513 — this fix
  • MC #99495 — fleet-wide BW session hygiene + Keychain migration (supersedes this pattern when delivered)
  • MC #99400 — original LightRAG observability runbook (D8 publish)
  • D2 (BW field rename + DEPRECATED untag) DEFERRED 2026-05-06 — pending CEO Bitwarden vault unlock; runtime is now plist-driven, BW lookup is fallback only

Vault-Sourced Token Injection Pattern

All scripts MUST source CF Access tokens from Bitwarden vault at runtime. NEVER hardcode credentials in scripts.

Shell Script Pattern

# BW session liveness check (updated location ~/.cache/bw-session)
BW_SESSION_FILE="${HOME}/.cache/bw-session"
if [[ ! -f "$BW_SESSION_FILE" ]] || [[ ! -s "$BW_SESSION_FILE" ]]; then
  echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
  exit 1
fi

BW_SESSION=$(cat "$BW_SESSION_FILE")

# Load CF Access credentials from vault using custom fields (MC #99495 pattern)
function load_cf_credentials() {
  local bw_item
  bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
  
  if [[ $? -ne 0 ]] || [[ -z "$bw_item" ]]; then
    echo "ERROR: Failed to retrieve CF Access token from Bitwarden. Check BW_SESSION."
    return 1
  fi
  
  # Extract from custom fields array (jq pattern)
  CF_ACCESS_CLIENT_ID=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value')
  CF_ACCESS_CLIENT_SECRET=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value')
  
  if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
    echo "ERROR: CF Access credentials not found in Bitwarden item custom fields."
    return 1
  fi
  
  return 0
}

# Load credentials
if ! load_cf_credentials; then
  echo "FATAL: Cannot proceed without CF Access credentials."
  exit 2
fi

# Use in curl
curl -s \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  https://lightrag.alai.no/health

Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.

Root Cause:

  • BW_SESSION expired → ~/.cache/bw-session stale or missing
  • Wrong Bitwarden item ID (using deprecated 61d0bf21 instead of canonical b42cb5c2)
  • CF Access service token rotated without updating Bitwarden

Fix:

# Refresh BW session
bw unlock
# Copy new session token to ~/.cache/bw-session

# Verify canonical token is live
BW_SESSION=$(cat "${HOME}/.cache/bw-session")
bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION")
echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value'
echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value'

# Test with curl
CF_ID=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value')
CF_SECRET=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value')
curl -s -w "\nHTTP: %{http_code}\n" \
  -H "CF-Access-Client-Id: $CF_ID" \
  -H "CF-Access-Client-Secret: $CF_SECRET" \
  https://lightrag.alai.no/health

DOWN (HTTP 5xx or Connection Refused)

Symptom: Probe returns HTTP 500/502/503 or connection refused.

Root Cause:

  • LightRAG app crashed
  • Docker container stopped
  • Azure VM down
  • Neo4j backend unavailable

Fix:

# Check Azure VM status via Azure Portal
# https://portal.azure.com → vm-alai-lightrag

# Or via az CLI
az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"

# SSH to VM (if access configured)
ssh -i ~/.ssh/azure_alai [email protected]

# Check Docker containers
docker ps --filter name=lightrag
docker ps --filter name=neo4j

# Restart if needed
docker restart $(docker ps -q --filter name=lightrag)

UNREACHABLE (Timeout or Network Error)

Symptom: Probe times out after 5-10 seconds without HTTP response.

Root Cause:

  • ANVIL → Azure VM network path broken
  • ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)
  • Azure VM firewall blocking port 9621

Fix:

# Test direct VM connectivity
curl -s --connect-timeout 5 http://20.240.61.67:9621/health

# Check current ISP IP
curl -s https://ifconfig.co

# Compare to NSG rule source IP
az network nsg rule show \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --query "sourceAddressPrefix"

# If IPs differ, update NSG rule
NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --source-address-prefixes "${NEW_IP}/32"

Auto-Heal Removal (MC #99400 D5)

Status: com.alai.lightrag-auto-heal LaunchAgent REMOVED 2026-05-06

Rationale

The auto-heal daemon was designed to automatically restart LightRAG on health check failures. However, it had fatal design flaws:

  1. Wrong probe target: Checked 127.0.0.1:9621 (localhost), but LightRAG runs on Azure VM 20.240.61.67:9621. Auto-heal never triggered on real CF outages.
  2. ADC token drift: Relied on Application Default Credentials (ADC) for Azure CLI commands. When ADC tokens expired, script failed silently.
  3. Blast radius risk: Restart logic could affect production LightRAG instance with 121K pending docs.
  4. False sense of safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.

CEO Decision (2026-05-06)

D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).

Archived Files

  • ~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06
  • ~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06
  • ~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/
  • RIP Note: ~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.md

Replacement Plan

Azure Monitor can directly alert on:

  • VM metrics (CPU, memory, disk, network)
  • Application Insights custom metrics (if LightRAG reports to AppInsights)
  • Log Analytics queries (if LightRAG logs to Log Analytics workspace)

Action groups can:

  • Send Slack alerts via webhook
  • Trigger Azure Automation runbooks for auto-remediation
  • Page on-call engineer via PagerDuty/Opsgenie

This will be scoped in a future child MC per CEO directive.


com.john.lightrag-monitor LaunchAgent

Status: ACTIVE (approved + activated 2026-05-06)

Schedule

Runs periodically (check plist StartInterval or StartCalendarInterval for current schedule).

What It Probes

  • Calls ~/system/tools/lightrag-health-with-alert.sh
  • Runs both CF tunnel probe + direct VM probe
  • Sends Slack alert to #alerts channel on exit code 2 (errors)

Environment Variables

The LaunchAgent plist includes:

  • LIGHTRAG_VM_IP — Set to 20.240.61.67 (Azure VM internal IP)
  • BW session sourced from ~/.cache/bw-session (vault-sourced credentials)

Manual Trigger

launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor

Known Issues

  • Ollama API 302: ollama.alai.no returns HTTP 302 on every health check, triggering false-alarm Slack alerts. This is a pre-existing issue unrelated to MC #99400. File separate MC for Ollama endpoint CF Access auth configuration.

Token Rotation Procedure

Rotation Cadence: Manual (no TTL configured — rotation policy is OPEN)

When to Rotate

  • Suspected token compromise
  • Periodic security hygiene (e.g., every 90 days — CEO decision pending)
  • After employee offboarding (if token was shared)

Rotation Steps

  1. Generate new service token in Cloudflare dashboard:

    • Navigate to Zero Trust → Access → Service Auth
    • Create new service token for lightrag.alai.no Access policy
    • Copy Client ID + Client Secret
  2. Update Bitwarden item b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2:

    BW_SESSION=$(cat "${HOME}/.cache/bw-session")
    bw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION"
    # Update custom fields: cf-access-client-id = "<new_client_id>"
    # Update custom fields: cf-access-client-secret = "<new_client_secret>"
    
  3. Test new token:

    curl -s -w "\nHTTP: %{http_code}\n" \
      -H "CF-Access-Client-Id: <new_client_id>" \
      -H "CF-Access-Client-Secret: <new_client_secret>" \
      https://lightrag.alai.no/health
    
  4. Verify health probes work:

    bash ~/system/tools/lightrag-health.sh
    # Should return HEALTHY (exit code 0)
    
  5. Update cf-access-token-registry.json:

    jq '.["lightrag.alai.no"].last_verified = "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"' \
      ~/system/state/cf-access-token-registry.json > /tmp/registry.json
    mv /tmp/registry.json ~/system/state/cf-access-token-registry.json
    
  6. Revoke old token in Cloudflare dashboard (after confirming new token works)


Known Dependencies

Azure VM NSG — Source IP Sensitivity

The direct VM probe (http://20.240.61.67:9621/health) depends on an NSG allow-rule on Azure VM vm-alai-support that whitelists the ANVIL machine's source IP.

Risk: If ANVIL's ISP rotates its public IP address, the NSG source-IP rule becomes stale and the direct VM probe will silently degrade to UNREACHABLE without any Cloudflare involvement. The CF tunnel probe will still work (it routes through Cloudflare, no direct IP dependency), but direct VM access will fail, giving a misleading "CF layer OK / VM layer unreachable" diagnosis.

Detection: Run curl -s https://ifconfig.co on ANVIL and compare to the NSG rule's source address prefix:

az network nsg rule show \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --query "sourceAddressPrefix"

Long-term fix: Provision an internal DNS hostname for the VM so that the probe endpoint is hostname-based and survives IP rotation without NSG changes. File a child MC when ANVIL ISP rotation becomes a recurring issue.



Changelog

2026-05-06 — MC #99495 cleanup: StartInterval 900s MTTD reduction, BIC hostname migration to ollama.alai.no, jq custom-fields extraction, Known Dependencies section added
2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
2026-04-21 — Initial version (baseline setup + first run)


Document Owner: FlowForge
Last Updated: 2026-05-06 (MC #99495)
Approved By: CEO (Alem Basic) — MC #99400 deliverable D8