LightRAG Health Monitoring Runbook

LightRAG Observability Runbook (MC #99400 Updated)

Status: ACTIVE
Last Updated: 2026-05-06 (MC #99400 hotfix)
Owner: FlowForge
Related: MC #99400, MC #8545

Purpose

Health monitoring and troubleshooting for LightRAG observability stack following MC #99400 hotfix (secrets hygiene + probe topology resolution).

This runbook covers:

Probe topology (three separate probe surfaces)
Canonical CF Access service token
Vault-sourced token injection pattern
Troubleshooting failure modes
Auto-heal removal rationale

Probe Topology

LightRAG health is monitored via THREE separate probe surfaces, each with different purposes and authentication requirements:

Probe Surface	Script	Endpoint	Auth Required	Layer Probed	Failure Modes Caught	Status Labels
CF Tunnel Probe	`lightrag-health.sh` (check_cf_tunnel)	`https://lightrag.alai.no/health`	YES (CF Access headers)	Cloudflare Access + App	CF auth failure (302), CF outage, app down	HEALTHY \| DOWN \| AUTH_FAIL \| UNREACHABLE
Direct VM Probe	`lightrag-health.sh` (check_vm_direct_access)	`http://20.240.61.67:9621/health`	NO (internal IP)	Azure VM app-layer	App crash, VM down, network unreachable	HEALTHY \| DOWN \| UNREACHABLE
Boot Probe	`boot.sh` lines 88-94	`http://20.240.61.67:9621/health` (or `${LIGHTRAG_VM_IP}`)	NO (internal IP)	Azure VM app-layer	App crash, VM down	HEALTHY \| DOWN \| UNREACHABLE
Monitoring Daemon	`com.john.lightrag-monitor` (calls lightrag-health-with-alert.sh)	Same as lightrag-health.sh (both CF tunnel + direct VM)	YES for CF tunnel layer	Both layers + Slack alerts	All failure modes + sends alerts to #alerts	Sends Slack alert on exit code 2

Key Decision (CEO approved): Boot probe uses direct VM IP for speed + no auth dependency. CF tunnel probe remains in lightrag-health.sh for external smoke-check.

Canonical CF Access Service Token

Token Location

Bitwarden Item ID: b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service token (note: typo in original item name)

⚠️ DO NOT use Bitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returned HTTP 302, token dead)

Scripts Using This Token

~/system/tools/lightrag-health.sh (CF tunnel check)
~/system/boot.sh (optional, not currently used since boot probe uses direct VM IP)

Daemons Using This Token

com.john.lightrag-monitor (LaunchAgent for scheduled health checks + Slack alerts)

Rotation Policy

Cadence: Manual (no TTL configured)
Owner: FlowForge
Next Rotation: TBD (CEO decision pending)

Note: CF Access service tokens do not expire by default unless a TTL is explicitly set. Current token has no expiration configured.

Vault-Sourced Token Injection Pattern

All scripts MUST source CF Access tokens from Bitwarden vault at runtime. NEVER hardcode credentials in scripts.

Shell Script Pattern

# BW session liveness check (updated location ~/.cache/bw-session)
BW_SESSION_FILE="${HOME}/.cache/bw-session"
if [[ ! -f /tmp/bw-session"$BW_SESSION_FILE" ]] || [[ ! -s /tmp/bw-session"$BW_SESSION_FILE" ]]; then
  echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
  exit 1
fi

BW_SESSION=$(cat /tmp/bw-session)"$BW_SESSION_FILE")

# Load CF Access credentials from vault using custom fields (MC #99495 pattern)
function load_cf_credentials() {
  local item_databw_item
  item_data=bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
  
  if [[ $? -ne 0 ]] || [[ -z "$item_data"bw_item" ]]; then
    echo "ERROR: Failed to retrieve CF Access token from Bitwarden. Check BW_SESSION."
    return 1
  fi
  
  # Extract from custom fields array (jq pattern)
  CF_ACCESS_CLIENT_ID=$(echo "$item_data"bw_item" | jq -r '.login.username // empty'fields[] | sedselect(.name=="cf-access-client-id") 's/^CF-Access-Client-Id:| //'.value')
  CF_ACCESS_CLIENT_SECRET=$(echo "$item_data"bw_item" | jq -r '.login.passwordfields[] //| empty'select(.name=="cf-access-client-secret") | .value')
  
  if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
    echo "ERROR: CF Access credentials not found in Bitwarden item.item custom fields."
    return 1
  fi
  
  return 0
}

# Load credentials
if ! load_cf_credentials; then
  echo "FATAL: Cannot proceed without CF Access credentials."
  exit 2
fi

# Use in curl
curl -s \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  https://lightrag.alai.no/health

Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.

Root Cause:

BW_SESSION expired → ~/tmp/.cache/bw-session stale or missing
Wrong Bitwarden item ID (using deprecated 61d0bf21 instead of canonical b42cb5c2)
CF Access service token rotated without updating Bitwarden

Fix:

# Refresh BW session
bw unlock
# Copy new session token to ~/tmp/.cache/bw-session

# Verify canonical token is live
BW_SESSION=$(cat "${HOME}/.cache/bw-session")
bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$(catBW_SESSION")
/tmp/bw-session)echo "$bw_item" | jq -r '.login.username,fields[] | select(.name=="cf-access-client-id") | .login.password'value'
echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value'

# Test with curl
CF_ID=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value')
CF_SECRET=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value')
curl -s -w "\nHTTP: %{http_code}\n" \
  -H "CF-Access-Client-Id: $(bw get username 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session) | sed 's/^CF-Access-Client-Id: //')"CF_ID" \
  -H "CF-Access-Client-Secret: $(bw get password 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session))"CF_SECRET" \
  https://lightrag.alai.no/health

DOWN (HTTP 5xx or Connection Refused)

Symptom: Probe returns HTTP 500/502/503 or connection refused.

Root Cause:

LightRAG app crashed
Docker container stopped
Azure VM down
Neo4j backend unavailable

Fix:

# Check Azure VM status via Azure Portal
# https://portal.azure.com → vm-alai-lightrag

# Or via az CLI
az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"

# SSH to VM (if access configured)
ssh -i ~/.ssh/azure_alai [email protected]

# Check Docker containers
docker ps --filter name=lightrag
docker ps --filter name=neo4j

# Restart if needed
docker restart $(docker ps -q --filter name=lightrag)

UNREACHABLE (Timeout or Network Error)

Symptom: Probe times out after 5-10 seconds without HTTP response.

Root Cause:

ANVIL → Azure VM network path broken
ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)
Azure VM firewall blocking port 9621

Fix:

# Test direct VM connectivity
curl -s --connect-timeout 5 http://20.240.61.67:9621/health

# Check current ISP IP
curl -s https://ifconfig.co

# Compare to NSG rule source IP
az network nsg rule show \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --query "sourceAddressPrefix"

# If IPs differ, update NSG rule
NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --source-address-prefixes "${NEW_IP}/32"

Auto-Heal Removal (MC #99400 D5)

Status: com.alai.lightrag-auto-heal LaunchAgent REMOVED 2026-05-06

Rationale

The auto-heal daemon was designed to automatically restart LightRAG on health check failures. However, it had fatal design flaws:

Wrong probe target: Checked 127.0.0.1:9621 (localhost), but LightRAG runs on Azure VM 20.240.61.67:9621. Auto-heal never triggered on real CF outages.
ADC token drift: Relied on Application Default Credentials (ADC) for Azure CLI commands. When ADC tokens expired, script failed silently.
Blast radius risk: Restart logic could affect production LightRAG instance with 121K pending docs.
False sense of safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.

CEO Decision (2026-05-06)

D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).

Archived Files

~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06
~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06
~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/
RIP Note: ~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.md

Replacement Plan

Azure Monitor can directly alert on:

VM metrics (CPU, memory, disk, network)
Application Insights custom metrics (if LightRAG reports to AppInsights)
Log Analytics queries (if LightRAG logs to Log Analytics workspace)

Action groups can:

Send Slack alerts via webhook
Trigger Azure Automation runbooks for auto-remediation
Page on-call engineer via PagerDuty/Opsgenie

This will be scoped in a future child MC per CEO directive.

com.john.lightrag-monitor LaunchAgent

Status: ACTIVE (approved + activated 2026-05-06)

Schedule

Runs periodically (check plist StartInterval or StartCalendarInterval for current schedule).

What It Probes

Calls ~/system/tools/lightrag-health-with-alert.sh
Runs both CF tunnel probe + direct VM probe
Sends Slack alert to #alerts channel on exit code 2 (errors)

Environment Variables

The LaunchAgent plist includes:

LIGHTRAG_VM_IP — Set to 20.240.61.67 (Azure VM internal IP)
BW session sourced from ~/tmp/.cache/bw-session (vault-sourced credentials)

Manual Trigger

launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor

Known Issues

Ollama API 302: ollama.alai.no returns HTTP 302 on every health check, triggering false-alarm Slack alerts. This is a pre-existing issue unrelated to MC #99400. File separate MC for Ollama endpoint CF Access auth configuration.

Token Rotation Procedure

Rotation Cadence: Manual (no TTL configured — rotation policy is OPEN)

When to Rotate

Suspected token compromise
Periodic security hygiene (e.g., every 90 days — CEO decision pending)
After employee offboarding (if token was shared)

Rotation Steps

Generate new service token in Cloudflare dashboard:
- Navigate to Zero Trust → Access → Service Auth
- Create new service token for lightrag.alai.no Access policy
- Copy Client ID + Client Secret

Update Bitwarden item b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2:

BW_SESSION=$(cat "${HOME}/.cache/bw-session")
bw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$(cat /tmp/bw-session)BW_SESSION"
# Update usernamecustom fields: cf-access-client-id = "CF-Access-Client-Id: <new_client_id>"
# Update passwordcustom fields: cf-access-client-secret = "<new_client_secret>"

Test new token:

curl -s -w "\nHTTP: %{http_code}\n" \
  -H "CF-Access-Client-Id: <new_client_id>" \
  -H "CF-Access-Client-Secret: <new_client_secret>" \
  https://lightrag.alai.no/health

Verify health probes work:

bash ~/system/tools/lightrag-health.sh
# Should return HEALTHY (exit code 0)

Update cf-access-token-registry.json:

jq '.["lightrag.alai.no"].last_verified = "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"' \
  ~/system/state/cf-access-token-registry.json > /tmp/registry.json
mv /tmp/registry.json ~/system/state/cf-access-token-registry.json

Revoke old token in Cloudflare dashboard (after confirming new token works)

Known Dependencies

Azure VM NSG — Source IP Sensitivity

The direct VM probe (http://20.240.61.67:9621/health) depends on an NSG allow-rule on Azure VM vm-alai-support that whitelists the ANVIL machine's source IP.

Risk: If ANVIL's ISP rotates its public IP address, the NSG source-IP rule becomes stale and the direct VM probe will silently degrade to UNREACHABLE without any Cloudflare involvement. The CF tunnel probe will still work (it routes through Cloudflare, no direct IP dependency), but direct VM access will fail, giving a misleading "CF layer OK / VM layer unreachable" diagnosis.

Detection: Run curl -s https://ifconfig.co on ANVIL and compare to the NSG rule's source address prefix:

az network nsg rule show \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --query "sourceAddressPrefix"

Long-term fix: Provision an internal DNS hostname for the VM so that the probe endpoint is hostname-based and survives IP rotation without NSG changes. File a child MC when ANVIL ISP rotation becomes a recurring issue.

Azure LightRAG Migration Runbook
CF-BIC Whitelist Rule — INFRA-CF-001
MC #99400 Evidence Pack: ~/system/evidence/99400-proveo-pass/
Forged Prompt: ~/system/prompts/forged/99400.md
CF Access Token Registry: ~/system/state/cf-access-token-registry.json

Changelog

2026-05-06 — MC #99495 cleanup: StartInterval 900s MTTD reduction, BIC hostname migration to ollama.alai.no, jq custom-fields extraction, Known Dependencies section added
2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
2026-04-21 — Initial version (baseline setup + first run)

Document Owner: FlowForge
Last Updated: 2026-05-06 (MC #99495)
Approved By: CEO (Alem Basic) — MC #99400 deliverable D8

LightRAG Health Monitoring Runbook

LightRAG Observability Runbook (MC #99400 Updated)

Purpose

Probe Topology

Canonical CF Access Service Token

Token Location

Scripts Using This Token

Daemons Using This Token

Rotation Policy

Vault-Sourced Token Injection Pattern

Shell Script Pattern

Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

DOWN (HTTP 5xx or Connection Refused)

UNREACHABLE (Timeout or Network Error)

Auto-Heal Removal (MC #99400 D5)

Rationale

CEO Decision (2026-05-06)

Archived Files

Replacement Plan

com.john.lightrag-monitor LaunchAgent

Schedule

What It Probes

Environment Variables

Manual Trigger

Known Issues

Token Rotation Procedure

When to Rotate

Rotation Steps

Known Dependencies

Azure VM NSG — Source IP Sensitivity

Related Documentation

Changelog