LightRAG Health Monitoring Runbook

LightRAG Observability Runbook (MC #99400 Updated)

Status: ACTIVE
Last Updated: 2026-05-06 (MC #99400 hotfix)
Owner: FlowForge
Related: MC #99400, MC #8545

Purpose

Health monitoring and troubleshooting for LightRAG observability stack following MC #99400 hotfix (secrets hygiene + probe topology resolution).

This runbook covers:

Probe topology (three separate probe surfaces)
Canonical CF Access service token
Vault-sourced token injection pattern
Troubleshooting failure modes
Auto-heal removal rationale

Probe Topology

LightRAG health is monitored via THREE separate probe surfaces, each with different purposes and authentication requirements:

Probe Surface	Script	Endpoint	Auth Required	Layer Probed	Failure Modes Caught	Status Labels
CF Tunnel Probe	`lightrag-health.sh` (check_cf_tunnel)	`https://lightrag.alai.no/health`	YES (CF Access headers)	Cloudflare Access + App	CF auth failure (302), CF outage, app down	HEALTHY \| DOWN \| AUTH_FAIL \| UNREACHABLE
Direct VM Probe	`lightrag-health.sh` (check_vm_direct_access)	`http://20.240.61.67:9621/health`	NO (internal IP)	Azure VM app-layer	App crash, VM down, network unreachable	HEALTHY \| DOWN \| UNREACHABLE
Boot Probe	`boot.sh` lines 88-94	`http://20.240.61.67:9621/health` (or `${LIGHTRAG_VM_IP}`)	NO (internal IP)	Azure VM app-layer	App crash, VM down	HEALTHY \| DOWN \| UNREACHABLE
Monitoring Daemon	`com.john.lightrag-monitor` (calls lightrag-health-with-alert.sh)	Same as lightrag-health.sh (both CF tunnel + direct VM)	YES for CF tunnel layer	Both layers + Slack alerts	All failure modes + sends alerts to #alerts	Sends Slack alert on exit code 2

Key Decision (CEO approved): Boot probe uses direct VM IP for speed + no auth dependency. CF tunnel probe remains in lightrag-health.sh for external smoke-check.

Canonical CF Access Service Token

Token Location

Bitwarden Item ID: b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service token (note: typo in original item name)

⚠️ DO NOT use Bitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returned HTTP 302, token dead)

Scripts Using This Token

~/system/tools/lightrag-health.sh (CF tunnel check)
~/system/boot.sh (optional, not currently used since boot probe uses direct VM IP)

Daemons Using This Token

com.john.lightrag-monitor (LaunchAgent for scheduled health checks + Slack alerts)

Rotation Policy

Cadence: Manual (no TTL configured)
Owner: FlowForge
Next Rotation: TBD (CEO decision pending)

Note: CF Access service tokens do not expire by default unless a TTL is explicitly set. Current token has no expiration configured.

Vault-Sourced Token Injection Pattern

All scripts MUST source CF Access tokens from Bitwarden vault at runtime. NEVER hardcode credentials in scripts.

Shell Script Pattern

# BW session liveness check
if [[ ! -f /tmp/bw-session ]] || [[ ! -s /tmp/bw-session ]]; then
  echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
  exit 1
fi

BW_SESSION=$(cat /tmp/bw-session)

# Load CF Access credentials from vault
function load_cf_credentials() {
  local item_data
  item_data=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
  
  if [[ $? -ne 0 ]] || [[ -z "$item_data" ]]; then
    echo "ERROR: Failed to retrieve CF Access token from Bitwarden. Check BW_SESSION."
    return 1
  fi
  
  CF_ACCESS_CLIENT_ID=$(echo "$item_data" | jq -r '.login.username // empty' | sed 's/^CF-Access-Client-Id: //')
  CF_ACCESS_CLIENT_SECRET=$(echo "$item_data" | jq -r '.login.password // empty')
  
  if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
    echo "ERROR: CF Access credentials not found in Bitwarden item."
    return 1
  fi
  
  return 0
}

# Load credentials
if ! load_cf_credentials; then
  echo "FATAL: Cannot proceed without CF Access credentials."
  exit 2
fi

# Use in curl
curl -s \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  https://lightrag.alai.no/health

Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.

Root Cause:

BW_SESSION expired → /tmp/bw-session stale or missing
Wrong Bitwarden item ID (using deprecated 61d0bf21 instead of canonical b42cb5c2)
CF Access service token rotated without updating Bitwarden

Fix:

# Refresh BW session
bw unlock
# Copy new session token to /tmp/bw-session

# Verify canonical token is live
bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session) | jq -r '.login.username, .login.password'

# Test with curl
curl -s -w "\nHTTP: %{http_code}\n" \
  -H "CF-Access-Client-Id: $(bw get username 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session) | sed 's/^CF-Access-Client-Id: //')" \
  -H "CF-Access-Client-Secret: $(bw get password 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session))" \
  https://lightrag.alai.no/health

DOWN (HTTP 5xx or Connection Refused)

Symptom: Probe returns HTTP 500/502/503 or connection refused.

Root Cause:

LightRAG app crashed
Docker container stopped
Azure VM down
Neo4j backend unavailable

Fix:

# Check Azure VM status via Azure Portal
# https://portal.azure.com → vm-alai-lightrag

# Or via az CLI
az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"

# SSH to VM (if access configured)
ssh -i ~/.ssh/azure_alai [email protected]

# Check Docker containers
docker ps --filter name=lightrag
docker ps --filter name=neo4j

# Restart if needed
docker restart $(docker ps -q --filter name=lightrag)

UNREACHABLE (Timeout or Network Error)

Symptom: Probe times out after 5-10 seconds without HTTP response.

Root Cause:

ANVIL → Azure VM network path broken
ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)
Azure VM firewall blocking port 9621

Fix:

# Test direct VM connectivity
curl -s --connect-timeout 5 http://20.240.61.67:9621/health

# Check current ISP IP
curl -s https://ifconfig.co

# Compare to NSG rule source IP
az network nsg rule show \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --query "sourceAddressPrefix"

# If IPs differ, update NSG rule
NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --source-address-prefixes "${NEW_IP}/32"

Auto-Heal Removal (MC #99400 D5)

Status: com.alai.lightrag-auto-heal LaunchAgent REMOVED 2026-05-06

Rationale

The auto-heal daemon was designed to automatically restart LightRAG on health check failures. However, it had fatal design flaws:

Wrong probe target: Checked 127.0.0.1:9621 (localhost), but LightRAG runs on Azure VM 20.240.61.67:9621. Auto-heal never triggered on real CF outages.
ADC token drift: Relied on Application Default Credentials (ADC) for Azure CLI commands. When ADC tokens expired, script failed silently.
Blast radius risk: Restart logic could affect production LightRAG instance with 121K pending docs.
False sense of safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.

CEO Decision (2026-05-06)

D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).

Archived Files

~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06
~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06
~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/
RIP Note: ~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.md

Replacement Plan

Azure Monitor can directly alert on:

VM metrics (CPU, memory, disk, network)
Application Insights custom metrics (if LightRAG reports to AppInsights)
Log Analytics queries (if LightRAG logs to Log Analytics workspace)

Action groups can:

Send Slack alerts via webhook
Trigger Azure Automation runbooks for auto-remediation
Page on-call engineer via PagerDuty/Opsgenie

This will be scoped in a future child MC per CEO directive.

com.john.lightrag-monitor LaunchAgent

Status: ACTIVE (approved + activated 2026-05-06)

Schedule

Runs periodically (check plist StartInterval or StartCalendarInterval for current schedule).

What It Probes

Calls ~/system/tools/lightrag-health-with-alert.sh
Runs both CF tunnel probe + direct VM probe
Sends Slack alert to #alerts channel on exit code 2 (errors)

Environment Variables

The LaunchAgent plist includes:

LIGHTRAG_VM_IP — Set to 20.240.61.67 (Azure VM internal IP)
BW session sourced from /tmp/bw-session (vault-sourced credentials)

Manual Trigger

launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor

Known Issues

Ollama API 302: ollama.alai.no returns HTTP 302 on every health check, triggering false-alarm Slack alerts. This is a pre-existing issue unrelated to MC #99400. File separate MC for Ollama endpoint CF Access auth configuration.

Token Rotation Procedure

Rotation Cadence: Manual (no TTL configured — rotation policy is OPEN)

When to Rotate

Suspected token compromise
Periodic security hygiene (e.g., every 90 days — CEO decision pending)
After employee offboarding (if token was shared)

Rotation Steps

Generate new service token in Cloudflare dashboard:
- Navigate to Zero Trust → Access → Service Auth
- Create new service token for lightrag.alai.no Access policy
- Copy Client ID + Client Secret

Update Bitwarden item b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2:

bw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session)
# Update username = "CF-Access-Client-Id: <new_client_id>"
# Update password = "<new_client_secret>"

Test new token:

curl -s -w "\nHTTP: %{http_code}\n" \
  -H "CF-Access-Client-Id: <new_client_id>" \
  -H "CF-Access-Client-Secret: <new_client_secret>" \
  https://lightrag.alai.no/health

Verify health probes work:

bash ~/system/tools/lightrag-health.sh
# Should return HEALTHY (exit code 0)

Update cf-access-token-registry.json:

jq '.["lightrag.alai.no"].last_verified = "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"' \
  ~/system/state/cf-access-token-registry.json > /tmp/registry.json
mv /tmp/registry.json ~/system/state/cf-access-token-registry.json

Revoke old token in Cloudflare dashboard (after confirming new token works)

Azure LightRAG Migration Runbook
CF-BIC Whitelist Rule — INFRA-CF-001
MC #99400 Evidence Pack: ~/system/evidence/99400-proveo-pass/
Forged Prompt: ~/system/prompts/forged/99400.md
CF Access Token Registry: ~/system/state/cf-access-token-registry.json

Changelog

2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
2026-04-21 — Initial version (baseline setup + first run)

Document Owner: FlowForge
Last Updated: 2026-05-06
Approved By: CEO (Alem Basic) — MC #99400 deliverable D8

LightRAG Health Monitoring Runbook

LightRAG Observability Runbook (MC #99400 Updated)

Purpose

Probe Topology

Canonical CF Access Service Token

Token Location

Scripts Using This Token

Daemons Using This Token

Rotation Policy

Vault-Sourced Token Injection Pattern

Shell Script Pattern

Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

DOWN (HTTP 5xx or Connection Refused)

UNREACHABLE (Timeout or Network Error)

Auto-Heal Removal (MC #99400 D5)

Rationale

CEO Decision (2026-05-06)

Archived Files

Replacement Plan

com.john.lightrag-monitor LaunchAgent

Schedule

What It Probes

Environment Variables

Manual Trigger

Known Issues

Token Rotation Procedure

When to Rotate

Rotation Steps

Related Documentation

Changelog