LightRAG Health Monitoring Runbook
LightRAG Observability Runbook (MC #99400 Updated)
Status: ACTIVE
Last Updated: 2026-05-06 (MC #99400 hotfix)
Owner: FlowForge
Related: MC #99400, MC #8545
Purpose
Health monitoring and troubleshooting for LightRAG observability stack following MC #99400 hotfix (secrets hygiene + probe topology resolution).
This runbook covers:
- Probe topology (three separate probe surfaces)
- Canonical CF Access service token
- Vault-sourced token injection pattern
- Troubleshooting failure modes
- Auto-heal removal rationale
Probe Topology
LightRAG health is monitored via THREE separate probe surfaces, each with different purposes and authentication requirements:
| Probe Surface | Script | Endpoint | Auth Required | Layer Probed | Failure Modes Caught | Status Labels |
|---|---|---|---|---|---|---|
| CF Tunnel Probe | lightrag-health.sh (check_cf_tunnel) |
https://lightrag.alai.no/health |
YES (CF Access headers) | Cloudflare Access + App | CF auth failure (302), CF outage, app down | HEALTHY | DOWN | AUTH_FAIL | UNREACHABLE |
| Direct VM Probe | lightrag-health.sh (check_vm_direct_access) |
http://20.240.61.67:9621/health |
NO (internal IP) | Azure VM app-layer | App crash, VM down, network unreachable | HEALTHY | DOWN | UNREACHABLE |
| Boot Probe | boot.sh lines 88-94 |
http://20.240.61.67:9621/health (or ${LIGHTRAG_VM_IP}) |
NO (internal IP) | Azure VM app-layer | App crash, VM down | HEALTHY | DOWN | UNREACHABLE |
| Monitoring Daemon | com.john.lightrag-monitor (calls lightrag-health-with-alert.sh) |
Same as lightrag-health.sh (both CF tunnel + direct VM) | YES for CF tunnel layer | Both layers + Slack alerts | All failure modes + sends alerts to #alerts | Sends Slack alert on exit code 2 |
Key Decision (CEO approved): Boot probe uses direct VM IP for speed + no auth dependency. CF tunnel probe remains in lightrag-health.sh for external smoke-check.
Canonical CF Access Service Token
Token Location
Bitwarden Item ID: b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service token (note: typo in original item name)
⚠️ DO NOT use Bitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returned HTTP 302, token dead)
Scripts Using This Token
~/system/tools/lightrag-health.sh(CF tunnel check)~/system/boot.sh(optional, not currently used since boot probe uses direct VM IP)
Daemons Using This Token
com.john.lightrag-monitor(LaunchAgent for scheduled health checks + Slack alerts)
Rotation Policy
Cadence: Manual (no TTL configured)
Owner: FlowForge
Next Rotation: TBD (CEO decision pending)
Note: CF Access service tokens do not expire by default unless a TTL is explicitly set. Current token has no expiration configured.
Vault-Sourced Token Injection Pattern
All scripts MUST source CF Access tokens from Bitwarden vault at runtime. NEVER hardcode credentials in scripts.
Shell Script Pattern
# BW session liveness check (updated location ~/.cache/bw-session)
BW_SESSION_FILE="${HOME}/.cache/bw-session"
if [[ ! -f /tmp/bw-session"$BW_SESSION_FILE" ]] || [[ ! -s /tmp/bw-session"$BW_SESSION_FILE" ]]; then
echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
exit 1
fi
BW_SESSION=$(cat /tmp/bw-session)"$BW_SESSION_FILE")
# Load CF Access credentials from vault using custom fields (MC #99495 pattern)
function load_cf_credentials() {
local item_databw_item
item_data=bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
if [[ $? -ne 0 ]] || [[ -z "$item_data"bw_item" ]]; then
echo "ERROR: Failed to retrieve CF Access token from Bitwarden. Check BW_SESSION."
return 1
fi
# Extract from custom fields array (jq pattern)
CF_ACCESS_CLIENT_ID=$(echo "$item_data"bw_item" | jq -r '.login.username // empty'fields[] | sedselect(.name=="cf-access-client-id") 's/^CF-Access-Client-Id:| //'.value')
CF_ACCESS_CLIENT_SECRET=$(echo "$item_data"bw_item" | jq -r '.login.passwordfields[] //| empty'select(.name=="cf-access-client-secret") | .value')
if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
echo "ERROR: CF Access credentials not found in Bitwarden item.item custom fields."
return 1
fi
return 0
}
# Load credentials
if ! load_cf_credentials; then
echo "FATAL: Cannot proceed without CF Access credentials."
exit 2
fi
# Use in curl
curl -s \
-H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
-H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
https://lightrag.alai.no/health
Failure Modes and Troubleshooting
AUTH_FAIL (HTTP 302 or 401)
Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.
Root Cause:
- BW_SESSION expired →
~/stale or missingtmp/.cache/bw-session - Wrong Bitwarden item ID (using deprecated
61d0bf21instead of canonicalb42cb5c2) - CF Access service token rotated without updating Bitwarden
Fix:
# Refresh BW session
bw unlock
# Copy new session token to ~/tmp/.cache/bw-session
# Verify canonical token is live
BW_SESSION=$(cat "${HOME}/.cache/bw-session")
bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$(catBW_SESSION")
/tmp/bw-session)echo "$bw_item" | jq -r '.login.username,fields[] | select(.name=="cf-access-client-id") | .login.password'value'
echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value'
# Test with curl
CF_ID=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value')
CF_SECRET=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value')
curl -s -w "\nHTTP: %{http_code}\n" \
-H "CF-Access-Client-Id: $(bw get username 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session) | sed 's/^CF-Access-Client-Id: //')"CF_ID" \
-H "CF-Access-Client-Secret: $(bw get password 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session))"CF_SECRET" \
https://lightrag.alai.no/health
DOWN (HTTP 5xx or Connection Refused)
Symptom: Probe returns HTTP 500/502/503 or connection refused.
Root Cause:
- LightRAG app crashed
- Docker container stopped
- Azure VM down
- Neo4j backend unavailable
Fix:
# Check Azure VM status via Azure Portal
# https://portal.azure.com → vm-alai-lightrag
# Or via az CLI
az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"
# SSH to VM (if access configured)
ssh -i ~/.ssh/azure_alai [email protected]
# Check Docker containers
docker ps --filter name=lightrag
docker ps --filter name=neo4j
# Restart if needed
docker restart $(docker ps -q --filter name=lightrag)
UNREACHABLE (Timeout or Network Error)
Symptom: Probe times out after 5-10 seconds without HTTP response.
Root Cause:
- ANVIL → Azure VM network path broken
- ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)
- Azure VM firewall blocking port 9621
Fix:
# Test direct VM connectivity
curl -s --connect-timeout 5 http://20.240.61.67:9621/health
# Check current ISP IP
curl -s https://ifconfig.co
# Compare to NSG rule source IP
az network nsg rule show \
-g rg-alai-lightrag \
--nsg-name vm-alai-lightragNSG \
-n allow-lightrag-macstudio \
--query "sourceAddressPrefix"
# If IPs differ, update NSG rule
NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
-g rg-alai-lightrag \
--nsg-name vm-alai-lightragNSG \
-n allow-lightrag-macstudio \
--source-address-prefixes "${NEW_IP}/32"
Auto-Heal Removal (MC #99400 D5)
Status: com.alai.lightrag-auto-heal LaunchAgent REMOVED 2026-05-06
Rationale
The auto-heal daemon was designed to automatically restart LightRAG on health check failures. However, it had fatal design flaws:
- Wrong probe target: Checked
127.0.0.1:9621(localhost), but LightRAG runs on Azure VM20.240.61.67:9621. Auto-heal never triggered on real CF outages. - ADC token drift: Relied on Application Default Credentials (ADC) for Azure CLI commands. When ADC tokens expired, script failed silently.
- Blast radius risk: Restart logic could affect production LightRAG instance with 121K pending docs.
- False sense of safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.
CEO Decision (2026-05-06)
D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).
Archived Files
~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/- RIP Note:
~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.md
Replacement Plan
Azure Monitor can directly alert on:
- VM metrics (CPU, memory, disk, network)
- Application Insights custom metrics (if LightRAG reports to AppInsights)
- Log Analytics queries (if LightRAG logs to Log Analytics workspace)
Action groups can:
- Send Slack alerts via webhook
- Trigger Azure Automation runbooks for auto-remediation
- Page on-call engineer via PagerDuty/Opsgenie
This will be scoped in a future child MC per CEO directive.
com.john.lightrag-monitor LaunchAgent
Status: ACTIVE (approved + activated 2026-05-06)
Schedule
Runs periodically (check plist StartInterval or StartCalendarInterval for current schedule).
What It Probes
- Calls
~/system/tools/lightrag-health-with-alert.sh - Runs both CF tunnel probe + direct VM probe
- Sends Slack alert to
#alertschannel on exit code 2 (errors)
Environment Variables
The LaunchAgent plist includes:
LIGHTRAG_VM_IP— Set to20.240.61.67(Azure VM internal IP)- BW session sourced from
~/(vault-sourced credentials)tmp/.cache/bw-session
Manual Trigger
launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor
Known Issues
- Ollama API 302:
ollama.alai.noreturns HTTP 302 on every health check, triggering false-alarm Slack alerts. This is a pre-existing issue unrelated to MC #99400. File separate MC for Ollama endpoint CF Access auth configuration.
Token Rotation Procedure
Rotation Cadence: Manual (no TTL configured — rotation policy is OPEN)
When to Rotate
- Suspected token compromise
- Periodic security hygiene (e.g., every 90 days — CEO decision pending)
- After employee offboarding (if token was shared)
Rotation Steps
-
Generate new service token in Cloudflare dashboard:
- Navigate to
Zero Trust → Access → Service Auth - Create new service token for
lightrag.alai.noAccess policy - Copy Client ID + Client Secret
- Navigate to
-
Update Bitwarden item
b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2:BW_SESSION=$(cat "${HOME}/.cache/bw-session") bw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$(cat /tmp/bw-session)BW_SESSION" # Updateusernamecustom fields: cf-access-client-id = "CF-Access-Client-Id:<new_client_id>" # Updatepasswordcustom fields: cf-access-client-secret = "<new_client_secret>" -
Test new token:
curl -s -w "\nHTTP: %{http_code}\n" \ -H "CF-Access-Client-Id: <new_client_id>" \ -H "CF-Access-Client-Secret: <new_client_secret>" \ https://lightrag.alai.no/health -
Verify health probes work:
bash ~/system/tools/lightrag-health.sh # Should return HEALTHY (exit code 0) -
Update
cf-access-token-registry.json:jq '.["lightrag.alai.no"].last_verified = "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"' \ ~/system/state/cf-access-token-registry.json > /tmp/registry.json mv /tmp/registry.json ~/system/state/cf-access-token-registry.json -
Revoke old token in Cloudflare dashboard (after confirming new token works)
Known Dependencies
Azure VM NSG — Source IP Sensitivity
The direct VM probe (http://20.240.61.67:9621/health) depends on an NSG allow-rule on Azure VM vm-alai-support that whitelists the ANVIL machine's source IP.
Risk: If ANVIL's ISP rotates its public IP address, the NSG source-IP rule becomes stale and the direct VM probe will silently degrade to UNREACHABLE without any Cloudflare involvement. The CF tunnel probe will still work (it routes through Cloudflare, no direct IP dependency), but direct VM access will fail, giving a misleading "CF layer OK / VM layer unreachable" diagnosis.
Detection: Run curl -s https://ifconfig.co on ANVIL and compare to the NSG rule's source address prefix:
az network nsg rule show \
-g rg-alai-lightrag \
--nsg-name vm-alai-lightragNSG \
-n allow-lightrag-macstudio \
--query "sourceAddressPrefix"
Long-term fix: Provision an internal DNS hostname for the VM so that the probe endpoint is hostname-based and survives IP rotation without NSG changes. File a child MC when ANVIL ISP rotation becomes a recurring issue.
Related Documentation
- Azure LightRAG Migration Runbook
- CF-BIC Whitelist Rule — INFRA-CF-001
- MC #99400 Evidence Pack:
~/system/evidence/99400-proveo-pass/ - Forged Prompt:
~/system/prompts/forged/99400.md - CF Access Token Registry:
~/system/state/cf-access-token-registry.json
Changelog
2026-05-06 — MC #99495 cleanup: StartInterval 900s MTTD reduction, BIC hostname migration to ollama.alai.no, jq custom-fields extraction, Known Dependencies section added
2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
2026-04-21 — Initial version (baseline setup + first run)
Document Owner: FlowForge
Last Updated: 2026-05-06 (MC #99495)
Approved By: CEO (Alem Basic) — MC #99400 deliverable D8