LightRAG Health Monitoring Runbook
LightRAG Observability Runbook (MC #99400 Updated)
Status: ACTIVE
Last Updated: 2026-05-06 (MC #99400 hotfix)
Owner: FlowForge
Related: MC #99400, MC #8545
Purpose
Health monitoring and troubleshooting for LightRAG observability stack following MC #99400 hotfix (secrets hygiene + probe topology resolution).
This runbook covers:
- Probe topology (three separate probe surfaces)
- Canonical CF Access service token
- Vault-sourced token injection pattern
- Troubleshooting failure modes
- Auto-heal removal rationale
Probe Topology
LightRAG health is monitored via THREE separate probe surfaces, each with different purposes and authentication requirements:
| Probe Surface | Script | Endpoint | Auth Required | Layer Probed | Failure Modes Caught | Status Labels |
|---|---|---|---|---|---|---|
| CF Tunnel Probe | lightrag-health.sh (check_cf_tunnel) |
https://lightrag.alai.no/health |
YES (CF Access headers) | Cloudflare Access + App | CF auth failure (302), CF outage, app down | HEALTHY | DOWN | AUTH_FAIL | UNREACHABLE |
| Direct VM Probe | lightrag-health.sh (check_vm_direct_access) |
http://20.240.61.67:9621/health |
NO (internal IP) | Azure VM app-layer | App crash, VM down, network unreachable | HEALTHY | DOWN | UNREACHABLE |
| Boot Probe | boot.sh lines 88-94 |
http://20.240.61.67:9621/health (or ${LIGHTRAG_VM_IP}) |
NO (internal IP) | Azure VM app-layer | App crash, VM down | HEALTHY | DOWN | UNREACHABLE |
| Monitoring Daemon | com.john.lightrag-monitor (calls lightrag-health-with-alert.sh) |
Same as lightrag-health.sh (both CF tunnel + direct VM) | YES for CF tunnel layer | Both layers + Slack alerts | All failure modes + sends alerts to #alerts | Sends Slack alert on exit code 2 |
Key Decision (CEO approved): Boot probe uses direct VM IP for speed + no auth dependency. CF tunnel probe remains in lightrag-health.sh for external smoke-check.
Canonical CF Access Service Token
Token Location
Bitwarden Item ID: b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service token (note: typo in original item name)
⚠️ DO NOT use Bitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returned HTTP 302, token dead)
Scripts Using This Token
~/system/tools/lightrag-health.sh(CF tunnel check)~/system/boot.sh(optional, not currently used since boot probe uses direct VM IP)
Daemons Using This Token
com.john.lightrag-monitor(LaunchAgent for scheduled health checks + Slack alerts)
Rotation Policy
Cadence: Manual (no TTL configured)
Owner: FlowForge
Next Rotation: TBD (CEO decision pending)
Note: CF Access service tokens do not expire by default unless a TTL is explicitly set. Current token has no expiration configured.
Vault-Sourced Token Injection Pattern
All scripts MUST source CF Access tokens from Bitwarden vault at runtime. NEVER hardcode credentials in scripts.
Shell Script Pattern
# BW session liveness check
if [[ ! -f /tmp/bw-session ]] || [[ ! -s /tmp/bw-session ]]; then
echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
exit 1
fi
BW_SESSION=$(cat /tmp/bw-session)
# Load CF Access credentials from vault
function load_cf_credentials() {
local item_data
item_data=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
if [[ $? -ne 0 ]] || [[ -z "$item_data" ]]; then
echo "ERROR: Failed to retrieve CF Access token from Bitwarden. Check BW_SESSION."
return 1
fi
CF_ACCESS_CLIENT_ID=$(echo "$item_data" | jq -r '.login.username // empty' | sed 's/^CF-Access-Client-Id: //')
CF_ACCESS_CLIENT_SECRET=$(echo "$item_data" | jq -r '.login.password // empty')
if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
echo "ERROR: CF Access credentials not found in Bitwarden item."
return 1
fi
return 0
}
# Load credentials
if ! load_cf_credentials; then
echo "FATAL: Cannot proceed without CF Access credentials."
exit 2
fi
# Use in curl
curl -s \
-H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
-H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
https://lightrag.alai.no/health
Failure Modes and Troubleshooting
AUTH_FAIL (HTTP 302 or 401)
Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.
Root Cause:
- BW_SESSION expired →
/tmp/bw-sessionstale or missing - Wrong Bitwarden item ID (using deprecated
61d0bf21instead of canonicalb42cb5c2) - CF Access service token rotated without updating Bitwarden
Fix:
# Refresh BW session
bw unlock
# Copy new session token to /tmp/bw-session
# Verify canonical token is live
bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session) | jq -r '.login.username, .login.password'
# Test with curl
curl -s -w "\nHTTP: %{http_code}\n" \
-H "CF-Access-Client-Id: $(bw get username 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session) | sed 's/^CF-Access-Client-Id: //')" \
-H "CF-Access-Client-Secret: $(bw get password 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session))" \
https://lightrag.alai.no/health
DOWN (HTTP 5xx or Connection Refused)
Symptom: Probe returns HTTP 500/502/503 or connection refused.
Root Cause:
- LightRAG app crashed
- Docker container stopped
- Azure VM down
- Neo4j backend unavailable
Fix:
# Check Azure VM status via Azure Portal
# https://portal.azure.com → vm-alai-lightrag
# Or via az CLI
az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"
# SSH to VM (if access configured)
ssh -i ~/.ssh/azure_alai [email protected]
# Check Docker containers
docker ps --filter name=lightrag
docker ps --filter name=neo4j
# Restart if needed
docker restart $(docker ps -q --filter name=lightrag)
UNREACHABLE (Timeout or Network Error)
Symptom: Probe times out after 5-10 seconds without HTTP response.
Root Cause:
- ANVIL → Azure VM network path broken
- ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)
- Azure VM firewall blocking port 9621
Fix:
# Test direct VM connectivity
curl -s --connect-timeout 5 http://20.240.61.67:9621/health
# Check current ISP IP
curl -s https://ifconfig.co
# Compare to NSG rule source IP
az network nsg rule show \
-g rg-alai-lightrag \
--nsg-name vm-alai-lightragNSG \
-n allow-lightrag-macstudio \
--query "sourceAddressPrefix"
# If IPs differ, update NSG rule
NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
-g rg-alai-lightrag \
--nsg-name vm-alai-lightragNSG \
-n allow-lightrag-macstudio \
--source-address-prefixes "${NEW_IP}/32"
Auto-Heal Removal (MC #99400 D5)
Status: com.alai.lightrag-auto-heal LaunchAgent REMOVED 2026-05-06
Rationale
The auto-heal daemon was designed to automatically restart LightRAG on health check failures. However, it had fatal design flaws:
- Wrong probe target: Checked
127.0.0.1:9621(localhost), but LightRAG runs on Azure VM20.240.61.67:9621. Auto-heal never triggered on real CF outages. - ADC token drift: Relied on Application Default Credentials (ADC) for Azure CLI commands. When ADC tokens expired, script failed silently.
- Blast radius risk: Restart logic could affect production LightRAG instance with 121K pending docs.
- False sense of safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.
CEO Decision (2026-05-06)
D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).
Archived Files
~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/- RIP Note:
~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.md
Replacement Plan
Azure Monitor can directly alert on:
- VM metrics (CPU, memory, disk, network)
- Application Insights custom metrics (if LightRAG reports to AppInsights)
- Log Analytics queries (if LightRAG logs to Log Analytics workspace)
Action groups can:
- Send Slack alerts via webhook
- Trigger Azure Automation runbooks for auto-remediation
- Page on-call engineer via PagerDuty/Opsgenie
This will be scoped in a future child MC per CEO directive.
com.john.lightrag-monitor LaunchAgent
Status: ACTIVE (approved + activated 2026-05-06)
Schedule
Runs periodically (check plist StartInterval or StartCalendarInterval for current schedule).
What It Probes
- Calls
~/system/tools/lightrag-health-with-alert.sh - Runs both CF tunnel probe + direct VM probe
- Sends Slack alert to
#alertschannel on exit code 2 (errors)
Environment Variables
The LaunchAgent plist includes:
LIGHTRAG_VM_IP— Set to20.240.61.67(Azure VM internal IP)- BW session sourced from
/tmp/bw-session(vault-sourced credentials)
Manual Trigger
launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor
Known Issues
- Ollama API 302:
ollama.alai.noreturns HTTP 302 on every health check, triggering false-alarm Slack alerts. This is a pre-existing issue unrelated to MC #99400. File separate MC for Ollama endpoint CF Access auth configuration.
Token Rotation Procedure
Rotation Cadence: Manual (no TTL configured — rotation policy is OPEN)
When to Rotate
- Suspected token compromise
- Periodic security hygiene (e.g., every 90 days — CEO decision pending)
- After employee offboarding (if token was shared)
Rotation Steps
-
Generate new service token in Cloudflare dashboard:
- Navigate to
Zero Trust → Access → Service Auth - Create new service token for
lightrag.alai.noAccess policy - Copy Client ID + Client Secret
- Navigate to
-
Update Bitwarden item
b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2:bw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session) # Update username = "CF-Access-Client-Id: <new_client_id>" # Update password = "<new_client_secret>" -
Test new token:
curl -s -w "\nHTTP: %{http_code}\n" \ -H "CF-Access-Client-Id: <new_client_id>" \ -H "CF-Access-Client-Secret: <new_client_secret>" \ https://lightrag.alai.no/health -
Verify health probes work:
bash ~/system/tools/lightrag-health.sh # Should return HEALTHY (exit code 0) -
Update
cf-access-token-registry.json:jq '.["lightrag.alai.no"].last_verified = "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"' \ ~/system/state/cf-access-token-registry.json > /tmp/registry.json mv /tmp/registry.json ~/system/state/cf-access-token-registry.json -
Revoke old token in Cloudflare dashboard (after confirming new token works)
Related Documentation
- Azure LightRAG Migration Runbook
- CF-BIC Whitelist Rule — INFRA-CF-001
- MC #99400 Evidence Pack:
~/system/evidence/99400-proveo-pass/ - Forged Prompt:
~/system/prompts/forged/99400.md - CF Access Token Registry:
~/system/state/cf-access-token-registry.json
Changelog
2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
2026-04-21 — Initial version (baseline setup + first run)
Document Owner: FlowForge
Last Updated: 2026-05-06
Approved By: CEO (Alem Basic) — MC #99400 deliverable D8