Skip to main content

LightRAG Health Monitoring Runbook

LightRAG Observability Runbook (MC #99400 Updated)

Status: ACTIVE
Last Updated: 2026-05-06 (MC #99400 hotfix)
Owner: FlowForge
Related: MC #99400, MC #8545


Purpose

Health monitoring and troubleshooting for LightRAG observability stack following MC #99400 hotfix (secrets hygiene + probe topology resolution).

This runbook covers:

  • Probe topology (three separate probe surfaces)
  • Canonical CF Access service token
  • Vault-sourced token injection pattern
  • Troubleshooting failure modes
  • Auto-heal removal rationale

Probe Topology

LightRAG health is monitored via THREE separate probe surfaces, each with different purposes and authentication requirements:

Probe Surface Script Endpoint Auth Required Layer Probed Failure Modes Caught Status Labels
CF Tunnel Probe lightrag-health.sh (check_cf_tunnel) https://lightrag.alai.no/health YES (CF Access headers) Cloudflare Access + App CF auth failure (302), CF outage, app down HEALTHY | DOWN | AUTH_FAIL | UNREACHABLE
Direct VM Probe lightrag-health.sh (check_vm_direct_access) http://20.240.61.67:9621/health NO (internal IP) Azure VM app-layer App crash, VM down, network unreachable HEALTHY | DOWN | UNREACHABLE
Boot Probe boot.sh lines 88-94 http://20.240.61.67:9621/health (or ${LIGHTRAG_VM_IP}) NO (internal IP) Azure VM app-layer App crash, VM down HEALTHY | DOWN | UNREACHABLE
Monitoring Daemon com.john.lightrag-monitor (calls lightrag-health-with-alert.sh) Same as lightrag-health.sh (both CF tunnel + direct VM) YES for CF tunnel layer Both layers + Slack alerts All failure modes + sends alerts to #alerts Sends Slack alert on exit code 2

Key Decision (CEO approved): Boot probe uses direct VM IP for speed + no auth dependency. CF tunnel probe remains in lightrag-health.sh for external smoke-check.


Canonical CF Access Service Token

Token Location

Bitwarden Item ID: b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service token (note: typo in original item name)

⚠️ DO NOT use Bitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returned HTTP 302, token dead)

Scripts Using This Token

  • ~/system/tools/lightrag-health.sh (CF tunnel check)
  • ~/system/boot.sh (optional, not currently used since boot probe uses direct VM IP)

Daemons Using This Token

  • com.john.lightrag-monitor (LaunchAgent for scheduled health checks + Slack alerts)

Rotation Policy

Cadence: Manual (no TTL configured)
Owner: FlowForge
Next Rotation: TBD (CEO decision pending)

Note: CF Access service tokens do not expire by default unless a TTL is explicitly set. Current token has no expiration configured.


Vault-Sourced Token Injection Pattern

All scripts MUST source CF Access tokens from Bitwarden vault at runtime. NEVER hardcode credentials in scripts.

Shell Script Pattern

# BW session liveness check
if [[ ! -f /tmp/bw-session ]] || [[ ! -s /tmp/bw-session ]]; then
  echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
  exit 1
fi

BW_SESSION=$(cat /tmp/bw-session)

# Load CF Access credentials from vault
function load_cf_credentials() {
  local item_data
  item_data=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
  
  if [[ $? -ne 0 ]] || [[ -z "$item_data" ]]; then
    echo "ERROR: Failed to retrieve CF Access token from Bitwarden. Check BW_SESSION."
    return 1
  fi
  
  CF_ACCESS_CLIENT_ID=$(echo "$item_data" | jq -r '.login.username // empty' | sed 's/^CF-Access-Client-Id: //')
  CF_ACCESS_CLIENT_SECRET=$(echo "$item_data" | jq -r '.login.password // empty')
  
  if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
    echo "ERROR: CF Access credentials not found in Bitwarden item."
    return 1
  fi
  
  return 0
}

# Load credentials
if ! load_cf_credentials; then
  echo "FATAL: Cannot proceed without CF Access credentials."
  exit 2
fi

# Use in curl
curl -s \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  https://lightrag.alai.no/health

Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.

Root Cause:

  • BW_SESSION expired → /tmp/bw-session stale or missing
  • Wrong Bitwarden item ID (using deprecated 61d0bf21 instead of canonical b42cb5c2)
  • CF Access service token rotated without updating Bitwarden

Fix:

# Refresh BW session
bw unlock
# Copy new session token to /tmp/bw-session

# Verify canonical token is live
bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session) | jq -r '.login.username, .login.password'

# Test with curl
curl -s -w "\nHTTP: %{http_code}\n" \
  -H "CF-Access-Client-Id: $(bw get username 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session) | sed 's/^CF-Access-Client-Id: //')" \
  -H "CF-Access-Client-Secret: $(bw get password 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session))" \
  https://lightrag.alai.no/health

DOWN (HTTP 5xx or Connection Refused)

Symptom: Probe returns HTTP 500/502/503 or connection refused.

Root Cause:

  • LightRAG app crashed
  • Docker container stopped
  • Azure VM down
  • Neo4j backend unavailable

Fix:

# Check Azure VM status via Azure Portal
# https://portal.azure.com → vm-alai-lightrag

# Or via az CLI
az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"

# SSH to VM (if access configured)
ssh -i ~/.ssh/azure_alai [email protected]

# Check Docker containers
docker ps --filter name=lightrag
docker ps --filter name=neo4j

# Restart if needed
docker restart $(docker ps -q --filter name=lightrag)

UNREACHABLE (Timeout or Network Error)

Symptom: Probe times out after 5-10 seconds without HTTP response.

Root Cause:

  • ANVIL → Azure VM network path broken
  • ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)
  • Azure VM firewall blocking port 9621

Fix:

# Test direct VM connectivity
curl -s --connect-timeout 5 http://20.240.61.67:9621/health

# Check current ISP IP
curl -s https://ifconfig.co

# Compare to NSG rule source IP
az network nsg rule show \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --query "sourceAddressPrefix"

# If IPs differ, update NSG rule
NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --source-address-prefixes "${NEW_IP}/32"

Auto-Heal Removal (MC #99400 D5)

Status: com.alai.lightrag-auto-heal LaunchAgent REMOVED 2026-05-06

Rationale

The auto-heal daemon was designed to automatically restart LightRAG on health check failures. However, it had fatal design flaws:

  1. Wrong probe target: Checked 127.0.0.1:9621 (localhost), but LightRAG runs on Azure VM 20.240.61.67:9621. Auto-heal never triggered on real CF outages.
  2. ADC token drift: Relied on Application Default Credentials (ADC) for Azure CLI commands. When ADC tokens expired, script failed silently.
  3. Blast radius risk: Restart logic could affect production LightRAG instance with 121K pending docs.
  4. False sense of safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.

CEO Decision (2026-05-06)

D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).

Archived Files

  • ~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06
  • ~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06
  • ~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/
  • RIP Note: ~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.md

Replacement Plan

Azure Monitor can directly alert on:

  • VM metrics (CPU, memory, disk, network)
  • Application Insights custom metrics (if LightRAG reports to AppInsights)
  • Log Analytics queries (if LightRAG logs to Log Analytics workspace)

Action groups can:

  • Send Slack alerts via webhook
  • Trigger Azure Automation runbooks for auto-remediation
  • Page on-call engineer via PagerDuty/Opsgenie

This will be scoped in a future child MC per CEO directive.


com.john.lightrag-monitor LaunchAgent

Status: ACTIVE (approved + activated 2026-05-06)

Schedule

Runs periodically (check plist StartInterval or StartCalendarInterval for current schedule).

What It Probes

  • Calls ~/system/tools/lightrag-health-with-alert.sh
  • Runs both CF tunnel probe + direct VM probe
  • Sends Slack alert to #alerts channel on exit code 2 (errors)

Environment Variables

The LaunchAgent plist includes:

  • LIGHTRAG_VM_IP — Set to 20.240.61.67 (Azure VM internal IP)
  • BW session sourced from /tmp/bw-session (vault-sourced credentials)

Manual Trigger

launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor

Known Issues

  • Ollama API 302: ollama.alai.no returns HTTP 302 on every health check, triggering false-alarm Slack alerts. This is a pre-existing issue unrelated to MC #99400. File separate MC for Ollama endpoint CF Access auth configuration.

Token Rotation Procedure

Rotation Cadence: Manual (no TTL configured — rotation policy is OPEN)

When to Rotate

  • Suspected token compromise
  • Periodic security hygiene (e.g., every 90 days — CEO decision pending)
  • After employee offboarding (if token was shared)

Rotation Steps

  1. Generate new service token in Cloudflare dashboard:

    • Navigate to Zero Trust → Access → Service Auth
    • Create new service token for lightrag.alai.no Access policy
    • Copy Client ID + Client Secret
  2. Update Bitwarden item b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2:

    bw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session)
    # Update username = "CF-Access-Client-Id: <new_client_id>"
    # Update password = "<new_client_secret>"
    
  3. Test new token:

    curl -s -w "\nHTTP: %{http_code}\n" \
      -H "CF-Access-Client-Id: <new_client_id>" \
      -H "CF-Access-Client-Secret: <new_client_secret>" \
      https://lightrag.alai.no/health
    
  4. Verify health probes work:

    bash ~/system/tools/lightrag-health.sh
    # Should return HEALTHY (exit code 0)
    
  5. Update cf-access-token-registry.json:

    jq '.["lightrag.alai.no"].last_verified = "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"' \
      ~/system/state/cf-access-token-registry.json > /tmp/registry.json
    mv /tmp/registry.json ~/system/state/cf-access-token-registry.json
    
  6. Revoke old token in Cloudflare dashboard (after confirming new token works)



Changelog

2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
2026-04-21 — Initial version (baseline setup + first run)


Document Owner: FlowForge
Last Updated: 2026-05-06
Approved By: CEO (Alem Basic) — MC #99400 deliverable D8