LightRAG Health Monitoring Runbook
LightRAG Health MonitoringObservability Runbook (MC #99400 Updated)
Status: ACTIVE
Created:Last Updated: 2026-04-2105-06 (MC #99400 hotfix)
Owner: FlowForge (AgentForge)
Related: MC #8545,#99400, INFRA-CF-001MC #8545
Purpose
Continuous healthHealth monitoring and troubleshooting for LightRAG observability stack following MC #99400 hotfix (Azuresecrets VMhygiene + Cloudflare)probe followingtopology the 2026-04-20 outage fix (CF Browser Integrity Check configuration)resolution).
This runbook covers:
HealthProbechecktopologyscript(threeusageseparate probe surfaces)InterpretingCanonicalresultsCF Access service tokenAutomatedVault-sourcedmonitoringtokensetupinjection pattern- Troubleshooting
commonfailureissuesmodes RollbackAuto-healproceduresremoval rationale
ArchitectureProbe OverviewTopology
LightRAG runshealth onis Azuremonitored VMvia (20.240.61.67:9621)THREE separate probe surfaces, each with different purposes and isauthentication exposed via Cloudflare tunnel at https://lightrag.basicconsulting.no. The system depends on:requirements:
Azure VM— Docker containers (lightrag + neo4j)Cloudflare tunnel— Routes traffic through Mac Studio relayCloudflare Access— Authentication via service tokensCloudflare BIC rule— Allows automation clients (Python UA)Ollama upstream—https://ollama.basicconsulting.nofor LLM inference
See: Azure LightRAG Migration Runbook
Health Check Script
Location
~/system/tools/lightrag-health.sh
Manual Execution
bash ~/system/tools/lightrag-health.sh
Output
Terminal:Colored status summary (green/yellow/red per layer)JSON:~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.json(machine-readable)Markdown:~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.md(human-readable)
Exit Codes
0— All checks passed (healthy)1— Warnings detected (degraded but operational)2— Errors detected (critical issues)
Check Layers
Layer 1: Azure VM Health
| Auth Required | Layer Probed | Failure Modes Caught | Status Labels | |||
|---|---|---|---|---|---|---|
| CF Tunnel Probe | (check_cf_tunnel) |
https://lightrag.alai.no/health |
CF auth failure (302), CF outage, app down | HEALTHY | DOWN | AUTH_FAIL | UNREACHABLE | ||
| Direct VM Probe | (check_vm_direct_access) |
http://20.240.61.67:9621/health |
App crash, VM down, network unreachable | HEALTHY | DOWN | UNREACHABLE | ||
| Boot Probe | boot.sh lines 88-94 |
http://20.240.61.67:9621/health (or ${LIGHTRAG_VM_IP}) |
NO (internal IP) | Azure VM app-layer | App crash, VM down | HEALTHY | DOWN | UNREACHABLE |
| Monitoring Daemon | com.john.lightrag-monitor (calls lightrag-health-with-alert.sh) |
Same as lightrag-health.sh (both CF tunnel + |
YES for CF tunnel layer | Both layers + Slack alerts | All failure modes + sends alerts to #alerts | Sends Slack alert on exit code 2 |
Note:Key Decision (CEO approved): SSHBoot accessprobe currentlyuses unavailabledirect (publickeyVM auth).IP Manualfor verificationspeed required+ viano Azureauth Portal or after SSH key setup.
Layer 2: Cloudflare Network
| ||
| Canonical | Access Token |
|
Critical:Bitwarden Item ID: python_uab42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service token check(note: verifiestypo thein CF-BIC-001original ruleitem isname)
⚠️ IfDO thisNOT failsuse withBitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returned HTTP 403,302, automationtoken clients (pi-orchestrator, lightrag-outbox-ingest.js) will break.dead)
LayerScripts 3:Using ApplicationThis HealthToken
| ~/ | (CF |
| ~/ | currently
Daemons Using This Token
com.john.lightrag-monitor(LaunchAgent for scheduled health checks + Slack alerts)
Rotation Policy
Cadence: Manual (no TTL configured)
Owner: FlowForge
Next Rotation: TBD (CEO decision pending)
Note: FirstCF queryAccess afterservice idletokens maydo takenot longerexpire (coldby start).default Ifunless timeout,a retryTTL once.
Layerexplicitly 4:set. OllamaCurrent Upstream
token |
Critical: LightRAG requires these specific models. If missing, queries will fail.configured.
InterpretingVault-Sourced ResultsToken Injection Pattern
Green (Exit 0) — Healthy
All criticalscripts checksMUST passed.source SystemCF operational.
tokens from Bitwarden vault at runtime. Action:NEVER hardcode credentials in scripts. None required.
YellowShell (ExitScript 1) — WarningsPattern
Non-critical issues detected. System degraded but operational.
Common warnings:
Action: Review warning details. Monitor next check. Escalate if warnings persist 3+ checks.
Red (Exit 2) — Errors
Critical issues detected. System may be non-operational or partially failed.
Common errors:
Query endpoint timeout (> 30s)HTTP 403 from Python UA (BIC rule disabled)Ollama models missingDirect VM access failed
Action:
Review error details in JSON evidenceFollow troubleshooting section belowIf unresolved after 30 min, consider rollback (see Azure LightRAG Migration Runbook)
Automated Monitoring Setup
LaunchAgent Installation (DRAFT — Pending Alem Approval)
Draft file: ~/system/evidence/lightrag-monitor-launchagent-draft.plist
Schedule: Daily at 9:00 AM (frequent for 4-week monitoring period)
Installation steps (when approved):
# 1.BW Copysession draftliveness check
if [[ ! -f /tmp/bw-session ]] || [[ ! -s /tmp/bw-session ]]; then
echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
exit 1
fi
BW_SESSION=$(cat /tmp/bw-session)
# Load CF Access credentials from vault
function load_cf_credentials() {
local item_data
item_data=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
if [[ $? -ne 0 ]] || [[ -z "$item_data" ]]; then
echo "ERROR: Failed to LaunchAgentsretrieve cpCF ~Access token from Bitwarden. Check BW_SESSION."
return 1
fi
CF_ACCESS_CLIENT_ID=$(echo "$item_data" | jq -r '.login.username /system/evidence/lightrag-monitor-launchagent-draft.plist/ \empty' ~| sed 's/^CF-Access-Client-Id: /Library/LaunchAgents/com.john.lightrag-monitor.plist/')
CF_ACCESS_CLIENT_SECRET=$(echo "$item_data" | jq -r '.login.password // empty')
if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
echo "ERROR: CF Access credentials not found in Bitwarden item."
return 1
fi
return 0
}
# 2. Load thecredentials
agentif launchctl! loadload_cf_credentials; ~/Library/LaunchAgents/com.john.lightrag-monitor.plistthen
echo "FATAL: Cannot proceed without CF Access credentials."
exit 2
fi
# 3. Start immediately (optional)
launchctl start com.john.lightrag-monitor
Manual trigger:
launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor
Logs:
stdout:~/system/logs/lightrag-monitor/stdout.logstderr:~/system/logs/lightrag-monitor/stderr.log
Slack Alerts (To Be Implemented)
When LaunchAgent detects exit code 2 (errors), send alert to #alerts channel:
node ~/system/tools/slack.js send alerts "🚨 LightRAG health check FAILED at $(date). Check ~/system/evidence/lightrag-health-*.json"
This requires wrapping the health check scriptUse in a post-execution hook (see plist comments).
Health History Database
Location: ~/system/databases/lightrag-health.db
Schema: ~/system/tools/lightrag-health-db-init.sql
Tables
health_checks— Overall check resultshealth_check_details— Individual layer/check results
Views
health_checks_summary— Last 30 checkshealth_checks_trend— Daily aggregates
Query Examples
Last 10 checks:
sqlite3 ~/system/databases/lightrag-health.db \
"SELECT timestamp, overall_status, errors, warnings FROM health_checks ORDER BY created_at DESC LIMIT 10;"
Trend over last 7 days:
sqlite3 ~/system/databases/lightrag-health.db \
"SELECT * FROM health_checks_trend WHERE check_date >= date('now', '-7 days');"
All errors in last 24 hours:
sqlite3 ~/system/databases/lightrag-health.db \
"SELECT hc.timestamp, hcd.layer, hcd.check_name, hcd.message FROM health_checks hc
JOIN health_check_details hcd ON hc.id = hcd.health_check_id
WHERE hcd.status = 'error' AND hc.created_at >= datetime('now', '-24 hours');"
Note: Database logging will be implemented in next iteration of the health script.
Troubleshooting
Issue: Query endpoint timeout (HTTP 000, 35s)
Possible causes:
First query after idle (cold start)Ollama FORGE overloadedNetwork path issue (Mac Studio → CF → Azure → CF → Mac Studio)
Diagnosis:
# Test Ollama upstream directlycurl
curl -s https://ollama.basicconsulting.no/api/tags \
-H "CF-Access-Client-Id: $(grep CF_ACCESS_CLIENT_ID ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')"CF_ACCESS_CLIENT_ID" \
-H "CF-Access-Client-Secret: $(grep CF_ACCESS_CLIENT_SECRET ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')" | jq '.models | length'
# Check if FORGE is responding
curl http://10.0.0.2:11434/api/ps
# Test query directly with extended timeout
curl -s --max-time 60 \
-H "CF-Access-Client-Id: ..." \
-H "CF-Access-Client-Secret: ..." \
-H "Content-Type: application/json" \
-X POST \
-d '{"query":"test","mode":"naive","only_need_context":false}'CF_ACCESS_CLIENT_SECRET" \
https://lightrag.basicconsulting.alai.no/queryhealth
Failure Modes and Troubleshooting
AUTH_FAIL (HTTP 302 or 401)
Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.
Root Cause:
- BW_SESSION expired →
/tmp/bw-sessionstale or missing - Wrong Bitwarden item ID (using deprecated
61d0bf21instead of canonicalb42cb5c2) - CF Access service token rotated without updating Bitwarden
Fix:
If cold start: Retry once, should succeedIf FORGE overloaded: Identify competing workload, throttle/stopIf persistent: Check Azure LightRAG Migration Runbook for tunnel troubleshooting
Issue: Python UA blocked (HTTP 403)
Root cause: CF Browser Integrity Check rule disabled or misconfigured.
Diagnosis:
# Refresh BW session
bw unlock
# Copy new session token to /tmp/bw-session
# Verify canonical token is live
bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session) | jq -r '.login.username, .login.password'
# Test with Python UAcurl
curl -s -w "\nHTTP: %{http_code}\n" \
-A "Python/3.11 urllib/1.26" \
-H "CF-Access-Client-Id: ...$(bw get username 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session) | sed 's/^CF-Access-Client-Id: //')" \
-H "CF-Access-Client-Secret: ...$(bw get password 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session))" \
https://lightrag.basicconsulting.alai.no/health
Fix:
Verify CF Configuration Rule (Ruleset4fc2c122d04d4791a5d17409b097c510, Rulec5990f19f655441180ae886f4512de40)Ensure rule is enabled and expression includeslightrag.basicconsulting.noSee:~/system/rules/cf-proxied-api-bic-whitelist.md
Critical: This is a repeat of the 2026-04-20 outage. If rule is disabled, all automation breaks.
Issue:DOWN Ollama(HTTP models5xx missingor Connection Refused)
Symptoms:Symptom: Probe api_tagscheckreturns failsHTTP 500/502/503 or warnsconnection about missing models.refused.
RequiredRoot models:Cause:
LightRAGqwen2.5-coder:32b-instruct-q8_0(LLMappinference)crashedDockerbge-m3:latest(embeddings)container stopped- Azure VM down
- Neo4j backend unavailable
Fix:
# Check Azure VM status via Azure Portal
# https://portal.azure.com → vm-alai-lightrag
# Or via az CLI
az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"
# SSH to FORGEVM (10.0.0.2)if access configured)
ssh [email protected]-i ~/.ssh/azure_alai [email protected]
# PullCheck missingDocker modelscontainers
ollamadocker pullps qwen2.5-coder:32b-instruct-q8_0--filter ollamaname=lightrag
pulldocker bge-m3:latestps --filter name=neo4j
# VerifyRestart ollamaif listneeded
|docker greprestart $(docker ps -Eq "(qwen2.5-coder:32b-instruct-q8_0|bge-m3:latest)"--filter name=lightrag)
Issue:UNREACHABLE Direct(Timeout VMor accessNetwork failedError)
Symptoms:Symptom: Probe direct_accesschecktimes returnsout after 5-10 seconds without HTTP error or timeout.response.
Diagnosis:Root Cause:
- ANVIL → Azure VM network path broken
- ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)
- Azure VM firewall blocking port 9621
Fix:
# Test direct HTTPVM connectivity
curl -s --connect-timeout 5 http://20.240.61.67:9621/health
# Check NSGcurrent rules (Mac StudioISP IP
maycurl have-s changed)https://ifconfig.co
# Compare to NSG rule source IP
az network nsg rule show \
-g rg-alai-lightrag \
--nsg-name vm-alai-lightragNSG \
-n allow-lightrag-macstudio \
--query "sourceAddressPrefix"
# Compare to current ISP IP
curl -s https://ifconfig.co
Fix:
If MacIPs Studio ISP IP rotated,differ, update NSG rule:
NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
-g rg-alai-lightrag \
--nsg-name vm-alai-lightragNSG \
-n allow-lightrag-macstudio \
--source-address-prefixes "${NEW_IP}/32"
Auto-Heal Removal (MC #99400 D5)
Note:Status: com.alai.lightrag-auto-heal LaunchAgent REMOVED 2026-05-06
Rationale
The auto-heal daemon was designed to automatically restart LightRAG on health check failures. However, it had fatal design flaws:
- Wrong probe target: Checked
127.0.0.1:9621(localhost), but LightRAG runs on AzureresourcesVM20.240.61.67:9621. Auto-heal never triggered on real CF outages. - ADC token drift: Relied on Application Default Credentials (
rg-alai-lightrag)ADC)arefornotAzurecurrentlyCLIvisiblecommands. When ADC tokens expired, script failed silently. - Blast radius risk: Restart logic could affect production LightRAG instance with 121K pending docs.
- False sense of safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.
CEO Decision (2026-05-06)
D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).
Archived Files
~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/- RIP Note:
~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.md
Replacement Plan
Azure Monitor can directly alert on:
- VM metrics (CPU, memory, disk, network)
- Application Insights custom metrics (if LightRAG reports to AppInsights)
- Log Analytics queries (if LightRAG logs to Log Analytics workspace)
Action groups can:
- Send Slack alerts via
webhookaz - Trigger Azure Automation runbooks for auto-remediation
- Page on-call engineer via PagerDuty/Opsgenie
This maywill indicatebe differentscoped subscriptionin ora accessfuture issue.child DirectMC HTTPper accessCEO confirms VM is operational.directive.
Rollbackcom.john.lightrag-monitor ProcedureLaunchAgent
IfStatus: LightRAG stack becomes unstableACTIVE (exitapproved code+ 2activated persisting2026-05-06)
Schedule
Runs min,periodically (check plist StartInterval or CEOStartCalendarInterval directive):for current schedule).
Follow:
What AzureIt LightRAG Migration Runbook → Section "Rollback Procedure"
Summary:
Revert consumer URLs fromhttps://lightrag.basicconsulting.notohttp://localhost:9621Restart local Docker LightRAGVerify local serviceOptionally deprovision Azure VM
Expected rollback time: 5-15 minutes
Data loss risk: ZERO (local volumes preserved)
Maintenance
Weekly Tasks (First 4 Weeks)Probes
- Calls
Review health check trend via database query~/system/tools/lightrag-health-with-alert.sh - Runs
CheckbothforCFpersistenttunnelwarnings/errorsprobe + direct VM probe - Sends
VerifySlackevidencealertfilestoare#alertsbeingchannelgenerated Compare latency trends (p50, p95)
After 4 Weeks
If system stable (noon exit code 2 in(errors)
4
weeks):
Environment Variables
The LaunchAgent plist includes:
ReduceLIGHTRAG_VM_IPmonitoring—frequencySet to20.240.61.67(Azure VM internal IP)- BW session sourced from
daily/tmp/bw-session(vault-sourced credentials)
Manual Trigger
launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor
Known Issues
- Ollama API 302:
ollama.alai.noreturns HTTP 302 on every health check, triggering false-alarm Slack alerts. This is a pre-existing issue unrelated toweeklyMC Update#99400.LaunchAgentFileseparateStartCalendarIntervaltoMCrunforonOllamaMondaysendpointonlyCF ArchiveAccessoldauthevidence files (keep last 30)configuration.
EvidenceToken FilesRotation Procedure
AllRotation healthCadence: checksManual generate(no timestampedTTL evidence:configured — rotation policy is OPEN)
When to Rotate
- Suspected token compromise
- Periodic security hygiene (e.g., every 90 days — CEO decision pending)
- After employee offboarding (if token was shared)
Rotation Steps
-
Location:Generate new service token in Cloudflare dashboard:- Navigate to
~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.*Zero Trust → Access → Service Auth - Create new service token for
lightrag.alai.noAccess policy - Copy Client ID + Client Secret
- Navigate to
-
Retention:UpdateKeepBitwardenlastitem30 days, archive older to Azure Blob Storage.Example archive command (to be automated)b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2:findbw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session) # Update username = "CF-Access-Client-Id: <new_client_id>" # Update password = "<new_client_secret>" -
Test new token:
curl -s -w "\nHTTP: %{http_code}\n" \ -H "CF-Access-Client-Id: <new_client_id>" \ -H "CF-Access-Client-Secret: <new_client_secret>" \ https://lightrag.alai.no/health -
Verify health probes work:
bash ~/system/evidencetools/lightrag-health.sh # Should return HEALTHY (exit code 0) -
Update
cf-access-token-registry.json:jq '.["lightrag.alai.no"].last_verified = "'"$(date -name "lightrag-health-*.json" -mtimeu +30%Y-%m-%dT%H:%M:%SZ)"'"' \| xargs tar -czf~/system/evidence/archive-$(datestate/cf-access-token-registry.json+%Y%m).tar.gz>#/tmp/registry.jsonUploadmvto Azure Blob az storage blob upload \ --account-name plockfrontstaging \ --container-name evidence \ --name lightrag-health-archive-$(date +%Y%m).tar.gz \ --file/tmp/registry.json ~/system/evidence/archive-$(date +%Y%m).tar.gzstate/cf-access-token-registry.json -
Revoke old token in Cloudflare dashboard (after confirming new token works)
Related Documentation
- Azure LightRAG Migration Runbook
— Full migration details + rollback - CF-BIC Whitelist Rule — INFRA-CF-001
- MC
Task#99400#8545Evidence—Pack:Health~/system/evidence/99400-proveo-pass/ - Forged
projectPrompt:~/system/prompts/forged/99400.md - CF Access Token Registry:
~/system/state/cf-access-token-registry.json
Changelog
2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
2026-04-21 — Initial version (baseline setup + first run)
Document Owner: FlowForge
Last Updated: 2026-04-2105-06
Approved By: PendingCEO (Alem approvalBasic) for— LaunchAgentMC installation#99400 deliverable D8