LightRAG Health Monitoring Runbook
LightRAG Health Monitoring Runbook
Status: ACTIVE
Created: 2026-04-21
Owner: FlowForge (AgentForge)
Related: MC #8545, INFRA-CF-001
Purpose
Continuous health monitoring for LightRAG stack (Azure VM + Cloudflare) following the 2026-04-20 outage fix (CF Browser Integrity Check configuration).
This runbook covers:
- Health check script usage
- Interpreting results
- Automated monitoring setup
- Troubleshooting common issues
- Rollback procedures
Architecture Overview
LightRAG runs on Azure VM (20.240.61.67:9621) and is exposed via Cloudflare tunnel at https://lightrag.basicconsulting.no. The system depends on:
- Azure VM — Docker containers (lightrag + neo4j)
- Cloudflare tunnel — Routes traffic through Mac Studio relay
- Cloudflare Access — Authentication via service tokens
- Cloudflare BIC rule — Allows automation clients (Python UA)
- Ollama upstream —
https://ollama.basicconsulting.nofor LLM inference
See: Azure LightRAG Migration Runbook
Health Check Script
Location
~/system/tools/lightrag-health.sh
Manual Execution
bash ~/system/tools/lightrag-health.sh
Output
- Terminal: Colored status summary (green/yellow/red per layer)
- JSON:
~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.json(machine-readable) - Markdown:
~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.md(human-readable)
Exit Codes
0— All checks passed (healthy)1— Warnings detected (degraded but operational)2— Errors detected (critical issues)
Check Layers
Layer 1: Azure VM Health
| Check | What it tests | Healthy criteria |
|---|---|---|
direct_access |
Direct HTTP to VM IP:port | HTTP 200, status=healthy |
docker_containers |
Container status via SSH | lightrag + neo4j running, healthy |
Note: SSH access currently unavailable (publickey auth). Manual verification required via Azure Portal or after SSH key setup.
Layer 2: Cloudflare Network
| Check | What it tests | Healthy criteria |
|---|---|---|
cf_tunnel |
HTTPS via CF tunnel | HTTP 200, latency < 2s |
cf_bic_rule |
BIC rule configuration | Rule enabled, covers both endpoints |
python_ua |
Python client access | HTTP 200 with Python UA |
Critical: python_ua check verifies the CF-BIC-001 rule is active. If this fails with HTTP 403, automation clients (pi-orchestrator, lightrag-outbox-ingest.js) will break.
Layer 3: Application Health
| Check | What it tests | Healthy criteria |
|---|---|---|
health_endpoint |
/health endpoint |
status=healthy, pipeline_busy=false |
query_endpoint |
/query with naive mode |
HTTP 200, valid response, < 30s |
Note: First query after idle may take longer (cold start). If timeout, retry once.
Layer 4: Ollama Upstream
| Check | What it tests | Healthy criteria |
|---|---|---|
api_tags |
Ollama model availability | qwen2.5-coder:32b + bge-m3 present |
Critical: LightRAG requires these specific models. If missing, queries will fail.
Interpreting Results
Green (Exit 0) — Healthy
All critical checks passed. System operational.
Action: None required.
Yellow (Exit 1) — Warnings
Non-critical issues detected. System degraded but operational.
Common warnings:
Action: Review warning details. Monitor next check. Escalate if warnings persist 3+ checks.
Red (Exit 2) — Errors
Critical issues detected. System may be non-operational or partially failed.
Common errors:
- Query endpoint timeout (> 30s)
- HTTP 403 from Python UA (BIC rule disabled)
- Ollama models missing
- Direct VM access failed
Action:
- Review error details in JSON evidence
- Follow troubleshooting section below
- If unresolved after 30 min, consider rollback (see Azure LightRAG Migration Runbook)
Automated Monitoring Setup
LaunchAgent Installation (DRAFT — Pending Alem Approval)
Draft file: ~/system/evidence/lightrag-monitor-launchagent-draft.plist
Schedule: Daily at 9:00 AM (frequent for 4-week monitoring period)
Installation steps (when approved):
# 1. Copy draft to LaunchAgents
cp ~/system/evidence/lightrag-monitor-launchagent-draft.plist \
~/Library/LaunchAgents/com.john.lightrag-monitor.plist
# 2. Load the agent
launchctl load ~/Library/LaunchAgents/com.john.lightrag-monitor.plist
# 3. Start immediately (optional)
launchctl start com.john.lightrag-monitor
Manual trigger:
launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor
Logs:
- stdout:
~/system/logs/lightrag-monitor/stdout.log - stderr:
~/system/logs/lightrag-monitor/stderr.log
Slack Alerts (To Be Implemented)
When LaunchAgent detects exit code 2 (errors), send alert to #alerts channel:
node ~/system/tools/slack.js send alerts "🚨 LightRAG health check FAILED at $(date). Check ~/system/evidence/lightrag-health-*.json"
This requires wrapping the health check script in a post-execution hook (see plist comments).
Health History Database
Location: ~/system/databases/lightrag-health.db
Schema: ~/system/tools/lightrag-health-db-init.sql
Tables
health_checks— Overall check resultshealth_check_details— Individual layer/check results
Views
health_checks_summary— Last 30 checkshealth_checks_trend— Daily aggregates
Query Examples
Last 10 checks:
sqlite3 ~/system/databases/lightrag-health.db \
"SELECT timestamp, overall_status, errors, warnings FROM health_checks ORDER BY created_at DESC LIMIT 10;"
Trend over last 7 days:
sqlite3 ~/system/databases/lightrag-health.db \
"SELECT * FROM health_checks_trend WHERE check_date >= date('now', '-7 days');"
All errors in last 24 hours:
sqlite3 ~/system/databases/lightrag-health.db \
"SELECT hc.timestamp, hcd.layer, hcd.check_name, hcd.message FROM health_checks hc
JOIN health_check_details hcd ON hc.id = hcd.health_check_id
WHERE hcd.status = 'error' AND hc.created_at >= datetime('now', '-24 hours');"
Note: Database logging will be implemented in next iteration of the health script.
Troubleshooting
Issue: Query endpoint timeout (HTTP 000, 35s)
Possible causes:
- First query after idle (cold start)
- Ollama FORGE overloaded
- Network path issue (Mac Studio → CF → Azure → CF → Mac Studio)
Diagnosis:
# Test Ollama upstream directly
curl -s https://ollama.basicconsulting.no/api/tags \
-H "CF-Access-Client-Id: $(grep CF_ACCESS_CLIENT_ID ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')" \
-H "CF-Access-Client-Secret: $(grep CF_ACCESS_CLIENT_SECRET ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')" | jq '.models | length'
# Check if FORGE is responding
curl http://10.0.0.2:11434/api/ps
# Test query directly with extended timeout
curl -s --max-time 60 \
-H "CF-Access-Client-Id: ..." \
-H "CF-Access-Client-Secret: ..." \
-H "Content-Type: application/json" \
-X POST \
-d '{"query":"test","mode":"naive","only_need_context":false}' \
https://lightrag.basicconsulting.no/query
Fix:
- If cold start: Retry once, should succeed
- If FORGE overloaded: Identify competing workload, throttle/stop
- If persistent: Check Azure LightRAG Migration Runbook for tunnel troubleshooting
Issue: Python UA blocked (HTTP 403)
Root cause: CF Browser Integrity Check rule disabled or misconfigured.
Diagnosis:
# Test with Python UA
curl -s -w "\nHTTP: %{http_code}\n" \
-A "Python/3.11 urllib/1.26" \
-H "CF-Access-Client-Id: ..." \
-H "CF-Access-Client-Secret: ..." \
https://lightrag.basicconsulting.no/health
Fix:
- Verify CF Configuration Rule (Ruleset
4fc2c122d04d4791a5d17409b097c510, Rulec5990f19f655441180ae886f4512de40) - Ensure rule is enabled and expression includes
lightrag.basicconsulting.no - See:
~/system/rules/cf-proxied-api-bic-whitelist.md
Critical: This is a repeat of the 2026-04-20 outage. If rule is disabled, all automation breaks.
Issue: Ollama models missing
Required models:
qwen2.5-coder:32b-instruct-q8_0(LLM inference)bge-m3:latest(embeddings)
Fix:
# SSH to FORGE (10.0.0.2)
ssh [email protected]
# Pull missing models
ollama pull qwen2.5-coder:32b-instruct-q8_0
ollama pull bge-m3:latest
# Verify
ollama list | grep -E "(qwen2.5-coder:32b-instruct-q8_0|bge-m3:latest)"
Issue: Direct VM access failed
Symptoms: direct_access check returns HTTP error or timeout.
Diagnosis:
# Test direct HTTP
curl -s --connect-timeout 5 http://20.240.61.67:9621/health
# Check NSG rules (Mac Studio IP may have changed)
az network nsg rule show \
-g rg-alai-lightrag \
--nsg-name vm-alai-lightragNSG \
-n allow-lightrag-macstudio \
--query "sourceAddressPrefix"
# Compare to current ISP IP
curl -s https://ifconfig.co
Fix: If Mac Studio ISP IP rotated, update NSG rule:
NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
-g rg-alai-lightrag \
--nsg-name vm-alai-lightragNSG \
-n allow-lightrag-macstudio \
--source-address-prefixes "${NEW_IP}/32"
Note: Azure resources (rg-alai-lightrag) are not currently visible via az CLI. This may indicate different subscription or access issue. Direct HTTP access confirms VM is operational.
Rollback Procedure
If LightRAG stack becomes unstable (exit code 2 persisting > 30 min, or CEO directive):
Follow: Azure LightRAG Migration Runbook → Section "Rollback Procedure"
Summary:
- Revert consumer URLs from
https://lightrag.basicconsulting.notohttp://localhost:9621 - Restart local Docker LightRAG
- Verify local service
- Optionally deprovision Azure VM
Expected rollback time: 5-15 minutes
Data loss risk: ZERO (local volumes preserved)
Maintenance
Weekly Tasks (First 4 Weeks)
- Review health check trend via database query
- Check for persistent warnings/errors
- Verify evidence files are being generated
- Compare latency trends (p50, p95)
After 4 Weeks
If system stable (no exit code 2 in 4 weeks):
- Reduce monitoring frequency from daily to weekly
- Update LaunchAgent
StartCalendarIntervalto run on Mondays only - Archive old evidence files (keep last 30)
Evidence Files
All health checks generate timestamped evidence:
Location: ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.*
Retention: Keep last 30 days, archive older to Azure Blob Storage.
Example archive command (to be automated):
find ~/system/evidence -name "lightrag-health-*.json" -mtime +30 \
| xargs tar -czf ~/system/evidence/archive-$(date +%Y%m).tar.gz
# Upload to Azure Blob
az storage blob upload \
--account-name plockfrontstaging \
--container-name evidence \
--name lightrag-health-archive-$(date +%Y%m).tar.gz \
--file ~/system/evidence/archive-$(date +%Y%m).tar.gz
Related Documentation
- Azure LightRAG Migration Runbook — Full migration details + rollback
- CF-BIC Whitelist Rule — INFRA-CF-001
- MC Task #8545 — Health monitoring project
Changelog
2026-04-21 — Initial version (baseline setup + first run)
Document Owner: FlowForge
Last Updated: 2026-04-21
Approved By: Pending Alem approval for LaunchAgent installation