LightRAG Health Monitoring Runbook

LightRAG Health MonitoringObservability Runbook (MC #99400 Updated)

Status: ACTIVE
~~Created:~~Last Updated: 2026-~~04-21~~05-06 (MC #99400 hotfix)
Owner: FlowForge ~~(AgentForge)~~
Related: MC ~~#8545,~~#99400, ~~INFRA-CF-001~~MC #8545

Purpose

~~Continuous health~~Health monitoring and troubleshooting for LightRAG observability stack following MC #99400 hotfix (~~Azure~~secrets VMhygiene + ~~Cloudflare)~~probe ~~following~~topology ~~the 2026-04-20 outage fix (CF Browser Integrity Check configuration)~~resolution).

This runbook covers:

~~Health~~Probe ~~check~~topology ~~script~~(three ~~usage~~separate probe surfaces)
~~Interpreting~~Canonical ~~results~~CF Access service token
~~Automated~~Vault-sourced ~~monitoring~~token ~~setup~~injection pattern
Troubleshooting ~~common~~failure ~~issues~~modes
~~Rollback~~Auto-heal ~~procedures~~removal rationale

ArchitectureProbe OverviewTopology

LightRAG ~~runs~~health onis ~~Azure~~monitored VMvia ~~(20.240.61.67:9621)~~THREE separate probe surfaces, each with different purposes and isauthentication ~~exposed via Cloudflare tunnel at~~ https://lightrag.basicconsulting.no~~. The system depends on:~~requirements:

~~Azure VM~~ ~~— Docker containers (lightrag + neo4j)~~

~~Cloudflare tunnel~~ ~~— Routes traffic through Mac Studio relay~~

~~Cloudflare Access~~ ~~— Authentication via service tokens~~

~~Cloudflare BIC rule~~ ~~— Allows automation clients (Python UA)~~

~~Ollama upstream~~ — https://ollama.basicconsulting.no ~~for LLM inference~~

~~See:~~ ~~Azure LightRAG Migration Runbook~~

Health Check Script

Location

~/system/tools/lightrag-health.sh

Manual Execution

bash ~/system/tools/lightrag-health.sh

Output

~~Terminal:~~ ~~Colored status summary (green/yellow/red per layer)~~

~~JSON:~~ ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.json ~~(machine-readable)~~

~~Markdown:~~ ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.md ~~(human-readable)~~

Exit Codes

0 ~~— All checks passed (healthy)~~

1 ~~— Warnings detected (degraded but operational)~~

2 ~~— Errors detected (critical issues)~~

Check Layers

Layer 1: Azure VM Health

~~criteria~~ ~~healthy~~

~~Check~~Probe Surface	~~What it tests~~Script	~~Healthy~~Endpoint	Auth Required	Layer Probed	Failure Modes Caught	Status Labels
CF Tunnel Probe	`direct_accesslightrag-health.sh` (check_cf_tunnel)	`https://lightrag.alai.no/health`	~~Direct~~YES ~~HTTP~~(CF toAccess ~~VM IP:port~~headers)	~~HTTP~~Cloudflare ~~200,~~Access ~~status=healthy~~+ App	CF auth failure (302), CF outage, app down	HEALTHY \| DOWN \| AUTH_FAIL \| UNREACHABLE
Direct VM Probe	`docker_containerslightrag-health.sh` (check_vm_direct_access)	`http://20.240.61.67:9621/health`	~~Container~~NO ~~status~~(internal ~~via SSH~~IP)	~~lightrag~~Azure VM app-layer	App crash, VM down, network unreachable	HEALTHY \| DOWN \| UNREACHABLE
Boot Probe	`boot.sh` lines 88-94	`http://20.240.61.67:9621/health` (or `${LIGHTRAG_VM_IP}`)	NO (internal IP)	Azure VM app-layer	App crash, VM down	HEALTHY \| DOWN \| UNREACHABLE
Monitoring Daemon	`com.john.lightrag-monitor` (calls lightrag-health-with-alert.sh)	Same as lightrag-health.sh (both CF tunnel + ~~neo4j~~direct ~~running,~~VM)	YES for CF tunnel layer	Both layers + Slack alerts	All failure modes + sends alerts to #alerts	Sends Slack alert on exit code 2

~~Note:~~Key Decision (CEO approved): ~~SSH~~Boot ~~access~~probe ~~currently~~uses ~~unavailable~~direct ~~(publickey~~VM ~~auth).~~IP ~~Manual~~for ~~verification~~speed ~~required~~+ ~~via~~no ~~Azure~~auth ~~Portal or after SSH key setup.~~

Layer 2: Cloudflare Network

forexternalsmoke-check.

Access

~~Check~~	~~What it tests~~	~~Healthy criteria~~
`cf_tunnel`	~~HTTPS via~~dependency. CF tunnel	~~HTTP~~probe ~~200,~~remains ~~latency~~in <lightrag-health.sh 2s
`cf_bic_rule`	~~BIC~~ Canonical ~~rule~~CF ~~configuration~~	~~Rule~~Service ~~enabled,~~Token ~~covers~~ Token ~~both endpoints~~
`python_ua`	~~Python client access~~	~~HTTP 200 with Python UA~~

Location

~~Critical:~~Bitwarden Item ID: python_uab42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service token ~~check~~(note: ~~verifies~~typo ~~the~~in ~~CF-BIC-001~~original ~~rule~~item isname)

~~active.~~

⚠️ IfDO ~~this~~NOT ~~fails~~use ~~with~~Bitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returned HTTP ~~403,~~302, ~~automation~~token ~~clients (pi-orchestrator, lightrag-outbox-ingest.js) will break.~~dead)

LayerScripts 3:Using ApplicationThis HealthToken

(CFcheck) currentlydirectVMIP)

~~Check~~	~~What it tests~~	~~Healthy criteria~~
`health_endpoint`	`~/healthsystem/tools/lightrag-health.sh` ~~endpoint~~	~~status=healthy,~~tunnel ~~pipeline_busy=false~~
`query_endpoint`	`~/querysystem/boot.sh` ~~with~~(optional, ~~naive~~not ~~mode~~	~~HTTP~~used ~~200,~~since ~~valid~~boot ~~response,~~probe <uses ~~30s~~

Daemons Using This Token

com.john.lightrag-monitor (LaunchAgent for scheduled health checks + Slack alerts)

Rotation Policy

Cadence: Manual (no TTL configured)
Owner: FlowForge
Next Rotation: TBD (CEO decision pending)

Note: ~~First~~CF ~~query~~Access ~~after~~service ~~idle~~tokens ~~may~~do ~~take~~not ~~longer~~expire ~~(cold~~by ~~start).~~default Ifunless ~~timeout,~~a ~~retry~~TTL ~~once.~~

Layerexplicitly 4:set. OllamaCurrent Upstream

token hasnoexpiration

~~Check~~	~~What it tests~~	~~Healthy criteria~~
`api_tags`	~~Ollama model availability~~	~~qwen2.5-coder:32b + bge-m3 present~~

~~Critical:~~ ~~LightRAG requires these specific models. If missing, queries will fail.~~configured.

InterpretingVault-Sourced ResultsToken Injection Pattern

Green (Exit 0) — Healthy

All ~~critical~~scripts ~~checks~~MUST ~~passed.~~source ~~System~~CF ~~operational.~~

Access

tokens from Bitwarden vault at runtime. ~~Action:~~NEVER hardcode credentials in scripts. ~~None required.~~

YellowShell (ExitScript 1) — WarningsPattern

~~Non-critical issues detected. System degraded but operational.~~

~~Common warnings:~~

~~SSH access unavailable (known limitation)~~

~~CF API token unavailable (can't verify BIC rule, but Python UA test compensates)~~

~~Slow response times (> 2s but < 30s)~~

~~Action:~~ ~~Review warning details. Monitor next check. Escalate if warnings persist 3+ checks.~~

Red (Exit 2) — Errors

~~Critical issues detected. System may be non-operational or partially failed.~~

~~Common errors:~~

~~Query endpoint timeout (> 30s)~~

~~HTTP 403 from Python UA (BIC rule disabled)~~

~~Ollama models missing~~

~~Direct VM access failed~~

~~Action:~~

~~Review error details in JSON evidence~~

~~Follow troubleshooting section below~~

~~If unresolved after 30 min, consider rollback (see Azure LightRAG Migration Runbook)~~

Automated Monitoring Setup

LaunchAgent Installation (DRAFT — Pending Alem Approval)

~~Draft file:~~ ~/system/evidence/lightrag-monitor-launchagent-draft.plist

~~Schedule:~~ ~~Daily at 9:00 AM (frequent for 4-week monitoring period)~~

~~Installation steps (when approved):~~

# 1.BW Copysession draftliveness check
if [[ ! -f /tmp/bw-session ]] || [[ ! -s /tmp/bw-session ]]; then
  echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
  exit 1
fi

BW_SESSION=$(cat /tmp/bw-session)

# Load CF Access credentials from vault
function load_cf_credentials() {
  local item_data
  item_data=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
  
  if [[ $? -ne 0 ]] || [[ -z "$item_data" ]]; then
    echo "ERROR: Failed to LaunchAgentsretrieve cpCF ~Access token from Bitwarden. Check BW_SESSION."
    return 1
  fi
  
  CF_ACCESS_CLIENT_ID=$(echo "$item_data" | jq -r '.login.username /system/evidence/lightrag-monitor-launchagent-draft.plist/ \empty' ~| sed 's/^CF-Access-Client-Id: /Library/LaunchAgents/com.john.lightrag-monitor.plist/')
  CF_ACCESS_CLIENT_SECRET=$(echo "$item_data" | jq -r '.login.password // empty')
  
  if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
    echo "ERROR: CF Access credentials not found in Bitwarden item."
    return 1
  fi
  
  return 0
}

# 2. Load thecredentials
agentif launchctl! loadload_cf_credentials; ~/Library/LaunchAgents/com.john.lightrag-monitor.plistthen
  echo "FATAL: Cannot proceed without CF Access credentials."
  exit 2
fi

# 3. Start immediately (optional)
launchctl start com.john.lightrag-monitor

~~Manual trigger:~~

launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor

~~Logs:~~

~~stdout:~~ ~/system/logs/lightrag-monitor/stdout.log

~~stderr:~~ ~/system/logs/lightrag-monitor/stderr.log

Slack Alerts (To Be Implemented)

~~When LaunchAgent detects exit code 2 (errors), send alert to~~ #alerts ~~channel:~~

node ~/system/tools/slack.js send alerts "🚨 LightRAG health check FAILED at $(date). Check ~/system/evidence/lightrag-health-*.json"

~~This requires wrapping the health check script~~Use in ~~a post-execution hook (see plist comments).~~

Health History Database

~~Location:~~ ~/system/databases/lightrag-health.db

~~Schema:~~ ~/system/tools/lightrag-health-db-init.sql

Tables

health_checks ~~— Overall check results~~

health_check_details ~~— Individual layer/check results~~

Views

health_checks_summary ~~— Last 30 checks~~

health_checks_trend ~~— Daily aggregates~~

Query Examples

~~Last 10 checks:~~

sqlite3 ~/system/databases/lightrag-health.db \
  "SELECT timestamp, overall_status, errors, warnings FROM health_checks ORDER BY created_at DESC LIMIT 10;"

~~Trend over last 7 days:~~

sqlite3 ~/system/databases/lightrag-health.db \
  "SELECT * FROM health_checks_trend WHERE check_date >= date('now', '-7 days');"

~~All errors in last 24 hours:~~

sqlite3 ~/system/databases/lightrag-health.db \
  "SELECT hc.timestamp, hcd.layer, hcd.check_name, hcd.message FROM health_checks hc
   JOIN health_check_details hcd ON hc.id = hcd.health_check_id
   WHERE hcd.status = 'error' AND hc.created_at >= datetime('now', '-24 hours');"

~~Note:~~ ~~Database logging will be implemented in next iteration of the health script.~~

Troubleshooting

Issue: Query endpoint timeout (HTTP 000, 35s)

~~Possible causes:~~

~~First query after idle (cold start)~~

~~Ollama FORGE overloaded~~

~~Network path issue (Mac Studio → CF → Azure → CF → Mac Studio)~~

~~Diagnosis:~~

# Test Ollama upstream directlycurl
curl -s https://ollama.basicconsulting.no/api/tags \
  -H "CF-Access-Client-Id: $(grep CF_ACCESS_CLIENT_ID ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')"CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $(grep CF_ACCESS_CLIENT_SECRET ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')" | jq '.models | length'

# Check if FORGE is responding
curl http://10.0.0.2:11434/api/ps

# Test query directly with extended timeout
curl -s --max-time 60 \
  -H "CF-Access-Client-Id: ..." \
  -H "CF-Access-Client-Secret: ..." \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{"query":"test","mode":"naive","only_need_context":false}'CF_ACCESS_CLIENT_SECRET" \
  https://lightrag.basicconsulting.alai.no/queryhealth

Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.

Root Cause:

BW_SESSION expired → /tmp/bw-session stale or missing

Wrong Bitwarden item ID (using deprecated 61d0bf21 instead of canonical b42cb5c2)

CF Access service token rotated without updating Bitwarden

Fix:

~~If cold start: Retry once, should succeed~~

~~If FORGE overloaded: Identify competing workload, throttle/stop~~

~~If persistent: Check Azure LightRAG Migration Runbook for tunnel troubleshooting~~

Issue: Python UA blocked (HTTP 403)

~~Root cause:~~ ~~CF Browser Integrity Check rule disabled or misconfigured.~~

~~Diagnosis:~~

# Refresh BW session
bw unlock
# Copy new session token to /tmp/bw-session

# Verify canonical token is live
bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session) | jq -r '.login.username, .login.password'

# Test with Python UAcurl
curl -s -w "\nHTTP: %{http_code}\n" \
  -A "Python/3.11 urllib/1.26" \
  -H "CF-Access-Client-Id: ...$(bw get username 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session) | sed 's/^CF-Access-Client-Id: //')" \
  -H "CF-Access-Client-Secret: ...$(bw get password 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session))" \
  https://lightrag.basicconsulting.alai.no/health

~~Fix:~~

~~Verify CF Configuration Rule (Ruleset~~ 4fc2c122d04d4791a5d17409b097c510~~, Rule~~ c5990f19f655441180ae886f4512de40)

~~Ensure rule is enabled and expression includes~~ lightrag.basicconsulting.no

~~See:~~ ~/system/rules/cf-proxied-api-bic-whitelist.md

~~Critical:~~ ~~This is a repeat of the 2026-04-20 outage. If rule is disabled, all automation breaks.~~

Issue:DOWN Ollama(HTTP models5xx missingor Connection Refused)

~~Symptoms:~~Symptom: api_tagsProbe ~~check~~returns ~~fails~~HTTP 500/502/503 or ~~warns~~connection ~~about missing models.~~refused.

~~Required~~Root ~~models:~~Cause:

qwen2.5-coder:32b-instruct-q8_0LightRAG ~~(LLM~~app ~~inference)~~crashed
bge-m3:latestDocker ~~(embeddings)~~container stopped

Azure VM down

Neo4j backend unavailable

Fix:

# Check Azure VM status via Azure Portal
# https://portal.azure.com → vm-alai-lightrag

# Or via az CLI
az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"

# SSH to FORGEVM (10.0.0.2)if access configured)
ssh [email protected]-i ~/.ssh/azure_alai [email protected]

# PullCheck missingDocker modelscontainers
ollamadocker pullps qwen2.5-coder:32b-instruct-q8_0--filter ollamaname=lightrag
pulldocker bge-m3:latestps --filter name=neo4j

# VerifyRestart ollamaif listneeded
|docker greprestart $(docker ps -Eq "(qwen2.5-coder:32b-instruct-q8_0|bge-m3:latest)"--filter name=lightrag)

Issue:UNREACHABLE Direct(Timeout VMor accessNetwork failedError)

~~Symptoms:~~Symptom: direct_accessProbe ~~check~~times ~~returns~~out after 5-10 seconds without HTTP ~~error or timeout.~~response.

~~Diagnosis:~~Root Cause:

ANVIL → Azure VM network path broken

ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)

Azure VM firewall blocking port 9621

Fix:

# Test direct HTTPVM connectivity
curl -s --connect-timeout 5 http://20.240.61.67:9621/health

# Check NSGcurrent rules (Mac StudioISP IP
maycurl have-s changed)https://ifconfig.co

# Compare to NSG rule source IP
az network nsg rule show \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --query "sourceAddressPrefix"

# Compare to current ISP IP
curl -s https://ifconfig.co

~~Fix:~~ If ~~Mac~~IPs ~~Studio ISP IP rotated,~~differ, update NSG ~~rule:~~

rule

NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --source-address-prefixes "${NEW_IP}/32"

Auto-Heal Removal (MC #99400 D5)

~~Note:~~Status: com.alai.lightrag-auto-heal LaunchAgent REMOVED 2026-05-06

Rationale

The auto-heal daemon was designed to automatically restart LightRAG on health check failures. However, it had fatal design flaws:

Wrong probe target: Checked 127.0.0.1:9621 (localhost), but LightRAG runs on Azure ~~resources~~VM 20.240.61.67:9621. Auto-heal never triggered on real CF outages.

ADC token drift: Relied on Application Default Credentials (~~rg-alai-lightrag)~~ADC) ~~are~~for ~~not~~Azure ~~currently~~CLI ~~visible~~commands. When ADC tokens expired, script failed silently.

Blast radius risk: Restart logic could affect production LightRAG instance with 121K pending docs.

False sense of safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.

CEO Decision (2026-05-06)

D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).

Archived Files

~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06

~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06

~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/

RIP Note: ~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.md

Replacement Plan

Azure Monitor can directly alert on:

VM metrics (CPU, memory, disk, network)

Application Insights custom metrics (if LightRAG reports to AppInsights)

Log Analytics queries (if LightRAG logs to Log Analytics workspace)

Action groups can:

Send Slack alerts via azwebhook

~~CLI.~~

Trigger Azure Automation runbooks for auto-remediation

Page on-call engineer via PagerDuty/Opsgenie

This ~~may~~will ~~indicate~~be ~~different~~scoped ~~subscription~~in ora ~~access~~future ~~issue.~~child ~~Direct~~MC ~~HTTP~~per ~~access~~CEO ~~confirms VM is operational.~~directive.

Rollbackcom.john.lightrag-monitor ProcedureLaunchAgent

IfStatus: ~~LightRAG stack becomes unstable~~ACTIVE (~~exit~~approved ~~code~~+ 2activated ~~persisting~~2026-05-06)

Schedule

Runs ~~min,~~periodically (check plist StartInterval or ~~CEO~~StartCalendarInterval ~~directive):~~for current schedule).

~~Follow:~~

What AzureIt LightRAG Migration Runbook → Section "Rollback Procedure"

Summary:

Revert consumer URLs from `https://lightrag.basicconsulting.no` to `http://localhost:9621`

Restart local Docker LightRAG

Verify local service

Optionally deprovision Azure VM

Expected rollback time: 5-15 minutes
Data loss risk: ZERO (local volumes preserved)

Maintenance

Weekly Tasks (First 4 Weeks)Probes

Calls ~~Review health check trend via database query~~~/system/tools/lightrag-health-with-alert.sh
Runs ~~Check~~both ~~for~~CF ~~persistent~~tunnel ~~warnings/errors~~probe + direct VM probe
Sends ~~Verify~~Slack ~~evidence~~alert ~~files~~to ~~are~~#alerts ~~being~~channel ~~generated~~

~~Compare latency trends (p50, p95)~~

After 4 Weeks

~~If system stable (no~~on exit code 2 in(errors) 4 ~~weeks):~~

Environment Variables

The LaunchAgent plist includes:

~~Reduce~~LIGHTRAG_VM_IP ~~monitoring~~— ~~frequency~~Set to 20.240.61.67 (Azure VM internal IP)

BW session sourced from ~~daily~~/tmp/bw-session (vault-sourced credentials)

Manual Trigger

launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor

Known Issues

Ollama API 302: ollama.alai.no returns HTTP 302 on every health check, triggering false-alarm Slack alerts. This is a pre-existing issue unrelated to ~~weekly~~

~~Update~~#99400. ~~LaunchAgent~~File StartCalendarIntervalseparate toMC ~~run~~for onOllama ~~Mondays~~endpoint ~~only~~

~~Archive~~Access ~~old~~auth ~~evidence files (keep last 30)~~configuration.

EvidenceToken FilesRotation Procedure

~~All~~Rotation ~~health~~Cadence: ~~checks~~Manual ~~generate~~(no ~~timestamped~~TTL ~~evidence:~~configured — rotation policy is OPEN)

When to Rotate

Suspected token compromise

Periodic security hygiene (e.g., every 90 days — CEO decision pending)

After employee offboarding (if token was shared)

Rotation Steps

~~Location:~~Generate new service token in Cloudflare dashboard:
- Navigate to ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.*Zero Trust → Access → Service Auth
- Create new service token for lightrag.alai.no Access policy
- Copy Client ID + Client Secret

~~Retention:~~Update ~~Keep~~Bitwarden ~~last~~item ~~30 days, archive older to Azure Blob Storage.~~

~~Example archive command (to be automated)~~b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2:

findbw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session)
# Update username = "CF-Access-Client-Id: <new_client_id>"
# Update password = "<new_client_secret>"

Test new token:

curl -s -w "\nHTTP: %{http_code}\n" \
  -H "CF-Access-Client-Id: <new_client_id>" \
  -H "CF-Access-Client-Secret: <new_client_secret>" \
  https://lightrag.alai.no/health

Verify health probes work:

bash ~/system/evidencetools/lightrag-health.sh
# Should return HEALTHY (exit code 0)

Update cf-access-token-registry.json:

jq '.["lightrag.alai.no"].last_verified = "'"$(date -name "lightrag-health-*.json" -mtimeu +30%Y-%m-%dT%H:%M:%SZ)"'"' \
  | xargs tar -czf
  ~/system/evidence/archive-$(datestate/cf-access-token-registry.json +%Y%m).tar.gz> #/tmp/registry.json
Uploadmv to Azure Blob
az storage blob upload \
  --account-name plockfrontstaging \
  --container-name evidence \
  --name lightrag-health-archive-$(date +%Y%m).tar.gz \
  --file/tmp/registry.json ~/system/evidence/archive-$(date +%Y%m).tar.gzstate/cf-access-token-registry.json

Revoke old token in Cloudflare dashboard (after confirming new token works)

Azure LightRAG Migration Runbook ~~— Full migration details + rollback~~
CF-BIC Whitelist Rule — INFRA-CF-001
MC ~~Task~~#99400 ~~#8545~~Evidence —Pack: ~~Health~~~/system/evidence/99400-proveo-pass/

~~monitoring~~

Forged ~~project~~Prompt: ~/system/prompts/forged/99400.md

CF Access Token Registry: ~/system/state/cf-access-token-registry.json

Changelog

2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
2026-04-21 — Initial version (baseline setup + first run)

Document Owner: FlowForge
Last Updated: 2026-~~04-21~~05-06
Approved By: ~~Pending~~CEO (Alem ~~approval~~Basic) ~~for~~— ~~LaunchAgent~~MC ~~installation~~#99400 deliverable D8

LightRAG Health Monitoring Runbook

LightRAG Health MonitoringObservability Runbook (MC #99400 Updated)

Purpose

ArchitectureProbe OverviewTopology

Health Check Script

Location

Manual Execution

Output

Exit Codes

Check Layers

Layer 1: Azure VM Health

Layer 2: Cloudflare Network

Canonical ruleCF configuration

Token both endpoints

LayerScripts 3:Using ApplicationThis HealthToken

Daemons Using This Token

Rotation Policy

Layerexplicitly 4:set. OllamaCurrent Upstream

InterpretingVault-Sourced ResultsToken Injection Pattern

Green (Exit 0) — Healthy

YellowShell (ExitScript 1) — WarningsPattern

Red (Exit 2) — Errors

Automated Monitoring Setup

LaunchAgent Installation (DRAFT — Pending Alem Approval)

Slack Alerts (To Be Implemented)

Health History Database

Tables

Views

Query Examples

Troubleshooting

Issue: Query endpoint timeout (HTTP 000, 35s)

Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

Issue: Python UA blocked (HTTP 403)

Issue:DOWN Ollama(HTTP models5xx missingor Connection Refused)

Issue:UNREACHABLE Direct(Timeout VMor accessNetwork failedError)

Auto-Heal Removal (MC #99400 D5)

Rationale

CEO Decision (2026-05-06)

Archived Files

Replacement Plan

Rollbackcom.john.lightrag-monitor ProcedureLaunchAgent

Schedule

Maintenance

Weekly Tasks (First 4 Weeks)Probes

After 4 Weeks

Environment Variables

Manual Trigger

Known Issues

EvidenceToken FilesRotation Procedure

When to Rotate

Rotation Steps

Related Documentation

Changelog