LightRAG Health Monitoring Runbook

LightRAG ObservabilityHealth Monitoring Runbook

Domain note (MC2026-05-17): ~~#99400~~References ~~Updated)~~to lightrag.basicconsulting.no and ollama.basicconsulting.no are legacy hostnames. Current live endpoints: lightrag.alai.no and ollama.alai.no.

Status: ACTIVE
~~Last Updated:~~Created: 2026-~~05-06 (MC #99400 hotfix)~~04-21
Owner: FlowForge (AgentForge)
Related: MC ~~#99400,~~#8545, ~~MC #8545~~INFRA-CF-001

Purpose

~~Health~~Continuous health monitoring ~~and troubleshooting~~ for LightRAG ~~observability~~stack ~~stack~~(Azure VM + Cloudflare) following MCthe ~~#99400~~2026-04-20 ~~hotfix~~outage fix (~~secrets~~CF ~~hygiene~~Browser +Integrity ~~probe~~Check ~~topology resolution)~~configuration).

This runbook covers:

~~Probe~~Health ~~topology~~check ~~(three~~script ~~separate probe surfaces)~~usage
~~Canonical~~Interpreting ~~CF Access service token~~results
~~Vault-sourced~~Automated ~~token~~monitoring ~~injection pattern~~setup
Troubleshooting ~~failure~~common ~~modes~~issues
~~Auto-heal~~Rollback ~~removal rationale~~procedures

ProbeArchitecture TopologyOverview

LightRAG ~~health~~runs on Azure VM (20.240.61.67:9621) and is ~~monitored~~exposed via ~~THREE~~Cloudflare ~~separate~~tunnel ~~probe~~at ~~surfaces,~~https://lightrag.basicconsulting.no. ~~each~~The ~~with~~system ~~different~~depends ~~purposes~~on:

~~and~~

~~authentication~~

Azure ~~requirements:~~VM — Docker containers (lightrag + neo4j)

Cloudflare tunnel — Routes traffic through Mac Studio relay

Cloudflare Access — Authentication via service tokens

Cloudflare BIC rule — Allows automation clients (Python UA)

Ollama upstream — https://ollama.basicconsulting.no for LLM inference

See: Azure LightRAG Migration Runbook

Health Check Script

Location

~/system/tools/lightrag-health.sh

Manual Execution

bash ~/system/tools/lightrag-health.sh

Output

Terminal: Colored status summary (green/yellow/red per layer)

JSON: ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.json (machine-readable)

Markdown: ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.md (human-readable)

Exit Codes

0 — All checks passed (healthy)

1 — Warnings detected (degraded but operational)

2 — Errors detected (critical issues)

Check Layers

Layer 1: Azure VM Health

Healthy running,

~~Probe Surface~~Check	~~Script~~What it tests	~~Endpoint~~	~~Auth Required~~	~~Layer Probed~~	~~Failure Modes Caught~~	~~Status Labels~~criteria
~~CF Tunnel Probe~~	`lightrag-health.sh` ~~(check_cf_tunnel)~~	`https://lightrag.alai.no/healthdirect_access`	~~YES~~Direct ~~(CF~~HTTP ~~Access~~to ~~headers)~~VM IP:port	~~Cloudflare~~HTTP ~~Access~~200, ~~+ App~~	~~CF auth failure (302), CF outage, app down~~	~~HEALTHY \| DOWN \| AUTH_FAIL \| UNREACHABLE~~status=healthy
~~Direct VM Probe~~	`lightrag-health.sh` ~~(check_vm_direct_access)~~	`http://20.240.61.67:9621/healthdocker_containers`	NOContainer ~~(internal~~status ~~IP)~~via SSH	~~Azure VM app-layer~~	~~App crash, VM down, network unreachable~~	~~HEALTHY \| DOWN \| UNREACHABLE~~
~~Boot Probe~~	`boot.sh` ~~lines 88-94~~	`http://20.240.61.67:9621/health` ~~(or~~ `${LIGHTRAG_VM_IP}`)	~~NO (internal IP)~~	~~Azure VM app-layer~~	~~App crash, VM down~~	~~HEALTHY \| DOWN \| UNREACHABLE~~
~~Monitoring Daemon~~	`com.john.lightrag-monitor` ~~(calls lightrag-health-with-alert.sh)~~	~~Same as lightrag-health.sh (both CF tunnel~~lightrag + ~~direct~~neo4j ~~VM)~~	~~YES for CF tunnel layer~~	~~Both layers + Slack alerts~~	~~All failure modes + sends alerts to #alerts~~	~~Sends Slack alert on exit code 2~~healthy

~~Key~~Note: ~~Decision~~SSH access currently unavailable (~~CEO~~publickey ~~approved):~~auth). ~~Boot~~Manual ~~probe~~verification ~~uses~~required ~~direct~~via VMAzure IPPortal ~~for~~or ~~speed~~after +SSH nokey ~~auth~~setup.

~~dependency.~~

Layer 2: Cloudflare Network

~~probeexternalsmoke-check.~~

Canonical

~~Service~~

Check	What it tests	Healthy criteria
`cf_tunnel`	HTTPS via CF tunnel	HTTP ~~remains~~200, inlatency ~~lightrag-health.sh~~< ~~for~~2s
`cf_bic_rule`	BIC CFrule ~~Access~~configuration	Rule ~~Token~~enabled, ~~Token~~covers ~~Location~~ both endpoints
`python_ua`	Python client access	HTTP 200 with Python UA

~~Bitwarden Item ID:~~Critical: b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
~~Bitwarden Item Name:~~ ligthrag monitor deamon service tokenpython_ua ~~(note:~~check ~~typo~~verifies inthe ~~original~~CF-BIC-001 ~~item~~rule ~~name)~~

⚠️active. DOIf ~~NOT~~this ~~use~~fails ~~Bitwarden item ID:~~ 61d0bf21-2823-4ae3-a141-95046434591a ~~— DEPRECATED 2026-05-06 (returned~~with HTTP ~~302,~~403, ~~token~~automation ~~dead)~~clients (pi-orchestrator, lightrag-outbox-ingest.js) will break.

ScriptsLayer Using3: ThisApplication TokenHealth

~~tunnel~~

~~used~~

~~IP)~~

Daemons Using This Token

com.john.lightrag-monitor ~~(LaunchAgent for scheduled health checks + Slack alerts)~~

Rotation Policy

~~Cadence:~~ ~~Manual (no TTL configured)~~
~~Owner:~~ ~~FlowForge~~
~~Next Rotation:~~ ~~TBD (CEO decision pending)~~

Check	What it tests	Healthy criteria
`~health_endpoint`	`/system/tools/lightrag-health.shhealth` ~~(CF~~endpoint	status=healthy, ~~check)~~pipeline_busy=false
`~query_endpoint`	`/system/boot.shquery` ~~(optional,~~with ~~not~~naive ~~currently~~mode	HTTP ~~since~~200, ~~boot~~valid ~~probe~~response, ~~uses~~< ~~direct~~30s

Note: CFFirst ~~Access~~query ~~service~~after ~~tokens~~idle domay ~~not~~take ~~expire~~longer by(cold ~~default~~start). ~~unless~~If atimeout, ~~TTL~~retry isonce.

~~explicitly~~

Layer set.4: CurrentOllama tokenUpstream

~~has~~

~~expiration~~

~~configured.~~

Check	What it tests	Healthy criteria
`api_tags`	Ollama model availability	qwen2.5-coder:32b + bge-m3 present

Critical: LightRAG requires these specific models. If missing, queries will fail.

MCInterpreting #99513Results

Green (Exit 0) — PlistHealthy

~~EnvironmentVariables~~

All ~~Pattern~~critical ~~(LaunchAgent~~checks ~~Context)~~passed. System operational.

~~Date:~~Action: ~~2026-05-06~~
None ~~Why:~~ ~~LaunchAgent context has no TTY → BW interactive unlock impossible at scheduled fire times.~~
~~Pattern:~~ ~~CF Access credentials injected directly into plist~~ EnvironmentVariables ~~block.~~
~~Hardening:~~ ~~Plist mode~~ 0600 ~~(NOT default 0644) prevents~~ launchctl print ~~exposure to other local users.~~required.

Configuration

Yellow

(Exit

~~Plist:~~1) ~/Library/LaunchAgents/com.john.lightrag-monitor.plist

—

~~Required keys in~~ <key>EnvironmentVariables</key>: CF_ACCESS_CLIENT_ID + CF_ACCESS_CLIENT_SECRET

~~Script~~ lightrag-health.sh ~~lines 28-31 use~~ ${VAR:-} ~~parameter expansion to preserve env-injected values; otherwise the script's own initialization clobbers env vars.~~

Known RegressionWarnings

Non-critical issues detected. System degraded but operational.

~~Trade-off~~Common ~~accepted under MC #99513:~~warnings: ~~plaintext CF Access credentials in plist (mode 0600).~~

~~Blast~~SSH ~~radius:~~access ~~single CF Access application~~unavailable (~~revocable~~known ~~in seconds from CF dashboard, not user/account-tier)~~limitation)
~~Supersede~~CF ~~plan:~~API ~~when~~token MCunavailable ~~#99495~~(can't ~~fleet~~verify ~~Keychain~~BIC ~~Services~~rule, ~~migration~~but ~~lands,~~Python ~~replace~~UA ~~plist~~test ~~injection~~compensates)

~~with~~

Slow ~~Keychain-backed~~response ~~lookup.~~times (> 2s but < 30s)

Action: Review warning details. Monitor next check. Escalate if warnings persist 3+ checks.

RotationRed Procedure(Exit 2) — Errors

Critical issues detected. System may be non-operational or partially failed.

Common errors:

Query endpoint timeout (> 30s)

HTTP 403 from Python UA (BIC rule disabled)

Ollama models missing

Direct VM access failed

Action:

Review error details in JSON evidence

Follow troubleshooting section below

If unresolved after 30 min, consider rollback (see Azure LightRAG Migration Runbook)

Automated Monitoring Setup

LaunchAgent Installation (DRAFT — Pending Alem Approval)

Draft file: ~/system/evidence/lightrag-monitor-launchagent-draft.plist

Schedule: Daily at 9:00 AM (frequent for 4-week monitoring period)

Installation steps (when CFapproved):

~~token~~

# rotates)1. Copy Revokedraft oldto tokenLaunchAgents
incp Cloudflare Access dashboard

Generate new client_id + client_secret in CF

Update BW item b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2 (canonical) — login.username + login.password (legacy storage) AND custom fields cf-access-client-id ~/ cf-access-client-secret (post-D2 future state)

Edit system/evidence/lightrag-monitor-launchagent-draft.plist EnvironmentVariables block with new values

launchctl bootout gui/$(id -u)/com.john.lightrag-monitor && launchctl bootstrap gui/$(id -u)\
   ~/Library/LaunchAgents/com.john.lightrag-monitor.plist

# 2. Load the agent
launchctl load ~/Library/LaunchAgents/com.john.lightrag-monitor.plist

# 3. Start immediately (optional)
launchctl start com.john.lightrag-monitor


Manual trigger:

launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor

—Logs:
verify
stdout: ~/system/logs/lightrag-monitor/stdout.log

stderr: ~/system/logs/lightrag-monitor/stderr.log


Slack Alerts (To Be Implemented)

When LaunchAgent detects exit code 02 or(errors), 1
send alert to #alerts channel:

node ~/system/tools/slack.js send alerts "🚨 LightRAG health check FAILED at $(date). Check ~/system/evidence/lightrag-health-*.json"

This requires wrapping the health check script in a post-execution hook (see plist comments).


Health History Database

Location: ~/system/databases/lightrag-health.db

Schema: ~/system/tools/lightrag-health-db-init.sql
Cross-referencesTables

MC #99513health_checks — thisOverall fixcheck results
MC #99495health_check_details — fleet-wideIndividual BWlayer/check session hygiene + Keychain migration (supersedes this pattern when delivered)

MC #99400 — original LightRAG observability runbook (D8 publish)

D2 (BW field rename + DEPRECATED untag) DEFERRED 2026-05-06 — pending CEO Bitwarden vault unlock; runtime is now plist-driven, BW lookup is fallback onlyresults

Views


health_checks_summary — Last 30 checks

health_checks_trend — Daily aggregates


Query Examples

Last 10 checks:

sqlite3 ~/system/databases/lightrag-health.db \
  "SELECT timestamp, overall_status, errors, warnings FROM health_checks ORDER BY created_at DESC LIMIT 10;"

Trend over last 7 days:

sqlite3 ~/system/databases/lightrag-health.db \
  "SELECT * FROM health_checks_trend WHERE check_date >= date('now', '-7 days');"

All errors in last 24 hours:

sqlite3 ~/system/databases/lightrag-health.db \
  "SELECT hc.timestamp, hcd.layer, hcd.check_name, hcd.message FROM health_checks hc
   JOIN health_check_details hcd ON hc.id = hcd.health_check_id
   WHERE hcd.status = 'error' AND hc.created_at >= datetime('now', '-24 hours');"

Note: Database logging will be implemented in next iteration of the health script.


Vault-Sourced Token Injection PatternTroubleshooting
Issue: Query endpoint timeout (HTTP 000, 35s)

AllPossible scriptscauses:
MUST
sourceFirst query after idle (cold start)

Ollama FORGE overloaded

Network path issue (Mac Studio → CF Access→ tokensAzure from→ BitwardenCF vault→ atMac runtime.Studio)

NEVER hardcode credentials in scripts.Diagnosis:

Shell Script Pattern
# BWTest sessionOllama livenessupstream check (updated location ~/.cache/bw-session)
BW_SESSION_FILE="${HOME}/.cache/bw-session"
if [[ ! -f "$BW_SESSION_FILE" ]] || [[ ! -s "$BW_SESSION_FILE" ]]; then
  echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
  exit 1
fi

BW_SESSION=$(cat "$BW_SESSION_FILE")

# Load CF Access credentials from vault using custom fields (MC #99495 pattern)
function load_cf_credentials() {
  local bw_item
  bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
  
  if [[ $? -ne 0 ]] || [[ -z "$bw_item" ]]; then
    echo "ERROR: Failed to retrieve CF Access token from Bitwarden. Check BW_SESSION."
    return 1
  fi
  
  # Extract from custom fields array (jq pattern)
  CF_ACCESS_CLIENT_ID=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value')
  CF_ACCESS_CLIENT_SECRET=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value')
  
  if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
    echo "ERROR: CF Access credentials not found in Bitwarden item custom fields."
    return 1
  fi
  
  return 0
}

# Load credentials
if ! load_cf_credentials; then
  echo "FATAL: Cannot proceed without CF Access credentials."
  exit 2
fi

# Use in curldirectly
curl -s https://ollama.basicconsulting.no/api/tags \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID"(grep CF_ACCESS_CLIENT_ID ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET"(grep CF_ACCESS_CLIENT_SECRET ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')" | jq '.models | length'

# Check if FORGE is responding
curl http://10.0.0.2:11434/api/ps

# Test query directly with extended timeout
curl -s --max-time 60 \
  -H "CF-Access-Client-Id: ..." \
  -H "CF-Access-Client-Secret: ..." \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{"query":"test","mode":"naive","only_need_context":false}' \
  https://lightrag.alai.basicconsulting.no/healthquery


Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.

Root Cause:


BW_SESSION expired → ~/.cache/bw-session stale or missing

Wrong Bitwarden item ID (using deprecated 61d0bf21 instead of canonical b42cb5c2)

CF Access service token rotated without updating Bitwarden


Fix:

If cold start: Retry once, should succeed

If FORGE overloaded: Identify competing workload, throttle/stop

If persistent: Check Azure LightRAG Migration Runbook for tunnel troubleshooting



Issue: Python UA blocked (HTTP 403)

Root cause: CF Browser Integrity Check rule disabled or misconfigured.

Diagnosis:

# Refresh BW session
bw unlock
# Copy new session token to ~/.cache/bw-session

# Verify canonical token is live
BW_SESSION=$(cat "${HOME}/.cache/bw-session")
bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION")
echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value'
echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value'

# Test with curlPython CF_ID=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value')
CF_SECRET=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value')UA
curl -s -w "\nHTTP: %{http_code}\n" \
  -A "Python/3.11 urllib/1.26" \
  -H "CF-Access-Client-Id: $CF_ID"..." \
  -H "CF-Access-Client-Secret: $CF_SECRET"..." \
  https://lightrag.alai.basicconsulting.no/health

Fix:


Verify CF Configuration Rule (Ruleset 4fc2c122d04d4791a5d17409b097c510, Rule c5990f19f655441180ae886f4512de40)

Ensure rule is enabled and expression includes lightrag.basicconsulting.no

See: ~/system/rules/cf-proxied-api-bic-whitelist.md


Critical: This is a repeat of the 2026-04-20 outage. If rule is disabled, all automation breaks.


DOWNIssue: (HTTPOllama 5xxmodels or Connection Refused)missing
Symptom:Symptoms: Probeapi_tags returnscheck HTTP 500/502/503fails or connectionwarns refused.about missing models.
RootRequired Cause:models:

LightRAGqwen2.5-coder:32b-instruct-q8_0 app(LLM crashedinference)
Dockerbge-m3:latest container stopped

Azure VM down

Neo4j backend unavailable(embeddings)

Fix:
# Check Azure VM status via Azure Portal
# https://portal.azure.com → vm-alai-lightrag

# Or via az CLI
az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"

# SSH to VMFORGE (if access configured)10.0.0.2)
ssh -i ~/.ssh/azure_alai [email protected][email protected]

# CheckPull Dockermissing containersmodels
dockerollama pspull --filterqwen2.5-coder:32b-instruct-q8_0
name=lightragollama dockerpull ps --filter name=neo4jbge-m3:latest

# RestartVerify
ifollama neededlist docker| restart $(docker psgrep -qE --filter name=lightrag)"(qwen2.5-coder:32b-instruct-q8_0|bge-m3:latest)"


UNREACHABLEIssue: (TimeoutDirect orVM Networkaccess Error)failed
Symptom:Symptoms: Probedirect_access timescheck out after 5-10 seconds withoutreturns HTTP response.error or timeout.
Root Cause:


ANVIL → Azure VM network path broken

ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)

Azure VM firewall blocking port 9621


Fix:Diagnosis:
# Test direct VM connectivityHTTP
curl -s --connect-timeout 5 http://20.240.61.67:9621/health

# Check currentNSG ISPrules (Mac Studio IP curlmay -shave https://ifconfig.co

# Compare to NSG rule source IPchanged)
az network nsg rule show \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --query "sourceAddressPrefix"

# Compare to current ISP IP
curl -s https://ifconfig.co

Fix:
If IPsMac differ,Studio ISP IP rotated, update NSG rulerule:
NEW_IP=$(curl -s https://ifconfig.co)
az network nsg rule update \
  -g rg-alai-lightrag \
  --nsg-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudio \
  --source-address-prefixes "${NEW_IP}/32"

Note: Azure resources (rg-alai-lightrag) are not currently visible via az CLI. This may indicate different subscription or access issue. Direct HTTP access confirms VM is operational.


Auto-HealRollback Removal (MC #99400 D5)Procedure
If LightRAG stack becomes unstable (exit code 2 persisting > 30 min, or CEO directive):

Status:Follow: com.alai.lightrag-auto-healAzure LaunchAgentLightRAG REMOVEDMigration 2026-05-06Runbook → Section "Rollback Procedure"
Rationale
Summary:
The

auto-healRevert daemonconsumer wasURLs designedfrom https://lightrag.basicconsulting.no to automaticallyhttp://localhost:9621
restartRestart local Docker LightRAG
onVerify local service

Optionally deprovision Azure VM


Expected rollback time: 5-15 minutes

Data loss risk: ZERO (local volumes preserved)


Maintenance

Weekly Tasks (First 4 Weeks)


 Review health check failures.trend However,via itdatabase had fatal design flaws:


Wrong probe target: Checked 127.0.0.1:9621 (localhost), but LightRAG runs on Azure VM 20.240.61.67:9621. Auto-heal never triggered on real CF outages.query
ADC token drift: Relied on Application Default Credentials (ADC)Check for Azurepersistent CLI commands. When ADC tokens expired, script failed silently.warnings/errors
Blast radiusVerify risk:evidence Restartfiles logicare couldbeing affect production LightRAG instance with 121K pending docs.generated
False senseCompare oflatency safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.


CEO Decisiontrends (2026-05-06)
p50, D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).

Archived Files


~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06

~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06

~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/

RIP Note: ~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.mdp95)

ReplacementAfter Plan4 Weeks
AzureIf Monitorsystem can directly alert on:


VM metricsstable (CPU, memory, disk, network)

Application Insights custom metrics (if LightRAG reports to AppInsights)

Log Analytics queries (if LightRAG logs to Log Analytics workspace)


Action groups can:


Send Slack alerts via webhook

Trigger Azure Automation runbooks for auto-remediation

Page on-call engineer via PagerDuty/Opsgenie


This will be scoped in a future child MC per CEO directive.


com.john.lightrag-monitor LaunchAgent

Status: ACTIVE (approved + activated 2026-05-06)

Schedule

Runs periodically (check plist StartInterval or StartCalendarInterval for current schedule).

What It Probes


Calls ~/system/tools/lightrag-health-with-alert.sh

Runs both CF tunnel probe + direct VM probe

Sends Slack alert to #alerts channel onno exit code 2 (errors)
in 
4 Environment Variables

The LaunchAgent plist includes:weeks):

LIGHTRAG_VM_IPReduce —monitoring Setfrequency from daily to 20.240.61.67 (Azure VM internal IP)weekly
BWUpdate session sourced fromLaunchAgent ~/.cache/bw-sessionStartCalendarInterval (vault-sourcedto credentials)


Manual Trigger

launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor

Known Issues


Ollama API 302: ollama.alai.no returns HTTP 302run on everyMondays healthonly
check,Archive triggeringold false-alarmevidence Slackfiles alerts.(keep Thislast is a pre-existing issue unrelated to MC #99400. File separate MC for Ollama endpoint CF Access auth configuration.30)


TokenEvidence Rotation ProcedureFiles
All health checks generate timestamped evidence:

Rotation Cadence:Location: Manual (no TTL configured — rotation policy is OPEN)~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.*
WhenRetention: Keep last 30 days, archive older to Rotate
Azure Blob Suspected token compromise

Periodic security hygiene (e.g., every 90 days — CEO decision pending)

After employee offboarding (if token was shared)


Rotation Steps


Storage.
GenerateExample newarchive servicecommand token in Cloudflare dashboard:


Navigate (to Zerobe Trust → Access → Service Auth

Create new service token for lightrag.alai.no Access policy

Copy Client ID + Client Secret




Update Bitwarden item b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2automated):
BW_SESSION=$(cat "${HOME}/.cache/bw-session")
bw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION"
# Update custom fields: cf-access-client-id = "<new_client_id>"
# Update custom fields: cf-access-client-secret = "<new_client_secret>"



Test new token:

curl -s -w "\nHTTP: %{http_code}\n" \
  -H "CF-Access-Client-Id: <new_client_id>" \
  -H "CF-Access-Client-Secret: <new_client_secret>" \
  https://lightrag.alai.no/health



Verify health probes work:

bashfind ~/system/tools/lightrag-health.shevidence # Should return HEALTHY (exit code 0)



Update cf-access-token-registry.json:

jq '.["lightrag.alai.no"].last_verified =-name "'"lightrag-health-*.json" -mtime +30 \
  | xargs tar -czf ~/system/evidence/archive-$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"'Y%m).tar.gz

\# ~/system/state/cf-access-token-registry.jsonUpload > /tmp/registry.json
mv /tmp/registry.json ~/system/state/cf-access-token-registry.json



Revoke old token in Cloudflare dashboard (after confirming new token works)




Known Dependencies

Azure VM NSG — Source IP Sensitivity

The direct VM probe (http://20.240.61.67:9621/health) depends on an NSG allow-rule onto Azure VMBlob
vm-alai-support that whitelists the ANVIL machine's source IP.

Risk: If ANVIL's ISP rotates its public IP address, the NSG source-IP rule becomes stale and the direct VM probe will silently degrade to UNREACHABLE without any Cloudflare involvement. The CF tunnel probe will still work (it routes through Cloudflare, no direct IP dependency), but direct VM access will fail, giving a misleading "CF layer OK / VM layer unreachable" diagnosis.

Detection: Run curl -s https://ifconfig.co on ANVIL and compare to the NSG rule's source address prefix:

az networkstorage nsgblob rule show \
  -g rg-alai-lightragupload \
  --nsg-account-name vm-alai-lightragNSG \
  -n allow-lightrag-macstudioplockfrontstaging \
  --querycontainer-name "sourceAddressPrefix"evidence \
  --name lightrag-health-archive-$(date +%Y%m).tar.gz \
  --file ~/system/evidence/archive-$(date +%Y%m).tar.gz

Long-term fix: Provision an internal DNS hostname for the VM so that the probe endpoint is hostname-based and survives IP rotation without NSG changes. File a child MC when ANVIL ISP rotation becomes a recurring issue.

Related Documentation

Azure LightRAG Migration Runbook — Full migration details + rollback
CF-BIC Whitelist Rule — INFRA-CF-001
MC #99400Task Evidence#8545 Pack:— ~/system/evidence/99400-proveo-pass/
Health Forgedmonitoring Prompt: ~/system/prompts/forged/99400.md

CF Access Token Registry: ~/system/state/cf-access-token-registry.jsonproject


Changelog
2026-05-06 — MC #99495 cleanup: StartInterval 900s MTTD reduction, BIC hostname migration to ollama.alai.no, jq custom-fields extraction, Known Dependencies section added

2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated

2026-04-21 — Initial version (baseline setup + first run)

Document Owner: FlowForge

Last Updated: 2026-05-06 (MC #99495)04-21

Approved By: CEOPending (Alem Basic)approval —for MCLaunchAgent #99400 deliverable D8installation

LightRAG Health Monitoring Runbook

LightRAG ObservabilityHealth Monitoring Runbook

Purpose

ProbeArchitecture TopologyOverview

Health Check Script

Location

Manual Execution

Output

Exit Codes

Check Layers

Layer 1: Azure VM Health

Layer 2: Cloudflare Network

Canonical

Tokencovers Location

ScriptsLayer Using3: ThisApplication TokenHealth

Daemons Using This Token

Rotation Policy

Layer set.4: CurrentOllama tokenUpstream

MCInterpreting #99513Results

Green (Exit 0) — PlistHealthy

Configuration

Known RegressionWarnings

RotationRed Procedure(Exit 2) — Errors

Automated Monitoring Setup

LaunchAgent Installation (DRAFT — Pending Alem Approval)

Slack Alerts (To Be Implemented)

Health History Database

Cross-referencesTables

Views

Query Examples

Vault-Sourced Token Injection PatternTroubleshooting

Issue: Query endpoint timeout (HTTP 000, 35s)

Shell Script Pattern

Failure Modes and Troubleshooting

AUTH_FAIL (HTTP 302 or 401)

Issue: Python UA blocked (HTTP 403)

DOWNIssue: (HTTPOllama 5xxmodels or Connection Refused)missing

UNREACHABLEIssue: (TimeoutDirect orVM Networkaccess Error)failed

Auto-HealRollback Removal (MC #99400 D5)Procedure

Rationale

Maintenance

Weekly Tasks (First 4 Weeks)

CEO Decisiontrends (2026-05-06)

Archived Files

ReplacementAfter Plan4 Weeks

com.john.lightrag-monitor LaunchAgent

Schedule

What It Probes

Environment Variables

Manual Trigger

Known Issues

TokenEvidence Rotation ProcedureFiles

WhenRetention: Keep last 30 days, archive older to Rotate

Rotation Steps

Known Dependencies

Azure VM NSG — Source IP Sensitivity

Related Documentation

Changelog

When
Retention: Keep last 30 days, archive older to Rotate