Skip to main content

LightRAG Health Monitoring Runbook

LightRAG Health MonitoringObservability Runbook (MC #99400 Updated)

Status: ACTIVE
Created:Last Updated: 2026-04-2105-06 (MC #99400 hotfix)
Owner: FlowForge (AgentForge)
Related: MC #8545,#99400, INFRA-CF-001MC #8545


Purpose

Continuous healthHealth monitoring and troubleshooting for LightRAG observability stack following MC #99400 hotfix (Azuresecrets VMhygiene + Cloudflare)probe followingtopology the 2026-04-20 outage fix (CF Browser Integrity Check configuration)resolution).

This runbook covers:

  • HealthProbe checktopology script(three usageseparate probe surfaces)
  • InterpretingCanonical resultsCF Access service token
  • AutomatedVault-sourced monitoringtoken setupinjection pattern
  • Troubleshooting commonfailure issuesmodes
  • RollbackAuto-heal proceduresremoval rationale

ArchitectureProbe OverviewTopology

LightRAG runshealth onis Azuremonitored VMvia (20.240.61.67:9621)THREE separate probe surfaces, each with different purposes and isauthentication exposed via Cloudflare tunnel at https://lightrag.basicconsulting.no. The system depends on:requirements:

  1. Azure VM — Docker containers (lightrag + neo4j)
  2. Cloudflare tunnel — Routes traffic through Mac Studio relay
  3. Cloudflare Access — Authentication via service tokens
  4. Cloudflare BIC rule — Allows automation clients (Python UA)
  5. Ollama upstream — https://ollama.basicconsulting.no for LLM inference

See: Azure LightRAG Migration Runbook


Health Check Script

Location

~/system/tools/lightrag-health.sh

Manual Execution

bash ~/system/tools/lightrag-health.sh

Output

  • Terminal: Colored status summary (green/yellow/red per layer)
  • JSON: ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.json (machine-readable)
  • Markdown: ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.md (human-readable)

Exit Codes

  • 0 — All checks passed (healthy)
  • 1 — Warnings detected (degraded but operational)
  • 2 — Errors detected (critical issues)

Check Layers

Layer 1: Azure VM Health

criteria healthy
CheckProbe Surface What it testsScript HealthyEndpoint Auth RequiredLayer ProbedFailure Modes CaughtStatus Labels
CF Tunnel Probedirect_accesslightrag-health.sh (check_cf_tunnel)https://lightrag.alai.no/health DirectYES HTTP(CF toAccess VM IP:portheaders) HTTPCloudflare 200,Access status=healthy+ AppCF auth failure (302), CF outage, app downHEALTHY | DOWN | AUTH_FAIL | UNREACHABLE
Direct VM Probedocker_containerslightrag-health.sh (check_vm_direct_access)http://20.240.61.67:9621/health ContainerNO status(internal via SSHIP) lightragAzure VM app-layerApp crash, VM down, network unreachableHEALTHY | DOWN | UNREACHABLE
Boot Probeboot.sh lines 88-94http://20.240.61.67:9621/health (or ${LIGHTRAG_VM_IP})NO (internal IP)Azure VM app-layerApp crash, VM downHEALTHY | DOWN | UNREACHABLE
Monitoring Daemoncom.john.lightrag-monitor (calls lightrag-health-with-alert.sh)Same as lightrag-health.sh (both CF tunnel + neo4jdirect running,VM) YES for CF tunnel layerBoth layers + Slack alertsAll failure modes + sends alerts to #alertsSends Slack alert on exit code 2

Note:Key Decision (CEO approved): SSHBoot accessprobe currentlyuses unavailabledirect (publickeyVM auth).IP Manualfor verificationspeed required+ viano Azureauth Portal or after SSH key setup.

Layer 2: Cloudflare Network

forexternalsmoke-check.


Access
CheckWhat it testsHealthy criteria
cf_tunnelHTTPS viadependency. CF tunnel HTTPprobe 200,remains latencyin <lightrag-health.sh 2s
cf_bic_rule BIC

Canonical ruleCF configuration

RuleService enabled,Token covers

Token both endpoints

python_uaPython client accessHTTP 200 with Python UA
Location

Critical:Bitwarden Item ID: python_uab42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service token check(note: verifiestypo thein CF-BIC-001original ruleitem isname)

active.

⚠️ IfDO thisNOT failsuse withBitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returned HTTP 403,302, automationtoken clients (pi-orchestrator, lightrag-outbox-ingest.js) will break.dead)

LayerScripts 3:Using ApplicationThis HealthToken

(CFcheck) currentlydirectVMIP)
CheckWhat it testsHealthy criteria
  • health_endpoint
  • ~/healthsystem/tools/lightrag-health.sh endpoint status=healthy,tunnel pipeline_busy=false
  • query_endpoint
  • ~/querysystem/boot.sh with(optional, naivenot mode HTTPused 200,since validboot response,probe <uses 30s

    Daemons Using This Token

    • com.john.lightrag-monitor (LaunchAgent for scheduled health checks + Slack alerts)

    Rotation Policy

    Cadence: Manual (no TTL configured)
    Owner: FlowForge
    Next Rotation: TBD (CEO decision pending)

    Note: FirstCF queryAccess afterservice idletokens maydo takenot longerexpire (coldby start).default Ifunless timeout,a retryTTL once.

    is

    Layerexplicitly 4:set. OllamaCurrent Upstream

    token hasnoexpiration
    CheckWhat it testsHealthy criteria
    api_tagsOllama model availabilityqwen2.5-coder:32b + bge-m3 present

    Critical: LightRAG requires these specific models. If missing, queries will fail.configured.


    InterpretingVault-Sourced ResultsToken Injection Pattern

    Green (Exit 0) — Healthy

    All criticalscripts checksMUST passed.source SystemCF operational.

    Access

    tokens from Bitwarden vault at runtime. Action:NEVER hardcode credentials in scripts. None required.

    YellowShell (ExitScript 1) — WarningsPattern

    Non-critical issues detected. System degraded but operational.

    Common warnings:

    • SSH access unavailable (known limitation)
    • CF API token unavailable (can't verify BIC rule, but Python UA test compensates)
    • Slow response times (> 2s but < 30s)

    Action: Review warning details. Monitor next check. Escalate if warnings persist 3+ checks.

    Red (Exit 2) — Errors

    Critical issues detected. System may be non-operational or partially failed.

    Common errors:

    • Query endpoint timeout (> 30s)
    • HTTP 403 from Python UA (BIC rule disabled)
    • Ollama models missing
    • Direct VM access failed

    Action:

    1. Review error details in JSON evidence
    2. Follow troubleshooting section below
    3. If unresolved after 30 min, consider rollback (see Azure LightRAG Migration Runbook)

    Automated Monitoring Setup

    LaunchAgent Installation (DRAFT — Pending Alem Approval)

    Draft file: ~/system/evidence/lightrag-monitor-launchagent-draft.plist

    Schedule: Daily at 9:00 AM (frequent for 4-week monitoring period)

    Installation steps (when approved):

    # 1.BW Copysession draftliveness check
    if [[ ! -f /tmp/bw-session ]] || [[ ! -s /tmp/bw-session ]]; then
      echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
      exit 1
    fi
    
    BW_SESSION=$(cat /tmp/bw-session)
    
    # Load CF Access credentials from vault
    function load_cf_credentials() {
      local item_data
      item_data=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
      
      if [[ $? -ne 0 ]] || [[ -z "$item_data" ]]; then
        echo "ERROR: Failed to LaunchAgentsretrieve cpCF ~Access token from Bitwarden. Check BW_SESSION."
        return 1
      fi
      
      CF_ACCESS_CLIENT_ID=$(echo "$item_data" | jq -r '.login.username /system/evidence/lightrag-monitor-launchagent-draft.plist/ \empty' ~| sed 's/^CF-Access-Client-Id: /Library/LaunchAgents/com.john.lightrag-monitor.plist/')
      CF_ACCESS_CLIENT_SECRET=$(echo "$item_data" | jq -r '.login.password // empty')
      
      if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
        echo "ERROR: CF Access credentials not found in Bitwarden item."
        return 1
      fi
      
      return 0
    }
    
    # 2. Load thecredentials
    agentif launchctl! loadload_cf_credentials; ~/Library/LaunchAgents/com.john.lightrag-monitor.plistthen
      echo "FATAL: Cannot proceed without CF Access credentials."
      exit 2
    fi
    
    # 3. Start immediately (optional)
    launchctl start com.john.lightrag-monitor
    

    Manual trigger:

    launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor
    

    Logs:

    • stdout: ~/system/logs/lightrag-monitor/stdout.log
    • stderr: ~/system/logs/lightrag-monitor/stderr.log

    Slack Alerts (To Be Implemented)

    When LaunchAgent detects exit code 2 (errors), send alert to #alerts channel:

    node ~/system/tools/slack.js send alerts "🚨 LightRAG health check FAILED at $(date). Check ~/system/evidence/lightrag-health-*.json"
    

    This requires wrapping the health check scriptUse in a post-execution hook (see plist comments).


    Health History Database

    Location: ~/system/databases/lightrag-health.db

    Schema: ~/system/tools/lightrag-health-db-init.sql

    Tables

    • health_checks — Overall check results
    • health_check_details — Individual layer/check results

    Views

    • health_checks_summary — Last 30 checks
    • health_checks_trend — Daily aggregates

    Query Examples

    Last 10 checks:

    sqlite3 ~/system/databases/lightrag-health.db \
      "SELECT timestamp, overall_status, errors, warnings FROM health_checks ORDER BY created_at DESC LIMIT 10;"
    

    Trend over last 7 days:

    sqlite3 ~/system/databases/lightrag-health.db \
      "SELECT * FROM health_checks_trend WHERE check_date >= date('now', '-7 days');"
    

    All errors in last 24 hours:

    sqlite3 ~/system/databases/lightrag-health.db \
      "SELECT hc.timestamp, hcd.layer, hcd.check_name, hcd.message FROM health_checks hc
       JOIN health_check_details hcd ON hc.id = hcd.health_check_id
       WHERE hcd.status = 'error' AND hc.created_at >= datetime('now', '-24 hours');"
    

    Note: Database logging will be implemented in next iteration of the health script.


    Troubleshooting

    Issue: Query endpoint timeout (HTTP 000, 35s)

    Possible causes:

    1. First query after idle (cold start)
    2. Ollama FORGE overloaded
    3. Network path issue (Mac Studio → CF → Azure → CF → Mac Studio)

    Diagnosis:

    # Test Ollama upstream directlycurl
    curl -s https://ollama.basicconsulting.no/api/tags \
      -H "CF-Access-Client-Id: $(grep CF_ACCESS_CLIENT_ID ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')"CF_ACCESS_CLIENT_ID" \
      -H "CF-Access-Client-Secret: $(grep CF_ACCESS_CLIENT_SECRET ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')" | jq '.models | length'
    
    # Check if FORGE is responding
    curl http://10.0.0.2:11434/api/ps
    
    # Test query directly with extended timeout
    curl -s --max-time 60 \
      -H "CF-Access-Client-Id: ..." \
      -H "CF-Access-Client-Secret: ..." \
      -H "Content-Type: application/json" \
      -X POST \
      -d '{"query":"test","mode":"naive","only_need_context":false}'CF_ACCESS_CLIENT_SECRET" \
      https://lightrag.basicconsulting.alai.no/queryhealth
    

    Failure Modes and Troubleshooting

    AUTH_FAIL (HTTP 302 or 401)

    Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.

    Root Cause:

    • BW_SESSION expired → /tmp/bw-session stale or missing
    • Wrong Bitwarden item ID (using deprecated 61d0bf21 instead of canonical b42cb5c2)
    • CF Access service token rotated without updating Bitwarden

    Fix:

    • If cold start: Retry once, should succeed
    • If FORGE overloaded: Identify competing workload, throttle/stop
    • If persistent: Check Azure LightRAG Migration Runbook for tunnel troubleshooting

    Issue: Python UA blocked (HTTP 403)

    Root cause: CF Browser Integrity Check rule disabled or misconfigured.

    Diagnosis:

    # Refresh BW session
    bw unlock
    # Copy new session token to /tmp/bw-session
    
    # Verify canonical token is live
    bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session) | jq -r '.login.username, .login.password'
    
    # Test with Python UAcurl
    curl -s -w "\nHTTP: %{http_code}\n" \
      -A "Python/3.11 urllib/1.26" \
      -H "CF-Access-Client-Id: ...$(bw get username 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session) | sed 's/^CF-Access-Client-Id: //')" \
      -H "CF-Access-Client-Secret: ...$(bw get password 'b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2' --session $(cat /tmp/bw-session))" \
      https://lightrag.basicconsulting.alai.no/health
    

    Fix:

    1. Verify CF Configuration Rule (Ruleset 4fc2c122d04d4791a5d17409b097c510, Rule c5990f19f655441180ae886f4512de40)
    2. Ensure rule is enabled and expression includes lightrag.basicconsulting.no
    3. See: ~/system/rules/cf-proxied-api-bic-whitelist.md

    Critical: This is a repeat of the 2026-04-20 outage. If rule is disabled, all automation breaks.


    Issue:DOWN Ollama(HTTP models5xx missingor Connection Refused)

    Symptoms:Symptom: api_tagsProbe checkreturns failsHTTP 500/502/503 or warnsconnection about missing models.refused.

    RequiredRoot models:Cause:

    • qwen2.5-coder:32b-instruct-q8_0LightRAG (LLMapp inference)crashed
    • bge-m3:latestDocker (embeddings)container stopped
    • Azure VM down
    • Neo4j backend unavailable

    Fix:

    # Check Azure VM status via Azure Portal
    # https://portal.azure.com → vm-alai-lightrag
    
    # Or via az CLI
    az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"
    
    # SSH to FORGEVM (10.0.0.2)if access configured)
    ssh [email protected]-i ~/.ssh/azure_alai [email protected]
    
    # PullCheck missingDocker modelscontainers
    ollamadocker pullps qwen2.5-coder:32b-instruct-q8_0--filter ollamaname=lightrag
    pulldocker bge-m3:latestps --filter name=neo4j
    
    # VerifyRestart ollamaif listneeded
    |docker greprestart $(docker ps -Eq "(qwen2.5-coder:32b-instruct-q8_0|bge-m3:latest)"--filter name=lightrag)
    

    Issue:UNREACHABLE Direct(Timeout VMor accessNetwork failedError)

    Symptoms:Symptom: direct_accessProbe checktimes returnsout after 5-10 seconds without HTTP error or timeout.response.

    Diagnosis:Root Cause:

    • ANVIL → Azure VM network path broken
    • ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)
    • Azure VM firewall blocking port 9621

    Fix:

    # Test direct HTTPVM connectivity
    curl -s --connect-timeout 5 http://20.240.61.67:9621/health
    
    # Check NSGcurrent rules (Mac StudioISP IP
    maycurl have-s changed)https://ifconfig.co
    
    # Compare to NSG rule source IP
    az network nsg rule show \
      -g rg-alai-lightrag \
      --nsg-name vm-alai-lightragNSG \
      -n allow-lightrag-macstudio \
      --query "sourceAddressPrefix"
    
    # Compare to current ISP IP
    curl -s https://ifconfig.co
    

    Fix: If MacIPs Studio ISP IP rotated,differ, update NSG rule:

    rule
    NEW_IP=$(curl -s https://ifconfig.co)
    az network nsg rule update \
      -g rg-alai-lightrag \
      --nsg-name vm-alai-lightragNSG \
      -n allow-lightrag-macstudio \
      --source-address-prefixes "${NEW_IP}/32"
    

    Auto-Heal Removal (MC #99400 D5)

    Note:Status: com.alai.lightrag-auto-heal LaunchAgent REMOVED 2026-05-06

    Rationale

    The auto-heal daemon was designed to automatically restart LightRAG on health check failures. However, it had fatal design flaws:

    1. Wrong probe target: Checked 127.0.0.1:9621 (localhost), but LightRAG runs on Azure resourcesVM 20.240.61.67:9621. Auto-heal never triggered on real CF outages.
    2. ADC token drift: Relied on Application Default Credentials (rg-alai-lightrag)ADC) arefor notAzure currentlyCLI visiblecommands. When ADC tokens expired, script failed silently.
    3. Blast radius risk: Restart logic could affect production LightRAG instance with 121K pending docs.
    4. False sense of safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.

    CEO Decision (2026-05-06)

    D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).

    Archived Files

    • ~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06
    • ~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06
    • ~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/
    • RIP Note: ~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.md

    Replacement Plan

    Azure Monitor can directly alert on:

    • VM metrics (CPU, memory, disk, network)
    • Application Insights custom metrics (if LightRAG reports to AppInsights)
    • Log Analytics queries (if LightRAG logs to Log Analytics workspace)

    Action groups can:

    • Send Slack alerts via azwebhook
    • CLI.
    • Trigger Azure Automation runbooks for auto-remediation
    • Page on-call engineer via PagerDuty/Opsgenie

    This maywill indicatebe differentscoped subscriptionin ora accessfuture issue.child DirectMC HTTPper accessCEO confirms VM is operational.directive.


    Rollbackcom.john.lightrag-monitor ProcedureLaunchAgent

    IfStatus: LightRAG stack becomes unstableACTIVE (exitapproved code+ 2activated persisting2026-05-06)

    >

    Schedule

    30

    Runs min,periodically (check plist StartInterval or CEOStartCalendarInterval directive):for current schedule).

    Follow:

    What AzureIt LightRAG Migration Runbook → Section "Rollback Procedure"

    Summary:

    1. Revert consumer URLs from https://lightrag.basicconsulting.no to http://localhost:9621
    2. Restart local Docker LightRAG
    3. Verify local service
    4. Optionally deprovision Azure VM

    Expected rollback time: 5-15 minutes
    Data loss risk: ZERO (local volumes preserved)


    Maintenance

    Weekly Tasks (First 4 Weeks)Probes

    • Calls Review health check trend via database query~/system/tools/lightrag-health-with-alert.sh
    • Runs Checkboth forCF persistenttunnel warnings/errorsprobe + direct VM probe
    • Sends VerifySlack evidencealert filesto are#alerts beingchannel generated
    •  Compare latency trends (p50, p95)

    After 4 Weeks

    If system stable (noon exit code 2 in(errors) 4 weeks):

    Environment Variables

    The LaunchAgent plist includes:

    • ReduceLIGHTRAG_VM_IP monitoring frequencySet to 20.240.61.67 (Azure VM internal IP)
    • BW session sourced from daily/tmp/bw-session (vault-sourced credentials)

    Manual Trigger

    launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor
    

    Known Issues

    • Ollama API 302: ollama.alai.no returns HTTP 302 on every health check, triggering false-alarm Slack alerts. This is a pre-existing issue unrelated to weekly
    • MC
    • Update#99400. LaunchAgentFile StartCalendarIntervalseparate toMC runfor onOllama Mondaysendpoint only
    • CF
    • ArchiveAccess oldauth evidence files (keep last 30)configuration.

    EvidenceToken FilesRotation Procedure

    AllRotation healthCadence: checksManual generate(no timestampedTTL evidence:configured — rotation policy is OPEN)

    When to Rotate

    • Suspected token compromise
    • Periodic security hygiene (e.g., every 90 days — CEO decision pending)
    • After employee offboarding (if token was shared)

    Rotation Steps

    1. Location:Generate new service token in Cloudflare dashboard:

      • Navigate to ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.*Zero Trust → Access → Service Auth

      • Create new service token for lightrag.alai.no Access policy
      • Copy Client ID + Client Secret
    2. Retention:Update KeepBitwarden lastitem 30 days, archive older to Azure Blob Storage.

      Example archive command (to be automated)b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2:

      findbw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session $(cat /tmp/bw-session)
      # Update username = "CF-Access-Client-Id: <new_client_id>"
      # Update password = "<new_client_secret>"
      
    3. Test new token:

      curl -s -w "\nHTTP: %{http_code}\n" \
        -H "CF-Access-Client-Id: <new_client_id>" \
        -H "CF-Access-Client-Secret: <new_client_secret>" \
        https://lightrag.alai.no/health
      
    4. Verify health probes work:

      bash ~/system/evidencetools/lightrag-health.sh
      # Should return HEALTHY (exit code 0)
      
    5. Update cf-access-token-registry.json:

      jq '.["lightrag.alai.no"].last_verified = "'"$(date -name "lightrag-health-*.json" -mtimeu +30%Y-%m-%dT%H:%M:%SZ)"'"' \
        | xargs tar -czf
        ~/system/evidence/archive-$(datestate/cf-access-token-registry.json +%Y%m).tar.gz> #/tmp/registry.json
      Uploadmv to Azure Blob
      az storage blob upload \
        --account-name plockfrontstaging \
        --container-name evidence \
        --name lightrag-health-archive-$(date +%Y%m).tar.gz \
        --file/tmp/registry.json ~/system/evidence/archive-$(date +%Y%m).tar.gzstate/cf-access-token-registry.json
      
    6. Revoke old token in Cloudflare dashboard (after confirming new token works)



    Changelog

    2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
    2026-04-21 — Initial version (baseline setup + first run)


    Document Owner: FlowForge
    Last Updated: 2026-04-2105-06
    Approved By: PendingCEO (Alem approvalBasic) for LaunchAgentMC installation#99400 deliverable D8