Skip to main content

LightRAG Health Monitoring Runbook

LightRAG ObservabilityHealth Monitoring Runbook

Domain note (MC2026-05-17): #99400References Updated)to lightrag.basicconsulting.no and ollama.basicconsulting.no are legacy hostnames. Current live endpoints: lightrag.alai.no and ollama.alai.no.

Status: ACTIVE
Last Updated:Created: 2026-05-06 (MC #99400 hotfix)04-21
Owner: FlowForge (AgentForge)
Related: MC #99400,#8545, MC #8545INFRA-CF-001


Purpose

HealthContinuous health monitoring and troubleshooting for LightRAG observabilitystack stack(Azure VM + Cloudflare) following MCthe #994002026-04-20 hotfixoutage fix (secretsCF hygieneBrowser +Integrity probeCheck topology resolution)configuration).

This runbook covers:

  • ProbeHealth topologycheck (threescript separate probe surfaces)usage
  • CanonicalInterpreting CF Access service tokenresults
  • Vault-sourcedAutomated tokenmonitoring injection patternsetup
  • Troubleshooting failurecommon modesissues
  • Auto-healRollback removal rationaleprocedures

ProbeArchitecture TopologyOverview

LightRAG healthruns on Azure VM (20.240.61.67:9621) and is monitoredexposed via THREECloudflare separatetunnel probeat surfaces,https://lightrag.basicconsulting.no. eachThe withsystem differentdepends purposeson:

and
    authentication
  1. Azure requirements:VM — Docker containers (lightrag + neo4j)
  2. Cloudflare tunnel — Routes traffic through Mac Studio relay
  3. Cloudflare Access — Authentication via service tokens
  4. Cloudflare BIC rule — Allows automation clients (Python UA)
  5. Ollama upstream — https://ollama.basicconsulting.no for LLM inference

See: Azure LightRAG Migration Runbook


Health Check Script

Location

~/system/tools/lightrag-health.sh

Manual Execution

bash ~/system/tools/lightrag-health.sh

Output

  • Terminal: Colored status summary (green/yellow/red per layer)
  • JSON: ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.json (machine-readable)
  • Markdown: ~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.md (human-readable)

Exit Codes

  • 0 — All checks passed (healthy)
  • 1 — Warnings detected (degraded but operational)
  • 2 — Errors detected (critical issues)

Check Layers

Layer 1: Azure VM Health

Healthy running,
Probe SurfaceCheck ScriptWhat it tests Endpoint Auth RequiredLayer ProbedFailure Modes CaughtStatus Labelscriteria
CF Tunnel Probelightrag-health.sh (check_cf_tunnel)https://lightrag.alai.no/healthdirect_access YESDirect (CFHTTP Accessto headers)VM IP:port CloudflareHTTP Access200, + AppCF auth failure (302), CF outage, app downHEALTHY | DOWN | AUTH_FAIL | UNREACHABLEstatus=healthy
Direct VM Probelightrag-health.sh (check_vm_direct_access)http://20.240.61.67:9621/healthdocker_containers NOContainer (internalstatus IP)via SSH Azure VM app-layerApp crash, VM down, network unreachableHEALTHY | DOWN | UNREACHABLE
Boot Probeboot.sh lines 88-94http://20.240.61.67:9621/health (or ${LIGHTRAG_VM_IP})NO (internal IP)Azure VM app-layerApp crash, VM downHEALTHY | DOWN | UNREACHABLE
Monitoring Daemoncom.john.lightrag-monitor (calls lightrag-health-with-alert.sh)Same as lightrag-health.sh (both CF tunnellightrag + directneo4j VM) YES for CF tunnel layerBoth layers + Slack alertsAll failure modes + sends alerts to #alertsSends Slack alert on exit code 2healthy

KeyNote: DecisionSSH access currently unavailable (CEOpublickey approved):auth). BootManual probeverification usesrequired directvia VMAzure IPPortal foror speedafter +SSH nokey authsetup.

dependency.

Layer 2: Cloudflare Network

probeexternalsmoke-check.


Canonical

Service
CheckWhat it testsHealthy criteria
cf_tunnelHTTPS via CF tunnel HTTP remains200, inlatency lightrag-health.sh< for2s
cf_bic_rule BIC CFrule Accessconfiguration Rule Tokenenabled,

Tokencovers Location

both endpoints
python_uaPython client accessHTTP 200 with Python UA

Bitwarden Item ID:Critical: b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2
Bitwarden Item Name: ligthrag monitor deamon service tokenpython_ua (note:check typoverifies inthe originalCF-BIC-001 itemrule name)

is

⚠️active. DOIf NOTthis usefails Bitwarden item ID: 61d0bf21-2823-4ae3-a141-95046434591a — DEPRECATED 2026-05-06 (returnedwith HTTP 302,403, tokenautomation dead)clients (pi-orchestrator, lightrag-outbox-ingest.js) will break.

ScriptsLayer Using3: ThisApplication TokenHealth

  • tunnel
  • usedVMIP)

    Daemons Using This Token

    • com.john.lightrag-monitor (LaunchAgent for scheduled health checks + Slack alerts)

    Rotation Policy

    Cadence: Manual (no TTL configured)
    Owner: FlowForge
    Next Rotation: TBD (CEO decision pending)

    CheckWhat it testsHealthy criteria
    ~health_endpoint/system/tools/lightrag-health.shhealth (CFendpoint status=healthy, check)pipeline_busy=false
    ~query_endpoint/system/boot.shquery (optional,with notnaive currentlymode HTTP since200, bootvalid proberesponse, uses< direct30s

    Note: CFFirst Accessquery serviceafter tokensidle domay nottake expirelonger by(cold defaultstart). unlessIf atimeout, TTLretry isonce.

    explicitly

    Layer set.4: CurrentOllama tokenUpstream

    hasnoexpirationconfigured.
    CheckWhat it testsHealthy criteria
    api_tagsOllama model availabilityqwen2.5-coder:32b + bge-m3 present

    Critical: LightRAG requires these specific models. If missing, queries will fail.


    MCInterpreting #99513Results

    Green (Exit 0)PlistHealthy

    EnvironmentVariables

    All Patterncritical (LaunchAgentchecks Context)passed. System operational.

    Date:Action: 2026-05-06
    None Why: LaunchAgent context has no TTY → BW interactive unlock impossible at scheduled fire times.
    Pattern: CF Access credentials injected directly into plist EnvironmentVariables block.
    Hardening: Plist mode 0600 (NOT default 0644) prevents launchctl print exposure to other local users.required.

    Configuration

    Yellow
      (Exit
    • Plist:1) ~/Library/LaunchAgents/com.john.lightrag-monitor.plist
    • Required keys in <key>EnvironmentVariables</key>: CF_ACCESS_CLIENT_ID + CF_ACCESS_CLIENT_SECRET
    • Script lightrag-health.sh lines 28-31 use ${VAR:-} parameter expansion to preserve env-injected values; otherwise the script's own initialization clobbers env vars.

    Known RegressionWarnings

    Non-critical issues detected. System degraded but operational.

    Trade-offCommon accepted under MC #99513:warnings: plaintext CF Access credentials in plist (mode 0600).

    • BlastSSH radius:access single CF Access applicationunavailable (revocableknown in seconds from CF dashboard, not user/account-tier)limitation)
    • SupersedeCF plan:API whentoken MCunavailable #99495(can't fleetverify KeychainBIC Servicesrule, migrationbut lands,Python replaceUA plisttest injectioncompensates)
    • with
    • Slow Keychain-backedresponse lookup.times (> 2s but < 30s)

    Action: Review warning details. Monitor next check. Escalate if warnings persist 3+ checks.

    RotationRed Procedure(Exit 2) — Errors

    Critical issues detected. System may be non-operational or partially failed.

    Common errors:

    • Query endpoint timeout (> 30s)
    • HTTP 403 from Python UA (BIC rule disabled)
    • Ollama models missing
    • Direct VM access failed

    Action:

    1. Review error details in JSON evidence
    2. Follow troubleshooting section below
    3. If unresolved after 30 min, consider rollback (see Azure LightRAG Migration Runbook)

    Automated Monitoring Setup

    LaunchAgent Installation (DRAFT — Pending Alem Approval)

    Draft file: ~/system/evidence/lightrag-monitor-launchagent-draft.plist

    Schedule: Daily at 9:00 AM (frequent for 4-week monitoring period)

    Installation steps (when CFapproved):

    token
    # rotates)1. 
      Copy
    1. Revokedraft oldto tokenLaunchAgents incp Cloudflare Access dashboard
    2. Generate new client_id + client_secret in CF
    3. Update BW item b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2 (canonical) — login.username + login.password (legacy storage) AND custom fields cf-access-client-id ~/ cf-access-client-secret (post-D2 future state)
    4. Edit system/evidence/lightrag-monitor-launchagent-draft.plist EnvironmentVariables block with new values
    5. launchctl bootout gui/$(id -u)/com.john.lightrag-monitor && launchctl bootstrap gui/$(id -u)\ ~/Library/LaunchAgents/com.john.lightrag-monitor.plist # 2. Load the agent launchctl load ~/Library/LaunchAgents/com.john.lightrag-monitor.plist # 3. Start immediately (optional) launchctl start com.john.lightrag-monitor
  • Manual trigger:

    launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor
    

    Logs:

    verify
    • stdout: ~/system/logs/lightrag-monitor/stdout.log
    • stderr: ~/system/logs/lightrag-monitor/stderr.log

    Slack Alerts (To Be Implemented)

    When LaunchAgent detects exit code 02 or(errors), 1

  • send alert to #alerts channel:

    node ~/system/tools/slack.js send alerts "🚨 LightRAG health check FAILED at $(date). Check ~/system/evidence/lightrag-health-*.json"
    

    This requires wrapping the health check script in a post-execution hook (see plist comments).


    Health History Database

    Location: ~/system/databases/lightrag-health.db

    Schema: ~/system/tools/lightrag-health-db-init.sql

    Cross-referencesTables

    • MC #99513health_checksthisOverall fixcheck results
    • MC #99495health_check_detailsfleet-wideIndividual BWlayer/check session hygiene + Keychain migration (supersedes this pattern when delivered)
    • MC #99400 — original LightRAG observability runbook (D8 publish)
    • D2 (BW field rename + DEPRECATED untag) DEFERRED 2026-05-06 — pending CEO Bitwarden vault unlock; runtime is now plist-driven, BW lookup is fallback onlyresults

    Views

    • health_checks_summary — Last 30 checks
    • health_checks_trend — Daily aggregates

    Query Examples

    Last 10 checks:

    sqlite3 ~/system/databases/lightrag-health.db \
      "SELECT timestamp, overall_status, errors, warnings FROM health_checks ORDER BY created_at DESC LIMIT 10;"
    

    Trend over last 7 days:

    sqlite3 ~/system/databases/lightrag-health.db \
      "SELECT * FROM health_checks_trend WHERE check_date >= date('now', '-7 days');"
    

    All errors in last 24 hours:

    sqlite3 ~/system/databases/lightrag-health.db \
      "SELECT hc.timestamp, hcd.layer, hcd.check_name, hcd.message FROM health_checks hc
       JOIN health_check_details hcd ON hc.id = hcd.health_check_id
       WHERE hcd.status = 'error' AND hc.created_at >= datetime('now', '-24 hours');"
    

    Note: Database logging will be implemented in next iteration of the health script.


    Vault-Sourced Token Injection PatternTroubleshooting

    Issue: Query endpoint timeout (HTTP 000, 35s)

    AllPossible scriptscauses:

    MUST
      source
    1. First query after idle (cold start)
    2. Ollama FORGE overloaded
    3. Network path issue (Mac Studio → CF Access tokensAzure from BitwardenCF vault atMac runtime.Studio)

    NEVER hardcode credentials in scripts.Diagnosis:

    Shell Script Pattern

    # BWTest sessionOllama livenessupstream check (updated location ~/.cache/bw-session)
    BW_SESSION_FILE="${HOME}/.cache/bw-session"
    if [[ ! -f "$BW_SESSION_FILE" ]] || [[ ! -s "$BW_SESSION_FILE" ]]; then
      echo "ERROR: BW_SESSION file missing or empty. Run 'bw unlock' first."
      exit 1
    fi
    
    BW_SESSION=$(cat "$BW_SESSION_FILE")
    
    # Load CF Access credentials from vault using custom fields (MC #99495 pattern)
    function load_cf_credentials() {
      local bw_item
      bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION" 2>/dev/null)
      
      if [[ $? -ne 0 ]] || [[ -z "$bw_item" ]]; then
        echo "ERROR: Failed to retrieve CF Access token from Bitwarden. Check BW_SESSION."
        return 1
      fi
      
      # Extract from custom fields array (jq pattern)
      CF_ACCESS_CLIENT_ID=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value')
      CF_ACCESS_CLIENT_SECRET=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value')
      
      if [[ -z "$CF_ACCESS_CLIENT_ID" ]] || [[ -z "$CF_ACCESS_CLIENT_SECRET" ]]; then
        echo "ERROR: CF Access credentials not found in Bitwarden item custom fields."
        return 1
      fi
      
      return 0
    }
    
    # Load credentials
    if ! load_cf_credentials; then
      echo "FATAL: Cannot proceed without CF Access credentials."
      exit 2
    fi
    
    # Use in curldirectly
    curl -s https://ollama.basicconsulting.no/api/tags \
      -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID"(grep CF_ACCESS_CLIENT_ID ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')" \
      -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET"(grep CF_ACCESS_CLIENT_SECRET ~/Library/LaunchAgents/com.john.pi-orchestrator.plist | sed 's/.*<string>\(.*\)<\/string>/\1/')" | jq '.models | length'
    
    # Check if FORGE is responding
    curl http://10.0.0.2:11434/api/ps
    
    # Test query directly with extended timeout
    curl -s --max-time 60 \
      -H "CF-Access-Client-Id: ..." \
      -H "CF-Access-Client-Secret: ..." \
      -H "Content-Type: application/json" \
      -X POST \
      -d '{"query":"test","mode":"naive","only_need_context":false}' \
      https://lightrag.alai.basicconsulting.no/healthquery
    

    Failure Modes and Troubleshooting

    AUTH_FAIL (HTTP 302 or 401)

    Symptom: Probe returns HTTP 302 redirect or 401 Unauthorized instead of 200 OK.

    Root Cause:

    • BW_SESSION expired → ~/.cache/bw-session stale or missing
    • Wrong Bitwarden item ID (using deprecated 61d0bf21 instead of canonical b42cb5c2)
    • CF Access service token rotated without updating Bitwarden

    Fix:

    • If cold start: Retry once, should succeed
    • If FORGE overloaded: Identify competing workload, throttle/stop
    • If persistent: Check Azure LightRAG Migration Runbook for tunnel troubleshooting

    Issue: Python UA blocked (HTTP 403)

    Root cause: CF Browser Integrity Check rule disabled or misconfigured.

    Diagnosis:

    # Refresh BW session
    bw unlock
    # Copy new session token to ~/.cache/bw-session
    
    # Verify canonical token is live
    BW_SESSION=$(cat "${HOME}/.cache/bw-session")
    bw_item=$(bw get item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION")
    echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value'
    echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value'
    
    # Test with curlPython CF_ID=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-id") | .value')
    CF_SECRET=$(echo "$bw_item" | jq -r '.fields[] | select(.name=="cf-access-client-secret") | .value')UA
    curl -s -w "\nHTTP: %{http_code}\n" \
      -A "Python/3.11 urllib/1.26" \
      -H "CF-Access-Client-Id: $CF_ID"..." \
      -H "CF-Access-Client-Secret: $CF_SECRET"..." \
      https://lightrag.alai.basicconsulting.no/health
    

    Fix:

    1. Verify CF Configuration Rule (Ruleset 4fc2c122d04d4791a5d17409b097c510, Rule c5990f19f655441180ae886f4512de40)
    2. Ensure rule is enabled and expression includes lightrag.basicconsulting.no
    3. See: ~/system/rules/cf-proxied-api-bic-whitelist.md

    Critical: This is a repeat of the 2026-04-20 outage. If rule is disabled, all automation breaks.


    DOWNIssue: (HTTPOllama 5xxmodels or Connection Refused)missing

    Symptom:Symptoms: Probeapi_tags returnscheck HTTP 500/502/503fails or connectionwarns refused.about missing models.

    RootRequired Cause:models:

    • LightRAGqwen2.5-coder:32b-instruct-q8_0 app(LLM crashedinference)
    • Dockerbge-m3:latest container stopped
    • Azure VM down
    • Neo4j backend unavailable(embeddings)

    Fix:

    # Check Azure VM status via Azure Portal
    # https://portal.azure.com → vm-alai-lightrag
    
    # Or via az CLI
    az vm show -d -g rg-alai-lightrag -n vm-alai-lightrag --query "powerState"
    
    # SSH to VMFORGE (if access configured)10.0.0.2)
    ssh -i ~/.ssh/azure_alai [email protected][email protected]
    
    # CheckPull Dockermissing containersmodels
    dockerollama pspull --filterqwen2.5-coder:32b-instruct-q8_0
    name=lightragollama dockerpull ps --filter name=neo4jbge-m3:latest
    
    # RestartVerify
    ifollama neededlist docker| restart $(docker psgrep -qE --filter name=lightrag)"(qwen2.5-coder:32b-instruct-q8_0|bge-m3:latest)"
    

    UNREACHABLEIssue: (TimeoutDirect orVM Networkaccess Error)failed

    Symptom:Symptoms: Probedirect_access timescheck out after 5-10 seconds withoutreturns HTTP response.error or timeout.

    Root Cause:

    • ANVIL → Azure VM network path broken
    • ISP IP rotation (Mac Studio public IP changed, NSG rule outdated)
    • Azure VM firewall blocking port 9621

    Fix:Diagnosis:

    # Test direct VM connectivityHTTP
    curl -s --connect-timeout 5 http://20.240.61.67:9621/health
    
    # Check currentNSG ISPrules (Mac Studio IP curlmay -shave https://ifconfig.co
    
    # Compare to NSG rule source IPchanged)
    az network nsg rule show \
      -g rg-alai-lightrag \
      --nsg-name vm-alai-lightragNSG \
      -n allow-lightrag-macstudio \
      --query "sourceAddressPrefix"
    
    # Compare to current ISP IP
    curl -s https://ifconfig.co
    

    Fix: If IPsMac differ,Studio ISP IP rotated, update NSG rulerule:

    NEW_IP=$(curl -s https://ifconfig.co)
    az network nsg rule update \
      -g rg-alai-lightrag \
      --nsg-name vm-alai-lightragNSG \
      -n allow-lightrag-macstudio \
      --source-address-prefixes "${NEW_IP}/32"
    

    Note: Azure resources (rg-alai-lightrag) are not currently visible via az CLI. This may indicate different subscription or access issue. Direct HTTP access confirms VM is operational.


    Auto-HealRollback Removal (MC #99400 D5)Procedure

    If LightRAG stack becomes unstable (exit code 2 persisting > 30 min, or CEO directive):

    Status:Follow: com.alai.lightrag-auto-healAzure LaunchAgentLightRAG REMOVEDMigration 2026-05-06Runbook → Section "Rollback Procedure"

    Rationale

    Summary:

    The

      auto-heal
    1. Revert daemonconsumer wasURLs designedfrom https://lightrag.basicconsulting.no to automaticallyhttp://localhost:9621
    2. restart
    3. Restart local Docker LightRAG
    4. on
    5. Verify local service
    6. Optionally deprovision Azure VM

    Expected rollback time: 5-15 minutes
    Data loss risk: ZERO (local volumes preserved)


    Maintenance

    Weekly Tasks (First 4 Weeks)

    •  Review health check failures.trend However,via itdatabase had fatal design flaws:

      1. Wrong probe target: Checked 127.0.0.1:9621 (localhost), but LightRAG runs on Azure VM 20.240.61.67:9621. Auto-heal never triggered on real CF outages.query
      2. ADC token drift: Relied on Application Default Credentials (ADC)Check for Azurepersistent CLI commands. When ADC tokens expired, script failed silently.warnings/errors
      3. Blast radiusVerify risk:evidence Restartfiles logicare couldbeing affect production LightRAG instance with 121K pending docs.generated
      4. False senseCompare oflatency safety: Having a non-functional auto-heal daemon creates operational risk — humans assume "auto-heal will catch it" when it won't.

      CEO Decisiontrends (2026-05-06)

      p50,

      D5 = RIP: Remove auto-heal LaunchAgent + script. Replacement = Azure Monitor alert action (child MC to be filed separately).

      Archived Files

      • ~/Library/LaunchAgents/_archive/com.alai.lightrag-auto-heal.plist.deprecated-99400-2026-05-06
      • ~/system/tools/_archive/lightrag-auto-heal.sh.deprecated-99400-2026-05-06
      • ~/system/state/_archive/lightrag-auto-heal-deprecated-99400-2026-05-06/
      • RIP Note: ~/Library/LaunchAgents/_archive/RIP-NOTE-lightrag-auto-heal-99400.mdp95)

      ReplacementAfter Plan4 Weeks

      AzureIf Monitorsystem can directly alert on:

      • VM metricsstable (CPU, memory, disk, network)
      • Application Insights custom metrics (if LightRAG reports to AppInsights)
      • Log Analytics queries (if LightRAG logs to Log Analytics workspace)

      Action groups can:

      • Send Slack alerts via webhook
      • Trigger Azure Automation runbooks for auto-remediation
      • Page on-call engineer via PagerDuty/Opsgenie

      This will be scoped in a future child MC per CEO directive.


      com.john.lightrag-monitor LaunchAgent

      Status: ACTIVE (approved + activated 2026-05-06)

      Schedule

      Runs periodically (check plist StartInterval or StartCalendarInterval for current schedule).

      What It Probes

      • Calls ~/system/tools/lightrag-health-with-alert.sh
      • Runs both CF tunnel probe + direct VM probe
      • Sends Slack alert to #alerts channel onno exit code 2 (errors)
      • in
      4

      Environment Variables

      The LaunchAgent plist includes:weeks):

      • LIGHTRAG_VM_IPReduce monitoring Setfrequency from daily to 20.240.61.67 (Azure VM internal IP)weekly
      • BWUpdate session sourced fromLaunchAgent ~/.cache/bw-sessionStartCalendarInterval (vault-sourcedto credentials)

      Manual Trigger

      launchctl kickstart -k gui/$(id -u)/com.john.lightrag-monitor
      

      Known Issues

      • Ollama API 302: ollama.alai.no returns HTTP 302run on everyMondays healthonly
      • check,
      • Archive triggeringold false-alarmevidence Slackfiles alerts.(keep Thislast is a pre-existing issue unrelated to MC #99400. File separate MC for Ollama endpoint CF Access auth configuration.30)

      TokenEvidence Rotation ProcedureFiles

      All health checks generate timestamped evidence:

      Rotation Cadence:Location: Manual (no TTL configured — rotation policy is OPEN)~/system/evidence/lightrag-health-YYYYMMDD-HHMMSS.*

      When

      Retention: Keep last 30 days, archive older to Rotate

      Azure
        Blob
      • Suspected token compromise
      • Periodic security hygiene (e.g., every 90 days — CEO decision pending)
      • After employee offboarding (if token was shared)

      Rotation Steps

      1. Storage.

        GenerateExample newarchive servicecommand token in Cloudflare dashboard:

        • Navigate (to Zerobe Trust → Access → Service Auth
        • Create new service token for lightrag.alai.no Access policy
        • Copy Client ID + Client Secret
      2. Update Bitwarden item b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2automated):

        BW_SESSION=$(cat "${HOME}/.cache/bw-session")
        bw edit item "b42cb5c2-dc9b-4a43-bcc7-4adcfde992b2" --session "$BW_SESSION"
        # Update custom fields: cf-access-client-id = "<new_client_id>"
        # Update custom fields: cf-access-client-secret = "<new_client_secret>"
        
      3. Test new token:

        curl -s -w "\nHTTP: %{http_code}\n" \
          -H "CF-Access-Client-Id: <new_client_id>" \
          -H "CF-Access-Client-Secret: <new_client_secret>" \
          https://lightrag.alai.no/health
        
      4. Verify health probes work:

        bashfind ~/system/tools/lightrag-health.shevidence # Should return HEALTHY (exit code 0)
        
      5. Update cf-access-token-registry.json:

        jq '.["lightrag.alai.no"].last_verified =-name "'"lightrag-health-*.json" -mtime +30 \
          | xargs tar -czf ~/system/evidence/archive-$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"'Y%m).tar.gz
        
        \# ~/system/state/cf-access-token-registry.jsonUpload > /tmp/registry.json
        mv /tmp/registry.json ~/system/state/cf-access-token-registry.json
        
      6. Revoke old token in Cloudflare dashboard (after confirming new token works)


      Known Dependencies

      Azure VM NSG — Source IP Sensitivity

      The direct VM probe (http://20.240.61.67:9621/health) depends on an NSG allow-rule onto Azure VMBlob vm-alai-support that whitelists the ANVIL machine's source IP.

      Risk: If ANVIL's ISP rotates its public IP address, the NSG source-IP rule becomes stale and the direct VM probe will silently degrade to UNREACHABLE without any Cloudflare involvement. The CF tunnel probe will still work (it routes through Cloudflare, no direct IP dependency), but direct VM access will fail, giving a misleading "CF layer OK / VM layer unreachable" diagnosis.

      Detection: Run curl -s https://ifconfig.co on ANVIL and compare to the NSG rule's source address prefix:

      az networkstorage nsgblob rule show \
        -g rg-alai-lightragupload \
        --nsg-account-name vm-alai-lightragNSG \
        -n allow-lightrag-macstudioplockfrontstaging \
        --querycontainer-name "sourceAddressPrefix"evidence \
        --name lightrag-health-archive-$(date +%Y%m).tar.gz \
        --file ~/system/evidence/archive-$(date +%Y%m).tar.gz
      

      Long-term fix: Provision an internal DNS hostname for the VM so that the probe endpoint is hostname-based and survives IP rotation without NSG changes. File a child MC when ANVIL ISP rotation becomes a recurring issue.



      Changelog

      2026-05-06 — MC #99495 cleanup: StartInterval 900s MTTD reduction, BIC hostname migration to ollama.alai.no, jq custom-fields extraction, Known Dependencies section added
      2026-05-06 — MC #99400 hotfix: secrets hygiene + probe topology resolution, auto-heal removed, com.john.lightrag-monitor activated
      2026-04-21 — Initial version (baseline setup + first run)


      Document Owner: FlowForge
      Last Updated: 2026-05-06 (MC #99495)04-21
      Approved By: CEOPending (Alem Basic)approval for MCLaunchAgent #99400 deliverable D8installation