System Evolution 2026-04-16 — Main Runbook ALAI System Evolution — April 2026 Upgrade Date: 2026-04-16 Team Lead: Petter Graff Contributors: Chip Huyen, Martin Kleppmann, Angie Jones, Kelsey Hightower Status: Complete — Evolution Score 3/10 → 7/10 Mission Control: Task #8020 (master) Executive Summary The ALAI system was designed to be self-improving, but critical feedback loops were broken. This upgrade repairs three core chains: Knowledge chain: Task completion now flows → HiveMind → LightRAG → agent retrieval (default-on) State chain: Ghost databases removed, single source of truth enforced Governance chain: ZAKON PLAN linter + Proveo gate + Blueprint liveness enforced at commit/done time Before: System ingested knowledge but never retrieved it. Agents hallucinated. Plans were shipped without validation tasks. After: Every task enriches the next one. Every plan is enforced. Every "done" requires evidence. Architecture: Self-Improving Loop flowchart LR A[Task Done] -->|mc.js done| B[HiveMind Write] B -->|mc-task-outcomes.jsonl| C[Outbox Queue] C -->|lightrag-bulk-upload.js| D[LightRAG Ingest] D --> E[Neo4j Graph + Entity Index] E -->|discover.js default| F[Agent Query] F -->|Enhanced Context| G[Next Task Smarter] G --> A style A fill:#e1f5e1 style D fill:#fff3cd style F fill:#d1ecf1 style G fill:#e1f5e1 Key insight: The loop was 25% complete (ingest only). Now it's 90% (ingest → index → retrieve → apply → writeback). What Changed — 11 Core Improvements 1. LightRAG Health Probe Fix File: ~/system/docker/lightrag/docker-compose.yml (line 74) Problem: Probe used curl , but image had no curl binary. Container marked unhealthy for 46+ hours while pipeline worked. Fix: Python probe using urllib.request : healthcheck: test: ["CMD-SHELL", "python3 -c 'import urllib.request; urllib.request.urlopen(\"http://localhost:9621/health\", timeout=5)' || exit 1"] interval: 15s timeout: 10s retries: 3 start_period: 30s Verification: docker inspect lightrag | jq -r '.State.Health.Status' # Expected: "healthy" Known issue: Under heavy ingest load (87,000+ docs), probe can timeout. Container remains functional. Recommend 30s timeout for production. 2. Ghost HiveMind Symlink Files: Ghost: ~/.claude/hivemind.db (empty, misleading) Real: ~/system/databases/hivemind.db (30,912 entries) Problem: Subagents referencing old path wrote to void. Silent intel loss. Fix: Symlink created: ln -sf ~/system/databases/hivemind.db ~/.claude/hivemind.db Verification: ls -lah ~/.claude/hivemind.db # Expected: lrwxr-xr-x ... -> /Users/makinja/system/databases/hivemind.db sqlite3 ~/.claude/hivemind.db "SELECT COUNT(*) FROM intel;" # Expected: 30912 (matches real DB) 3. LightRAG Default-On in discover.js File: ~/system/tools/discover.js Problem: LightRAG was flag-gated ( if (flags.lightrag) ). Default agent workflow never queried the graph. 68,602 documents ingested but zero retrieval. Fix: Inverted logic: // OLD: const useLightRAG = flags.lightrag; // NEW: const useLightRAG = !flags['no-lightrag']; LightRAG now runs by default with 5s timeout fallback. Opt-out: discover.js --no-lightrag "query" . Verification: node ~/system/tools/discover.js "MC task workflow" | grep -q "LightRAG" # Expected: LightRAG section in output Runbook: See ~/system/docs/runbooks/lightrag-default-on.md 4. Auto-Writeback on mc.js done Files: ~/system/tools/mc.js (done command) ~/system/logs/mc-task-outcomes.jsonl (outbox) Problem: Task learnings stayed in session logs. Never indexed. Next agent started from zero context. Fix: When mc.js done runs: Extracts task summary + outcome Writes to HiveMind ( intel table) — fire-and-forget Appends to JSONL outbox for bulk LightRAG ingest Non-blocking: HiveMind failure logs error but doesn't block task closure Format (outbox): { "task_id": 8020, "title": "System Evo T11: Blueprint liveness gate", "outcome": "Gate implemented. mc.js checks blueprint mtime during done.", "timestamp": "2026-04-16T21:12:03Z", "tags": ["mc", "blueprint", "governance"] } Verification: tail -n 3 ~/system/logs/mc-task-outcomes.jsonl sqlite3 ~/system/databases/hivemind.db \ "SELECT COUNT(*) FROM intel WHERE category='briefing' AND created_at > datetime('now', '-1 hour');" Runbook: See ~/system/docs/runbooks/mc-done-auto-writeback.md 5. ZAKON PLAN Linter File: ~/system/tools/zakon-plan-lint.sh Problem: Plans often shipped without validation task (Proveo/Angie) or documentation task (Skillforge). Hard Constraint violation was voluntary. Fix: Pre-commit hook enforces ZAKON PLAN: bash ~/system/tools/zakon-plan-lint.sh ~/system/specs/my-plan.md # Exit 0: Plan compliant (has validation + docs task) # Exit 1: Plan missing required tasks Detects: Validation task with Proveo or Angie owner Documentation task with Skillforge or BookStack keyword Verification: # Test with compliant plan bash ~/system/tools/zakon-plan-lint.sh ~/system/specs/system-evolution-plan.md && echo PASS # Regression suite runs linter on all specs/*-plan.md (max 10) bash ~/system/tools/system-regression.sh | grep "ZAKON PLAN" Runbook: See ~/system/docs/runbooks/zakon-plan-linter.md 6. Proveo Gate in mc.js done File: ~/system/tools/mc.js (done command) Problem: Builder could mark task done without validation evidence. Hard Constraint #4 ("Builder cannot say done") was unenforced. Fix: mc.js done now checks: Does task have evidence_ref field populated? Was last update by Proveo/Angie agent? If neither → reject unless --force "reason" Force reason logged to HiveMind with quality gate flag. Usage: # Normal flow (requires evidence) node ~/system/tools/mc.js done 8020 # Emergency bypass (logged + flagged) node ~/system/tools/mc.js done 8020 --force "Production incident, validated live with CEO" Verification: Evidence file: ~/system/evidence/system-evolution-2026-04-16/v4-reject.txt + v4-accept.txt 7. Blueprint Liveness Gate Files: ~/system/tools/mc.js (done command checks mtime) ~/felles/shared-configs/BUILD-BLUEPRINT.md (new) Problem: Blueprints were static documentation. Tasks claiming "stack migration complete" never updated blueprint. Plock blueprint was 40 days stale despite Vite→Next.js migration. Fix: MC tasks can reference blueprint: mc.js add --blueprint-ref ~/ALAI/products/plock/BUILD-BLUEPRINT.md On mc.js done , gate checks: was blueprint file modified during task execution? If not → warn or reject (based on config) Shared-configs blueprint: Created ~/felles/shared-configs/BUILD-BLUEPRINT.md covering: @alai/tsconfig @alai/eslint-config Prettier config Docker Compose patterns Verification: # Check Plock blueprint freshness ls -l ~/ALAI/products/plock/BUILD-BLUEPRINT.md # Verify shared-configs blueprint exists cat ~/felles/shared-configs/BUILD-BLUEPRINT.md | head -20 Runbook: See ~/system/docs/runbooks/blueprint-liveness.md 8. Cost Tracker Token Counting Fix Files: ~/system/tools/comms-responder.js ~/system/tools/cost-tracker/adapters/ollama.js Problem: Cost tracker reported 0 tokens despite active LLM usage. Alem had no spend visibility. Root cause: Ollama adapter wasn't parsing response format correctly. Fix: Updated adapter to handle Ollama /api/chat response structure + fallback for missing usage field. Verification: node ~/system/tools/cost-tracker.js summary today # Expected: tokens_total > 0 # Evidence: ~/system/evidence/system-evolution-2026-04-16/v6-cost.txt # Shows: Total requests: 10, token sample: 1,463 9. Regression Suite File: ~/system/tools/system-regression.sh Why: No automated smoke tests for system toolset. Breakage discovered days later by agents failing mid-task. Coverage (10 checks, <10s runtime): Tools health ( discover.js --verify ) MC smoke ( mc.js list --limit 1 ) LightRAG container health LightRAG HTTP reachable HiveMind readable HiveMind growing (delta check vs baseline) MC outbox exists ZAKON PLAN compliance scan (specs/*-plan.md) Dead daemon count (< 5 threshold) Cost tracker non-zero tokens Output format: PASS / FAIL / WARN with color-coded summary. Verification: bash ~/system/tools/system-regression.sh # Evidence: ~/system/evidence/system-evolution-2026-04-16/v8-regression.txt Runbook: See ~/system/docs/runbooks/system-regression-suite.md 10. Orchestration Surface Authority File: ~/system/rules/orchestration-surface.md Problem: Three competing orchestration surfaces (Ollama DAG, Claude chains, PI factory) with no routing authority. Agents chose arbitrarily. Fix: Decision table created: Task type Surface Primary tool Long-running DAG (> 5 min) Ollama DAG orchestrator-http-server.js Interactive subagent (in-session) Claude chains YAML from ~/system/agents/chains/ Persistent company agent PI factory agent-factory.js One-shot atomic build (< 10 min) Task tool (Agent) subagent_type param Cron / scheduled CronCreate skill cron registry Default when unsure: One-shot Task tool. Verification: Evidence file: ~/system/evidence/system-evolution-2026-04-16/v7-orch.txt 11. Database Deduplication Problem: Three MC database files found: ~/system/databases/mission-control.db (real, 2.1 MB) ~/system/tools/mc.db (0 bytes) ~/system/databases/mc.sqlite (0 bytes) Agents using wrong path → empty queue, silent failures. Fix: Empty duplicates removed. Only mission-control.db remains. Verification: ls -lh ~/system/databases/mission-control.db ls -lh ~/system/tools/mc.db 2>/dev/null || echo "Correctly deleted" Evidence: ~/system/evidence/system-evolution-2026-04-16/v7-dupes-gone.txt Known Issues & Limitations 1. LightRAG Probe Timeout Under Load Status: Non-critical Symptoms: Health check times out during bulk ingest (87K+ docs) Workaround: Container remains functional. Probe timeout doesn't affect pipeline. Fix plan: Increase probe timeout to 30s in production (MC #8048) 2. B2 Offsite Backup Daemon Dead Status: CRITICAL Task: MC #5 (restart + fix) Impact: No offsite backups since 2026-04-14 3. 43 Dead Daemons Status: Fleet degraded Task: MC #8049 (triage + restart priority) List: launchctl list | awk '$1 == "-" && $2 != "0"' 4. ZAKON PLAN Compliance: 2/10 Historic Plans Status: Expected drift Action: Linter enforces NEW plans. Retro-fix not required. Validation Evidence All evidence stored in: ~/system/evidence/system-evolution-2026-04-16/ Check File Result LightRAG health v1-lightrag-health.json Functional (probe timeout during load) Auto-writeback v2-intel-tail.txt + v2-outbox-tail.txt 6 new intel entries ZAKON linter v3-pass.txt + v3-fail.txt Detects missing tasks Proveo gate v4-accept.txt + v4-reject.txt Rejects without evidence Regression suite v8-regression.txt 7/10 PASS, 3 FAIL (expected) Cost tracker v6-cost.txt Non-zero tokens (1,463 sample) DB dedup v7-dupes-gone.txt Duplicates removed Orchestration v7-orch.txt Authority doc created Symlink v7-symlink.txt Ghost DB now symlinked How to Verify System Health Post-Upgrade Quick Check (30 seconds) bash ~/system/tools/system-regression.sh # Expected: 7+ checks PASS Detailed Validation 1. LightRAG pipeline working: curl -s http://localhost:9621/documents | jq '.statuses | {processed, pending, failed}' docker inspect lightrag | jq -r '.State.Health.Status' 2. HiveMind auto-writeback: # Create test task TEST_ID=$(node ~/system/tools/mc.js add "Test writeback" --owner john | grep -o '#[0-9]*' | tr -d '#') # Complete it node ~/system/tools/mc.js done $TEST_ID # Check intel table sqlite3 ~/system/databases/hivemind.db \ "SELECT content FROM intel WHERE content LIKE '%Test writeback%' ORDER BY id DESC LIMIT 1;" 3. ZAKON PLAN linter: # Test with non-compliant plan (should fail) echo "# Plan\nSome tasks but no validation" > /tmp/bad-plan.md bash ~/system/tools/zakon-plan-lint.sh /tmp/bad-plan.md && echo "ERROR: should have failed" # Test with system-evolution-plan (should pass) bash ~/system/tools/zakon-plan-lint.sh ~/system/specs/system-evolution-plan.md && echo PASS 4. Proveo gate: # Try to mark task done without evidence (should reject) node ~/system/tools/mc.js done # Expected: Error message about missing validation 5. Cost tracker: node ~/system/tools/cost-tracker.js summary today | jq .tokens_total # Expected: > 0 Impact Metrics Metric Before After Target Evolution score 3/10 7/10 7/10 ✅ LightRAG ingest rate 5% 95%+ >95% ✅ LightRAG default retrieval 0% 100% 100% ✅ Dead daemons 12 43 <3 ⚠️ ZAKON PLAN compliance Partial 100% (new) 100% ✅ Self-test coverage ~15% 40%+ 40% ✅ Ghost databases 3 0 0 ✅ Auto-writeback No Yes Yes ✅ Overall: 7/9 targets met. Dead daemons (43) and B2 backup require follow-up (MC #8049, #5). Related Documentation Runbooks: ZAKON PLAN Linter LightRAG Default-On MC Done Auto-Writeback Blueprint Liveness System Regression Suite System Rules: ~/system/rules/orchestration-surface.md — Orchestration routing authority ~/system/rules/john-operating-system.md — Full rule set ~/.claude/CLAUDE.md — John's identity + routing Evidence: ~/system/evidence/system-evolution-2026-04-16/ — All validation artifacts Original Plan: ~/system/specs/system-evolution-plan.md — Full 15-task breakdown Next Steps Immediate (CEO approved) Restart B2 backup daemon (MC #5) — CRITICAL Triage 43 dead daemons (MC #8049) — HIGH priority Monitor LightRAG ingest rate — Daily check for 1 week Short-term (2 weeks) Retrofit Plock blueprint with stack compliance checklist LightRAG probe timeout increase to 30s in docker-compose.yml Weekly regression suite scheduled via launchd Long-term (1 month) Extend ZAKON linter to check for Evidence Level (L2+ minimum) Blueprint liveness — change from warn to block HiveMind outbox idempotency — add unique constraint on correlation_id Validated by: Angie Jones (Proveo) — Task #8027 Documented by: Skillforge — Task #8038 Approved by: Petter Graff (Team Lead) Date: 2026-04-16 23:14 CEST "Every task completion now enriches the next one. That is the evolution the CEO asked for." — Petter Graff