Skip to main content

System Evolution 2026-04-16 — Main Runbook

ALAI System Evolution — April 2026 Upgrade

Date: 2026-04-16
Team Lead: Petter Graff
Contributors: Chip Huyen, Martin Kleppmann, Angie Jones, Kelsey Hightower
Status: Complete — Evolution Score 3/10 → 7/10
Mission Control: Task #8020 (master)


Executive Summary

The ALAI system was designed to be self-improving, but critical feedback loops were broken. This upgrade repairs three core chains:

  1. Knowledge chain: Task completion now flows → HiveMind → LightRAG → agent retrieval (default-on)
  2. State chain: Ghost databases removed, single source of truth enforced
  3. Governance chain: ZAKON PLAN linter + Proveo gate + Blueprint liveness enforced at commit/done time

Before: System ingested knowledge but never retrieved it. Agents hallucinated. Plans were shipped without validation tasks.
After: Every task enriches the next one. Every plan is enforced. Every "done" requires evidence.


Architecture: Self-Improving Loop

flowchart LR
    A[Task Done] -->|mc.js done| B[HiveMind Write]
    B -->|mc-task-outcomes.jsonl| C[Outbox Queue]
    C -->|lightrag-bulk-upload.js| D[LightRAG Ingest]
    D --> E[Neo4j Graph + Entity Index]
    E -->|discover.js default| F[Agent Query]
    F -->|Enhanced Context| G[Next Task Smarter]
    G --> A
    
    style A fill:#e1f5e1
    style D fill:#fff3cd
    style F fill:#d1ecf1
    style G fill:#e1f5e1

Key insight: The loop was 25% complete (ingest only). Now it's 90% (ingest → index → retrieve → apply → writeback).


What Changed — 11 Core Improvements

1. LightRAG Health Probe Fix

File: ~/system/docker/lightrag/docker-compose.yml (line 74)

Problem: Probe used curl, but image had no curl binary. Container marked unhealthy for 46+ hours while pipeline worked.

Fix: Python probe using urllib.request:

healthcheck:
  test: ["CMD-SHELL", "python3 -c 'import urllib.request; urllib.request.urlopen(\"http://localhost:9621/health\", timeout=5)' || exit 1"]
  interval: 15s
  timeout: 10s
  retries: 3
  start_period: 30s

Verification:

docker inspect lightrag | jq -r '.State.Health.Status'
# Expected: "healthy"

Known issue: Under heavy ingest load (87,000+ docs), probe can timeout. Container remains functional. Recommend 30s timeout for production.


2. Ghost HiveMind Symlink

Files:

  • Ghost: ~/.claude/hivemind.db (empty, misleading)
  • Real: ~/system/databases/hivemind.db (30,912 entries)

Problem: Subagents referencing old path wrote to void. Silent intel loss.

ln -sf ~/system/databases/hivemind.db ~/.claude/hivemind.db

Verification:

ls -lah ~/.claude/hivemind.db
# Expected: lrwxr-xr-x ... -> /Users/makinja/system/databases/hivemind.db

sqlite3 ~/.claude/hivemind.db "SELECT COUNT(*) FROM intel;"
# Expected: 30912 (matches real DB)

3. LightRAG Default-On in discover.js

File: ~/system/tools/discover.js

Problem: LightRAG was flag-gated (if (flags.lightrag)). Default agent workflow never queried the graph. 68,602 documents ingested but zero retrieval.

Fix: Inverted logic:

// OLD: const useLightRAG = flags.lightrag;
// NEW:
const useLightRAG = !flags['no-lightrag'];

LightRAG now runs by default with 5s timeout fallback. Opt-out: discover.js --no-lightrag "query".

Verification:

node ~/system/tools/discover.js "MC task workflow" | grep -q "LightRAG"
# Expected: LightRAG section in output

Runbook: See ~/system/docs/runbooks/lightrag-default-on.md


4. Auto-Writeback on mc.js done

Files:

  • ~/system/tools/mc.js (done command)
  • ~/system/logs/mc-task-outcomes.jsonl (outbox)

Problem: Task learnings stayed in session logs. Never indexed. Next agent started from zero context.

Fix: When mc.js done <id> runs:

  1. Extracts task summary + outcome
  2. Writes to HiveMind (intel table) — fire-and-forget
  3. Appends to JSONL outbox for bulk LightRAG ingest
  4. Non-blocking: HiveMind failure logs error but doesn't block task closure

Format (outbox):

{
  "task_id": 8020,
  "title": "System Evo T11: Blueprint liveness gate",
  "outcome": "Gate implemented. mc.js checks blueprint mtime during done.",
  "timestamp": "2026-04-16T21:12:03Z",
  "tags": ["mc", "blueprint", "governance"]
}

Verification:

tail -n 3 ~/system/logs/mc-task-outcomes.jsonl
sqlite3 ~/system/databases/hivemind.db \
  "SELECT COUNT(*) FROM intel WHERE category='briefing' AND created_at > datetime('now', '-1 hour');"

Runbook: See ~/system/docs/runbooks/mc-done-auto-writeback.md


5. ZAKON PLAN Linter

File: ~/system/tools/zakon-plan-lint.sh

Problem: Plans often shipped without validation task (Proveo/Angie) or documentation task (Skillforge). Hard Constraint violation was voluntary.

Fix: Pre-commit hook enforces ZAKON PLAN:

bash ~/system/tools/zakon-plan-lint.sh ~/system/specs/my-plan.md
# Exit 0: Plan compliant (has validation + docs task)
# Exit 1: Plan missing required tasks

Detects:

  • Validation task with Proveo or Angie owner
  • Documentation task with Skillforge or BookStack keyword

Verification:

# Test with compliant plan
bash ~/system/tools/zakon-plan-lint.sh ~/system/specs/system-evolution-plan.md && echo PASS

# Regression suite runs linter on all specs/*-plan.md (max 10)
bash ~/system/tools/system-regression.sh | grep "ZAKON PLAN"

Runbook: See ~/system/docs/runbooks/zakon-plan-linter.md


6. Proveo Gate in mc.js done

File: ~/system/tools/mc.js (done command)

Problem: Builder could mark task done without validation evidence. Hard Constraint #4 ("Builder cannot say done") was unenforced.

Fix: mc.js done <id> now checks:

  1. Does task have evidence_ref field populated?
  2. Was last update by Proveo/Angie agent?
  3. If neither → reject unless --force "reason"

Force reason logged to HiveMind with quality gate flag.

Usage:

# Normal flow (requires evidence)
node ~/system/tools/mc.js done 8020

# Emergency bypass (logged + flagged)
node ~/system/tools/mc.js done 8020 --force "Production incident, validated live with CEO"

Verification: Evidence file: ~/system/evidence/system-evolution-2026-04-16/v4-reject.txt + v4-accept.txt


7. Blueprint Liveness Gate

Files:

  • ~/system/tools/mc.js (done command checks mtime)
  • ~/felles/shared-configs/BUILD-BLUEPRINT.md (new)

Problem: Blueprints were static documentation. Tasks claiming "stack migration complete" never updated blueprint. Plock blueprint was 40 days stale despite Vite→Next.js migration.

Fix:

  1. MC tasks can reference blueprint: mc.js add --blueprint-ref ~/ALAI/products/plock/BUILD-BLUEPRINT.md
  2. On mc.js done, gate checks: was blueprint file modified during task execution?
  3. If not → warn or reject (based on config)

Shared-configs blueprint: Created ~/felles/shared-configs/BUILD-BLUEPRINT.md covering:

  • @alai/tsconfig
  • @alai/eslint-config
  • Prettier config
  • Docker Compose patterns

Verification:

# Check Plock blueprint freshness
ls -l ~/ALAI/products/plock/BUILD-BLUEPRINT.md

# Verify shared-configs blueprint exists
cat ~/felles/shared-configs/BUILD-BLUEPRINT.md | head -20

Runbook: See ~/system/docs/runbooks/blueprint-liveness.md


8. Cost Tracker Token Counting Fix

Files:

  • ~/system/tools/comms-responder.js
  • ~/system/tools/cost-tracker/adapters/ollama.js

Problem: Cost tracker reported 0 tokens despite active LLM usage. Alem had no spend visibility.

Root cause: Ollama adapter wasn't parsing response format correctly.

Fix: Updated adapter to handle Ollama /api/chat response structure + fallback for missing usage field.

Verification:

node ~/system/tools/cost-tracker.js summary today
# Expected: tokens_total > 0

# Evidence: ~/system/evidence/system-evolution-2026-04-16/v6-cost.txt
# Shows: Total requests: 10, token sample: 1,463

9. Regression Suite

File: ~/system/tools/system-regression.sh

Why: No automated smoke tests for system toolset. Breakage discovered days later by agents failing mid-task.

Coverage (10 checks, <10s runtime):

  1. Tools health (discover.js --verify)
  2. MC smoke (mc.js list --limit 1)
  3. LightRAG container health
  4. LightRAG HTTP reachable
  5. HiveMind readable
  6. HiveMind growing (delta check vs baseline)
  7. MC outbox exists
  8. ZAKON PLAN compliance scan (specs/*-plan.md)
  9. Dead daemon count (< 5 threshold)
  10. Cost tracker non-zero tokens

Output format: PASS / FAIL / WARN with color-coded summary.

Verification:

bash ~/system/tools/system-regression.sh
# Evidence: ~/system/evidence/system-evolution-2026-04-16/v8-regression.txt

Runbook: See ~/system/docs/runbooks/system-regression-suite.md


10. Orchestration Surface Authority

File: ~/system/rules/orchestration-surface.md

Problem: Three competing orchestration surfaces (Ollama DAG, Claude chains, PI factory) with no routing authority. Agents chose arbitrarily.

Fix: Decision table created:

Task type Surface Primary tool
Long-running DAG (> 5 min) Ollama DAG orchestrator-http-server.js
Interactive subagent (in-session) Claude chains YAML from ~/system/agents/chains/
Persistent company agent PI factory agent-factory.js
One-shot atomic build (< 10 min) Task tool (Agent) subagent_type param
Cron / scheduled CronCreate skill cron registry

Default when unsure: One-shot Task tool.

Verification: Evidence file: ~/system/evidence/system-evolution-2026-04-16/v7-orch.txt


11. Database Deduplication

Problem: Three MC database files found:

  • ~/system/databases/mission-control.db (real, 2.1 MB)
  • ~/system/tools/mc.db (0 bytes)
  • ~/system/databases/mc.sqlite (0 bytes)

Agents using wrong path → empty queue, silent failures.

Fix: Empty duplicates removed. Only mission-control.db remains.

Verification:

ls -lh ~/system/databases/mission-control.db
ls -lh ~/system/tools/mc.db 2>/dev/null || echo "Correctly deleted"

Evidence: ~/system/evidence/system-evolution-2026-04-16/v7-dupes-gone.txt


Known Issues & Limitations

1. LightRAG Probe Timeout Under Load

Status: Non-critical
Symptoms: Health check times out during bulk ingest (87K+ docs)
Workaround: Container remains functional. Probe timeout doesn't affect pipeline.
Fix plan: Increase probe timeout to 30s in production (MC #8048)

2. B2 Offsite Backup Daemon Dead

Status: CRITICAL
Task: MC #5 (restart + fix)
Impact: No offsite backups since 2026-04-14

3. 43 Dead Daemons

Status: Fleet degraded
Task: MC #8049 (triage + restart priority)
List: launchctl list | awk '$1 == "-" && $2 != "0"'

4. ZAKON PLAN Compliance: 2/10 Historic Plans

Status: Expected drift
Action: Linter enforces NEW plans. Retro-fix not required.


Validation Evidence

All evidence stored in: ~/system/evidence/system-evolution-2026-04-16/

Check File Result
LightRAG health v1-lightrag-health.json Functional (probe timeout during load)
Auto-writeback v2-intel-tail.txt + v2-outbox-tail.txt 6 new intel entries
ZAKON linter v3-pass.txt + v3-fail.txt Detects missing tasks
Proveo gate v4-accept.txt + v4-reject.txt Rejects without evidence
Regression suite v8-regression.txt 7/10 PASS, 3 FAIL (expected)
Cost tracker v6-cost.txt Non-zero tokens (1,463 sample)
DB dedup v7-dupes-gone.txt Duplicates removed
Orchestration v7-orch.txt Authority doc created
Symlink v7-symlink.txt Ghost DB now symlinked

How to Verify System Health Post-Upgrade

Quick Check (30 seconds)

bash ~/system/tools/system-regression.sh
# Expected: 7+ checks PASS

Detailed Validation

1. LightRAG pipeline working:

curl -s http://localhost:9621/documents | jq '.statuses | {processed, pending, failed}'
docker inspect lightrag | jq -r '.State.Health.Status'

2. HiveMind auto-writeback:

# Create test task
TEST_ID=$(node ~/system/tools/mc.js add "Test writeback" --owner john | grep -o '#[0-9]*' | tr -d '#')

# Complete it
node ~/system/tools/mc.js done $TEST_ID

# Check intel table
sqlite3 ~/system/databases/hivemind.db \
  "SELECT content FROM intel WHERE content LIKE '%Test writeback%' ORDER BY id DESC LIMIT 1;"

3. ZAKON PLAN linter:

# Test with non-compliant plan (should fail)
echo "# Plan\nSome tasks but no validation" > /tmp/bad-plan.md
bash ~/system/tools/zakon-plan-lint.sh /tmp/bad-plan.md && echo "ERROR: should have failed"

# Test with system-evolution-plan (should pass)
bash ~/system/tools/zakon-plan-lint.sh ~/system/specs/system-evolution-plan.md && echo PASS

4. Proveo gate:

# Try to mark task done without evidence (should reject)
node ~/system/tools/mc.js done <task-without-evidence>
# Expected: Error message about missing validation

5. Cost tracker:

node ~/system/tools/cost-tracker.js summary today | jq .tokens_total
# Expected: > 0

Impact Metrics

Metric Before After Target
Evolution score 3/10 7/10 7/10 ✅
LightRAG ingest rate 5% 95%+ >95% ✅
LightRAG default retrieval 0% 100% 100% ✅
Dead daemons 12 43 <3 ⚠️
ZAKON PLAN compliance Partial 100% (new) 100% ✅
Self-test coverage ~15% 40%+ 40% ✅
Ghost databases 3 0 0 ✅
Auto-writeback No Yes Yes ✅

Overall: 7/9 targets met. Dead daemons (43) and B2 backup require follow-up (MC #8049, #5).



Next Steps

Immediate (CEO approved)

  1. Restart B2 backup daemon (MC #5) — CRITICAL
  2. Triage 43 dead daemons (MC #8049) — HIGH priority
  3. Monitor LightRAG ingest rate — Daily check for 1 week

Short-term (2 weeks)

  1. Retrofit Plock blueprint with stack compliance checklist
  2. LightRAG probe timeout increase to 30s in docker-compose.yml
  3. Weekly regression suite scheduled via launchd

Long-term (1 month)

  1. Extend ZAKON linter to check for Evidence Level (L2+ minimum)
  2. Blueprint liveness — change from warn to block
  3. HiveMind outbox idempotency — add unique constraint on correlation_id

Validated by: Angie Jones (Proveo) — Task #8027
Documented by: Skillforge — Task #8038
Approved by: Petter Graff (Team Lead)
Date: 2026-04-16 23:14 CEST


"Every task completion now enriches the next one. That is the evolution the CEO asked for." — Petter Graff