System Evolution 2026-04-16 — Main Runbook

ALAI System Evolution — April 2026 Upgrade

Date: 2026-04-16
Team Lead: Petter Graff
Contributors: Chip Huyen, Martin Kleppmann, Angie Jones, Kelsey Hightower
Status: Complete — Evolution Score 3/10 → 7/10
Mission Control: Task #8020 (master)

Executive Summary

The ALAI system was designed to be self-improving, but critical feedback loops were broken. This upgrade repairs three core chains:

Knowledge chain: Task completion now flows → HiveMind → LightRAG → agent retrieval (default-on)
State chain: Ghost databases removed, single source of truth enforced
Governance chain: ZAKON PLAN linter + Proveo gate + Blueprint liveness enforced at commit/done time

Before: System ingested knowledge but never retrieved it. Agents hallucinated. Plans were shipped without validation tasks.
After: Every task enriches the next one. Every plan is enforced. Every "done" requires evidence.

Architecture: Self-Improving Loop

flowchart LR
    A[Task Done] -->|mc.js done| B[HiveMind Write]
    B -->|mc-task-outcomes.jsonl| C[Outbox Queue]
    C -->|lightrag-bulk-upload.js| D[LightRAG Ingest]
    D --> E[Neo4j Graph + Entity Index]
    E -->|discover.js default| F[Agent Query]
    F -->|Enhanced Context| G[Next Task Smarter]
    G --> A
    
    style A fill:#e1f5e1
    style D fill:#fff3cd
    style F fill:#d1ecf1
    style G fill:#e1f5e1

Key insight: The loop was 25% complete (ingest only). Now it's 90% (ingest → index → retrieve → apply → writeback).

What Changed — 11 Core Improvements

1. LightRAG Health Probe Fix

File: ~/system/docker/lightrag/docker-compose.yml (line 74)

Problem: Probe used curl, but image had no curl binary. Container marked unhealthy for 46+ hours while pipeline worked.

Fix: Python probe using urllib.request:

healthcheck:
  test: ["CMD-SHELL", "python3 -c 'import urllib.request; urllib.request.urlopen(\"http://localhost:9621/health\", timeout=5)' || exit 1"]
  interval: 15s
  timeout: 10s
  retries: 3
  start_period: 30s

Verification:

docker inspect lightrag | jq -r '.State.Health.Status'
# Expected: "healthy"

Known issue: Under heavy ingest load (87,000+ docs), probe can timeout. Container remains functional. Recommend 30s timeout for production.

2. Ghost HiveMind Symlink

Files:

Ghost: ~/.claude/hivemind.db (empty, misleading)
Real: ~/system/databases/hivemind.db (30,912 entries)

Problem: Subagents referencing old path wrote to void. Silent intel loss.

Fix: Symlink created:

ln -sf ~/system/databases/hivemind.db ~/.claude/hivemind.db

Verification:

ls -lah ~/.claude/hivemind.db
# Expected: lrwxr-xr-x ... -> /Users/makinja/system/databases/hivemind.db

sqlite3 ~/.claude/hivemind.db "SELECT COUNT(*) FROM intel;"
# Expected: 30912 (matches real DB)

3. LightRAG Default-On in discover.js

File: ~/system/tools/discover.js

Problem: LightRAG was flag-gated (if (flags.lightrag)). Default agent workflow never queried the graph. 68,602 documents ingested but zero retrieval.

Fix: Inverted logic:

// OLD: const useLightRAG = flags.lightrag;
// NEW:
const useLightRAG = !flags['no-lightrag'];

LightRAG now runs by default with 5s timeout fallback. Opt-out: discover.js --no-lightrag "query".

Verification:

node ~/system/tools/discover.js "MC task workflow" | grep -q "LightRAG"
# Expected: LightRAG section in output

Runbook: See ~/system/docs/runbooks/lightrag-default-on.md

4. Auto-Writeback on mc.js done

Files:

~/system/tools/mc.js (done command)
~/system/logs/mc-task-outcomes.jsonl (outbox)

Problem: Task learnings stayed in session logs. Never indexed. Next agent started from zero context.

Fix: When mc.js done <id> runs:

Extracts task summary + outcome
Writes to HiveMind (intel table) — fire-and-forget
Appends to JSONL outbox for bulk LightRAG ingest
Non-blocking: HiveMind failure logs error but doesn't block task closure

Format (outbox):

{
  "task_id": 8020,
  "title": "System Evo T11: Blueprint liveness gate",
  "outcome": "Gate implemented. mc.js checks blueprint mtime during done.",
  "timestamp": "2026-04-16T21:12:03Z",
  "tags": ["mc", "blueprint", "governance"]
}

Verification:

tail -n 3 ~/system/logs/mc-task-outcomes.jsonl
sqlite3 ~/system/databases/hivemind.db \
  "SELECT COUNT(*) FROM intel WHERE category='briefing' AND created_at > datetime('now', '-1 hour');"

Runbook: See ~/system/docs/runbooks/mc-done-auto-writeback.md

5. ZAKON PLAN Linter

File: ~/system/tools/zakon-plan-lint.sh

Problem: Plans often shipped without validation task (Proveo/Angie) or documentation task (Skillforge). Hard Constraint violation was voluntary.

Fix: Pre-commit hook enforces ZAKON PLAN:

bash ~/system/tools/zakon-plan-lint.sh ~/system/specs/my-plan.md
# Exit 0: Plan compliant (has validation + docs task)
# Exit 1: Plan missing required tasks

Detects:

Validation task with Proveo or Angie owner
Documentation task with Skillforge or BookStack keyword

Verification:

# Test with compliant plan
bash ~/system/tools/zakon-plan-lint.sh ~/system/specs/system-evolution-plan.md && echo PASS

# Regression suite runs linter on all specs/*-plan.md (max 10)
bash ~/system/tools/system-regression.sh | grep "ZAKON PLAN"

Runbook: See ~/system/docs/runbooks/zakon-plan-linter.md

6. Proveo Gate in mc.js done

File: ~/system/tools/mc.js (done command)

Problem: Builder could mark task done without validation evidence. Hard Constraint #4 ("Builder cannot say done") was unenforced.

Fix: mc.js done <id> now checks:

Does task have evidence_ref field populated?
Was last update by Proveo/Angie agent?
If neither → reject unless --force "reason"

Force reason logged to HiveMind with quality gate flag.

Usage:

# Normal flow (requires evidence)
node ~/system/tools/mc.js done 8020

# Emergency bypass (logged + flagged)
node ~/system/tools/mc.js done 8020 --force "Production incident, validated live with CEO"

Verification: Evidence file: ~/system/evidence/system-evolution-2026-04-16/v4-reject.txt + v4-accept.txt

7. Blueprint Liveness Gate

Files:

~/system/tools/mc.js (done command checks mtime)
~/felles/shared-configs/BUILD-BLUEPRINT.md (new)

Problem: Blueprints were static documentation. Tasks claiming "stack migration complete" never updated blueprint. Plock blueprint was 40 days stale despite Vite→Next.js migration.

Fix:

MC tasks can reference blueprint: mc.js add --blueprint-ref ~/ALAI/products/plock/BUILD-BLUEPRINT.md
On mc.js done, gate checks: was blueprint file modified during task execution?
If not → warn or reject (based on config)

Shared-configs blueprint: Created ~/felles/shared-configs/BUILD-BLUEPRINT.md covering:

@alai/tsconfig
@alai/eslint-config
Prettier config
Docker Compose patterns

Verification:

# Check Plock blueprint freshness
ls -l ~/ALAI/products/plock/BUILD-BLUEPRINT.md

# Verify shared-configs blueprint exists
cat ~/felles/shared-configs/BUILD-BLUEPRINT.md | head -20

Runbook: See ~/system/docs/runbooks/blueprint-liveness.md

8. Cost Tracker Token Counting Fix

Files:

~/system/tools/comms-responder.js
~/system/tools/cost-tracker/adapters/ollama.js

Problem: Cost tracker reported 0 tokens despite active LLM usage. Alem had no spend visibility.

Root cause: Ollama adapter wasn't parsing response format correctly.

Fix: Updated adapter to handle Ollama /api/chat response structure + fallback for missing usage field.

Verification:

node ~/system/tools/cost-tracker.js summary today
# Expected: tokens_total > 0

# Evidence: ~/system/evidence/system-evolution-2026-04-16/v6-cost.txt
# Shows: Total requests: 10, token sample: 1,463

9. Regression Suite

File: ~/system/tools/system-regression.sh

Why: No automated smoke tests for system toolset. Breakage discovered days later by agents failing mid-task.

Coverage (10 checks, <10s runtime):

Tools health (discover.js --verify)
MC smoke (mc.js list --limit 1)
LightRAG container health
LightRAG HTTP reachable
HiveMind readable
HiveMind growing (delta check vs baseline)
MC outbox exists
ZAKON PLAN compliance scan (specs/*-plan.md)
Dead daemon count (< 5 threshold)
Cost tracker non-zero tokens

Output format: PASS / FAIL / WARN with color-coded summary.

Verification:

bash ~/system/tools/system-regression.sh
# Evidence: ~/system/evidence/system-evolution-2026-04-16/v8-regression.txt

Runbook: See ~/system/docs/runbooks/system-regression-suite.md

10. Orchestration Surface Authority

File: ~/system/rules/orchestration-surface.md

Problem: Three competing orchestration surfaces (Ollama DAG, Claude chains, PI factory) with no routing authority. Agents chose arbitrarily.

Fix: Decision table created:

Task type	Surface	Primary tool
Long-running DAG (> 5 min)	Ollama DAG	`orchestrator-http-server.js`
Interactive subagent (in-session)	Claude chains	YAML from `~/system/agents/chains/`
Persistent company agent	PI factory	`agent-factory.js`
One-shot atomic build (< 10 min)	Task tool (Agent)	`subagent_type` param
Cron / scheduled	CronCreate skill	cron registry

Default when unsure: One-shot Task tool.

Verification: Evidence file: ~/system/evidence/system-evolution-2026-04-16/v7-orch.txt

11. Database Deduplication

Problem: Three MC database files found:

~/system/databases/mission-control.db (real, 2.1 MB)
~/system/tools/mc.db (0 bytes)
~/system/databases/mc.sqlite (0 bytes)

Agents using wrong path → empty queue, silent failures.

Fix: Empty duplicates removed. Only mission-control.db remains.

Verification:

ls -lh ~/system/databases/mission-control.db
ls -lh ~/system/tools/mc.db 2>/dev/null || echo "Correctly deleted"

Evidence: ~/system/evidence/system-evolution-2026-04-16/v7-dupes-gone.txt

Known Issues & Limitations

1. LightRAG Probe Timeout Under Load

Status: Non-critical
Symptoms: Health check times out during bulk ingest (87K+ docs)
Workaround: Container remains functional. Probe timeout doesn't affect pipeline.
Fix plan: Increase probe timeout to 30s in production (MC #8048)

2. B2 Offsite Backup Daemon Dead

Status: CRITICAL
Task: MC #5 (restart + fix)
Impact: No offsite backups since 2026-04-14

3. 43 Dead Daemons

Status: Fleet degraded
Task: MC #8049 (triage + restart priority)
List: launchctl list | awk '$1 == "-" && $2 != "0"'

4. ZAKON PLAN Compliance: 2/10 Historic Plans

Status: Expected drift
Action: Linter enforces NEW plans. Retro-fix not required.

Validation Evidence

All evidence stored in: ~/system/evidence/system-evolution-2026-04-16/

Check	File	Result
LightRAG health	`v1-lightrag-health.json`	Functional (probe timeout during load)
Auto-writeback	`v2-intel-tail.txt` + `v2-outbox-tail.txt`	6 new intel entries
ZAKON linter	`v3-pass.txt` + `v3-fail.txt`	Detects missing tasks
Proveo gate	`v4-accept.txt` + `v4-reject.txt`	Rejects without evidence
Regression suite	`v8-regression.txt`	7/10 PASS, 3 FAIL (expected)
Cost tracker	`v6-cost.txt`	Non-zero tokens (1,463 sample)
DB dedup	`v7-dupes-gone.txt`	Duplicates removed
Orchestration	`v7-orch.txt`	Authority doc created
Symlink	`v7-symlink.txt`	Ghost DB now symlinked

How to Verify System Health Post-Upgrade

Quick Check (30 seconds)

bash ~/system/tools/system-regression.sh
# Expected: 7+ checks PASS

Detailed Validation

1. LightRAG pipeline working:

curl -s http://localhost:9621/documents | jq '.statuses | {processed, pending, failed}'
docker inspect lightrag | jq -r '.State.Health.Status'

2. HiveMind auto-writeback:

# Create test task
TEST_ID=$(node ~/system/tools/mc.js add "Test writeback" --owner john | grep -o '#[0-9]*' | tr -d '#')

# Complete it
node ~/system/tools/mc.js done $TEST_ID

# Check intel table
sqlite3 ~/system/databases/hivemind.db \
  "SELECT content FROM intel WHERE content LIKE '%Test writeback%' ORDER BY id DESC LIMIT 1;"

3. ZAKON PLAN linter:

# Test with non-compliant plan (should fail)
echo "# Plan\nSome tasks but no validation" > /tmp/bad-plan.md
bash ~/system/tools/zakon-plan-lint.sh /tmp/bad-plan.md && echo "ERROR: should have failed"

# Test with system-evolution-plan (should pass)
bash ~/system/tools/zakon-plan-lint.sh ~/system/specs/system-evolution-plan.md && echo PASS

4. Proveo gate:

# Try to mark task done without evidence (should reject)
node ~/system/tools/mc.js done <task-without-evidence>
# Expected: Error message about missing validation

5. Cost tracker:

node ~/system/tools/cost-tracker.js summary today | jq .tokens_total
# Expected: > 0

Impact Metrics

Metric	Before	After	Target
Evolution score	3/10	7/10	7/10 ✅
LightRAG ingest rate	5%	95%+	>95% ✅
LightRAG default retrieval	0%	100%	100% ✅
Dead daemons	12	43	<3 ⚠️
ZAKON PLAN compliance	Partial	100% (new)	100% ✅
Self-test coverage	~15%	40%+	40% ✅
Ghost databases	3	0	0 ✅
Auto-writeback	No	Yes	Yes ✅

Overall: 7/9 targets met. Dead daemons (43) and B2 backup require follow-up (MC #8049, #5).

Runbooks:
System Rules:
- ~/system/rules/orchestration-surface.md — Orchestration routing authority
- ~/system/rules/john-operating-system.md — Full rule set
- ~/.claude/CLAUDE.md — John's identity + routing
Evidence:
- ~/system/evidence/system-evolution-2026-04-16/ — All validation artifacts
Original Plan:
- ~/system/specs/system-evolution-plan.md — Full 15-task breakdown

Next Steps

Immediate (CEO approved)

Restart B2 backup daemon (MC #5) — CRITICAL
Triage 43 dead daemons (MC #8049) — HIGH priority
Monitor LightRAG ingest rate — Daily check for 1 week

Short-term (2 weeks)

Retrofit Plock blueprint with stack compliance checklist
LightRAG probe timeout increase to 30s in docker-compose.yml
Weekly regression suite scheduled via launchd

Long-term (1 month)

Extend ZAKON linter to check for Evidence Level (L2+ minimum)
Blueprint liveness — change from warn to block
HiveMind outbox idempotency — add unique constraint on correlation_id

Validated by: Angie Jones (Proveo) — Task #8027
Documented by: Skillforge — Task #8038
Approved by: Petter Graff (Team Lead)
Date: 2026-04-16 23:14 CEST

"Every task completion now enriches the next one. That is the evolution the CEO asked for." — Petter Graff

Email Agent Runbook

Runbook: LightRAG ingest LaunchAgent fix (MC #10286)

Runbook: LightRAG ingest LaunchAgent fix (MC #10286)

Runbook: LightRAG ingest LaunchAgent fix (MC #10286)

System Evolution 2026-04-16 — Main Runbook

ALAI System Evolution — April 2026 Upgrade

Executive Summary

Architecture: Self-Improving Loop

What Changed — 11 Core Improvements

1. LightRAG Health Probe Fix

2. Ghost HiveMind Symlink

3. LightRAG Default-On in discover.js

4. Auto-Writeback on mc.js done

5. ZAKON PLAN Linter

6. Proveo Gate in mc.js done

7. Blueprint Liveness Gate

8. Cost Tracker Token Counting Fix

9. Regression Suite

10. Orchestration Surface Authority

11. Database Deduplication

Known Issues & Limitations

1. LightRAG Probe Timeout Under Load

2. B2 Offsite Backup Daemon Dead

3. 43 Dead Daemons

4. ZAKON PLAN Compliance: 2/10 Historic Plans

Validation Evidence

How to Verify System Health Post-Upgrade

Quick Check (30 seconds)

Detailed Validation

Impact Metrics

Next Steps

Immediate (CEO approved)

Short-term (2 weeks)

Long-term (1 month)

Email Agent Runbook

Runbook: LightRAG ingest LaunchAgent fix (MC #10286)

Runbook: LightRAG ingest LaunchAgent fix (MC #10286)

Runbook: LightRAG ingest LaunchAgent fix (MC #10286)

System Evolution 2026-04-16 — Main Runbook

ALAI System Evolution — April 2026 Upgrade

Executive Summary

Architecture: Self-Improving Loop

What Changed — 11 Core Improvements

1. LightRAG Health Probe Fix

2. Ghost HiveMind Symlink

3. LightRAG Default-On in discover.js

4. Auto-Writeback on mc.js done

5. ZAKON PLAN Linter

6. Proveo Gate in mc.js done

7. Blueprint Liveness Gate

8. Cost Tracker Token Counting Fix

9. Regression Suite

10. Orchestration Surface Authority

11. Database Deduplication

Known Issues & Limitations

1. LightRAG Probe Timeout Under Load

2. B2 Offsite Backup Daemon Dead

3. 43 Dead Daemons

4. ZAKON PLAN Compliance: 2/10 Historic Plans

Validation Evidence

How to Verify System Health Post-Upgrade

Quick Check (30 seconds)

Detailed Validation

Impact Metrics

Related Documentation

Next Steps

Immediate (CEO approved)

Short-term (2 weeks)

Long-term (1 month)