# T2 — Kleppmann distributed systems audit

# Distributed Systems &amp; Data Layer Audit (MC #10357 T2 — Martin Kleppmann)

**Auditor:** Martin Kleppmann (DDIA) **Date:** 2026-04-30 **Scope:** ALAI orchestration data plane — kernel, tool scripts, SQLite stores **Method:** Read-only file inspection + sqlite3 queries against live databases \---

## 1. Event Flow &amp; Consistency

### Traced path: task intake → execution → completion

**Canonical flow as implemented:**1\. **Intake** — `mc.js add` inserts into `tasks` (mission-control.db). Status: `open`. No corresponding `task\_scheduling` row yet. 2. **Poll** — `pi-orchestrator.js` (daemon, 30s default poll) executes a `getNextTask()` query: LEFT JOIN `tasks` with `task\_scheduling`, filtering on `status IN ('open','in\_progress')`, `dead\_letter=0`, `next\_eligible &lt;= now`, `lease\_until &lt; now`. Priority ordering: H→M→L, then `updated\_at ASC`. 3. **Claim** — `pi-orchestrator` sets `tasks.status = 'in\_progress'`, upserts `task\_scheduling.lease\_until = datetime('now', '+N minutes')`, inserts `active\_work(agent, task\_id, session\_id)`. These are three separate non-atomic DML statements (better-sqlite3 synchronous, but no BEGIN TRANSACTION wrapping the trio in code reviewed). 4. **Execute** — model selection (tier 1–5), agent spawn (Ollama or `claude` CLI subprocess). Execution happens outside the database entirely. 5. **Complete** — `mc.js done <id> "outcome"` sets `tasks.status = 'done'`, clears `task\_scheduling.lease\_until`, removes `active\_work` row. `task\_history` append. 6. **Failure path** — after `MAX\_TASK\_RETRIES = 3`, `task\_scheduling.dead\_letter = 1`, `tasks.status = 'blocked'`, entry in `dead\_letter\_queue`.</id>

**Events durability:** `events.db` has a proper schema with `idempotency\_key UNIQUE`, `correlation\_id`, `causation\_id`, `publisher`, `retry\_count`, `dead\_letter` table — this is a full transactional outbox pattern. **However, it is underused.** `pi-orchestrator.js` does not write to `events.db` for task transitions; it uses `postHiveMind()` (fire-and-forget to HiveMind daemon) instead. `chain-runner.js` uses `bus.emit()` (in-process event bus, not durable). The durable `events.db` store exists but the primary orchestration path bypasses it. **Source of truth fragmentation (confirmed by query):**```text
tasks:           10,220 rows  (primary state machine)
task_scheduling:  4,211 rows  (covers 41% of tasks — not all tasks have scheduling entries)
active_work:          2 rows  (in-memory live view — only current active agents)
task_history:    36,865 rows  (audit log — orphan FK violations detected, see §2)

```

State is split across four tables plus `active\_work`. A task can be `status=in\_progress` without a `task\_scheduling` row (confirmed: 1 task in this state). The `active\_work` table has a stale row (`skillforge|10342|ready\_for\_review`) — the task is no longer `in\_progress` but the `active\_work` entry was not cleared. **This is a consistency bug.**

**Replayability:** The `events.db` schema supports replay (`status`, `retry\_count`, `idempotency\_key`), but since orchestration transitions are not written there, an event replay would be incomplete. `task\_history` provides an immutable audit trail but lacks payload to reconstruct state from scratch. \---

## 2. Data Layer Health

### Schema summary (mission-control.db)

**tasks table** — 38 columns including `version INTEGER DEFAULT 0` (optimistic locking field exists but usage requires code audit to confirm enforcement), `quality\_signals TEXT DEFAULT '{}'`, `dod\_evidence`, `builder\_model`, `validator\_model`, `validation\_timestamp`. Schema has grown organically via ALTER TABLE (evidenced by column ordering and `review\_requeue\_count` trailing addition). ```sql
-- Indexes:
CREATE INDEX idx_tasks_status_owner ON tasks(status, owner);
CREATE INDEX idx_tasks_status_priority ON tasks(status, priority, created_at);

```

Only two indexes on 10,220 rows. The primary poll query uses `status`, `priority`, `updated\_at` — the existing index on `(status, priority, created\_at)` covers this well. No index on `delegated\_to` despite the primary `getNextTask()` path filtering on `delegated\_to = 'pi-orchestrator'` first. This is a **missing index**. **task\_scheduling table:**```sql
CREATE TABLE task_scheduling (
  task_id INTEGER PRIMARY KEY REFERENCES "tasks"(id),
  cb_state TEXT DEFAULT 'closed' CHECK(cb_state IN ('closed', 'open', 'half_open')),
  attempt_count INTEGER DEFAULT 0,
  last_attempt TEXT, next_eligible TEXT, lease_until TEXT,
  last_outcome TEXT, last_error TEXT, dead_letter INTEGER DEFAULT 0
);

```

No index on `next\_eligible` or `lease\_until`, both used in hot query path. No index on `dead\_letter`. **active\_work table:**```sql
CREATE TABLE active_work (
  agent TEXT PRIMARY KEY,
  task_id INTEGER DEFAULT NULL,
  FOREIGN KEY (task_id) REFERENCES "tasks"(id)
);

```

`agent TEXT PRIMARY KEY` — means only one task per agent name. This is correct for the current design but implies no parallelism per named agent. **Row counts (actuals):**| Table | Count | |---|---| | tasks | 10,220 | | tasks (open) | 890 | | tasks (in\_progress) | 2 | | tasks (done) | 6,605 | | tasks (paused) | 2,202 | | tasks (blocked) | 492 | | tasks (ready\_for\_review) | 29 | | task\_scheduling | 4,211 | | active\_work | 2 | | task\_history | ~36,865 | | outbox | 3,262 total, 0 unprocessed | | dead\_letter\_queue | 22 | | task\_scheduling.dead\_letter=1 | 52 | | events.db total | 3,710 | | events.db dead | 19 |

**Foreign key violations (PRAGMA foreign\_key\_check):**```text
task_history → tasks: 19 orphan rows (task_ids: 962, 982, 983, 984, 1770-1772, 7436-7438, 7446, 14430-14431, 15883, 33500, 36110-36111, 36127, 36133)
task_scheduling → tasks: 1 orphan row (task_id: 9351)
task_metrics → tasks: 7 orphan rows (10053-10054, 10056, 10058-10059, 10061, 10309)

```

Total: **27 dangling foreign key references**. SQLite enforces FK constraints only when `PRAGMA foreign\_keys = ON` is set per connection. `pi-orchestrator.js` uses `busy\_timeout = 3000` but does not set `PRAGMA foreign\_keys = ON` — FK integrity is not enforced at runtime. **WAL state:**```text
mission-control.db      25.6 MB  (main)
mission-control.db-wal  23.3 MB  (WAL = 91% of main — WAL checkpoint is NOT running)
hivemind.db             60.1 MB  (main)
hivemind.db-wal        147.0 MB  (WAL = 245% of main — severely bloated)
flywheel.db            224.2 MB  (main)
flywheel.db-wal        263.6 MB  (WAL = 118% of main)
events.db               14.6 MB  (main)
events.db-wal            4.5 MB  (WAL = 31% — acceptable)

```

`PRAGMA wal\_checkpoint` returned `0|1|0` for mission-control.db, confirming the checkpoint ran 0 pages — the WAL has not been checkpointed recently. hivemind.db WAL at 147MB is a **P1 operational risk**: if the process crashes, SQLite must replay the entire WAL on next open, causing multi-second startup delays. Large WALs also inflate Litestream replication bandwidth.

**Last vacuum:** No `VACUUM` or `ANALYZE` evidence found. The db has `freelist\_count = 0` (no free pages), which is normal but may indicate vacuum has never run or ran recently on a growing dataset. **Growth rate:**```text
Week 2026-W09: 798 tasks/week
Week 2026-W13: 1354 tasks/week (peak)
Week 2026-W16: 1145 tasks/week
Week 2026-W17: 790 tasks/week

```

At ~900 tasks/week, mission-control.db will reach 50,000 tasks in ~4 months. At current page size (4096 bytes), the database will reach ~500MB without periodic archival of done/blocked tasks. \---

## 3. Concurrency &amp; Ordering

### Multiple writers

The following processes can write to mission-control.db concurrently:

- `pi-orchestrator.js` daemon (primary writer — poll loop + task state transitions)
- `mc.js` CLI (human-invoked — add, done, block, assign, start)
- `sprint-pipeline.js` (calls `mc.js start` via `execSync` — indirect writer)
- `pipeline-controller.js` (reads MC DB directly readonly + indirect via mc.js)
- `cross-company-bus.js` (creates tasks via `mc.js add`)
- `chain-runner.js` (updates tasks via MC script calls)
- Multiple Claude agent subprocesses calling `mc.js` (spawned by pi-orchestrator)

**Concurrency mechanism:** better-sqlite3 is **synchronous** — all reads/writes block the Node.js event loop. With WAL mode enabled (`PRAGMA journal\_mode = WAL`), readers do not block writers and writers do not block readers. Multiple writers still serialise through SQLite's single-writer lock. `pi-orchestrator.js` sets `busy\_timeout = 3000` ms — meaning it will wait up to 3s for the lock, then throw. This is acceptable for a single-machine deployment. **Lease mechanism analysis:**The lease is implemented in `task\_scheduling.lease\_until`. The `getNextTask()` query filters `lease\_until &lt; datetime('now')` before claiming. Claim path:

```js
// From getNextTask() — three separate statements, not in a single transaction:
1. UPDATE task_scheduling SET lease_until = datetime('now', '+N min')  
2. UPDATE tasks SET status = 'in_progress'
3. INSERT OR REPLACE INTO active_work(agent, task_id, ...)

```

**Critical bug:** These three writes are not wrapped in a single SQLite transaction. If the process crashes between statements 1 and 2, `task\_scheduling.lease\_until` is set but `tasks.status` remains `open`, and `active\_work` has no entry. The task appears "leased" to a dead worker but also appears available to the next poller once the lease expires — which is the correct recovery. **However**, between crash and lease expiry, `getNextTask()` will skip this task unnecessarily, creating transient unavailability without notice. **Confirmed stuck leases (from query):** 3,254 of 3,255 `task\_scheduling` rows with a `lease\_until` value have an *expired* lease (`lease\_until &lt; datetime('now')`). This means nearly all leases are in the past — which is normal if tasks completed (the lease was not cleared to NULL on completion, just left expired). But it also means the lease mechanism provides **zero protection** against concurrent double-claiming, because any new poller will see `lease\_until &lt; now` as valid to claim again. This is by design (lease expiry = re-eligibility), but implies: **if two pi-orchestrator instances ran simultaneously, both could claim the same task** because the claim is not atomic (no SELECT...FOR UPDATE, SQLite has no such construct, and no BEGIN IMMEDIATE around the claim trio). **Reconcile mechanism:** `reconcileWorkerCounters()` (pi-orchestrator.js line ~3248) runs every 30s to detect drift between `activeTasks` Map and `activeWorkers` counters. This addresses the in-process ghost slot problem but not the cross-process double-claim problem. **On daemon crash mid-task:** The task remains `status=in\_progress` with an expired lease. The next orchestrator cycle (after restart) will pick it up from `getNextTask()` because `lease\_until &lt; now`. `attempt\_count` is incremented. After 3 retries, it goes to DLQ. This recovery path is correct but relies on the orchestrator restarting — if it does not restart (manual intervention needed), the task is orphaned in `in\_progress` indefinitely. Confirmed: MC task #10330 is currently `in\_progress` with `active\_work` row showing `john` as agent — this appears to be a legitimate in-progress task, not a ghost. **Event log:** `events.db` has a `dead\_letter` table with 19 entries — these are events that exhausted retries. The events schema includes `correlation\_id` and `causation\_id` for lineage tracing. However, as noted in §1, the primary orchestration path does not write to this store. \---

## 4. Backup &amp; Recovery

### What exists

**Litestream** (`/Users/makinja/system/config/litestream.yml`): Comprehensive — 50+ databases replicated to Azure Blob Storage (`alaibackups0ebb.blob.core.windows.net`, container `system-db-backups`). Tiering: - **P0-critical** (mission-control.db, hivemind.db, costs.db, events.db, durable-runner.db, knowledge.db, email-inbox.db, alem-directives.db): `sync-interval: 1s`, `retention: 168h` (7 days)
- **P0-financial** (fiken.db, invoices.db, contracts.db, leads.db): `sync-interval: 1s`, `retention: 720h` (30 days)
- **P1** (agent-routing.db, bee-index.db, contacts.db, etc.): `sync-interval: 1s`, `retention: 72h`
- **P2** (prompt-cache.db, baikal-caldav.db, etc.): `sync-interval: 10s`, `retention: 24h`

**flywheel.db** (224MB main + 263MB WAL): `sync-interval: 30s` — reasonable given size, but WAL bloat means Litestream is continuously replicating a large dirty WAL. **azure-db-backup.sh** (`com.alai.azure-db-backup`, every 4h): Independent Postgres + Docker volume + Qdrant backup to Azure Blob. This covers the cloud products (Bilko, Drop) but is reported as silently failing (daemon-fleet watchdog MC #10173). The error log shows: `rm: /tmp/az-backup-50926: Directory not empty` — non-fatal cleanup failure but indicative of script state pollution. **Point-in-time restore:** Litestream `.ltx` format supports point-in-time restore within the retention window. `litestream.yml.bak-20260425` and restore config at `litestream-restore.yml` confirm the restore procedure has been documented. **Manual snapshots:** `mission-control.db.bak-20260306-175717` (4.5MB) and `mission-control.db.bak-before-7082` (21.6MB) exist on disk — evidence of manual before/after snapshots. The 2026-03-06 snapshot is ~7 weeks old, covering a period when the database was ~1/5 current size. These are not automated. **RPO assessment:**- mission-control.db: RPO = ~1s (Litestream sync-interval) + Azure write latency (~100ms). **Effective RPO: ~2s**. This is excellent for a SQLite-based system.
- Litestream replication of a 23MB WAL: the WAL is being replicated continuously, but a large un-checkpointed WAL means recovery requires replaying that WAL — **estimated recovery time: 5–15 seconds** on a healthy machine.

**RTO assessment:**- If mission-control.db corrupts NOW: restore from Litestream latest snapshot + WAL segments. Litestream restore to local path, then restart pi-orchestrator. Assuming Azure access is available: **RTO ~2–5 minutes**.
- If ANVIL (the Mac Studio) dies: restore requires a new machine, reinstall Homebrew + Node.js + Litestream, then `litestream restore`. **RTO ~30–60 minutes** for full service restoration.

**What is lost on corruption NOW:**- `active\_work`: 2 current rows — 2 in-flight tasks would need manual status check/reset
- Any `task\_history`, `outbox`, or `approval\_tokens` rows written in the last ~2 seconds
- The 23.3MB WAL (uncommitted pages) — Litestream replicates WAL pages, so this is covered

**Untested restore:** No evidence of a restore drill in logs or memory files. `litestream-restore.yml` exists as documentation but there is a note in MEMORY.md that `azure-db-backup` was silently failing. If Litestream's Azure SP credentials (`AZURE\_CLIENT\_ID`) have rotated or the SP permissions have changed, the replicas may be stale without any alert. This is the **single most dangerous gap** in the backup posture. **ingest-queue.sqlite.corrupt** in `~/system/state/`: A 6.4MB corrupt SQLite file alongside a backup — evidence of a past corruption event. No RCA documented in accessible files. \---

## 5. Evolution Risk

### The 5,251-line pi-orchestrator.js problem

`pi-orchestrator.js` is a 5,251-line monolith that owns: semantic classification, model selection, prompt construction (with 4 RAG sources), worker pool management, quality gating (lint, test, semantic), proof-of-work verification, routing token writing, DLQ writing, HiveMind posting, cost tracking, and the main daemon loop. It has at least 14 conceptually independent subsystems co-located in one file.

**Concrete risks:**1\. **Single process, single point of failure.** If pi-orchestrator crashes mid-task, all in-flight work is abandoned. The lease mechanism provides recovery but with a gap of `lease\_until` duration (typically 5–30 minutes) before re-eligibility.

2\. **In-process worker pool counters** (`activeWorkers` object) are process-local. These are not persisted to the database. On restart, `activeWorkers` resets to zero even if tasks are still `status=in\_progress` in the DB — the orchestrator will over-claim slots until `reconcileWorkerCounters()` detects the drift (up to 30s).

3\. **No command authority boundary.** `chain-runner.js`, `sprint-pipeline.js`, `pipeline-controller.js`, `cross-company-bus.js`, and `mc.js` all write directly to mission-control.db. There is no single command handler. A schema migration or constraint change in `tasks` requires auditing all 14+ writers. The 38-column `tasks` table with columns added via `ALTER TABLE` is the result.

4\. **`task\_scheduling` not populated for all tasks.** 4,211 scheduling rows for 10,220 tasks means 58% of tasks have no scheduling metadata. `getNextTask()` handles this with `LEFT JOIN ... COALESCE(ts.cb\_state, 'closed')` — safe but means circuit breaker and lease state is silently absent for most tasks.

5\. **`sprint-pipeline.js` DAG state in JSON column.** `sprints.dag\_json TEXT` stores the full DAG serialisation inside a SQLite text column. This means querying or filtering on DAG phase status requires deserialisation in application code, cannot use SQL indexes, and is vulnerable to partial writes leaving malformed JSON.

### Modernisation recommendations

**Recommendation 1 — Immediate: WAL checkpointing cron (hours)**Run `PRAGMA wal\_checkpoint(TRUNCATE)` on hivemind.db, flywheel.db, and mission-control.db every 4 hours via LaunchAgent. The 147MB hivemind WAL is a live production risk. Cost: 2 hours, zero architectural change. **Recommendation 2 — Short term: Atomic claim transaction (1–2 days)**Wrap the task claim (lease\_until update + status=in\_progress + active\_work insert) in a single `BEGIN IMMEDIATE ... COMMIT`. This is trivially achievable with better-sqlite3's `db.transaction()` API and eliminates the three-statement race window. Also add `CREATE INDEX IF NOT EXISTS idx\_task\_scheduling\_next\_eligible ON task\_scheduling(next\_eligible, dead\_letter)` and `CREATE INDEX ON tasks(delegated\_to)` to fix the missing indexes in the hot query path. **Recommendation 3 — Short term: FK enforcement (1 day)**Add `db.pragma('foreign\_keys = ON')` to every better-sqlite3 connection open in pi-orchestrator.js, mc.js, and all tool scripts. Then fix the 27 existing orphan rows (most are safe to delete — they reference tasks that were hard-deleted, violating the soft-delete principle). This prevents a class of data inconsistency bugs at zero performance cost. **Recommendation 4 — Medium term: Extract command boundary (1–2 weeks)**Move all task state mutations (open→in\_progress, in\_progress→done, etc.) behind `mc.js` as the **single writer**. All tools that currently open mission-control.db directly as a writer (sprint-pipeline.js, pipeline-controller.js, cross-company-bus.js) should call `mc.js` via IPC or HTTP instead. This eliminates the 14-writer problem and makes schema evolution safe. The `outbox` table in mission-control.db is already structurally correct for this — route all mutations through it with `processed=0`, and have mc.js process the outbox as the canonical writer. **Recommendation 5 — Medium term: Decompose pi-orchestrator.js (2–4 weeks)**Split into four services: - `task-classifier.js` — semantic classification + caching (stateless, ~200 LoC)
- `model-router.js` — model selection + fleet health + routing tokens (~300 LoC)
- `agent-executor.js` — subprocess management + quality gate + proof-of-work (~500 LoC)
- `task-scheduler.js` — the daemon loop + lease management + DLQ + concurrency counters (~300 LoC)

These communicate through the existing `events.db` store (which is already correctly designed for this). This eliminates the single-process SPOF and enables independent restart of each component. **Recommendation 6 — Long term: Event-sourced task store (4–8 weeks)**Replace the multi-table task state machine with an event-sourced design: one `task\_events` table (taskId, eventType, payload, timestamp, actor) as the source of truth; `tasks` becomes a materialised view rebuilt from events. This enables full replay, audit trail, and time-travel queries. The existing `task\_history` table is a partial implementation of this pattern. The `events.db` schema (with `idempotency\_key`, `correlation\_id`, `causation\_id`) is the correct foundation — extend it to cover task lifecycle events and wire pi-orchestrator to write there instead of posting to HiveMind. **CDC option:** An alternative to event sourcing is Litestream-level CDC: Litestream already replicates WAL pages; a consumer reading those pages could reconstruct a change log. This requires custom Litestream consumer code but avoids changing the write path. \---

## Top 3 Architectural Risks

**RISK 1 — Unverified Litestream backup integrity (CRITICAL)**Litestream is configured for 1s RPO on mission-control.db, but the azure-db-backup daemon is known-silently-failing (MC #10173). If the Litestream LaunchAgent has also silently stopped (daemon watchdog found 11 failures), the Azure replicas may be hours or days stale. A corruption event today could result in data loss far exceeding the stated 2s RPO. **Action needed: verify last successful Litestream replication timestamp immediately.** Run `litestream snapshots -config ~/system/config/litestream.yml` and check dates. Then run a restore drill to a temp path. **RISK 2 — Non-atomic task claim + 14 concurrent writers (HIGH)**The three-statement claim (lease/status/active\_work) is non-atomic. With 14 identified writers accessing mission-control.db without a command authority boundary, there is no structural guarantee against double-claims, lost updates, or constraint bypass. The `version` column in `tasks` exists for optimistic locking but its enforcement is not verified — if writers do not check `WHERE version = ?` in their UPDATE statements, it provides no protection. **RISK 3 — 147MB hivemind.db WAL + 264MB flywheel.db WAL (HIGH)**Both WALs exceed their main database file sizes. WAL mode accumulates pages when readers hold long-running transactions or when `wal\_autocheckpoint` threshold is never reached due to write volume. These databases cannot benefit from Litestream's efficient snapshot mechanism until checkpointed. On an unexpected restart, SQLite must replay the entire WAL — for hivemind.db this means replaying 147MB before any operation can proceed, causing multi-second to minute-level startup delays. Under extreme conditions (full disk), the WAL files could prevent writes to all SQLite databases sharing the same volume. \---

## Top 3 Strengths

**STRENGTH 1 — Litestream replication architecture**The investment in Litestream with proper tier classification (P0/P1/P2), 7-day retention for critical stores, and Azure Blob backend is genuinely sophisticated for a single-machine SQLite deployment. This puts ALAI's backup posture ahead of many funded startups. The `litestream-restore.yml` and SP-based auth show operational maturity. **STRENGTH 2 — Circuit breaker + dead-letter queue pattern**`task\_scheduling.cb\_state IN ('closed', 'open', 'half\_open')` with `attempt\_count`, `next\_eligible` backoff, and `dead\_letter` flag implements a textbook circuit breaker. The DLQ path (auto-block + Slack alert after 3 retries) prevents runaway retry storms. 52 entries in DLQ with max 5 attempts shows the mechanism is working in production. **STRENGTH 3 — events.db transactional outbox schema**The `events.db` schema — with `idempotency\_key UNIQUE`, `correlation\_id`, `causation\_id`, `status` state machine, `dead\_letter` table, and proper indexes — is correctly designed for at-least-once event delivery with deduplication. The schema supports everything needed for an event-sourced architecture. The gap is that the primary orchestration path does not yet write to it; when it does, this becomes a genuine durable, replayable event log. \---

## Appendix: Key File Paths

- `/Users/makinja/system/kernel/pi-orchestrator.js` — 5,251 LoC main daemon
- `/Users/makinja/system/kernel/cross-company-bus.js` — lateral event routing
- `/Users/makinja/system/tools/chain-runner.js` — DAG chain execution
- `/Users/makinja/system/tools/sprint-pipeline.js` — sprint coordination
- `/Users/makinja/system/tools/pipeline-controller.js` — 13-phase lifecycle
- `/Users/makinja/system/databases/mission-control.db` — 25.6MB, 10,220 tasks
- `/Users/makinja/system/databases/mission-control.db-wal` — 23.3MB (needs checkpoint)
- `/Users/makinja/system/databases/hivemind.db-wal` — 147MB (CRITICAL — needs checkpoint)
- `/Users/makinja/system/databases/flywheel.db-wal` — 264MB (needs checkpoint)
- `/Users/makinja/system/databases/events.db` — 3,710 events, 19 dead
- `/Users/makinja/system/config/litestream.yml` — 741 lines, 50+ DBs replicated
- `/Users/makinja/system/state/ingest-queue.sqlite.corrupt` — evidence of past corruption