# Design Spec

# Pillar #5 — Autonomous Skill-Chain Meta-Orchestrator

## Design Spec — Agentic OS v1 Hardening

**MC:** #99129 | **Parent:** #99063 | **Date:** 2026-05-05 | **Status:** ready\_for\_review

**Synthesizer:** Petter Graff (CodeCraft)

**Sources:** 6 specialist sections at `/tmp/pillar5-99129/section-{1..6}-*.md` (4256 lines total) +

forged baseline `/Users/makinja/system/prompts/forged/99129.md`

---

## §1. Executive Summary

### Problem

Pillars #1-#4 produced an Agentic OS that thinks, schedules, remembers, and uses skills. What it

does not do is run **stateful, multi-step, durable, scheduled skill chains** that survive reboots,

have human-in-loop gates, and refuse to claim victory without disk evidence. Recent ALAI failures

(MLX phantom 2026-05-04, broken-JOHN 2026-05-02, Drop AWS phantom 2026-04-30) all share one root:

agent-emitted "done" with no runtime artefact behind it. Skill chains amplify that risk by chaining

N self-attesting links — phantom compounds linearly.

### Decision

Pillar #5 is a **thin convention layer**, not a new orchestration engine. It reuses primitives that

already exist on disk (`mc.js` task\_scheduling, launchd cron, skills, sub-agents, SQLite WAL,

`liveness-claim-validator.sh` hook, ai-factory-pipeline 64-gate stack). Two new artefacts only:

`chain-runner.sh` (≤150 LOC bash polling loop) and `phantom-link-detector.js` (15-min Node daemon).

Trigger model is **cron + sub-agent dispatch hybrid**: launchd fires `chain-runner.sh`, which polls

`mc.js list --delegated chain-runner`, advances ONE step per tick by spawning ONE sub-agent Task,

then exits. Pure recursive parent-spawning is rejected (cost penalty + observability collapse).

Pure cron-only stateless is rejected (no inheritance of 64-gate stack). Hybrid wins.

State lives in **mission-control.db** (`chain_runs` + `chain_steps` tables — Bruce's section-1

schema read of pipeline.db confirmed zero chain-shaped tables, default rule applies). Runtime evidence lives in

**JSONL append log** at `~/system/logs/chains/<chain>-<run>.jsonl` written ONLY by chain-runner.sh

(petter-graff "runtime not testimony" wins on observability; openai-ca durability argument wins on

storage; both adopted, state-file rejected).

### Deliverables (DESIGN — build phase = child MCs)

1. SQLite DDL for `chain_runs` + `chain_steps` (mission-control.db)
2. `chain-runner.sh` ≤150 LOC bash, shellcheck clean, dry-run + signal traps + exit codes 0-4
3. `phantom-link-detector.js` Node daemon, 15-min cadence with 7-min offset from watchdog
4. Three reference YAML pipelines: daily-inbox-triage (FIRST SHIP), weekly-client-report,

 email-pipeline-fix-loop (LAST SHIP — only after 7 clean days on first two)

1. Two new launchd plists + one watchdog integration call (no third plist for fix-loop)
2. Cost telemetry: AGENT\_COST event + cost-tracker.js `chain` subcommand + $3/day pre-flight gate
3. Threat mitigations addressing 8 attack surfaces (3 HIGH must block ship)

### Ship Gate

**80% daemon fleet health minimum.** Currently 79.7% (114/143 healthy per

`~/system/state/daemon-fleet-status.json` 2026-05-05T08:34Z). Two critical daemons must be fixed

first: `com.john.db-backup` (down\_exit\_256) and `com.alai.cost-daily-report` (calendar\_err\_32512).

### Cost Envelope

- **Hard ceiling:** $3.00/day combined across all `delegated_to=chain-runner` tasks
- **Expected monthly:** $2.80 (44 runs/month: 30 daily + 4 weekly + 10 fix-loop)
- **2× worst-case monthly:** $17.27 — recommend $20/month as Pillar #5 budget envelope
- **Daily peak risk:** ~$1.79 (12 fix-loop incidents in one day) — under ceiling
- **Ceiling breach scenario:** 23+ fix-loop firings in 24h — ceiling enforcement halts at run 22

---

## §2. Architecture Overview

### Components Diagram

```
                ┌──────────────┐                ┌──────────────────┐
                │   launchd    │  cron tick     │  daemon-fleet-   │
                │ (cron only,  │   (5min)       │    watchdog      │
                │ never exec)  │                │ (existing, 15min)│
                └──────┬───────┘                └────────┬─────────┘
                       │                                 │ failure detected
                       ▼                                 ▼
              ┌────────────────────┐         mc.js add --delegated
              │  chain-runner.sh   │◀─────────chain-runner --priority H
              │ (bash, ≤150 LOC,   │
              │  one step per tick,│
              │  exits cleanly)    │
              └─────┬────────┬─────┘
                    │        │
           polls mc.js       │ writes JSONL events
                    │        │ (NEVER from sub-agent)
                    ▼        ▼
        ┌───────────────────────────┐    ┌────────────────────────────┐
        │   mission-control.db      │    │ ~/system/logs/chains/      │
        │ ┌──────────┬───────────┐  │    │  <chain>-<run>.jsonl       │
        │ │chain_runs│chain_steps│  │    │  4 event types:            │
        │ ├──────────┴───────────┤  │    │   STEP_START, STEP_OUTPUT, │
        │ │tasks│task_scheduling │  │    │   AGENT_COST, STEP_END     │
        │ │runs (cost ledger)   │  │    └────────────────────────────┘
        │ └─────────────────────┘  │
        └────────────┬──────────────┘
                     │
                     │ dispatches ONE sub-agent per link
                     ▼
              ┌──────────────────┐         ┌─────────────────────────┐
              │  Task() (claude  │ writes  │  ~/system/state/chains/ │
              │   CLI headless,  │────────▶│  <chain>/<run>/         │
              │  inherits 64-gate│         │   step-N-<name>.json    │
              │  stack)          │         │  (atomic .tmp+rename)   │
              └──────────────────┘         └─────────┬───────────────┘
                                                     │
                                                     ▼
                                       ┌──────────────────────────────┐
                                       │ phantom-link-detector.js     │
                                       │ (Node, 15min, 7min offset)   │
                                       │  - file existence check      │
                                       │  - min size + SHA-256        │
                                       │  - chain_run_id correlation  │
                                       │  - cost=0 anomaly check      │
                                       │  → marks phantom_suspected   │
                                       │    blocks downstream done    │
                                       └──────────────────────────────┘
```

### Trigger Types (closed set in v1)

<table id="bkmrk-typesourceuse-casecr"><thead><tr><th>Type</th><th>Source</th><th>Use case</th></tr></thead><tbody><tr><td>`cron`</td><td>launchd `StartCalendarInterval`</td><td>daily-inbox-triage (07:00), weekly-client-report (Mon 09:00)</td></tr><tr><td>`watchdog`</td><td>`daemon-fleet-watchdog.sh` calls `mc.js add --delegated chain-runner`</td><td>email-pipeline-fix-loop (no plist of its own)</td></tr><tr><td>`manual`</td><td>Human/orchestrator `mc.js add --delegated chain-runner`</td><td>one-off, testing, incident response</td></tr></tbody></table>

A fourth type (`event_bus`) is **reserved for v2** when Pillar #9 multi-host runtime activates.

### Boundary Rules

- **Cron triggers, never executes work.** Plists call `chain-runner.sh --enqueue <chain>`; that

 script writes one MC task and exits.

- **MC is the queue.** No parallel queue primitive. `task_scheduling.lease_until` (mc.js:204),

 `attempt_count` (mc.js:201), `cb_state` (mc.js:200), `dead_letter` (mc.js:207) ARE the chain

 queue (Bruce §1).

- **Skills execute the work.** `~/.claude/skills/sp-*/SKILL.md` is the unit of skill content.

 Chain links wrap, do not replace, skills.

- **Sub-agents isolate the blast radius.** Each link spawns one Claude Code Task; the sub-agent

 inherits the 64-gate stack at `~/system/specs/ai-factory-pipeline.md` §2.

- **chain-runner is the only writer of state.** Sub-agents write only to `output_path`. JSONL,

 DB rows, lease state — all chain-runner.sh exclusively.

### Reuse vs New

<table id="bkmrk-existing-primitivepa"><thead><tr><th>Existing primitive</th><th>Path</th><th>Reuse for</th></tr></thead><tbody><tr><td>`mc.js` task\_scheduling</td><td>`~/system/tools/mc.js` lines 163,198,204,3089-3144</td><td>Chain queue, leases, cb\_state, DLQ</td></tr><tr><td>launchd</td><td>`~/Library/LaunchAgents/`</td><td>Cron triggers, phantom-detector cadence</td></tr><tr><td>Skills</td><td>`~/.claude/skills/sp-*/SKILL.md`</td><td>Per-link work units</td></tr><tr><td>Sub-agents (Task)</td><td>`claude` CLI headless</td><td>Per-link execution + 64-gate inheritance</td></tr><tr><td>SQLite WAL</td><td>mission-control.db</td><td>Durable chain state, atomic UPDATE</td></tr><tr><td>`liveness-claim-validator.sh`</td><td>`~/.claude/hooks/` (LIVE per `~/system/state/postflight-cleared-99127.json`)</td><td>Block bare LIVE/DONE claims</td></tr><tr><td>`bash-danger-gate.sh`</td><td>`~/.claude/hooks/` (LIVE)</td><td>Inherits across sub-agent dispatch</td></tr><tr><td>`cost-tracker.js`</td><td>`~/system/tools/cost-tracker.js`</td><td>Cost dual-write, daily aggregation, new `chain` subcommand</td></tr></tbody></table>

<table id="bkmrk-new-artefactpathpurp"><thead><tr><th>**NEW artefact**</th><th>**Path**</th><th>**Purpose**</th></tr></thead><tbody><tr><td>`chain-runner.sh`</td><td>`/Users/makinja/system/tools/chain-runner.sh` \[NEW\]</td><td>Cron-triggered polling + per-step dispatch</td></tr><tr><td>`phantom-link-detector.js`</td><td>`/Users/makinja/system/tools/phantom-link-detector.js` \[NEW\]</td><td>15-min phantom + cost anomaly daemon</td></tr><tr><td>`chain-validator.js`</td><td>`/Users/makinja/system/tools/chain-validator.js` \[NEW\]</td><td>Pre-registration YAML schema check</td></tr><tr><td>`secret-scanner.sh`</td><td>`~/.claude/hooks/secret-scanner.sh` \[NEW\]</td><td>PostToolUse hook on chain state dir (T6 mitigation)</td></tr><tr><td>`chain_runs` + `chain_steps`</td><td>mission-control.db \[NEW DDL\]</td><td>Durable chain state</td></tr><tr><td>Two LaunchAgent plists</td><td>`~/Library/LaunchAgents/com.alai.chain-{daily-inbox,weekly-report}.plist` \[NEW\]</td><td>Cron triggers</td></tr><tr><td>Phantom-detector plist</td><td>`~/Library/LaunchAgents/com.alai.chain-phantom-detector.plist` \[NEW\]</td><td>15-min scan</td></tr></tbody></table>

---

## §3. Storage &amp; State

### DB Location Decision — RESOLVED

**Verdict: mission-control.db.** Bruce verified pipeline.db schema this session (section-1 §1):

tables `pipelines`, `phase_history`, `gate_checks`, `blueprint_runs`, `phase_results`,

`blueprint_lineage`, `_litestream_seq`, `_litestream_lock`. None are chain-state-shaped.

`blueprint_runs` is the closest analog but client-project-scoped, not skill-chain-scoped.

Forged-prompt synthesizer's default rule applies: pipeline.db has no relevant tables → mission-

control.db. This **resolves** the unresolved item from forged-99129.md disagreement #3.

mission-control.db already holds `tasks` + `task_scheduling` (queue), `runs` (cost ledger with

`trace_id` for free-JOIN cost aggregation), `pipeline_events`, `outbox`. WAL mode active

(verified by Litestream marker tables). Single-file WAL is correct at ALAI volume (≤10 chain

runs/day, single-host ANVIL).

### chain\_runs + chain\_steps DDL

```sql
BEGIN;

CREATE TABLE IF NOT EXISTS chain_runs (
  chain_run_id   TEXT    PRIMARY KEY,           -- UUID v4 (NOT wall-clock; multi-host safe)
  chain_name     TEXT    NOT NULL,              -- "daily-inbox-triage" | etc.
  mc_task_id     INTEGER REFERENCES tasks(id),  -- chain MC task
  triggered_by   TEXT    NOT NULL
                 CHECK(triggered_by IN ('cron','watchdog','manual','retry')),
  status         TEXT    NOT NULL DEFAULT 'running'
                 CHECK(status IN ('running','awaiting_human','succeeded','failed',
                                  'phantom_suspected','cost_halted','dead_letter')),
  total_links    INTEGER NOT NULL DEFAULT 0,
  current_link   INTEGER NOT NULL DEFAULT 0,
  cost_usd_total REAL    NOT NULL DEFAULT 0.0,
  started_at     TEXT    NOT NULL DEFAULT (datetime('now')),
  completed_at   TEXT    DEFAULT NULL,
  lease_until    TEXT    DEFAULT NULL,          -- chain-level (30min); step lease on chain_steps
  schema_version INTEGER NOT NULL DEFAULT 1,
  estimated_cost_usd  REAL,                     -- from YAML cost_ceiling_usd_per_run
  cache_hit_ratio_avg REAL,                     -- rolling avg across steps
  phantom_check_status TEXT DEFAULT 'pending'   -- 'pending' | 'clean' | 'phantom_suspected'
);

CREATE INDEX IF NOT EXISTS idx_chain_runs_name_status ON chain_runs(chain_name, status);
CREATE INDEX IF NOT EXISTS idx_chain_runs_started     ON chain_runs(started_at);
CREATE INDEX IF NOT EXISTS idx_chain_runs_mc_task     ON chain_runs(mc_task_id);

CREATE TABLE IF NOT EXISTS chain_steps (
  id             INTEGER PRIMARY KEY AUTOINCREMENT,
  chain_run_id   TEXT    NOT NULL REFERENCES chain_runs(chain_run_id) ON DELETE CASCADE,
  step_name      TEXT    NOT NULL,
  step_index     INTEGER NOT NULL,
  skill          TEXT    NOT NULL,
  status         TEXT    NOT NULL DEFAULT 'pending'
                 CHECK(status IN ('pending','running','done','failed',
                                  'awaiting_human','phantom_suspected','skipped')),
  attempt_count  INTEGER NOT NULL DEFAULT 0,
  max_attempts   INTEGER NOT NULL DEFAULT 2,    -- 0 for human_gate links
  human_gate     INTEGER NOT NULL DEFAULT 0
                 CHECK(human_gate IN (0,1)),
  input_path     TEXT    DEFAULT NULL,
  output_path    TEXT    DEFAULT NULL,
  output_sha256  TEXT    DEFAULT NULL,
  cost_usd       REAL    NOT NULL DEFAULT 0.0,
  input_tokens   INTEGER NOT NULL DEFAULT 0,
  output_tokens  INTEGER NOT NULL DEFAULT 0,
  cache_creation_tokens INTEGER DEFAULT 0,
  cache_read_tokens     INTEGER DEFAULT 0,
  cache_hit_ratio       REAL DEFAULT 0,
  model          TEXT    DEFAULT NULL,
  started_at     TEXT    DEFAULT NULL,
  completed_at   TEXT    DEFAULT NULL,
  lease_until    TEXT    DEFAULT NULL,          -- step-level lease (10min)
  error_text     TEXT    DEFAULT NULL,
  compensate_on_failure TEXT DEFAULT NULL,
  UNIQUE (chain_run_id, step_name)              -- idempotency key
);

CREATE INDEX IF NOT EXISTS idx_chain_steps_run_status   ON chain_steps(chain_run_id, status);
CREATE INDEX IF NOT EXISTS idx_chain_steps_run_index    ON chain_steps(chain_run_id, step_index);
CREATE INDEX IF NOT EXISTS idx_chain_steps_lease        ON chain_steps(lease_until)
  WHERE status = 'running';
CREATE INDEX IF NOT EXISTS idx_chain_steps_phantom_scan ON chain_steps(status, output_path)
  WHERE status = 'done' AND output_path IS NOT NULL;

CREATE TABLE IF NOT EXISTS chain_runs_archive (
  chain_run_id   TEXT    PRIMARY KEY,
  chain_name     TEXT    NOT NULL,
  -- (matches chain_runs schema + archived_at)
  archived_at    TEXT    NOT NULL DEFAULT (datetime('now'))
);

COMMIT;
```

Full DDL with all archive columns: see section-1 §7 (lines 651-728).

### Atomic Checkpoint Protocol

SQLite WAL handles DB row atomicity natively. Sidecar JSON evidence files (sub-agent outputs)

use `.tmp` + `fsync` + `rename(2)` (APFS atomic) + SHA-256 capture:

```bash
TMP_PATH="${OUTPUT_PATH}.tmp"
write_output_to "${TMP_PATH}"
python3 -c "import os,sys; fd=os.open(sys.argv[1],os.O_RDONLY); os.fsync(fd); os.close(fd)" "${TMP_PATH}"
SHA=$(shasum -a 256 "${TMP_PATH}" | awk '{print $1}')
mv -f "${TMP_PATH}" "${OUTPUT_PATH}"
sqlite3 "${MC_DB}" "UPDATE chain_steps SET output_path='${OUTPUT_PATH}', output_sha256='${SHA}'
                    WHERE chain_run_id='${CHAIN_RUN_ID}' AND step_name='${STEP_NAME}';"
```

Resume integrity: re-compute SHA on disk, compare to `chain_steps.output_sha256`. Mismatch → reset

to `pending`, rerun. Orphan `.tmp` cleanup: `find ~/system/state/chains/ -name "*.tmp" -mmin +30`

runs as part of phantom-link-detector.

Artifact path convention (NOT `/tmp/` — cleared on reboot per CONSTRAINTS):

```
~/system/state/chains/<chain_name>/<chain_run_id>/step-<N>-<step_name>.json
```

### JSONL Event Log

**Path:** `~/system/logs/chains/<chain_name>-<chain_run_id>.jsonl` (per-run isolation,

petter-graff isolation argument; cross-run queries via `find` + `jq`).

**Four event types:**

- `STEP_START` — written by chain-runner.sh BEFORE sub-agent dispatch
- `STEP_OUTPUT` — written AFTER verifying output file + SHA-256 match
- `AGENT_COST` — written AFTER reading `runs` table for sub-agent's cost (NOT from sub-agent narrative)
- `STEP_END` — written AFTER STEP\_OUTPUT verified + chain\_steps.status updated

**Critical ownership rule (section-1 §4):** Sub-agents do NOT write to JSONL. They write only

to `output_path`. chain-runner.sh is the sole writer. "runtime not testimony" — code review

must reject any chain-runner that delegates JSONL writing to sub-agents.

Full event schemas with field tables: see section-1 §4.2 (lines 312-407) and section-2 §2.9

(lines 656-692).

### Lease Semantics — RESOLVED

**New `lease_until` column on `chain_steps`, NOT reuse of `task_scheduling.lease_until`** (Bruce §5.1).

`task_scheduling` is keyed by `task_id`; one MC task = one row. Reusing it for step-level leases

would either inflate the task table (one MC per step = operational noise + dispatch logic breakage)

or repurpose the chain MC row (losing chain-level vs step-level distinction). Clean column wins.

The chain MC task's `task_scheduling` row continues to serve cb\_state, dead\_letter, chain-level

retry counting (mc.js:3089).

Atomic lease acquire (CAS pattern):

```sql
UPDATE chain_steps
SET status='running', attempt_count=attempt_count+1, started_at=datetime('now'),
    lease_until=datetime('now', '+10 minutes')
WHERE chain_run_id=:chain_run_id AND step_name=:step_name
  AND status IN ('pending','failed')
  AND (lease_until IS NULL OR lease_until < datetime('now'));
-- chain-runner checks changes()==1; if 0, another runner holds lease, exit 0
```

SQLite WAL serializes writes; `changes()==1` is the optimistic-lock confirmation. Section-1 §5.2

(lines 432-510) carries full lease release + expiry recovery SQL.

### Retention / Age-Out (section-1 §6)

- **&gt;24h running/awaiting\_human** → mark `dead_letter`, cascade chain\_steps to `failed`
- **&gt;7d completed** → INSERT into `chain_runs_archive`, DELETE from `chain_runs` (CASCADE)
- **JSONL files** → mv to `~/system/logs/chains/_archive/YYYY-MM/`
- **Orphan `.tmp` files** → `find ... -mmin +30` → log as `ORPHAN_TMP` event

Age-out runs daily 03:00 via `com.alai.chain-ageout.plist` (StartCalendarInterval, NOT KeepAlive).

Build-phase plist; cite section-1 §6.3 (lines 610-640) for full XML.

---

## §4. chain-runner.sh + phantom-link-detector

### chain-runner.sh — Architecture

**Stateless bash polling loop.** Not a daemon, not a scheduler. ≤150 LOC. Each invocation:

poll mc.js → read chain\_steps → dispatch ONE sub-agent → wait for ready → verify output →

Proveo gate → advance row → release lease → exit. LaunchAgent re-triggers on cadence.

**Polling cadence:** `StartInterval=60` (chain-runner poll plist) for the inner loop;

`StartCalendarInterval` for the per-chain triggers. Do NOT poll tighter than 5min without

confirming daemon-fleet health (CONSTRAINT §).

**Exit codes:**

<table id="bkmrk-codemeaning0no-runna"><thead><tr><th>Code</th><th>Meaning</th></tr></thead><tbody><tr><td>0</td><td>No runnable task, or step dispatched and awaiting ready (normal cron tick)</td></tr><tr><td>1</td><td>Internal error (DB query fail, lock file, disk full)</td></tr><tr><td>2</td><td>Chain halted at human gate — emit HUMAN\_GATE event, exit clean</td></tr><tr><td>3</td><td>Cost ceiling exceeded — all chains halted, Slack alerted</td></tr><tr><td>4</td><td>Circuit breaker OPEN for this chain — skip, log, exit</td></tr></tbody></table>

**Signal handling:** SIGTERM/SIGINT → write STEP\_INTERRUPTED → release lease → exit 1.

SIGKILL untrapped; lease expiry (10 minutes per mc.js:1367) handles kill-9 resume.

**Max runtime per step:** 8 minutes (`timeout 480`). 10-min lease window leaves 2-min cleanup

buffer before lease expiry.

**Dry-run mode (`--dry-run`):** Read DB; print planned dispatch (chain\_id, run\_id, step, skill,

input\_path, output\_path); never call Task(); never mutate DB; exit 0.

### Sub-Agent Dispatch Contract

Each link dispatches exactly ONE sub-agent. Checkpoint passed as `--resume-from-checkpoint <abs_path>`

argument (NOT environment variable — args appear in `set -x` traces, env vars don't):

```bash
timeout 480 node "${CLAUDE_CLI}" task \
  --skill "${skill}" \
  --resume-from-checkpoint "${input_path}" \
  --output-path "${output_path}" \
  --chain-task-id "${chain_task_id}" \
  --chain-run-id "${CHAIN_RUN_ID}" \
  --step "${STEP_NAME}" \
  2>>"${STEP_STDERR_LOG}"
```

Sub-agent contract on receipt:

1. Read `--resume-from-checkpoint` file (cold start = `{}`; otherwise previous link's output)
2. Do skill work
3. Write output to `--output-path` using atomic `.tmp+rename`
4. Call `mc.js ready <chain_task_id>` with comment containing output\_path
5. **Do NOT call `mc.js done`** — done is exclusively Proveo's signal (Hard Constraint #4)

### Atomic Step Transition

```
READ checkpoint → ACQUIRE lease (CAS UPDATE) → WRITE STEP_START to JSONL →
UPDATE chain_steps.status='running' → SPAWN sub-agent (Task, timeout 480) →
WAIT for ready_for_review (poll mc.js show every 30s, max 8min) → 
VERIFY output_path exists on disk → INVOKE Proveo gate → 
WRITE STEP_END (JSONL FIRST, then SQL — mitigates skew, section-2 §2.7) →
WRITE next chain_steps row (status=pending, input_path=current output_path) →
RELEASE lease → EXIT
```

Full bash pseudocode: section-2 §2.3 (lines 224-298). LOC budget table: section-2 §2.10.

### Failure Handling

- **Crash mid-link:** lease expires (10min), next tick reads `WHERE status IN ('pending','failed')`

 ORDER BY id ASC, resumes from earliest unfinished step. Done steps immutable.

- **Lease expiry:** `_renew_lease` SQL UPDATE every 30s during ready-poll
- **Partial writes:** `.tmp` orphan only; phantom-detector reports `>30min` orphans
- **Sub-agent timeout:** STEP\_TIMEOUT event → increment attempt\_count → status=failed → release
- **Retry budget exhaustion:** `attempt_count >= max_retries` → cb\_state='OPEN', dead\_letter=1,

 STEP\_DEAD event, H-priority MC dead-letter task, Slack alert, EXIT 4

- **cb\_state circuit breaker:** check before dispatch; exit 4 if 'open' (mc.js:3101-3105 semantics)

Full SQL + bash: section-2 §2.4 (lines 328-415).

### Phantom-Link-Detector Daemon

**Node.js, 15-min cadence, 7-min offset from daemon-fleet-watchdog** (avoids SQLite contention:

phantom at :00/:15/:30/:45, watchdog at :07/:22/:37/:52).

**Algorithm per `chain_steps WHERE status='done' AND output_path IS NOT NULL` (24h window):**

1. `fs.existsSync(output_path)` — file present?
2. File size &gt;= MIN\_STEP\_OUTPUT\_BYTES (64) — empty-file bypass guard (T3 mitigation)
3. Parse JSON; verify `chain_run_id` + `step_name` fields match (T3)
4. Re-compute SHA-256, compare to `chain_steps.output_sha256` (T3)
5. Find matching `STEP_END status:ok` event in JSONL (T3)
6. Cost anomaly check: 5+ consecutive zero-cost agentic steps OR (2+ zero-cost AND total=0) (section-5 §5.6)

On mismatch:

- `chain_steps.status='phantom_suspected'`
- `chain_runs.status='phantom_suspected'`
- PHANTOM\_DETECTED / PHANTOM\_COST\_DETECTED event to JSONL
- Slack alert to `#alerts`
- mc.js comment to chain task: "PHANTOM LINK DETECTED... Downstream done BLOCKED"

**Blocking downstream done:** ai-factory-pipeline B10 fail-secure postflight gate already checks

`chain_runs.status` before allowing `mc.js done` — no mc.js code change needed.

**False-positive handling:** legitimate "process and delete" → `output_path: null` in chain YAML

(detector skips `WHERE output_path IS NULL`). 2-minute grace period on freshly-completed steps.

### Saga / Compensation

Per-link attribute `compensate_on_failure: <skill>`, NOT chain-wide. Only links crossing

external-API boundaries (section-2 §2.5):

<table id="bkmrk-pipelinelinks-needin"><thead><tr><th>Pipeline</th><th>Links needing compensation</th><th>Compensating skill</th></tr></thead><tbody><tr><td>daily-inbox-triage</td><td>None — read-only classification + idempotent INSERT OR IGNORE</td><td>N/A</td></tr><tr><td>weekly-client-report</td><td>`bookstack-publish` (CEO-visible side-effect), `client-email-send` (at-most-once)</td><td>`bookstack-revert-draft`; `email-send-partial-log`</td></tr><tr><td>email-pipeline-fix-loop</td><td>`apply-fix` (production config mutation)</td><td>`email-pipeline-rollback` (reverse-order via step2-proposal.json rollback\_procedure)</td></tr></tbody></table>

Idempotency tokens for external POSTs are **mandatory** regardless of compensation config —

`chain_run_id + step_id` composite stored in SQLite before POST attempted.

### Implementation Language Guard — RESOLVED

**bash + Node ONLY.** chain-runner.sh = bash. phantom-link-detector.js = Node. mc.js helpers = Node.

Python permitted only for the one-line `os.fsync()` invocation (bash has no fsync primitive).

**No Kotlin** (genesis: `feedback_john_kotlin_rabbit_hole_2026-05-02.md`). Three reasons (section-2 §2.8):

opacity (existing alai-hooks Kotlin CLI has 13/64 OPAQUE gates), runtime dependency (Gatekeeper

SIGKILL on cp documented in `reference_hook_system_2026-05-04.md`), rabbit-hole attractor.

Structural prevention: shellcheck pass is acceptance criterion → Kotlin impossible by construction.

**No TypeScript.** mc.js is plain Node JavaScript; chain-runner helpers stay in JavaScript.

TypeScript adds a build step that fails silently.

**No OpenAI Agents SDK** (DPA pending; vendor lock; bypasses 64-gate stack).

---

## §5. YAML Pipeline Schema + 3 Reference Pipelines

### Schema Spec (top-level fields)

<table id="bkmrk-fieldtypeconstraintc"><thead><tr><th>Field</th><th>Type</th><th>Constraint</th></tr></thead><tbody><tr><td>`chain_id`</td><td>string</td><td>`/^[a-z][a-z0-9-]{1,63}$/`, unique across `~/system/agents/chains/`</td></tr><tr><td>`schema_version`</td><td>int</td><td>must equal 1</td></tr><tr><td>`description`</td><td>string</td><td>≤120 chars</td></tr><tr><td>`priority`</td><td>enum</td><td>`H</td><td>M</td><td>L` (H forces human\_gate on all agentic links)</td></tr><tr><td>`trigger.type`</td><td>enum</td><td>`cron</td><td>watchdog</td><td>manual`</td></tr><tr><td>`trigger.spec`</td><td>object</td><td>shape per type (cron: plist\_label + calendar\_interval; watchdog: source\_plist + dedup\_window\_minutes; manual: optional allowed\_origins)</td></tr><tr><td>`cost_ceiling_usd_per_run`</td><td>float</td><td>`0 < x ≤ 5.0` (build phase enforced)</td></tr><tr><td>`human_gate_default`</td><td>bool</td><td>overridden true if priority=H</td></tr><tr><td>`max_retries_default`</td><td>int</td><td>`0..5`</td></tr><tr><td>`links[]`</td><td>list</td><td>length `1..20`</td></tr></tbody></table>

### Per-link fields

`name`, `skill`, `agent` (required); `status_initial` (`pending|skip_if_exists`),

`human_gate`, `compensation_skill`, `evidence_paths[]` (must start with `/Users/makinja/system/state/chains/` —

NO `/tmp/`), `idempotency_key_template` (Go-template, vars: chain\_run\_id, link\_name, date, attempt),

`timeout_seconds` (30..1800), `max_retries` (0..5; human\_gate links capped to 0),

`on_fail` (`stop|retry|compensate|dead_letter`).

Full spec with rationale per field: section-3 §3.1.2 (lines 76-149).

### Three Reference Pipelines

#### A. daily-inbox-triage — FIRST SHIP

- **Trigger:** cron, 07:00 daily (`com.alai.chain-daily-inbox.plist`)
- **Priority:** M
- **Cost ceiling:** $0.25/run (target $0.20)
- **Links (4):** fetch-inbox (deterministic, $0) → classify (Sonnet, ~$0.018 warm) →

 draft-replies (Sonnet, ~$0.017 warm) → send (deterministic + human gate)

- **Human gate:** ONLY before send link (link 3)
- **Compensation:** none — read-only classification + idempotent INSERT OR IGNORE
- **Expected $/month:** $0.96 (warm cache); 2× worst-case $4.87

Full YAML: section-3 §3.2 (lines 166-287).

#### B. weekly-client-report — SECOND SHIP (after 7 clean days on A)

- **Trigger:** cron Monday 09:00 (`com.alai.chain-weekly-report.plist`; one hour after existing

 `com.john.weekly-pipeline-review` 08:00)

- **Priority:** M
- **Cost ceiling:** $0.75/run
- **Links (6):** aggregate-status → draft-narrative → format-report → human-review-draft (gate 1, 48h timeout) →

 bookstack-publish → human-review-final-publish (gate 2, 24h timeout, includes client-notify-send)

- **Human gates:** TWO (links 3 and 5)
- **Compensation:** `bookstack-revert-draft` on link 4; `client-notify-partial-log` on link 5
- **Expected $/month:** $0.44; 2× worst-case $4.40

Full YAML: section-3 §3.3 (lines 304-446).

#### C. email-pipeline-fix-loop — THIRD SHIP (after 7 clean days on A+B)

- **Trigger:** watchdog (NOT cron). `daemon-fleet-watchdog.sh` calls `mc.js add --delegated chain-runner

\--priority H --tag chain:email-pipeline-fix-loop` when `email\_consecutive\_failures &gt;= 3`

- **Priority:** H — forces human\_gate on every agentic link regardless of explicit field
- **Cost ceiling:** $1.50/run
- **Links (5):** probe (deterministic) → diagnose (Sonnet, gate) → propose-fix (Sonnet, gate) →

 apply-fix (executes proposal.sh, gate, compensate on partial) → verify-fix (re-probe, gate)

- **Compensation:** `email-pipeline-rollback` on apply-fix
- **Diagnostic preamble (REQUIRED in skill prompt):** "You are diagnosing a BROKEN email pipeline.

 All evidence below was gathered from a system that is currently failing. Treat every observation

 as a hypothesis, not a fact. Do NOT claim the system is working."

- **Expected $/month:** $1.40 (10 incidents/month); 2× worst-case $8.00
- **Self-referential risk:** runs on broken system — context rot guaranteed; verify-fix re-probes

 from scratch, never relies on pre-apply state in memory

Full YAML: section-3 §3.4 (lines 464-606).

### Validation Rules (section-3 §3.6)

16 rules summarized; key callouts:

- **V-09 (skill exists):** `~/.claude/skills/sp-<value>/SKILL.md` OR `~/.claude/skills/<value>/SKILL.md`

 must exist at validator run time. Prevents phantom skill references (MLX-style gap).

- **V-11 (evidence path safety):** every path must start with `/Users/makinja/system/state/chains/`.

 `/tmp/` paths FORBIDDEN for cross-step evidence (cleared on reboot per CONSTRAINTS).

Validator at `~/system/tools/chain-validator.js` \[NEW\]; invoked at chain registration.

Invalid chain → logged as `CHAIN_INVALID` event in JSONL, skipped (does NOT crash runner).

### Versioning

`schema_version: 1` is the only supported version in v1. Required field; absence = validation

error (V-02). Migration: increment + write `chain-migrate-v1-v2.js` + 30-day compatibility

window (warn `SCHEMA_DEPRECATED`) + reject `CHAIN_SCHEMA_OUTDATED` after window.

`chain_id` is primary key — MUST NOT be renamed; replacements get new ID running in parallel.

---

## §6. Trigger Infrastructure (LaunchAgents)

### Two NEW LaunchAgent Plists

**`com.alai.chain-daily-inbox.plist`** (07:00 daily)

- ProgramArguments: `/bin/bash /Users/makinja/system/tools/chain-runner.sh --enqueue daily-inbox-triage --trigger cron`
- StartCalendarInterval: Hour=7, Minute=0
- KeepAlive=false; RunAtLoad=NOT set (no off-hours flood on launchctl load)
- StandardOutPath/StandardErrorPath: `~/system/logs/chains/chain-daily-inbox-launchd.log` (merged)

**`com.alai.chain-weekly-report.plist`** (Monday 09:00)

- StartCalendarInterval: Weekday=1, Hour=9, Minute=0
- One hour AFTER existing `com.john.weekly-pipeline-review` (verified 08:00) — separation

 intentional; pipeline-review fires first, errors surface before chain begins

- Does NOT duplicate `com.john.weekly-pipeline-review`. That plist calls `weekly-pipeline-review.js`

 directly; this plist enqueues an MC chain task

### Phantom-Detector Plist

**`com.alai.chain-phantom-detector.plist`** (15-min interval)

- macOS launchd does NOT support `*/15` cron syntax. StartCalendarInterval expressed as

 ARRAY of four dicts (Minute=0, Minute=15, Minute=30, Minute=45) per `man launchd.plist`

- KeepAlive=false (one-shot scan; ghost-worker defence)
- Calls `~/system/tools/chain-phantom-detector.sh` shell wrapper → `phantom-link-detector.js`

Full XML: section-4 §4.2.1, §4.2.2, §4.3 (lines 29-244).

### Watchdog Integration (NO new plist for fix-loop)

`email-pipeline-fix-loop` is event-driven, NOT calendar-driven. Adding a cron plist would fire

on every tick regardless of email pipeline health → H-priority MC noise. Existing

`com.alai.daemon-fleet-watchdog` (StartInterval=900, KeepAlive=false, RunAtLoad=true) gains

4 lines inside `daemon-fleet-watchdog.sh` (section-4 §4.2.3 lines 152-168):

```bash
EMAIL_FAILURE_THRESHOLD=3
if [[ "$email_consecutive_failures" -ge "$EMAIL_FAILURE_THRESHOLD" ]]; then
  /opt/homebrew/bin/node /Users/makinja/system/tools/mc.js add \
    "Chain run: email-pipeline-fix-loop (incident: ${INCIDENT_ID})" \
    --priority H --delegated chain-runner --desc "..." \
    >> /Users/makinja/system/logs/chains/chain-email-fix-launchd.log 2>&1
  /opt/homebrew/bin/node /Users/makinja/system/tools/slack.js send "#alerts" "..."
fi
```

### KeepAlive Decision Matrix (section-4 §4.5)

<table id="bkmrk-componentpatternrati"><thead><tr><th>Component</th><th>Pattern</th><th>Rationale</th></tr></thead><tbody><tr><td>chain-runner.sh poll plist</td><td>StartInterval=60</td><td>Ghost-worker defence; lease semantics need clean exit</td></tr><tr><td>chain-daily-inbox trigger</td><td>StartCalendarInterval Hour=7</td><td>Calendar event; exits after mc.js add</td></tr><tr><td>chain-weekly-report trigger</td><td>StartCalendarInterval Weekday=1 Hour=9</td><td>Calendar event</td></tr><tr><td>chain-phantom-detector</td><td>StartCalendarInterval ARRAY (4× Minute)</td><td>Discrete scan; KeepAlive=false</td></tr><tr><td>daemon-fleet-watchdog</td><td>StartInterval=900 (existing, unchanged)</td><td>Watchdog integration adds bash inside</td></tr><tr><td>email-pipeline-fix-loop</td><td>NO PLIST</td><td>Event-driven via watchdog hook</td></tr></tbody></table>

### Lid-Close / Caffeinate (section-4 §4.6)

`man launchd.plist` (StartCalendarInterval): "Unlike cron... launchd will start the job the

next time the computer wakes up. If multiple intervals transpire before the computer is woken,

those events will be coalesced into one event upon wake."

**Coalescing is wanted behaviour:** machine closed Sun-Mon → ONE Mon 09:00 weekly-report fire on

wake (not N queued stale runs). caffeinate NOT required.

`StartInterval` (phantom-detector, runner-poll) does NOT fire on missed ticks — fires at next

boundary after wake. Max delay 60s — acceptable.

If sub-agent running when lid closes: macOS suspends; on wake, `lease_until` likely expired →

chain-runner treats as failed, retries from last clean checkpoint.

### Logging Contract (section-4 §4.7)

```
~/system/logs/chains/
  chain-daily-inbox-launchd.log           # launchd stdout+stderr (merged)
  chain-weekly-report-launchd.log
  chain-phantom-detector-launchd.log
  chain-email-fix-launchd.log
  chain-events.jsonl                      # rolling, in-progress + recent
  archive/
    chain-events-<chain_run_id>.jsonl     # archived per chain_run_id on completion
```

Rotation via existing `com.john.log-rotate` (extend config to include `~/system/logs/chains/*.log`,

7-day retention). `chain-events.jsonl` requires append-only handling; archive on chain done/failed.

**Pre-flight:** `mkdir -p ~/system/logs/chains/archive` BEFORE any plist load (LaunchAgent does NOT

create parent dirs; silent log-write failure otherwise).

### 80% Daemon Fleet Health SHIP GATE

**HARD GATE.** Currently **79.7% (114/143)** per `~/system/state/daemon-fleet-status.json`

2026-05-05T08:34Z. One healthy daemon below threshold. **Do NOT load chain plists** until gate opens.

Two critical daemons must be remediated FIRST:

1. **`com.john.db-backup`** (down\_exit\_256) — mission-control.db backup; chain\_runs corruption has

 no fallback without it

1. **`com.alai.cost-daily-report`** (calendar\_err\_32512) — chain cost telemetry depends on

 cost-tracker.js daily report; broken report obscures $3/day ceiling proximity

After fixing, re-run `bash /Users/makinja/bin/daemon-fleet-watchdog.sh` and re-evaluate. If

healthy ≥116 (81%), gate opens. **Genesis:** SENTINEL v3 47/138 healthy figure; building cron-fleet

on 34% failure-rate scheduler doubles failure surface. Phantom-detection subsystem validity itself

depends on this gate (section-4 §4.8 lines 499-503).

**Pre-flight checklist (section-4 §4.4):** snapshot daemon-fleet → plutil -lint each plist →

`mkdir -p` log dirs → load phantom-detector FIRST → load chain trigger plists → verify all loaded

via `launchctl list | grep com.alai.chain-` → check no stale `mc.js list --delegated chain-runner`

runs.

---

## §7. Cost Telemetry &amp; Ceiling

### Per-Link Cost Capture

**Source of truth:** Anthropic API `usage` block from `claude` CLI exit JSON. NEVER from sub-agent

narrative text (testimony vs evidence — same discipline as JSONL ownership).

```json
{
  "usage": {
    "input_tokens": 4200, "output_tokens": 810,
    "cache_creation_input_tokens": 3100, "cache_read_input_tokens": 980
  },
  "model": "claude-sonnet-4-6"
}
```

chain-runner.sh computes `cost_usd` using same formula as `cost-tracker.js/estimateCost()`:

- Input: $3.00/M tokens; Output: $15.00/M tokens; Cache read: $0.30/M (10% discount); Cache create: at input rate
- Pricing constants from cost-tracker.js PRICING table (Sonnet-4.6 2026-04-06 rates)

**AGENT\_COST event** appended to JSONL atomically (`.tmp` + `cat >>` since JSONL append is atomic

at OS level for writes &lt; PIPE\_BUF=4096 bytes on Darwin):

```json
{
  "ts":"2026-05-05T06:33:44Z", "event":"AGENT_COST",
  "chain_run_id":"...", "step_name":"classify", "model":"claude-sonnet-4-6",
  "input_tokens":4200, "output_tokens":810,
  "cache_creation_input_tokens":3100, "cache_read_input_tokens":980,
  "cost_usd":0.0181, "cache_hit_ratio":0.233
}
```

### Aggregation — RESOLVED

**Decision: explicit SQL UPDATE on STEP\_END, NOT SQLite trigger.** Two reasons (section-5 §5.2):

(1) triggers are invisible to runtime, hard to debug; (2) trigger would fire on any future direct

DB writes (admin corrections), corrupting running total. Explicit UPDATE is grep-able, testable

with `--dry-run`, single-writer auditable.

```sql
UPDATE chain_runs
SET total_cost_usd = (SELECT COALESCE(SUM(cost_usd),0) FROM chain_steps WHERE chain_run_id = ?)
WHERE chain_run_id = ?;
```

WAL mode (already active per Litestream markers) provides read consistency for ceiling check

without blocking writer.

### Dual Write

After JSONL append, chain-runner.sh writes to `costs.db` via `cost-tracker.js trackCost()` with

`source: 'chain-runner'`, `agent_name: 'chain-runner/<chain_name>/<step_name>'`,

`metadata: {chain_run_id, step_index, attempt}`. No schema change to cost-tracker.js — agent\_name

encodes chain context, metadata carries structured chain\_run\_id for reconciliation.

### $3/Day Hard Ceiling

**Pre-flight check** (NOT post-hoc) BEFORE acquiring next chain task lease:

```bash
DAILY_SPEND=$(sqlite3 "${MC_DB}" "
  SELECT COALESCE(SUM(total_cost_usd), 0)
  FROM chain_runs
  WHERE started_at > datetime('now', '-1 day')
    AND status NOT IN ('cancelled', 'phantom_suspected');")
[ daily ≥ ceiling ] && {
  sqlite3 "${MC_DB}" "UPDATE task_scheduling
    SET cb_state='OPEN', cb_opened_at=datetime('now'), cb_reason='daily_cost_ceiling_exceeded'
    WHERE delegated_to='chain-runner' AND status IN ('open','pending');"
  slack #alerts; append COST_CEILING_BREACHED event; exit 0
}
```

`cb_state='OPEN'` is the same circuit-breaker mc.js already respects — natural halt without new

code path. **Recovery automatic:** 24h rolling window decays below $3.00 → next tick passes

ceiling check → dispatch resumes. Cron does NOT add duplicate chain tasks (existing-open check).

**Warn threshold:** $2.00 (67%) → Slack warning, continue dispatching. Externalize via

`~/system/config/chain-runner.env`:

```
CHAIN_DAILY_COST_CEILING_USD=3.00
CHAIN_COST_ALERT_THRESHOLD_USD=2.00
```

### cost-tracker.js `chain` Subcommand \[NEW\]

```
node ~/system/tools/cost-tracker.js chain <chain_name_or_run_id> [--window 7d|30d|all]
```

Cross-DB join (costs.db + mission-control.db). Output: per-run + per-step breakdown with cache hit

%, ACTUAL vs ESTIMATE variance, daily ceiling headroom. Variance &gt; 50% → exit 1 → chain-runner

appends `COST_VARIANCE_WARNING` event.

Full output sample + JS impl: section-5 §5.4 (lines 287-353).

### Cache Hit Telemetry

Columns added to chain\_steps DDL: `cache_creation_tokens`, `cache_read_tokens`, `cache_hit_ratio`.

`chain_runs.cache_hit_ratio_avg` updated at chain completion. Daily report at

`~/system/tools/chain-cache-report.js` \[NEW\] runs after last step of each chain run; checks

prefix-stability (SHA-256 of stable prompt portion vs yesterday).

### Phantom-Cost Detection (section-5 §5.6)

Inverse of MLX phantom: completed steps with `cost_usd=0` across all of them. Zero cost legitimate

ONLY for `type: deterministic` (shell, no Claude) or local Ollama/MLX (priced $0). 5+ consecutive

zero-cost agentic STEP\_END events = `phantom_suspected`. Stricter: 2+ zero-cost AND total=0 → flag.

False-positive escape: chain YAML opt-in `zero_cost_ok: true` (absent by default — safe side flags).

### Pipeline Cost Projection Table

<table id="bkmrk-pipelineruns%2Fmonthex"><thead><tr><th>Pipeline</th><th>Runs/month</th><th>Expected $/run (warm)</th><th>Cold $/run</th><th>Expected $/month</th><th>2× worst $/month</th></tr></thead><tbody><tr><td>daily-inbox-triage</td><td>30</td><td>$0.0319</td><td>$0.0811</td><td>$0.96</td><td>$4.87</td></tr><tr><td>weekly-client-report</td><td>4</td><td>$0.111</td><td>$0.255</td><td>$0.44</td><td>$4.40</td></tr><tr><td>email-pipeline-fix-loop</td><td>10</td><td>$0.140</td><td>$0.257</td><td>$1.40</td><td>$8.00</td></tr><tr><td>**Combined**</td><td>**44**</td><td>—</td><td>—</td><td>**$2.80**</td><td>**$17.27**</td></tr></tbody></table>

Section-5 §5.7 (lines 506-616) carries full per-step token + cost tables.

---

## §8. Threat Surface &amp; Mitigations

8 threats identified (Parisa §6). 3 HIGH must be addressed before ship.

### T1 — HIGH — Shell Injection in chain-runner.sh

**Vector:** YAML-derived strings (`skill`, `agent`, `evidence_paths[]`, `step_name`,

`compensate_on_failure`) interpolated unquoted into bash → arbitrary code execution as makinja.

YAML directory writable by any sub-agent (compromised sub-agent or malicious skill output).

**Mitigation (must-fix before ship):**

1. Always double-quote: `claude --skill "$SKILL_NAME"` (NEVER `$SKILL_NAME` bare)
2. Whitelist regex `^[a-zA-Z0-9_-]{1,64}$` on every YAML field BEFORE shell use; reject, do NOT

 sanitize

1. Path whitelist on `evidence_paths[]`: realpath must start with `~/system/state/chains/`
2. Load YAML via `node -e "require('js-yaml')"`, NEVER shell heredoc
3. shellcheck SC2086 (unquoted variable expansion) MUST remain active in CI

**Residual risk:** LOW after mitigations. Consider `chmod 444` on `~/system/agents/chains/` —

sub-agents have no legitimate reason to modify chain definitions.

### T2 — HIGH — APPROVE Token Forgery via mc.js Comment

**Vector:** chain-runner.sh runs as launchd daemon, NOT inside Claude Code session. Therefore:

- No UserPromptSubmit hook fires
- No `/tmp/ceo-turn-<session>.txt` exists for daemon
- PreToolUse Bash hooks only fire within Claude Code agent session
- `mc.js comment` `detectActor()` is hostname-only (mc.js:528-554, 2726-2745); no session token check

`ceo-token-origin-gate.sh` does NOT protect daemon-invoked `mc.js comment` path. Runaway chain-runner

or compromised sub-agent can self-issue APPROVE → advances chain past human gate. Same pattern as

"broken JOHN" 2026-05-02.

**Mitigation (must-fix before ship):**

1. **Switch to `mc.js approval_tokens` table** (already exists, mc.js:4130-4165), NOT comment parsing.

 chain-runner.sh writes `chain_steps.status='awaiting_human'` + new approval\_token record →

 exits. Next poll checks `approval_tokens WHERE status='approved'`. \*\*chain-runner NEVER writes

 APPROVE — only reads one.\*\*

1. CEO approval via `mc.js approval grant <token_id>` invoked in CEO's Claude Code session

 (where ceo-token-origin-gate.sh IS active)

1. mc.js `comment` handler extension: if text matches `\bAPPROVE\b/i` AND no

 `CLAUDE_SESSION_ID` env + valid CEO turn log → exit 2

**Residual risk:** MEDIUM. approval\_tokens table writable by any process running as makinja.

Cryptographic signing out of scope for v1 bash+Node. **System-wide implication:** mc.js comment

APPROVE forgery gap is NOT Pillar #5-specific — propose system-wide hardening MC (see §12).

### T3 — HIGH — Phantom-Link False-Negative (Empty File Bypass)

**Vector:** Sub-agent writes empty file at `output_path` before exiting → file existence check

passes → chain advances → downstream reads empty → silent failure or null propagation. Direct

analog of MLX HTTP 200 / no actual work pattern.

**Mitigation:**

1. `MIN_STEP_OUTPUT_BYTES=64` in phantom-link-detector.js
2. JSON parse + verify `chain_run_id` + `step_name` fields match
3. Re-compute SHA-256 vs `chain_steps.output_sha256` (catches replace-with-empty after checksum)
4. Find matching `STEP_END status:ok` in chain-events.jsonl
5. `required_output_fields[]` per-step YAML schema contract; non-null + non-empty validation

**Residual risk:** LOW. Phantom-link-detector is structural + provenance verifier, NOT semantic.

Semantic correctness = Proveo gate's domain.

### T4 — MED — JSONL Log Tampering

**Vector:** Same-UID sub-agents can append fake STEP\_END events; insert mid-file via temp+concat;

truncation attack.

**Mitigation:**

1. SHA-256 hash chain — each event includes `prev_hash` + `seq`; phantom-detector re-walks chain
2. `chain_runs.event_seq_watermark` — detects file shrinkage
3. `fs.open(path, 'a')` O\_APPEND only; never `writeFileSync` (truncates)
4. Sub-agent prompt explicit: "MUST NOT write to `~/system/logs/chains/`. Evidence files go to

 `~/system/state/chains/<chain_id>/<run_id>/` only."

1. File mode 0644 owned makinja (necessary, not sufficient — same-UID still writes)

**Residual risk:** MEDIUM. Hash chain prevents undetected injection but not full-log rewrite. True

append-only requires `chattr +a` (Linux) — macOS lacks direct equivalent. For ALAI v1 threat

model (accidental hallucination, not adversarial external attacker), hash chain + watermark sufficient.

### T5 — MED — Lease Takeover / Split-Brain

**Vector:** Two chain-runner instances acquire same chain\_run\_id lease → both dispatch same

sub-agent → both write same output\_path → both append STEP\_END → external POSTs execute twice.

**Mitigation:**

1. Atomic SQL UPDATE with WHERE clause on lease\_until — `changes()==0` → exit 0 (CAS pattern,

 section-1 §5.2 + section-6 mitigation 1). SQLite WAL serializes; impossible for two writers

 both see `changes==1`.

1. `node mc.js active | grep chain-runner` startup check (soft guard; SQL UPDATE is hard guard)
2. StartCalendarInterval (NOT KeepAlive) on chain trigger plists; ProcessType=Background
3. Log `chain_runs.runner_pid` on lease acquisition; second instance sees PID mismatch → exit clean

**Residual risk:** LOW after WAL atomic UPDATE. Standard SQLite job-queue pattern.

### T6 — HIGH — Secrets Exfiltration Through Skill Output

**Vector:** daily-inbox-triage reads inbox containing API keys, Bitwarden exports, GCP tokens,

contract terms. Prompt injection in malicious email → sub-agent writes secrets to chain-events.jsonl

or `~/system/state/chains/`. Persistent on disk (no TTL in spec); readable by future chains.

**No existing secret-scanner hook in `~/.claude/hooks/`** for Write/Edit on chain state dirs.

**Mitigation (must-fix before ship; NEW HOOK REQUIRED):**

1. **NEW hook `~/.claude/hooks/secret-scanner.sh`** \[NEW\] — PostToolUse Write/Edit targeting

 `~/system/state/chains/` and `~/system/logs/chains/`. Pattern:

```
   (sk-[a-zA-Z0-9]{20,}|AIza[0-9A-Za-z\-_]{35}|AKIA[0-9A-Z]{16}|ghp_[a-zA-Z0-9]{36}|
    -----BEGIN (RSA|EC|OPENSSH) PRIVATE KEY|["\s]password["\s]*[:=]["\s]*[^\s]{8,})
```

 Match → block Write, log `/tmp/secret-scanner.log`, Slack alert

1. Sub-agent context isolation: email-fetch writes ONLY structured metadata (no raw bodies) to

 `step1.json`. If classification needs body content, ephemeral via Claude context window, NOT disk

1. `~/system/state/chains/` mode 0700 (owner-only)
2. Evidence file TTL: 72h auto-purge via phantom-link-detector.js or separate cleanup
3. Email-triage skill prompt explicit: "Process metadata only. Do not reproduce raw email body

 text. Write only {sender, subject, classification, priority, mc\_id}."

**Residual risk:** MEDIUM. Prompt injection in emails persistent. Recommend human-gate at

email-fetch→classify boundary for first 30 days production.

### T7 — MED — Cost Exhaustion DoS

**Vector:** Pre-flight gap (Lee §3 post-link timing) — single Opus call &gt;$1; three Opus calls

exhaust $3/day before ceiling fires post-step. Aggregator timing (per-run-completion not real-time)

allows runaway concurrent dispatch. Retry cascade: 2N invocations at $0.50/call.

**Mitigation:**

1. **Pre-STEP\_START budget check** — sum today's chain spend + step's `cost_estimate_usd` &gt;

 ceiling → halt + cb\_halted before dispatch

1. **REQUIRED `cost_estimate_usd` per link** in YAML; chain-runner rejects YAML at load time if absent
2. **Per-step `max_cost_usd` hard ceiling** — actual STEP\_END cost &gt; max → cb\_state=OPEN, halt chain
3. Model enforcement: chain YAML header `ceo_approved_opus_links: []` (empty = no Opus). Validator

 rejects `agent: opus` or `model: claude-opus*` without \[CEO\_APPROVED\] marker

**Residual risk:** LOW for intentional overspend after pre-flight. Pathological wrong estimate

(model pricing change) not covered → monthly pricing-update reminder.

### T8 — LOW — Liveness-Claim-Validator Bypass via Synonym Avoidance

**Vector:** `liveness-claim-validator.sh` CLAIM\_PATTERN at hook line 122 (regex citation, not a claim):

```
\bLIVE\b|\bverified\b|\boperational\b|cost=\$0 verified|\bDONE\b
```

Case-sensitive match.

Synonyms bypass: "finished", "completed", "succeeded", "all links passed", "pipeline up and running",

"confirmed", "active". MLX-style failure mode.

**Mitigation:**

1. Extend CLAIM\_PATTERN with chain-completion synonyms (false-positive-prone in legitimate context;

 prefer targeted approach)

1. **Dedicated `chain-completion-validator` hook** (PreToolUse Write, same three paths). Triggered

 on `chain_run_id` AND past-tense completion language. Required evidence: `chain-events\.jsonl` +

 `STEP_END` + `chain_run_id.*ok` + `postflight-cleared` within ±15 lines

1. **mc.js ready cross-validation:** before any chain-associated task ready, mc.js checks

 `chain_runs.status='succeeded'` AND all `chain_steps.status='done'` AND no `phantom_suspected`.

 Fail → exit 2

**Residual risk:** MEDIUM. Pattern-matching natural language inherently incomplete. Strongest

mitigation is structural: architecturally impossible for chain to declare done without runtime

STEP\_END + phantom-detector clearance.

### Threat Summary Table

<table id="bkmrk-%23threatsevexisting-d"><thead><tr><th>\#</th><th>Threat</th><th>Sev</th><th>Existing Defense</th><th>Gap</th><th>Ship-Blocking?</th></tr></thead><tbody><tr><td>T1</td><td>Shell injection (YAML→bash)</td><td>HIGH</td><td>bash-danger-gate (pipe-to-shell)</td><td>No pre-quoted variable expansion guard</td><td>YES</td></tr><tr><td>T2</td><td>APPROVE forgery (daemon path)</td><td>HIGH</td><td>ceo-token-origin-gate (Claude session only)</td><td>Daemon has no session; mc.js comment hostname-only</td><td>YES</td></tr><tr><td>T3</td><td>Phantom empty-file bypass</td><td>HIGH</td><td>phantom-link-detector existence check</td><td>No content/schema validation</td><td>NO (Phase 1b)</td></tr><tr><td>T4</td><td>JSONL tampering</td><td>MED</td><td>runtime-only convention</td><td>Same-UID sub-agents append; no integrity</td><td>NO (Phase 1b)</td></tr><tr><td>T5</td><td>Lease takeover</td><td>MED</td><td>task\_scheduling.lease\_until column</td><td>chain-runner.sh CAS UPDATE not yet written</td><td>NO (initial build)</td></tr><tr><td>T6</td><td>Secrets exfil through email</td><td>HIGH</td><td>bash-danger-gate (pipe-to-shell)</td><td>No credential scanner on chain state dir</td><td>YES (NEW HOOK)</td></tr><tr><td>T7</td><td>Cost exhaustion DoS</td><td>MED</td><td>$3/day post-link ceiling</td><td>Pre-flight gap; single Opus spike</td><td>NO (initial build)</td></tr><tr><td>T8</td><td>Validator synonym bypass</td><td>LOW</td><td>liveness-claim-validator (LIVE/DONE)</td><td>"finished/completed/succeeded" bypass</td><td>NO (post-v1)</td></tr></tbody></table>

**Three SHIP-BLOCKING child MCs identified (T1, T2, T6).**

Full exploitation paths + code snippets: section-6 §Threat 1-8 (lines 32-644).

---

## §9. Build-Phase Plan (Child MCs)

<table id="bkmrk-child-mcownerestimat"><thead><tr><th>Child MC</th><th>Owner</th><th>Estimate</th><th>Deliverables</th></tr></thead><tbody><tr><td>**\#99129-A**</td><td>Petter Graff (CodeCraft)</td><td>8h</td><td>`chain-runner.sh` ≤150 LOC + `phantom-link-detector.js` + Node helpers; shellcheck clean; dry-run; T1 + T5 mitigations baked in</td></tr><tr><td>**\#99129-B**</td><td>Bruce Momjian (CodeCraft)</td><td>1h</td><td>chain\_runs + chain\_steps DDL applied via migration to mission-control.db; verify `.schema` returns expected; chain\_runs\_archive ready</td></tr><tr><td>**\#99129-C**</td><td>Parisa Tabriz (Securion)</td><td>4h</td><td>NEW `~/.claude/hooks/secret-scanner.sh` (T6); mc.js comment APPROVE forgery hardening — switch to approval\_tokens primitive (T2); register hook in `~/.claude/settings.json`</td></tr><tr><td>**\#99129-D**</td><td>Hadi Hariri (CodeCraft)</td><td>1h</td><td>Commit 3 reference YAMLs to `~/system/agents/chains/`; chain-validator.js with 16 rules; V-09 + V-11 enforced</td></tr><tr><td>**\#99129-E**</td><td>Kelsey Hightower (FlowForge)</td><td>2h</td><td>Fix `com.john.db-backup` + `com.alai.cost-daily-report` to clear 80% gate; load 3 plists + watchdog integration; pre-flight checklist run</td></tr><tr><td>**\#99129-F**</td><td>Lee Robinson (CodeCraft)</td><td>2h</td><td>cost-tracker.js `chain` subcommand; AGENT\_COST event spec; pre-STEP\_START ceiling check (T7); chain-cache-report.js</td></tr><tr><td>**\#99129-G**</td><td>Angie Jones (Proveo)</td><td>4h</td><td>E2E validation: cold start, resume-after-kill mid-link 2, phantom trip, human-gate APPROVE advance + ceo-origin block, cost ≤$0.20/run. Real evidence in `/tmp/proveo-99129/`</td></tr><tr><td>**\#99129-H**</td><td>Skillforge</td><td>2h</td><td>BookStack page on Pillars shelf at https://docs.basicconsulting.no/books/agentic-os/page/pillar-5-skill-chains</td></tr></tbody></table>

**Ship sequence:**

1. -A through -F build artefacts (parallel where independent)
2. -B + -E gate: 80% daemon health PASSED + DDL applied → unblock plist load
3. **daily-inbox-triage v1 ships first**
4. **7 clean days** (no phantom alerts, cost within projection, no broken-JOHN events) → ship

 weekly-client-report v2

1. **7 more clean days** → ship email-pipeline-fix-loop v3
2. -G validation runs end-to-end against daily-inbox-triage v1 BEFORE marking Pillar #5 done
3. -H BookStack publish

---

## §10. Acceptance Criteria for #99129 (DESIGN PHASE)

<table id="bkmrk-accriterionverificat"><thead><tr><th>AC</th><th>Criterion</th><th>Verification</th></tr></thead><tbody><tr><td>AC1</td><td>Spec file exists at `/Users/makinja/system/specs/agentic-os-pillar5-skillchains-2026-05-05.md`</td><td>`ls -la <path>` returns file</td></tr><tr><td>AC2</td><td>All 6 specialist sections referenced with file paths</td><td>grep `/tmp/pillar5-99129/section-` in spec returns 6+</td></tr><tr><td>AC3</td><td>3 reference pipeline YAMLs documented (NOT committed — that's build phase)</td><td>§5 contains daily-inbox-triage, weekly-client-report, email-pipeline-fix-loop section headers + cite section-3 line ranges</td></tr><tr><td>AC4</td><td>Threat mitigations enumerated with severity</td><td>§8 contains 8-row threat table with HIGH/MED/LOW + ship-blocking column</td></tr><tr><td>AC5</td><td>Cost envelope + ship gate documented</td><td>§1 + §6 + §7 contain $3/day ceiling + 80% fleet gate + projected $2.80/mo</td></tr><tr><td>AC6</td><td>ZAKON PLAN compliance (validation + docs in §11)</td><td>§11 lists Proveo MC + Skillforge MC explicitly</td></tr><tr><td>AC7</td><td>4 forged-prompt unresolved items resolved or escalated</td><td>DB location resolved §3; lease-column resolved §3; language guard resolved §4; saga-scope resolved §4</td></tr></tbody></table>

---

## §11. ZAKON PLAN — Required Follow-Up Tasks

**Both REQUIRED before Pillar #5 can be marked done.**

### Validation (Proveo / Angie Jones) — Child MC #99129-G

End-to-end test of `daily-inbox-triage` v1 with REAL evidence (NOT dry-run only):

1. Cold start — fresh chain\_runs row, fresh JSONL file
2. Resume-after-kill mid-link 2 — `kill -9 chain-runner` → next tick reads

 `WHERE status IN ('pending','failed')` ORDER BY id ASC → resumes from link 2 only,

 links 0-1 untouched

1. Phantom-link-detector trip — `rm <output_path>` mid-run → next 15-min tick → status=phantom\_suspected,

 downstream `mc.js done` BLOCKED by postflight gate

1. Human-gate halt + `mc.js comment <chain_task_id> "APPROVE <run_id>"` advances chain;

 ceo-token-origin-gate blocks self-issued APPROVE (T2 mitigation verified)

1. Cost ≤$0.20/run (warm) verified via `cost-tracker.js chain daily-inbox-triage --window 1d`

Real evidence committed to `/tmp/proveo-99129/`. PASS verdict at

`/tmp/proveo-99129-postflight.json`.

### Documentation (Skillforge) — Child MC #99129-H

BookStack publish on Pillars shelf:

- URL: https://docs.basicconsulting.no/books/agentic-os/page/pillar-5-skill-chains
- Cover: architecture, chain\_runs/chain\_steps DDL, runbook for adding 4th chain,

 phantom-link-detector incident response, T1-T8 threat playbook

- URL recorded in `/Users/makinja/system/state/postflight-cleared-99129.json`

---

## §12. Open Questions / Risks

1. **pipeline.db unused for chains.** Bruce's section-1 schema read found zero chain-shaped

 tables. Should it be archived or repurposed in Pillar #11 cleanup MC? \*\*Defer to Pillar #11

 cleanup; not Pillar #5 scope.\*\*

1. **Daemon-fleet 79.7% structural risk.** Chains depend on a partially-broken scheduler. Even

 after fixing db-backup + cost-daily-report (target 81%+), 19% of fleet remains unhealthy.

 The phantom-detection subsystem itself depends on a healthy phantom-detector plist. Recommend

 quarterly daemon-fleet audit MC as ongoing operational hygiene.

1. **mc.js comment APPROVE forgery gap (T2) is system-wide.** Affects ALL daemon-invoked mc.js

 comment paths (not just chain-runner). Propose **system-wide hardening MC** —

 "Audit all daemon → mc.js comment paths for APPROVE forgery vector" — separate from #99129-C

 which only addresses Pillar #5 surface.

1. **First-ship epistemics tension (Parisa flag).** daily-inbox-triage is LOWEST complexity but

 HIGHEST data sensitivity (CEO inbox contains secrets). Recommend human-gate at email-fetch→

 classify boundary for first 30 days production, in addition to send-link gate. Trade slight

 automation cost for substantially reduced exfiltration blast radius.

1. **Cache prefix invalidation risk.** Cost projections assume ≥75% cache hit on warm runs. If

 chain system prompt changes (new skill schema, system prompt edit), cache\_creation\_tokens

 spike + cache\_read\_tokens drop to zero → costs spike 2-4×. `chain-cache-report.js` prefix-

 stability check (SHA-256 of stable prompt portion) flags this within 1 day.

1. **24h vs 7d vs 30d retention windows** are estimates from Bruce §6 + petter-graff §3. May

 need tuning after first 30 days production telemetry — increment via build-phase MC, not

 spec change.

---

## Appendix A — Source Section Files

<table id="bkmrk-filelinesauthordomai"><thead><tr><th>File</th><th>Lines</th><th>Author</th><th>Domain</th></tr></thead><tbody><tr><td>`/tmp/pillar5-99129/section-1-storage-schema.md`</td><td>763</td><td>Bruce Momjian</td><td>DDL + DB location decision + retention</td></tr><tr><td>`/tmp/pillar5-99129/section-2-chain-runner-architecture.md`</td><td>722</td><td>Petter Graff</td><td>bash runner + signals + phantom detector + saga + language guard</td></tr><tr><td>`/tmp/pillar5-99129/section-3-yaml-pipelines.md`</td><td>882</td><td>Hadi Hariri</td><td>YAML schema + 3 pipelines + validation V-01..V-16 + versioning</td></tr><tr><td>`/tmp/pillar5-99129/section-4-launchd-daemon.md`</td><td>522</td><td>Kelsey Hightower</td><td>plist XML + daemon-fleet pre-flight + KeepAlive matrix + 80% gate</td></tr><tr><td>`/tmp/pillar5-99129/section-5-cost-telemetry.md`</td><td>644</td><td>Lee Robinson</td><td>AGENT\_COST event + ceiling + cache hit + phantom-cost + cost-tracker.js extension</td></tr><tr><td>`/tmp/pillar5-99129/section-6-threat-surface.md`</td><td>723</td><td>Parisa Tabriz</td><td>8 threats T1-T8 + exploitation paths + mitigations + ship-blocking matrix</td></tr><tr><td>**Total source lines**</td><td>**4256**</td><td>—</td><td>—</td></tr><tr><td>`/Users/makinja/system/prompts/forged/99129.md`</td><td>121</td><td>Synthesizer (petter-graff)</td><td>Forged baseline with 10 disagreements + decisions</td></tr></tbody></table>

---

## Appendix B — Disagreements Resolved

Per `/Users/makinja/system/prompts/forged/99129.md` <disagreements> block:</disagreements>

<table id="bkmrk-%23tensionresolutionci"><thead><tr><th>\#</th><th>Tension</th><th>Resolution</th><th>Citation</th></tr></thead><tbody><tr><td>1</td><td>Recursive sub-agent vs cron+stateless</td><td>**HYBRID**: chain-runner.sh = cron+stateless trigger and link advancer; each link's work dispatched as sub-agent Task call (inherits 64-gate stack)</td><td>§2 Architecture Overview; section-2 §2.0</td></tr><tr><td>2</td><td>Checkpoint storage shape (JSONL vs JSON file vs SQL)</td><td>**BOTH SQL + JSONL, NO state-file**. SQL = canonical state (durability, queryable, joins); JSONL = runtime-written event log; state-file rejected</td><td>§3 Storage; section-1 §4</td></tr><tr><td>3</td><td>DB location (mission-control.db vs pipeline.db)</td><td>**RESOLVED — mission-control.db**. Bruce's section-1 schema read confirmed pipeline.db has zero chain-shaped tables; default rule applies</td><td>§3 Storage; section-1 §1</td></tr><tr><td>4</td><td>Human-in-loop primitive (mc.js comment vs file-flag)</td><td>**mc.js comment canonical, Slack mirror-only.** File-flag bypasses ceo-token-origin-gate (broken-JOHN pattern) → REJECTED. T2 in §8 hardens this further (use approval\_tokens, not comment parsing)</td><td>§5 + §8 T2; section-6 Threat 2</td></tr><tr><td>5</td><td>Saga compensation pattern</td><td>**Per-link attribute, not chain-wide.** External-API boundary only (email send, BookStack publish). Read-only links + idempotent DB writes do NOT need compensation</td><td>§4 Saga; section-2 §2.5</td></tr><tr><td>6</td><td>OpenAI Agents SDK</td><td>**Both panelists rule out.** Hard constraint regardless</td><td>§4 Implementation Language Guard; section-2 §2.8</td></tr><tr><td>7</td><td>Implementation language</td><td>**bash + Node ONLY.** No Kotlin (rabbit-hole genesis), no TypeScript (silent build failure), no Python beyond fsync one-liner</td><td>§4 Implementation Language Guard; section-2 §2.8</td></tr><tr><td>8</td><td>First-ship pipeline ranking</td><td>**AGREED: daily-inbox-triage first; email-pipeline-fix-loop last.** Build order: triage → weekly → fix-loop. Fix-loop ships ONLY after 7 clean days on first two</td><td>§5 + §9 Build-Phase Plan</td></tr><tr><td>9</td><td>liveness-claim-validator integration</td><td>**Hook is LIVE per `/Users/makinja/system/state/postflight-cleared-99127.json` 2026-05-05T06:54Z. Pillar #5 reuses, does not reimplement.** Operationalized via phantom-link-detector</td><td>§2 Components Diagram; §8 T8</td></tr><tr><td>10</td><td>ANVIL/FORGE distributed nature</td><td>**v1 single-host (ANVIL), distributed concerns documented for v2.** UUID `chain_run_id` adopted NOW to avoid future schema migration. Clock skew, lid-close, mc.js↔JSONL skew already mitigated in v1</td><td>§2 Boundary Rules; section-2 §2.7</td></tr></tbody></table>

All 10 forged-prompt disagreements resolved. AC7 satisfied.

---

*End of spec — Pillar #5 Skill-Chain Meta-Orchestrator design.*

*Next: Mehanik clearance → child MC dispatch (#99129-A through -H) → 80% daemon health gate → daily-inbox-triage v1 first ship.*

*Bismillah.*