Claude Code Multi-Session Isolation

Claude Code Multi-Session Isolation

Status: Production (all 7 P0 resources verified SAFE)
Date: 2026-05-18
Owner: Petter Graff (architect), CodeCraft (implementation), Proveo (validation), Securion (threat review)
Parent MC: #101305 (Phase 2)

---

## What Broke

From 2026-05-13 onward, ALAI runs 6+ concurrent Claude Code sessions daily (12 sessions on 2026-05-15). Each session writes to shared state files with zero locking. On 2026-05-18 at 14:42, `~/system/memory/SESSION-STATE.md` was rewritten mid-session from session `256da42c` to session `a10b7bc9` between two reads in the same `/sync` skill invocation — John's continuity context silently flipped to another session's "Next Steps."

Three CEO-visible collisions confirmed before probing began:
1. Session continuity lost — John's "Next Steps" overwritten by last-writer-wins across concurrent sessions
2. Gate verdicts corrupted — `last-validator-verdict.json` written by session A, read by session B's `mc.js done`, passing/failing the wrong task
3. Cost tracking undercount — 1 of 4 concurrent Stop hooks' INSERTs lost in `costs.db`, causing `cost-tracker.js summary` to understate spend

The multi-session concurrency rate is accelerating: 6 sessions/day in May 2026 is 3× the February baseline. Without isolation, the collision surface grows quadratically.

---

## Collision Ledger

Empirical probe evidence from `/tmp/session-collision-20260518T{143721,143735}/probe.jsonl` (T3 Phase 1):

| P0 # | Resource | Path | Probe Verdict | Before-Fix Blast Radius |
|------|----------|------|---------------|-------------------------|
| P0-1 | SESSION-STATE.md | `~/system/memory/SESSION-STATE.md` | LAST_WRITER_WINS (A:line 6, B:line 8) | John's continuity context; "Next Steps" lost between sessions |
| P0-2 | last-validator-verdict.json | `~/system/state/last-validator-verdict.json` | LAST_WRITER_WINS (A:line 26, B:line 36) | Gate verdict read by wrong session; silent `mc.js done` pass/fail corruption |
| P0-3 | .ledger-root-hash | `~/system/state/.ledger-root-hash` | LAST_WRITER_WINS (A:line 31, B:line 43) | Evidence integrity check bypassed; stale hash passed when ledger changed |
| P0-4 | costs.db | `~/system/databases/costs.db` | SAFE at w=2 (A:line 16), LAST_WRITER_WINS at w=4 (B:line 22, 1 INSERT lost) | Financial audit trail undercount; CEO cost reports incorrect |
| P0-5 | incident_mode flag | `/tmp/incident-mode` | LAST_WRITER_WINS (A:line 41, B:line 57) | One session's incident response silently cleared by unrelated session |
| P0-6 | prompt_forge active | `/tmp/prompt-forge-active` | LAST_WRITER_WINS (A:line 46, B:line 64) | Model-override gate suppressed/enabled globally for all sessions |
| P0-7 | skill-registry.db | `~/system/databases/skill-registry.db` | LAST_WRITER_WINS at w=2 (A:line 21, 1 increment lost), non-deterministic at w=4 (B:line 29 SAFE) | Skill-use telemetry undercount degrades routing decisions |

Probed: 8 of 71 T1 inventory resources. P1 (13 resources) and P2 (14 resources) deferred.

---

## Isolation Model

Seven P0 collisions → five patterns applied:

### Pattern 1: Per-Session-Path (P0-1, P0-2, P0-5, P0-6)

Each session writes to `-.` instead of a single global file. At session boot (P0-1 only), compaction merges all per-session files with mtime ≤ 4h into canonical view.

Implementation:
- P0-1: `SESSION-STATE-.md` written by `session-ledger.sh`; compacted by `enforce-next-steps.sh` at boot (lines 62-108); cleanup in `parent-session-cleanup.sh` (line 74)
- P0-2: `last-validator-verdict-.json` written by `session-output-validator.sh` (lines 491, 549); `mc.js done` reads per-session path (lines 2939-2966) with fail-closed gate if absent
- P0-5: `/tmp/incident-mode-` written by `incident-response-mode.sh` (lines 31-42); orphan purge at 4h (lines 52-59)
- P0-6: `/tmp/prompt-forge-active-` set by `/prompt-forge` skill (SKILL.md Step 0, line 57); reader bypass in `sonnet-default-gate.sh` (line 108) and `claude-sonnet-default.sh` (line 16)

Rollback: Set `ISOLATION_SESSION_STATE_SCOPED=0`, `ISOLATION_VERDICT_SESSION_SCOPE=0`, `ISOLATION_INCIDENT_SESSION_SCOPE=0`, or `ISOLATION_PROMPTFORGE_SESSION_SCOPE=0` to revert individual resources.

### Pattern 2: Advisory Lock via lockf (P0-3)

macOS ships `lockf(1)` at `/usr/bin/lockf` (not GNU `flock(1)`). Exclusive lock wraps `mc.js ready` invocation; lock released by kernel on process death (SIGKILL-safe per T8 Q1 live test).

Implementation:
- `mc-ready-gate.sh` (lines 98-112): `lockf -k -t 30 ~/system/state/.ledger-root-hash.lock node ~/system/tools/mc.js ready`
- Lock file kept via `-k` flag for reuse
- Fail-closed: exits 2 if `lockf` binary absent

Rollback: Set `ISOLATION_LEDGER_HASH_FLOCK=0`.

### Pattern 3: SQLite WAL + BEGIN IMMEDIATE + Retry (P0-4, P0-7)

SQLite Write-Ahead Log (WAL) mode + `BEGIN IMMEDIATE` transaction + application-layer retry loop (5 attempts: 0ms, 50ms, 100ms, 200ms, 400ms, 800ms backoffs).

Why BEGIN IMMEDIATE was required:
- T9 added `PRAGMA busy_timeout` but used DEFERRED transactions (default in sqlite3)
- Under w=4 burst, multiple connections acquired SHARED locks simultaneously; first write triggered RESERVED lock race → silent INSERT loss (costs.db) and UPDATE non-determinism (skill-registry.db)
- `BEGIN IMMEDIATE` acquires RESERVED lock upfront; only one writer proceeds, others get `SQLITE_BUSY` immediately and retry in application layer

Implementation:
- P0-4: `claude-cli-cost-hook.sh` (lines 135-215): Python `isolation_level=None` (autocommit mode), `BEGIN IMMEDIATE`, INSERT, `COMMIT`, wrapped in retry loop
- P0-7: `skill-use-counter.sh` (lines 24-60): bash heredoc `BEGIN IMMEDIATE; UPDATE; COMMIT;`, wrapped in retry loop
- Both DBs already in WAL mode (confirmed: `sqlite3 "PRAGMA journal_mode;"` → `wal`)
- Exit-code check + `BUSY_TIMEOUT_EXHAUSTED` / `SKILL_DB_ERROR_FINAL` log on retry exhaustion

Rollback: Set `ISOLATION_SQLITE_WAL=0`.

---

## Feature Flags

Six flags control isolation behavior (all default `1` = on):

| Flag | Controls | Revert Path |
|------|----------|-------------|
| `ISOLATION_SESSION_STATE_SCOPED` | P0-1 per-session SESSION-STATE | Revert `session-ledger.sh` write target; disable compaction in `enforce-next-steps.sh` |
| `ISOLATION_VERDICT_SESSION_SCOPE` | P0-2 per-session verdict | Revert `session-output-validator.sh` write path + `mc.js` done gate check |
| `ISOLATION_LEDGER_HASH_FLOCK` | P0-3 lockf advisory lock | Remove `lockf` wrapper from `mc-ready-gate.sh` |
| `ISOLATION_SQLITE_WAL` | P0-4 costs.db + P0-7 skill-registry.db BEGIN IMMEDIATE + retry | Revert to PRAGMA-only or bare INSERT/UPDATE |
| `ISOLATION_INCIDENT_SESSION_SCOPE` | P0-5 per-session incident flag | Revert `incident-response-mode.sh` to global `/tmp/incident-mode` |
| `ISOLATION_PROMPTFORGE_SESSION_SCOPE` | P0-6 per-session prompt-forge marker | Revert `sonnet-default-gate.sh` + skill SKILL.md to global path |

Set any flag to `0` in `~/.claude/settings.local.json` env block or export in hook environment to disable.

---

## Validation

### Final Evidence (T10-ter, MC #101325)

Four validation runs with updated harness (sha256 `acdbcd6abea1f1085f7c88056e59c747d073da6756889e9dcf5d54babd0bcfe3`):

| Run | Mode | Writers | Verdict | Probe Path |
|-----|------|---------|---------|------------|
| G | default | 2 | P0-4 SAFE [line 16], P0-7 SAFE [line 21]; P0-1/2/3/5/6 LWW expected in default mode | `/tmp/session-collision-20260518T160822/probe.jsonl` (sha256: `8da33aee...`) |
| H | default | 4 | P0-4 SAFE [line 22], P0-7 SAFE [line 29]; P0-1/2/3/5/6 LWW expected | `/tmp/session-collision-20260518T160829/probe.jsonl` (sha256: `2c13824e...`) |
| I | per-session | 2 | All 5 per-session P0s SAFE (lines 5,9,13,17,21) | `/tmp/session-collision-20260518T160837/probe.jsonl` (sha256: `c20ebf1e...`) |
| J | per-session | 4 | All 5 per-session P0s SAFE (lines 7,13,19,25,31) | `/tmp/session-collision-20260518T160843/probe.jsonl` (sha256: `cecccfc1...`) |

Stability: Run H repeated 3× (H-2, H-3, H-4) — P0-4 SAFE 3/3, P0-7 SAFE 3/3. Total: 4/4 SAFE at w=4 for SQLite resources.

### Before-After Summary

| P0 # | T3 Baseline (pre-fix) | T10-ter (post-fix) |
|------|-----------------------|--------------------|
| P0-1 | LWW at w=4 | SAFE in per-session mode (Run J line 7) |
| P0-2 | LWW at w=4 | SAFE in per-session mode (Run J line 13) |
| P0-3 | LWW at w=4 | SAFE in per-session mode (Run J line 19, lockf) |
| P0-4 | LWW at w=4 (1 INSERT lost) | SAFE at w=4 (Run H line 22, BEGIN IMMEDIATE) |
| P0-5 | LWW at w=4 | SAFE in per-session mode (Run J line 25) |
| P0-6 | LWW at w=4 | SAFE in per-session mode (Run J line 31) |
| P0-7 | LWW at w=2 (non-deterministic) | SAFE at w=4 (Run H line 29, BEGIN IMMEDIATE) |

---

## Runbook

### 1. How to Detect a Collision

Run the collision harness against production state (read-only inventory mode) or against `/tmp` sandbox fixtures (write mode):

```bash
# Production read-only inventory (lists shared resources, no writes)
bash ~/system/tools/diagnose-session-collision.sh --inventory-only

# Sandbox collision test — default mode (simulates pre-fix behavior for comparison)
bash ~/system/tools/diagnose-session-collision.sh --writers 4 --targets all

# Sandbox collision test — per-session mode (simulates post-fix production)
bash ~/system/tools/diagnose-session-collision.sh --per-session-mode --writers 4 --targets per-session-all
```

Expected output post-fix:
- Default mode: P0-1/2/3/5/6 show `LAST_WRITER_WINS` (correct — single fixture path simulates the race), P0-4/7 show `SAFE`
- Per-session mode: All 5 per-session P0s (`session_state_ps`, `last_verdict_ps`, `ledger_hash_ps`, `incident_mode_ps`, `prompt_forge_ps`) show `SAFE`

Verdict location: `/tmp/session-collision-/probe.jsonl` — each line is a JSON verdict with fields: `ts`, `resource`, `verdict`, `writers`, `pre_hash`, `post_hash`, `lost_writers`, `deadlocked_writers`

### 2. How to Roll Back Any Single Isolation

Set the corresponding feature flag to `0`:

```bash
# Roll back P0-1 (SESSION-STATE per-session)
export ISOLATION_SESSION_STATE_SCOPED=0

# Roll back P0-4 + P0-7 (SQLite BEGIN IMMEDIATE)
export ISOLATION_SQLITE_WAL=0

# Roll back P0-3 (lockf on ledger-root-hash)
export ISOLATION_LEDGER_HASH_FLOCK=0
```

Persistent rollback: Add to `~/.claude/settings.local.json`:
```json
{
"env": {
"ISOLATION_SESSION_STATE_SCOPED": "0"
}
}
```

Validation: Re-run harness with the flag disabled to confirm rollback worked.

IMPORTANT: Rolling back P0-4 or P0-7 restores the LAST_WRITER_WINS collision at w=4. Only roll back if BEGIN IMMEDIATE is causing production deadlocks (none observed in 4 validation runs + 3 stability repeats).

### 3. How to Add a New Shared Resource to Isolation

When a new shared resource is identified (e.g., a new `/tmp/global-marker` file or a new SQLite DB):

Step 1: Add to inventory

Edit `~/system/specs/multi-session/shared-state-inventory.md` (T1 artifact):
- List the resource path
- Classify: `per-session` | `global-single-writer` | `global-multi-writer` | `external-singleton`
- Cite the file/line that proves it is touched (e.g., `hook-name.sh:42`)

Step 2: Write a probe in the harness

Edit `~/system/tools/diagnose-session-collision.sh`:
- Add a `writer_` function that writes to a sandbox fixture
- Add a verdict function if the resource needs custom logic (e.g., per-session file enumeration, lock-attempt counting)
- Add the resource name to the `TARGETS` array

Step 3: Run the harness

```bash
bash ~/system/tools/diagnose-session-collision.sh --writers 4 --targets
```

Step 4: Decide pattern from catalogue

From `/Users/makinja/system/specs/multi-session/isolation-model.md` §2 (Pattern Catalogue):
- per-session-path: Single-consumer or append-only state (e.g., session logs)
- advisory-flock (lockf): Last-writer-wins file with single authoritative value (e.g., a hash file)
- SQLite WAL + BEGIN IMMEDIATE + retry: SQLite DB with concurrent INSERTs/UPDATEs
- CAS lease (mc.js claim): Cross-session resource allocation (e.g., task claiming)
- singleton-broker queue: High-risk writes that need daemon supervision (e.g., MEMORY.md)
- deprecate-and-replace: The global resource is a design defect; eliminate it

Step 5: Implement the pattern

Follow the implementation notes in `isolation-model.md` §4 (Per-P0 Design Table). Add a feature flag (e.g., `ISOLATION_NEW_RESOURCE=1`) for rollback safety.

Step 6: Validate

Run `diagnose-session-collision.sh` with the new isolation enabled. Verdict must be `SAFE` at w=4.

Step 7: Update this runbook

Add the new resource to the Collision Ledger table above and document the chosen pattern + rollback flag.

---

## Known Limitations

### P1 Resources (13 total) — Not Yet Addressed

From `COLLISION-LEDGER.md` rows 8-17:
- `lightrag-ingest-health.json` — SAFE at w=2, LAST_WRITER_WINS at w=4 (2 of 4 increments lost)
- `evidence-ledger.jsonl` — not probed; suspected interleaved appends under concurrent `mc.js done`
- `evidence-index.jsonl` — not probed; read at session boot without write lock
- Mehanik cleared markers (`/tmp/mehanik-cleared-`) — not probed; two sessions on same MC can both see cleared marker
- Evidence dirs (`/tmp/evidence-/`) — not probed; numeric sequence collision risk
- Claim schema stubs (`/tmp/claim-schema-.json`) — not probed; two sessions on same MC write conflicting schemas
- Hop-build started markers (`/tmp/hop-build-started-`) — not probed; 8 stale files present; double-build or skip-build risk
- Opus override token (`/tmp/opus-override-token`) — not probed; non-atomic consume allows two sessions to bypass cost gate
- John bash override token (`/tmp/john-bash-override-token`) — not probed; same TOCTOU as opus token
- MCP Playwright server (singleton) — not probed; unknown whether browser contexts are session-isolated
- LightRAG ingest API (`http://localhost:9621`) — not probed; concurrent POST from all sessions; LightRAG's own concurrency handling unverified
- MEMORY.md daemon write path — not probed; memory-writer.js queue serialisation under concurrent flush requests

Require Phase 2 sprint 2 or explicit CEO scope expansion.

### P2 Resources (14 total) — Design-Quality Improvements

From `COLLISION-LEDGER.md` rows 18-27:
- `blueprint-override-ledger.jsonl`, `h-ready-audit.jsonl`, `verdict-ledger.jsonl`, `daily-logs/.md`, `GOTCHA-task-.md`, `hivemind.db`, `knowledge.db`, `session-save.log`
- No CEO-visible blast radius confirmed in T3
- Deferred to backlog

### MCP Singleton Servers — Unprobed

- Playwright browser: unknown whether page state leaks between concurrent `mcpplaywrightnavigate` calls
- Docker MCP: unknown whether container state is session-isolated
- Spreadsheet MCP: unknown whether workbook handles are session-scoped

Require separate external-service isolation plan.

### Harness Measures /tmp Clones, Not Live State

The collision harness writes to `/tmp/session-collision-/fixtures/`, not production paths. Verdicts are correct for concurrency pattern analysis but do not directly measure live production contention. The harness is a structural test, not a load test.

To measure live contention: inspect hook execution logs (`~/system/memory/logs/hook-execution.log`) for `BUSY_TIMEOUT_HIT` (costs.db) or `SKILL_DB_ERROR_FINAL` (skill-registry.db) occurrences during high-concurrency periods.

---

## Out-of-Scope

The following were explicitly excluded from Phase 2:

1. P1 resources (13 items listed above) — require separate plan
2. P2 resources (14 items listed above) — backlog
3. External singletons (MCP servers, LightRAG, Qdrant, Ollama) — require external-service isolation plan
4. Hook scratch state not in T3 probe surface:
- MEMORY.md direct write path (protected by mmwb daemon redirect)
- `settings.local.json` (CEO-only writes per T1 classification)
5. Legacy /tmp markers cleanup (8 stale `hop-build-started-*` files present) — cleanup cron needed but collision risk unprobed

No existing hook was removed in Phase 2. Any future removal requires named CEO approval.

---

## Architecture Notes

### Why lockf, Not flock?

macOS 25.2.0 does not ship `flock(1)` (util-linux). macOS provides `lockf(1)` at `/usr/bin/lockf`, which uses BSD `flock(2)` kernel primitive. Semantics:
- `flock -x lockfile cmd` → `lockf -k -t 30 lockfile cmd`
- `-k` keeps the lock file on exit (required for reuse)
- `-t N` sets timeout in seconds (0 = non-blocking)
- Lock is released by kernel on any process death (SIGKILL-safe, confirmed by T8 Q1 live test + POSIX spec)

### Why BEGIN IMMEDIATE, Not Just PRAGMA busy_timeout?

SQLite default transaction mode is DEFERRED: `BEGIN DEFERRED` acquires no locks until the first write. Under w=4 burst with WAL mode:
1. Four connections open
2. Each executes `PRAGMA busy_timeout=5000`
3. Each executes `INSERT` (implicit BEGIN DEFERRED)
4. All four acquire SHARED locks
5. First write attempts to upgrade to RESERVED — succeeds
6. Other three attempt upgrade — all get SQLITE_BUSY
7. But the PRAGMA busy_timeout retry only applies if the lock was unavailable at BEGIN time. Since all four acquired SHARED before any write, the retry mechanism is bypassed.

Result: 1 of 4 INSERTs succeeds, 3 fail silently (exit code 5 from sqlite3 CLI, which hook may not check).

`BEGIN IMMEDIATE` acquires RESERVED lock upfront. Only one connection gets RESERVED; others block (or get SQLITE_BUSY) at BEGIN, where busy_timeout applies correctly. Application-layer retry loop ensures all writers eventually succeed.

### Why Compaction Only at Boot (P0-1)?

Per-session `SESSION-STATE-.md` files accumulate during the day. Compaction at boot (not at every session end) minimizes file I/O. The 4h mtime staleness filter ensures dead sessions' files are ignored. Compaction uses atomic write (`tmp+mv`) to prevent partial-write corruption if `enforce-next-steps.sh` is killed mid-boot.

### Why 4h Staleness Filter?

Claude Code sessions under normal use are ≤ 2h (median ~30min, p95 ~90min per session log analysis). 4h allows for extended debugging sessions (e.g., CEO deep-dive on a single task) while filtering overnight orphans. Session files older than 4h at boot time are assumed stale and skipped in compaction.

### WAL Sidecar Files

WAL mode creates `-wal` and `-shm` sidecar files next to each SQLite DB:
- `-wal`: Write-Ahead Log (contains uncommitted writes)
- `-shm`: Shared memory index (used by readers to find data in WAL)

NEVER manually delete these files while any Claude Code session is running. Deleting them corrupts the DB. macOS purges `/tmp` on reboot, but `~/system/databases/` is persistent — sidecar files remain until a checkpoint flushes them.

To verify WAL mode is active:
```bash
sqlite3 ~/system/databases/costs.db "PRAGMA journal_mode;"
# Output: wal
```

To revert to DELETE mode (NOT recommended unless WAL is causing issues):
```bash
sqlite3 ~/system/databases/costs.db "PRAGMA journal_mode=DELETE;"
```

---

## Evidence Files

All referenced evidence paths are archived in `~/system/specs/multi-session/`:

| File | Purpose | Lines | sha256 |
|------|---------|-------|--------|
| `COLLISION-LEDGER.md` | T5 ranked ledger, 28 resources | 128 | (T5 final version) |
| `isolation-model.md` | T7 P0-only design | 194 | (T7 final version) |
| `threat-review-t8.md` | T8 Securion review | 244 | (T8 final version) |
| `t9-implementation-log.md` | T9 P0 implementation | 251 | (T9 final version) |
| `t9-bis-implementation-log.md` | T9-bis harness + P0-6 writer | 159 | (T9-bis final version) |
| `t9-ter-implementation-log.md` | T9-ter SQLite BEGIN IMMEDIATE | 148 | (T9-ter final version) |
| `t10-ter-validation-report.md` | T10-ter PASS evidence | 169 | (T10-ter final version) |
| `/tmp/session-collision-20260518T160822/probe.jsonl` | Run G (w=2 default) | 50 lines | `8da33aee...` |
| `/tmp/session-collision-20260518T160829/probe.jsonl` | Run H (w=4 default) | 53 lines | `2c13824e...` |
| `/tmp/session-collision-20260518T160837/probe.jsonl` | Run I (w=2 per-session) | 25 lines | `c20ebf1e...` |
| `/tmp/session-collision-20260518T160843/probe.jsonl` | Run J (w=4 per-session) | 33 lines | `cecccfc1...` |

Harness location: `/Users/makinja/system/tools/diagnose-session-collision.sh` (1013 lines, sha256 `acdbcd6a...` post-T9-ter).

---

## Related Documentation

- [MC Claim Protocol](https://docs.alai.no/books/infrastructure/page/mc-claim-protocol) — Cross-session task claiming via CAS lease (already production before this work)
- [ADR-024 Agent Team Topology](https://docs.alai.no/books/system-architecture/page/agent-team-topology-adr-024) — Agent process supervision (single-session scope)
- [ZAKON NULA](https://docs.alai.no/books/rules/page/zakon-nula-tool-first) — Tool-first doctrine that drove the debug-before-solution mandate (T6 phase gate)

---

Created: 2026-05-18
Last Updated: 2026-05-18
Plan: `/Users/makinja/system/specs/claude-code-multi-session-isolation-plan.md` (207 lines)
MC Parent: #101305 (Phase 2)
Evidence Integrity: All verdicts cite probe.jsonl line numbers; no LLM inference in ledger or validation

Revision #3
Created 2026-05-18 18:35:46 UTC by John
Updated 2026-06-21 20:03:44 UTC by John

Claude Code Multi-Session Isolation

Related Pages