Skip to main content

Atomic-write pattern for shared state files (POSIX os.replace)

Atomic-Write Pattern for Shared State Files (POSIX os.replace)

1. Why This Matters

In a multi-session environment where hooks, tools, and daemons write to shared state files (JSON configs, task markers, session identifiers), a naive open() + write() + close() pattern creates a torn-write hazard:

  • Concurrent sessions racing to write the same file can corrupt each other's writes (last-writer-wins with no atomicity guarantee)
  • Crash mid-write (SIGKILL, disk-full, context compaction, kernel panic) leaves the file in a partial or zero-byte state
  • Silent corruption of session isolation guarantees — hooks reading an empty or malformed file may silently fall back to legacy global state or fail-open, defeating ZAKON enforcement

Impact: ZAKON #27 (active-thread enforcement) and ZAKON #28 (max-depth gate) rely on per-session state files that must NEVER contain partial writes. A torn write to /tmp/mc-active-task-$PID causes the hook to fall back to the global /tmp/mc-active-task, silently defeating session isolation.

2. The Pattern — POSIX Atomic Rename

2.1 Python Pattern

The correct pattern uses tempfile + fsync + os.replace() to guarantee atomicity:

import os
import tempfile

def write_active_task(task_id, claude_pid=None):
    """Write active task for this session (atomic POSIX rename pattern).

    Writes to a tempfile in the same directory as the target, then uses
    os.replace() for an atomic swap. A crash or SIGKILL during the write
    leaves the target either absent (first write) or containing the previous
    complete value — never a partial write.
    """
    task_file = get_session_task_file(claude_pid)
    dir_ = os.path.dirname(task_file) or "."
    fd, tmp = tempfile.mkstemp(prefix=".active-task-", dir=dir_)
    try:
        with os.fdopen(fd, "w") as f:
            f.write(str(task_id))
            f.flush()
            os.fsync(f.fileno())
        os.replace(tmp, task_file)
    except Exception:
        try:
            os.unlink(tmp)
        except OSError:
            pass
        raise

Why this works:

  1. tempfile.mkstemp() creates a unique temp file in the SAME directory (same filesystem) as the target
  2. Write content to the temp file, flush buffers, call fsync() to ensure data is on disk
  3. os.replace(tmp, target) performs an atomic rename — POSIX guarantees this is a single syscall
  4. Readers see either the old complete file OR the new complete file — never a partial write
  5. If the process crashes before os.replace(), the temp file is abandoned but the target is untouched (or absent if first write)

2.2 Bash Pattern

For bash hooks writing to state files, use mktemp + mv pattern:

# Atomic write in bash using mktemp + mv
TARGET="/tmp/some-state-file.json"
CONTENT='{"count":0,"ts":"2026-05-03T10:00:00Z"}'

# Create temp file in same directory as target (same filesystem requirement)
TMP=$(mktemp "${TARGET}.XXXXXX")
echo "$CONTENT" > "$TMP"
mv -f "$TMP" "$TARGET"  # POSIX atomic on same filesystem

Why mv is atomic: On POSIX, mv within the same filesystem calls rename(2), which is atomic. Same guarantee as Python's os.replace().

Constraints:

  • mktemp template must use same directory as $TARGET (guarantees same filesystem, required for atomic mv)
  • Use printf or echo to write to $TMP, NOT to $TARGET
  • mv -f atomically replaces $TARGET (POSIX guarantees this on same filesystem)
  • No portable fsync in bash — durability across power loss requires Python/Node.js with explicit os.fsync()

3. What It Replaces — The Anti-Pattern

3.1 Python Anti-Pattern

DO NOT USE:

# WRONG — non-atomic, torn-write hazard
def write_active_task_WRONG(task_id, task_file):
    with open(task_file, "w") as f:
        f.write(str(task_id))

Why this is broken:

  • The open("w") call truncates the file immediately (size=0 bytes)
  • The write() may be buffered and not hit disk until close() or explicit flush()
  • A SIGKILL or crash between truncate and flush leaves a zero-byte file
  • A concurrent reader during the write window sees partial content or empty file
  • No reader/writer can distinguish "empty because not written yet" from "empty because crashed mid-write"

3.2 Bash Anti-Pattern

DO NOT USE:

# WRONG — torn-write hazard in bash
echo "$TASK_ID" > /tmp/mc-active-task-$$

The > operator truncates the file immediately, then writes. A crash between truncate and write completion leaves a zero-byte or partial file — identical hazard to the Python anti-pattern.

4. Same-Filesystem Requirement

The dir= kwarg in tempfile.mkstemp(prefix=".active-task-", dir=dir_) is critical:

  • os.replace() is atomic ONLY when the source and target are on the same filesystem
  • Cross-device rename (e.g., /tmp/home on different partitions) degrades to copy-then-delete, which is NOT atomic
  • By creating the temp file in the same directory as the target (os.path.dirname(task_file)), we guarantee same-device
  • If dirname is empty (target in cwd), fallback to "."

Verification: df -h /tmp vs df -h ~/.claude/hooks — if different mount points, you MUST use dir= kwarg with target's parent directory.

For bash: Use mktemp "${TARGET}.XXXXXX" template — the suffix pattern ensures temp file is created in the same directory as $TARGET.

5. Crash Recovery Semantics

Scenario Before os.replace() After os.replace()
First write, no prior file Target absent, temp exists Target exists with new content
Overwrite existing file Target has old content, temp exists Target has new content
Crash during write() Target unchanged (or absent), temp partial/incomplete N/A — replace() never called
Crash during fsync() Target unchanged, temp may have partial data on disk N/A
Crash after os.replace() N/A Target has new complete content (atomic swap already done)

Key guarantee: The target file NEVER contains partial writes. A reader always sees either:

  1. File absent (no write has completed yet), OR
  2. File with the last successfully-completed write's full content

The exception handler (except: os.unlink(tmp)) cleans up the temp file on failure, preventing temp-file accumulation.

6. Testing Pattern

Unit test crash-recovery by mocking the write to raise an exception:

import unittest
import os
import tempfile
from unittest.mock import patch, mock_open

class TestAtomicWrite(unittest.TestCase):

    def test_crash_during_overwrite_preserves_old_content(self):
        """If write crashes after target exists, old content is preserved."""
        with tempfile.TemporaryDirectory() as tmpdir:
            target = os.path.join(tmpdir, "test-task.txt")

            # Write initial content
            with open(target, "w") as f:
                f.write("OLD-TASK-11111")

            # Simulate crash during second write
            with patch("builtins.open", side_effect=IOError("Simulated crash")):
                with self.assertRaises(IOError):
                    write_active_task_atomic("NEW-TASK-22222", target)

            # Old content must survive
            with open(target, "r") as f:
                content = f.read()
            self.assertEqual(content, "OLD-TASK-11111")

            # No temp files leaked
            leaked_temps = [f for f in os.listdir(tmpdir) if f.startswith(".active-task-")]
            self.assertEqual(len(leaked_temps), 0)

What this validates:

  • Exception during write → old content survives intact
  • No temp files leaked to disk (cleanup path works)
  • File state is never partial or corrupt

7. When to Apply

Use this pattern for any hook/lib writing JSON or state files where torn writes = corruption:

  • /tmp/mc-active-task-$SESSION_ID — ZAKON #28 depth gate relies on this
  • /tmp/active-thread-$SESSION_ID.txt — ZAKON #27 active-thread enforcement shadow file
  • ~/.claude/session-state.md shadow files (if per-session scoping is added)
  • Counter files (/tmp/john-mc-turn-counter.json, /tmp/ceo-approved-token-uses-*.count)
  • Mehanik clearance markers (/tmp/mehanik-cleared-<MC> with session_id field)
  • Any file where a concurrent reader must NEVER see partial data

Do NOT use for:

  • Log files (append-only, partial writes acceptable)
  • Human-edited markdown files (git-tracked, editor handles temp files)
  • SQLite databases (has internal transaction layer)

8. PhaseSites 2B Implication — Bash Hooks Are NOT AtomicCovered

BashThis hookspattern usinghas shellbeen redirection (>) do NOT provide atomic writes:

# WRONG — torn-write hazard in bash
echo "$TASK_ID" > /tmp/mc-active-task-$$

The > operator truncates the file immediately, then writes. A crash between truncate and write completion leaves a zero-byte or partial file — identical hazardapplied to the following high-risk state file writes:

8.1 Python anti-pattern.

Sites

Correct(Phase bash2A pattern:

MC #99076)

  • #~/.claude/hooks/archive/lib-legacy/session_id.py:138-161 Atomic write in bash using mktemp + mv TASK_FILE="/tmp/mc-active-task-$$" TMPFILE=$(mktemp "${TASK_FILE}.XXXXXX") echo "$TASK_ID" > "$TMPFILE" mv "$TMPFILE" "$TASK_FILE"

    Why mvwrite_active_task() isfunction atomic:(S8 On POSIX, mv within the same filesystem calls rename(2), which is atomic. Same guarantee as Python's os.replace().

    Phase 2B audit requirement: Before promoting session_id.py to production, audit ALL bash hooks that write tosurface: /tmp/mc-active-task-*$SESSION_ID)

  • or
session-scoped

8.2 stateBash files.Hook AnySites hook(Phase using2B-2 bare— MC #99080)

8 atomic-write patches applied across 4 hooks covering surfaces S3, S8, S9, S10:

redirectionMUSTuse pattern.Failurewilldefeattheatomic-writeguarantee2A.
FileLinePatternSurfaceDescription
mc-turn-reset.sh12Python >tempfile.mkstemp + os.replace S8 Reset beMC patchedturn tocounter
mc-turn-reset.sh20Bash mktemp + mv S3 Reset toCEO_APPROVED dotoken thiscounter
mc-turn-reset.sh 23 Bash addedmktemp in+ Phasemv S9Reset dispatch turn counter
ceo-intent-classifier.sh38Python tempfile.mkstemp + os.replaceS10Write CEO intent classification
one-ceo-turn-dispatch-cap.sh33Python tempfile.mkstemp + os.replaceS9Increment dispatch counter
one-ceo-turn-dispatch-cap.sh50Python tempfile.mkstemp + os.replaceS9Rollback dispatch counter on failure
one-ceo-turn-mc-cap.sh40Python tempfile.mkstemp + os.replaceS8Increment MC add counter
one-ceo-turn-mc-cap.sh59Python tempfile.mkstemp + os.replaceS8Rollback MC counter on failure

Validation: All 8 sites passed Proveo crash-safety testing (AC5: runtime exception AFTER write+fsync but BEFORE os.replace/mv — old content preserved, no temp file leak). See /tmp/proveo-99080-2026-05-03.json.

9. Reference

  • MC #99076 — Phase 2A atomic-write patch on session_id.py (thisPython fix)pattern)
  • MC #99080 — Phase 2B-2 atomic-write patches on 4 bash hooks (8 line-level sites)
  • MC #99078 — Phase 2B-1 bash atomicity audit (identified 8 UNSAFE sites)
  • MC #99069 — Session Isolation Audit (parent task, genesis of the finding)
  • Spec: ~/system/specs/session-isolation-audit-2026-05-03.md §3 W1 (Weakness 1) + Appendix A
  • Spec: ~/system/specs/bash-atomicity-audit-2026-05-03.md — Phase 2B-1 full inventory + fix templates
  • Source: ~/.claude/hooks/archive/lib-legacy/session_id.py lines 138-161 (patchedPython implementation)pattern reference)
  • Source: ~/.claude/hooks/mc-turn-reset.sh, ceo-intent-classifier.sh, one-ceo-turn-dispatch-cap.sh, one-ceo-turn-mc-cap.sh (bash pattern implementations)
  • Tests: ~/.claude/hooks/archive/lib-legacy/test_session_id_atomic.py (5 unit tests covering crash-recovery)
  • Proveo Report:Reports:
    • /tmp/postflight-99076/proveo-report.md (AC2Phase validates2A crash-during-overwrite-preserves-old-content)Python validation)
    • /tmp/postflight-99080/proveo-report.md (Phase 2B-2 bash validation)

10. Further Reading

  • Martin Kleppmann panelist review (/tmp/forged-99069-martin-kleppmann.md §2 Weakness 1): "write_active_task() is not atomic. Lines 138-142 use a bare open(task_file, 'w') write with no mktemp + os.replace() pattern. If the hook is interrupted mid-write (SIGKILL, context compaction crash, disk-full), the file is left in a partial or zero-byte state."
  • POSIX rename(2) man page: "If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing."
  • Best-in-class reference: one-ceo-turn-mc-cap.sh:108-113 (already usesused mktemp + mv for counter increment before Phase 2B audit — correct pattern)

Generated by Skillforge for MC #99076 — Phase 2A Session Isolation Fix
Date:Updated: 2026-05-03 (MC #99080 — Phase 2B-2 bash hook atomicity expansion)
Last verified: 2026-05-03 — Proveo Phase 2B-2 report