Skip to main content

Anthropic Outage Resilience — 529 Auto-Fallback Runbook

Anthropic Outage Resilience — 529 Auto-Fallback Runbook

MC: #104217 T5
Owner: Skillforge
Date: 2026-06-22
Status: Production (Active)


Executive Summary

What It Does:
When Anthropic API returns HTTP 529 (overloaded) on ALAI agent/tool paths, the system auto-enables offline-mode and routes LLM work to local Ollama (FORGE or ANVIL) within 30 seconds. Auto-recovery occurs when Anthropic becomes healthy again (5-minute health check cycle).

What It Protects:

  • Agent LLM calls via adapters/claude-api.js (line 194, 231)
  • Company Mesh comms-responder legacy path
  • Company worker CLI stderr path
  • Tool execution requiring LLM reasoning

What It Does NOT Protect (Honest Limits):

  • John's own Claude Code CLI session 529s (not interceptable — hooks run after Claude's internal API call)
  • During full Anthropic outage, John-the-orchestrator degrades to john-lite for bounded triage only, NOT full orchestration
  • H/BLOCKER/deploy/security tasks are rejected in offline mode (quality gates require full reasoning)

Cost:

  • Development: $1,800 one-time (MC #104217 T1+T2+T4)
  • Operational: $0/month (local Ollama)
  • Avoided productivity loss: $1,200-$2,400/month (2-4 stalls/week × 2h × $150/h CEO time)

Key Dependency:
FORGE Ollama (10.0.0.2:11434) must be reachable. Falls back to ANVIL (localhost:11434) if FORGE down.


1. System Architecture

1.1 Auto-Detection Layer (T1)

File: /Users/makinja/system/tools/anthropic-529-detector.js
Owner: FlowForge
Evidence: /tmp/evidence-104217/t1-hook/

How It Works:

  1. Wraps all Anthropic API calls with wrapAnthropicCall() middleware
  2. Catches errors and applies is529Error() detector:
    • HTTP status code 529
    • Error message contains "overload" (case-insensitive)
    • Word-boundary regex /\b(status|code|http|error)\s*529\b/i (avoids false positives on "529ms", "in 529 milliseconds")
    • Anthropic SDK error.type === 'overloaded_error'
  3. On 529 match:
    • Writes /tmp/john-offline-mode flag with metadata (timestamp, reason)
    • Spawns background recovery daemon
    • Re-throws original error (caller decides how to handle)

State Files:

  • /tmp/john-offline-mode — Offline-mode flag (checked by tier-router.js)
  • /tmp/anthropic-529-detector.json — Detector state (trigger time, health check history)

Recovery Behavior:

  • Auto-health-check every 5 minutes when offline-mode active
  • If Anthropic responds with status != 529, auto-disables offline-mode
  • TTL: Max 2 hours offline before forcing re-check

See full runbook at: /tmp/evidence-104217/t5-runbook.md