Anthropic Outage Resilience — 529 Auto-Fallback Runbook
Anthropic Outage Resilience — 529 Auto-Fallback Runbook
MC: #104217 T5
Owner: Skillforge
Date: 2026-06-22
Status: Production (Active)
Executive Summary
What It Does:
When Anthropic API returns HTTP 529 (overloaded) on ALAI agent/tool paths, the system auto-enables offline-mode and routes LLM work to local Ollama (FORGE or ANVIL) within 30 seconds. Auto-recovery occurs when Anthropic becomes healthy again (5-minute health check cycle).
What It Protects:
- Agent LLM calls via
adapters/claude-api.js(line 194, 231) - Company Mesh comms-responder legacy path
- Company worker CLI stderr path
- Tool execution requiring LLM reasoning
What It Does NOT Protect (Honest Limits):
- John's own Claude Code CLI session 529s (not interceptable — hooks run after Claude's internal API call)
- During full Anthropic outage, John-the-orchestrator degrades to
john-litefor bounded triage only, NOT full orchestration - H/BLOCKER/deploy/security tasks are rejected in offline mode (quality gates require full reasoning)
Cost:
- Development: $1,800 one-time (MC #104217 T1+T2+T4)
- Operational: $0/month (local Ollama)
- Avoided productivity loss: $1,200-$2,400/month (2-4 stalls/week × 2h × $150/h CEO time)
Key Dependency:
FORGE Ollama (10.0.0.2:11434) must be reachable. Falls back to ANVIL (localhost:11434) if FORGE down.
1. System Architecture
1.1 Auto-Detection Layer (T1)
File: /Users/makinja/system/tools/anthropic-529-detector.js
Owner: FlowForge
Evidence: /tmp/evidence-104217/t1-hook/
How It Works:
- Wraps all Anthropic API calls with
wrapAnthropicCall()middleware - Catches errors and applies
is529Error()detector:- HTTP status code 529
- Error message contains "overload" (case-insensitive)
- Word-boundary regex
/\b(status|code|http|error)\s*529\b/i(avoids false positives on "529ms", "in 529 milliseconds") - Anthropic SDK
error.type === 'overloaded_error'
- On 529 match:
- Writes
/tmp/john-offline-modeflag with metadata (timestamp, reason) - Spawns background recovery daemon
- Re-throws original error (caller decides how to handle)
- Writes
State Files:
/tmp/john-offline-mode— Offline-mode flag (checked by tier-router.js)/tmp/anthropic-529-detector.json— Detector state (trigger time, health check history)
Recovery Behavior:
- Auto-health-check every 5 minutes when offline-mode active
- If Anthropic responds with status != 529, auto-disables offline-mode
- TTL: Max 2 hours offline before forcing re-check
See full runbook at: /tmp/evidence-104217/t5-runbook.md