# Anthropic Outage Resilience — 529 Auto-Fallback Runbook # Anthropic Outage Resilience — 529 Auto-Fallback Runbook **MC:** #104217 T5 **Owner:** Skillforge **Date:** 2026-06-22 **Status:** Production (Active) **BookStack:** System Architecture --- ## Executive Summary **What It Does:** When Anthropic API returns HTTP 529 (overloaded) on ALAI agent/tool paths, the system auto-enables offline-mode and routes LLM work to local Ollama (FORGE or ANVIL) within 30 seconds. Auto-recovery occurs when Anthropic becomes healthy again (5-minute health check cycle). **What It Protects:** - Agent LLM calls via `adapters/claude-api.js` (line 194, 231) - Company Mesh comms-responder legacy path - Company worker CLI stderr path - Tool execution requiring LLM reasoning **What It Does NOT Protect (Honest Limits):** - John's own Claude Code CLI session 529s (not interceptable — hooks run after Claude's internal API call) - During full Anthropic outage, John-the-orchestrator degrades to `john-lite` for bounded triage only, NOT full orchestration - H/BLOCKER/deploy/security tasks are rejected in offline mode (quality gates require full reasoning) **Cost:** - Development: $1,800 one-time (MC #104217 T1+T2+T4) - Operational: $0/month (local Ollama) - Avoided productivity loss: $1,200-$2,400/month (2-4 stalls/week × 2h × $150/h CEO time) **Key Dependency:** FORGE Ollama (10.0.0.2:11434) must be reachable. Falls back to ANVIL (localhost:11434) if FORGE down. --- ## 1. System Architecture ### 1.1 Auto-Detection Layer (T1) **File:** `/Users/makinja/system/tools/anthropic-529-detector.js` **Owner:** FlowForge **Evidence:** `/tmp/evidence-104217/t1-hook/` **How It Works:** 1. Wraps all Anthropic API calls with `wrapAnthropicCall()` middleware 2. Catches errors and applies `is529Error()` detector: - HTTP status code 529 - Error message contains "overload" (case-insensitive) - Word-boundary regex `/\b(status|code|http|error)\s*529\b/i` (avoids false positives on "529ms", "in 529 milliseconds") - Anthropic SDK `error.type === 'overloaded_error'` 3. On 529 match: - Writes `/tmp/john-offline-mode` flag with metadata (timestamp, reason) - Spawns background recovery daemon (`node anthropic-529-detector.js recovery-daemon`) - Re-throws original error (caller decides how to handle) **Wired Call Sites (verified 2026-06-22):** ```javascript // adapters/claude-api.js line 194 (initial message) const detector = require('../anthropic-529-detector'); let response = await detector.wrapAnthropicCall(async () => { return await client.messages.create(apiParams, { signal: controller.signal }); }); // adapters/claude-api.js line 231 (tool-use round) response = await detector.wrapAnthropicCall(async () => { return await client.messages.create(apiParams, { signal: roundCtl.signal }); }); ``` Additional wired sites (per T2 job1-detector-wiring.md): - `comms-responder.js` (Company Mesh legacy) - `company-worker.js` (CLI stderr path) **State Files:** - `/tmp/john-offline-mode` — Offline-mode flag (checked by tier-router.js) - `/tmp/anthropic-529-detector.json` — Detector state (trigger time, health check history) **Recovery Behavior:** - Auto-health-check every 5 minutes when offline-mode active - If Anthropic responds with status != 529, auto-disables offline-mode - TTL: Max 2 hours offline before forcing re-check - Health check: `https OPTIONS api.anthropic.com/v1/messages` (any response except 529 = healthy) --- ### 1.2 Degraded Orchestration Layer (T2) **File:** `/Users/makinja/system/tools/john-lite.js` **Owner:** AgentForge **Evidence:** `/tmp/evidence-104217/t2/` **Purpose:** Bounded orchestration continuity when `/tmp/john-offline-mode` flag is active. **Modes:** ```bash node john-lite.js loop # REPL-like degraded orchestration loop node john-lite.js once "" # One-shot task dispatch node john-lite.js triage # MC triage (what needs attention) node john-lite.js status # Show capabilities + offline status ``` **Capabilities (CAN DO):** - MC triage (list open tasks, show task details via `mc.js`) - Task classification (priority, agent selection) - Simple dispatch to Ollama-tier agents (research, analysis, draft) - Read-only tool execution (git status, mc.js list, file reads) - Bounded research/brainstorm/summarize tasks - Status checks (daemon health, service status) **Capabilities (CANNOT DO — save for full John):** - H/BLOCKER priority orchestration (quality gates demand full reasoning) - Mehanik/prompt-forge workflows (multi-turn agentic planning) - Company Mesh P2P verifier orchestration - AI Factory workflow dispatch - Production deploys, security decisions, architecture changes - Evidence ledger writes (L2+ verification) - Complex multi-agent coordination - Anything requiring Opus/Sonnet-level reasoning **Rejection Logic:** Tasks matching these patterns exit with code 3: ```javascript const COMPLEX_PATTERNS = [ /\b(deploy|production|staging|release)\b/i, /\b(security|auth|encrypt|vulnerability)\b/i, /\b(architecture|refactor|migrate)\b/i, /\b(H|BLOCKER|P0|P1)\b/i, /\b(mehanik|prompt-forge|company-mesh|ai-factory)\b/i, /\b(evidence|verification|validator|proveo)\b/i, /\b(multi-file|cross-service|integration)\b/i, ]; ``` **Exit Codes:** - 0 = success - 1 = Anthropic healthy (john-lite not needed) - 2 = No reachable Ollama host (FORGE + ANVIL both down) - 3 = Task too complex for john-lite (needs full John) **Output Storage:** All john-lite output saved to `~/system/offline-queue/_john-lite_.md` with `NEEDS_REVIEW` flag for post-outage review. **Log File:** `/tmp/john-lite-log.jsonl` (append-only JSONL) --- ### 1.3 Local Ollama Fleet **Primary:** FORGE (10.0.0.2:11434) **Fallback:** ANVIL (localhost:11434) #### FORGE Models (verified 2026-06-22) ```bash $ curl -s http://10.0.0.2:11434/api/tags | jq -r '.models[].name' qwen2.5:7b-instruct-q8_0 qwen3-coder:30b # Code primary qwen3.5:27b deepseek-r1:70b # Deep reasoning (42GB) qwen2.5-coder:32b-instruct-q8_0 qwen3:32b # Reasoning primary qwen3:8b-q8_0 bge-m3:latest # Embedding ``` **Status:** UP (2026-06-22) **Network:** Listens on `*:11434` (all interfaces) **Fix History:** MC #104217 T2 Job 3 — OLLAMA_HOST=0.0.0.0:11434 added to launchd plist to enable remote access #### ANVIL Models (verified 2026-06-22) ```bash $ curl -s http://localhost:11434/api/tags | jq -r '.models[].name' bge-m3:latest llama3.1:8b # Reasoning fallback nomic-embed-text:latest llama-guard3:8b llama-guard3:8b-q8_0 ``` **Status:** UP (2026-06-22) **Network:** Localhost only (127.0.0.1:11434) --- ## 2. Operator Procedures ### 2.1 Check Offline Mode Status ```bash # Quick status node ~/system/tools/anthropic-529-detector.js status # Example output: === Anthropic 529 Detector Status === Offline Mode: ACTIVE Trigger Reason: Anthropic API 529 overload detected: status 529 Offline Since: 2026-06-22T14:23:15.123Z (12 minutes ago) Last Health Check: 2026-06-22T14:28:00.456Z Result: unhealthy Status Code: 529 Auto-Recovery: enabled ``` ### 2.2 Check john-lite Status ```bash node ~/system/tools/john-lite.js status # Example output: === JOHN-LITE STATUS === Offline Mode: 🔴 ACTIVE Reason: Anthropic API 529 overload detected Ollama Hosts: ✅ FORGE (http://10.0.0.2:11434) Models: qwen3-coder:30b, qwen3:32b, deepseek-r1:70b, qwen2.5-coder:32b, ... ✅ ANVIL (http://localhost:11434) Models: llama3.1:8b, nomic-embed-text:latest, ... ``` ### 2.3 Manual Enable/Disable Offline Mode **Enable (test mode):** ```bash node ~/system/tools/anthropic-529-detector.js test-529 # Simulates 529 trigger, enables offline-mode ``` **Disable (manual clear):** ```bash node ~/system/tools/anthropic-529-detector.js clear # Removes /tmp/john-offline-mode flag ``` **Force Health Check:** ```bash node ~/system/tools/anthropic-529-detector.js recovery-check # Runs one health check cycle immediately ``` ### 2.4 Monitor Logs **Detector State:** ```bash cat /tmp/anthropic-529-detector.json | jq . ``` **john-lite Activity:** ```bash tail -f /tmp/john-lite-log.jsonl | jq . ``` **Offline Queue (output awaiting review):** ```bash ls -lt ~/system/offline-queue/*.md | head -5 ``` ### 2.5 Check FORGE/ANVIL Reachability **FORGE (from ANVIL):** ```bash curl -s --max-time 3 http://10.0.0.2:11434/api/tags | jq -r '.models[].name' | head -5 ``` **ANVIL (local):** ```bash curl -s --max-time 3 http://localhost:11434/api/tags | jq -r '.models[].name' | head -5 ``` **If FORGE down:** 1. SSH to FORGE: `ssh makinja@10.0.0.2` 2. Check Ollama service: ```bash lsof -nP -iTCP -sTCP:LISTEN | grep ollama launchctl list | grep ollama ``` 3. Verify `OLLAMA_HOST=0.0.0.0:11434` in `~/Library/LaunchAgents/homebrew.mxcl.ollama.plist` 4. Reload if needed: ```bash launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist launchctl load ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist ``` 5. If unrecoverable, system auto-falls back to ANVIL localhost:11434 --- ## 3. Recovery Behavior (Auto) ### 3.1 Normal Recovery Cycle 1. 529 detected → offline-mode ENABLED → recovery daemon spawned 2. Every 5 minutes: health check `https OPTIONS api.anthropic.com/v1/messages` 3. If response status != 529 → offline-mode DISABLED → daemon exits 4. Next agent/tool call routes to Anthropic normally **Timeline:** - Detection to offline-mode: <30 seconds - Recovery check interval: 5 minutes - Max offline duration (TTL): 2 hours (forces health check) ### 3.2 Manual Recovery (if auto-recovery stuck) ```bash # Check if Anthropic is healthy node ~/system/tools/anthropic-529-detector.js health # If healthy, manually clear offline mode node ~/system/tools/anthropic-529-detector.js clear ``` --- ## 4. What Is NOT Protected (Honest Limits) ### 4.1 Claude Code CLI Session 529s **Problem:** When you (John) interact with CEO via Claude Code CLI and Claude's backend returns 529, the CLI's internal error handling kicks in BEFORE the `anthropic-529-detector.js` hook can intercept it. **Why:** The detector wraps `adapters/claude-api.js` (ALAI's own agent tool calls), not the Claude Code executable's internal network stack. **Workaround:** Use `john-lite.js loop` for bounded orchestration during outages. Accept degraded quality for the duration. **Evidence:** MC #104217 T1 IMPLEMENTATION.md line 35-40: ``` CONSTRAINTS (HONEST): - CANNOT intercept Claude Code CLI's own 529s (those are CLI-internal) - CAN detect 529s from ALAI agent/tool calls (company-worker, tier-router path) - Focus: agent workflow continuity, not CLI session continuity ``` ### 4.2 High-Priority/Complex Work **Rejected in offline mode:** - H/BLOCKER priority tasks - Deploy/production/security decisions - Architecture changes - Multi-agent orchestration (Company Mesh, AI Factory) - Evidence synthesis (L2+ verification) **Rationale:** Local Ollama 32B models lack the reasoning depth for quality gates. These tasks wait for Anthropic recovery. **How to check:** `john-lite.js` exits with code 3 and logs rejection reason. --- ## 5. Cost Analysis (Why Not API Priority Tier?) **Full Analysis:** `/Users/makinja/system/specs/anthropic-priority-tier-analysis.md` **Conclusion:** NO-GO on Priority Tier / Provisioned Throughput API migration **Rationale:** 1. **Anthropic does NOT offer a "Priority Tier" that prevents 529 errors.** Their tier system (Tier 1-5) controls rate limits (RPM/TPD/TPM), NOT capacity guarantees. A Tier 4 user can still hit 529 if Anthropic's backend is overloaded. 2. **No API migration path for Claude Code subscription.** ALAI's orchestration runs on Claude Code CLI (subscription-based, no `ANTHROPIC_API_KEY`). Cannot "upgrade to priority tier" — different product line. 3. **API migration cost vastly exceeds productivity loss:** - Current subscription: ~$500-2,000/month (embedded in Claude Code Enterprise license) - Hypothetical API (Tier 4): $13,400-$18,367/month (2-2.5x increase due to loss of free caching) - Hypothetical Provisioned Throughput: $15,000-$30,000/month (estimated, unverified) - Productivity loss from 529 stalls: $1,200-$2,400/month (2-4 stalls × 2h × $150/h CEO time) - **ROI: NEGATIVE. Cost increase >> productivity loss.** 4. **Auto-fallback to local Ollama delivers 529 resilience at $0 marginal cost.** - Development: $1,800 one-time (MC #104217 T1+T2+T4) - Operational: $0/month (FORGE/ANVIL already owned, Ollama free) - **ROI: POSITIVE. Payback in <1 month.** **Recommendation:** Maintain hybrid model (Claude subscription + auto-fallback). Defer API migration unless Anthropic provides SLA-backed capacity guarantee + cost < $5K/month. --- ## 6. Evidence & Sources ### Implementation Evidence **MC #104217 T1 (FlowForge):** `/tmp/evidence-104217/t1-hook/` - FLOWFORGE-REPORT.md - IMPLEMENTATION.md (detector design) - verification-output.txt (test results) **MC #104217 T2 (AgentForge):** `/tmp/evidence-104217/t2/` - job1-detector-wiring.md (wired call sites) - job2-john-lite.md (degraded orchestration) - job3-forge-ollama-fix.md (network binding fix) - SUMMARY.md **MC #104217 T4 (Proveo):** `/tmp/evidence-104217/t4-proveo/` - test-results.txt (simulation + validation) **MC #104217 T3 (AgentForge):** `/Users/makinja/system/specs/anthropic-priority-tier-analysis.md` (Tier analysis, cost/benefit, NO-GO recommendation) ### Source Files (canonical) - `/Users/makinja/system/tools/anthropic-529-detector.js` (T1 detector + recovery daemon) - `/Users/makinja/system/tools/john-lite.js` (T2 degraded orchestration) - `/Users/makinja/system/tools/adapters/claude-api.js` (wired call site lines 194, 231) - `/Users/makinja/system/tools/comms-responder.js` (legacy Company Mesh path) - `/Users/makinja/system/tools/company-worker.js` (CLI stderr path) ### Web Sources (Tier Analysis) - Claude Subscription Plans (Google Vertex AI Search grounding-api-redirect, 2026-06-22) - Anthropic API Rate Limit Tiers (Google Vertex AI Search grounding-api-redirect, 2026-06-22) - Claude Opus 4 / Sonnet 4.6 API Pricing (Google Vertex AI Search grounding-api-redirect, 2026-06-22) - Prompt Caching & Batch API (Google Vertex AI Search grounding-api-redirect, 2026-06-22) --- ## 7. Frequently Asked Questions ### Q: Why not just buy API priority tier? **A:** Anthropic does not offer a "priority tier" that prevents 529 overload errors. Their tier system (Tier 1-5) only controls rate limits (requests per minute/day, tokens per minute), not capacity guarantees. Even Tier 4 users can hit 529 during backend overload. Provisioned Throughput (enterprise-only, pricing undisclosed) might reduce exposure, but estimated cost ($15K-$30K/month) vastly exceeds productivity loss from 529 stalls ($1.2K-$2.4K/month). ### Q: How long does it take to switch to offline mode? **A:** <30 seconds from 529 detection to `/tmp/john-offline-mode` flag active. Next agent/tool call routes to Ollama. ### Q: How long does it take to recover when Anthropic is healthy again? **A:** 5-minute health check cycle. Once Anthropic responds with status != 529, offline-mode is auto-disabled. Next call routes to Anthropic. ### Q: What if FORGE Ollama is down? **A:** System auto-falls back to ANVIL localhost:11434 (llama3.1:8b reasoning, nomic-embed-text embedding). If both FORGE + ANVIL down, `john-lite.js` exits with code 2 and logs "No reachable Ollama host." ### Q: Can I manually trigger offline mode for testing? **A:** Yes. ```bash node ~/system/tools/anthropic-529-detector.js test-529 ``` Clear with: ```bash node ~/system/tools/anthropic-529-detector.js clear ``` ### Q: How do I review john-lite output after outage recovery? **A:** Check `~/system/offline-queue/*.md` for all output generated during offline mode. Each file includes: - Timestamp - Task description - Model used (qwen3:32b, llama3.1:8b, etc.) - Output - `NEEDS_REVIEW` flag Review before using in production (local model accuracy < Claude Opus 4). ### Q: Where are the logs? **A:** - Detector state: `/tmp/anthropic-529-detector.json` - john-lite activity: `/tmp/john-lite-log.jsonl` - Offline-mode flag: `/tmp/john-offline-mode` - Offline output queue: `~/system/offline-queue/_john-lite_*.md` --- ## 8. Related Documentation - **MC #104217:** [H] Anthropic-outage resilience: firma ne smije stati kad Claude API vrati 529/overloaded - **Tier Analysis:** `/Users/makinja/system/specs/anthropic-priority-tier-analysis.md` - **FORGE Ollama Fix:** `/tmp/evidence-104217/t2/job3-forge-ollama-fix.md` - **Cost Tracking:** `node ~/system/tools/cost-tracker.js summary week` --- **Last Updated:** 2026-06-22T21:29:00Z **Owner:** Skillforge **Status:** Production (Active) **Runbook Version:** 1.0 --- **END OF RUNBOOK**