ANVIL SPOF Elimination Plan (2026-04-20) Status: DRAFT — Awaiting Proveo validation + Alem approval Author: Kelsey Hightower / FlowForge Date: 2026-04-20 MC Task: #8515 ANVIL SPOF elimination sprint Deadline: 2026-05-01 ANVIL SPOF Elimination Plan Author: FlowForge (Kelsey Hightower) | MC Task #8515 Date: 2026-04-20 Status: DRAFT — Awaiting Alem approval before any implementation Executive Summary ANVIL (Mac Studio M3 Ultra, 96 GB, 100.103.49.98) is a single point of failure. One power outage, kernel panic, or SSD failure ends all ALAI operations — mission control, agent fleet, Ollama inference, all daemons. Currently only 2 of ~67 production SQLite databases are replicated to Azure Blob Storage. RTO is effectively infinite. This plan eliminates the SPOF across 9 sequential phases. Key finding: FORGE already exists. It is a Mac Studio M3 Ultra 256 GB connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE) with sub-millisecond latency, AND accessible via Tailscale at 100.104.164.86. No new hardware purchase is needed. Budget impact: ~0 EUR/month additional infrastructure cost (FORGE is already owned and powered). Targets: RPO < 60s | RTO < 5 min (manual failover Phase 1, automatic Phase 2+) Architecture Overview ANVIL (primary) FORGE (warm standby) Mac Studio M3 Ultra 96GB Mac Studio M3 Ultra 256GB 100.103.49.98 (Tailscale) 100.104.164.86 (Tailscale) 10.0.0.1 (Thunderbolt) 10.0.0.2 (Thunderbolt) │ │ │ Thunderbolt Bridge (< 1ms) │ └────────────────────────────────-─┘ │ ▼ Azure Blob Storage alaibackups0ebb system-db-backups container (litestream WAL segments, all DBs) All replication flows ANVIL → Azure → FORGE (pull-based via litestream restore). FORGE does NOT write back to Azure. Azure is the single durable WAL store. Phase 1 — Litestream Expansion (all ~67 DBs) 1.1 Database Tier Classification Priority rationale: P0 = system cannot function without it | P1 = major feature loss | P2 = historical/cache only. P0 — Mission Critical (system stops without these) Database Size Write Freq Justification mission-control.db 26 MB Very high Primary task ledger — all MC operations. CURRENTLY REPLICATED. hivemind.db 162 MB High Agent memory, HiveMind knowledge graph. CURRENTLY REPLICATED. tasks.db 4 KB High Active task queue — active work in flight costs.db 256 KB High Token cost tracking, budget enforcement events.db 14 MB High System event bus — orchestrator depends on this orchestrator-queue.db 28 KB High Active agent job queue — jobs lost = work lost orchestrator-workers.db 36 KB High Worker state — active session tracking durable-runner.db 896 KB Medium Durable task execution state session-index.db 56 MB High Agent session state — all active sessions knowledge.db 192 MB Medium RAG knowledge base — primary retrieval corpus emails.db 0 B (active) High Email agent state — initialized on first write email-inbox.db 3.1 MB High Live email queue alem-directives.db active WAL High CEO directives — highest trust data P0 — Financial / Legal (loss = regulatory exposure) Database Size Write Freq Justification fiken.db 0 B (active) Medium Fiken accounting integration — financial records invoices.db 36 KB Medium Invoice state — revenue tracking contracts.db 40 KB Low Signed contracts — legal documents leads.db 256 KB Medium Sales pipeline — business development P1 — Operational (system degrades without these) Database Size Write Freq Justification agent-routing.db 4.1 MB Medium Routing decisions, agent assignment bee-index.db 4.2 MB Medium Bee task index bih-tenders.db 640 KB Low BiH market tenders — business intelligence browser-tasks.db active WAL Medium Browser automation queue companies.db 0 B (active) Low Company registry contacts.db 192 KB Low CRM contacts deploy-registry.db 16 KB Low Deployment history design-reviews.db 64 KB Low Design review state distill.db 2.0 MB Medium Knowledge distillation cache documents.db 32 KB Low Document registry drafts.db 360 KB Medium Draft content drift.db active WAL Medium Config drift detection email-audit.db 256 KB Medium Email audit trail email-briefing.db 0 B (active) Low Daily briefing state email-index.db 0 B (active) Low Email search index email-tracking.db 36 KB Medium Email delivery tracking escalations.db 24 KB Medium Escalation queue facts.db 20 KB Low System facts store flywheel.db 432 MB Low Flywheel learning data — largest DB goals.db 44 KB Medium OKR / goal tracking guardrails-audit.db 10 MB Medium Safety audit trail health-events.db 15 MB High System health events hivemind-archive.db 6.7 MB Low HiveMind historical archive master-control.db 0 B (active) Medium Master control state mc.db 0 B (active) Medium Mission control alias minions.db 192 KB Medium Minion agent registry observability.db 44 KB Medium Metrics and traces orchestrator-events.db 0 B (active) Medium Orchestrator event log pipeline.db active WAL Medium CI/CD pipeline state projects.db 40 KB Low Project registry routing-outcomes.db 192 KB Medium Tier routing outcome log skill-improvements.db 20 KB Low Skill improvement tracking skill-registry.db 128 KB Low Agent skill registry sprint-pipeline.db 32 KB Medium Sprint pipeline state strategy-tracker.db 128 KB Low Strategic initiative tracking teams.db 40 KB Low Team registry tenders.db 384 KB Low Norwegian tender data tickets.db active WAL Medium Support ticket tracking tool-audit.db 6.1 MB Medium Tool usage audit tool-registry.db 128 KB Low Tool registry trace-events.db 52 MB High Distributed trace store applications-tracker.db 12 KB Low Job/grant applications P2 — Cache / Reconstructible (loss = inconvenience only) Database Size Write Freq Justification baikal-caldav.db 108 KB Low CalDAV cache — reconstructible from Baikal prompt-cache.db 320 KB Medium LLM prompt cache — can warm from scratch prompt-metrics.db 28 KB Low Prompt performance metrics rag-cache.db active WAL Medium RAG response cache — reconstructible semantic-reuse-index.db 192 KB Medium Semantic cache — reconstructible stbs.db 0 B (active) Low STBS data — empty telemetry.db 24 KB Medium Telemetry — can lose without ops impact token-cost.db active WAL Medium Cost log — reconstructible from API receipts usage.db 0 B (active) Low Usage tracking — empty vcr.db active WAL Low HTTP cassette cache — reconstructible 1.2 Retention Strategy Current retention for the 2 replicated DBs: 72h. This is insufficient for P0. Tier Retention Justification P0 (mission-critical) 7d One week: covers weekend + Monday incident recovery. 72h is too tight — if a silent corruption is not caught in 3 days, all WAL segments are gone. P0 (financial/legal) 30d Regulatory prudence. fiken.db, invoices.db, contracts.db. Matches typical invoice dispute windows. P1 72h Current default. Operationally acceptable. P2 24h Cache data. Disk cost matters more than recovery depth. Retention-check-interval: 1h for all tiers (current default, correct). Sync-interval: 1s for all tiers P0 and P1. 10s for P2 (reduce Azure transaction cost on low-value data). Azure storage cost estimate at current sizes (~1.2 GB total databases): WAL segments are incremental. Estimate ~500 MB/day delta across all active DBs. 7-day P0 WAL: ~3.5 GB. 30-day financial: ~1 GB. P1 72h: ~1 GB. Total Azure Blob: ~6 GB. At ~€0.02/GB/month = ~€0.12/month. Negligible. 1.3 New litestream.yml Path: /Users/makinja/system/config/litestream.yml Note on flywheel.db (432 MB): Include in P1 but with sync-interval: 30s to reduce churn. Note on knowledge.db (192 MB): P0, sync-interval 1s — it's actively written by RAG ingestion. # Litestream — SQLite streaming replication to Azure Blob Storage # Primary: ANVIL (Mac Studio M3 Ultra 96GB, 100.103.49.98) # Config: /Users/makinja/system/config/litestream.yml # Auth: Azure SP (alai-backup-writer) via client credentials # SP: alai-backup-writer (1a0b3018-0c31-474b-918f-531b0a29a669) # SP has Storage Blob Data Contributor on system-db-backups container # Litestream reads AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID from env # Launch: com.alai.litestream.plist (sets env vars in EnvironmentVariables block) # Updated: 2026-04-20 — ANVIL SPOF Elimination Sprint (MC #8515) # # Tier reference: # P0-critical: retention 7d, sync 1s # P0-financial: retention 30d, sync 1s # P1: retention 72h, sync 1s (or 30s for large DBs) # P2: retention 24h, sync 10s dbs: # ── P0 MISSION CRITICAL ────────────────────────────────────────────────────── - path: /Users/makinja/system/databases/mission-control.db replicas: - name: mc-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/mission-control retention: 168h # 7 days retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/hivemind.db replicas: - name: hivemind-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/hivemind retention: 168h # 7 days retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/tasks.db replicas: - name: tasks-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/tasks retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/costs.db replicas: - name: costs-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/costs retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/events.db replicas: - name: events-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/events retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/orchestrator-queue.db replicas: - name: orch-queue-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/orchestrator-queue retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/orchestrator-workers.db replicas: - name: orch-workers-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/orchestrator-workers retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/durable-runner.db replicas: - name: durable-runner-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/durable-runner retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/session-index.db replicas: - name: session-index-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/session-index retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/knowledge.db replicas: - name: knowledge-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/knowledge retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/emails.db replicas: - name: emails-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/emails retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/email-inbox.db replicas: - name: email-inbox-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/email-inbox retention: 168h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/alem-directives.db replicas: - name: alem-directives-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/alem-directives retention: 168h retention-check-interval: 1h sync-interval: 1s # ── P0 FINANCIAL / LEGAL ───────────────────────────────────────────────────── - path: /Users/makinja/system/databases/fiken.db replicas: - name: fiken-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/fiken retention: 720h # 30 days retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/invoices.db replicas: - name: invoices-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/invoices retention: 720h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/contracts.db replicas: - name: contracts-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/contracts retention: 720h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/leads.db replicas: - name: leads-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/leads retention: 720h retention-check-interval: 1h sync-interval: 1s # ── P1 OPERATIONAL ─────────────────────────────────────────────────────────── - path: /Users/makinja/system/databases/agent-routing.db replicas: - name: agent-routing-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/agent-routing retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/bee-index.db replicas: - name: bee-index-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/bee-index retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/bih-tenders.db replicas: - name: bih-tenders-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/bih-tenders retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/browser-tasks.db replicas: - name: browser-tasks-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/browser-tasks retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/companies.db replicas: - name: companies-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/companies retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/contacts.db replicas: - name: contacts-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/contacts retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/deploy-registry.db replicas: - name: deploy-registry-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/deploy-registry retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/design-reviews.db replicas: - name: design-reviews-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/design-reviews retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/distill.db replicas: - name: distill-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/distill retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/documents.db replicas: - name: documents-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/documents retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/drafts.db replicas: - name: drafts-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/drafts retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/drift.db replicas: - name: drift-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/drift retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/email-audit.db replicas: - name: email-audit-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/email-audit retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/email-briefing.db replicas: - name: email-briefing-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/email-briefing retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/email-index.db replicas: - name: email-index-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/email-index retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/email-tracking.db replicas: - name: email-tracking-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/email-tracking retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/escalations.db replicas: - name: escalations-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/escalations retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/facts.db replicas: - name: facts-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/facts retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/flywheel.db replicas: - name: flywheel-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/flywheel retention: 72h retention-check-interval: 1h sync-interval: 30s # 432MB — throttle sync to reduce Azure transactions - path: /Users/makinja/system/databases/goals.db replicas: - name: goals-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/goals retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/guardrails-audit.db replicas: - name: guardrails-audit-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/guardrails-audit retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/health-events.db replicas: - name: health-events-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/health-events retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/hivemind-archive.db replicas: - name: hivemind-archive-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/hivemind-archive retention: 72h retention-check-interval: 1h sync-interval: 10s - path: /Users/makinja/system/databases/master-control.db replicas: - name: master-control-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/master-control retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/mc.db replicas: - name: mc-db-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/mc-db retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/minions.db replicas: - name: minions-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/minions retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/observability.db replicas: - name: observability-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/observability retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/orchestrator-events.db replicas: - name: orch-events-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/orchestrator-events retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/pipeline.db replicas: - name: pipeline-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/pipeline retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/projects.db replicas: - name: projects-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/projects retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/routing-outcomes.db replicas: - name: routing-outcomes-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/routing-outcomes retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/skill-improvements.db replicas: - name: skill-improvements-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/skill-improvements retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/skill-registry.db replicas: - name: skill-registry-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/skill-registry retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/sprint-pipeline.db replicas: - name: sprint-pipeline-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/sprint-pipeline retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/strategy-tracker.db replicas: - name: strategy-tracker-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/strategy-tracker retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/teams.db replicas: - name: teams-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/teams retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/tenders.db replicas: - name: tenders-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/tenders retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/tickets.db replicas: - name: tickets-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/tickets retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/tool-audit.db replicas: - name: tool-audit-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/tool-audit retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/tool-registry.db replicas: - name: tool-registry-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/tool-registry retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/trace-events.db replicas: - name: trace-events-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/trace-events retention: 72h retention-check-interval: 1h sync-interval: 1s - path: /Users/makinja/system/databases/applications-tracker.db replicas: - name: applications-tracker-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/applications-tracker retention: 72h retention-check-interval: 1h sync-interval: 1s # ── P2 CACHE / RECONSTRUCTIBLE ─────────────────────────────────────────────── - path: /Users/makinja/system/databases/baikal-caldav.db replicas: - name: baikal-caldav-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/baikal-caldav retention: 24h retention-check-interval: 1h sync-interval: 10s - path: /Users/makinja/system/databases/prompt-cache.db replicas: - name: prompt-cache-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/prompt-cache retention: 24h retention-check-interval: 1h sync-interval: 10s - path: /Users/makinja/system/databases/prompt-metrics.db replicas: - name: prompt-metrics-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/prompt-metrics retention: 24h retention-check-interval: 1h sync-interval: 10s - path: /Users/makinja/system/databases/semantic-reuse-index.db replicas: - name: semantic-reuse-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/semantic-reuse-index retention: 24h retention-check-interval: 1h sync-interval: 10s - path: /Users/makinja/system/databases/stbs.db replicas: - name: stbs-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/stbs retention: 24h retention-check-interval: 1h sync-interval: 10s - path: /Users/makinja/system/databases/telemetry.db replicas: - name: telemetry-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/telemetry retention: 24h retention-check-interval: 1h sync-interval: 10s - path: /Users/makinja/system/databases/token-cost.db replicas: - name: token-cost-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/token-cost retention: 24h retention-check-interval: 1h sync-interval: 10s - path: /Users/makinja/system/databases/usage.db replicas: - name: usage-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/usage retention: 24h retention-check-interval: 1h sync-interval: 10s - path: /Users/makinja/system/databases/vcr.db replicas: - name: vcr-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/vcr retention: 24h retention-check-interval: 1h sync-interval: 10s 1.4 Implementation Steps (ANVIL) Stop litestream: launchctl stop com.alai.litestream Replace /Users/makinja/system/config/litestream.yml with the config above. Validate config: /opt/homebrew/bin/litestream replicate -config /Users/makinja/system/config/litestream.yml -config-validate Start litestream: launchctl start com.alai.litestream Verify all DBs appear in Azure: az storage blob list --container-name system-db-backups --account-name alaibackups0ebb --prefix litestream/ --auth-mode login --query "[].name" | wc -l (expect ~67+ entries). Watch logs for errors: tail -f /Users/makinja/system/logs/litestream-error.log Phase 2 — FORGE Hardware / OS Decision 2.1 FORGE Already Exists — Hardware Decision Is Made FORGE is confirmed to be a second Mac Studio M3 Ultra with 256 GB unified memory, connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE). Tailscale IP: 100.104.164.86. User: basicas. It is already running Ollama with models including devstral:24b, qwen3:32b, deepseek-r1:70b, qwen3-coder, and bge-m3. No hardware purchase is required. Monthly infrastructure cost delta: 0 EUR (already owned). 2.2 Why FORGE Wins Over Every Alternative Option Cost/mo Latency to ANVIL Apple Silicon macOS parity Verdict FORGE (Mac Studio M3U 256GB, owned) 0 EUR < 1ms (Thunderbolt) Yes (M3 Ultra) Yes (same LaunchAgent ecosystem) CHOSEN Mac Mini M4 Pro (purchase) ~50 EUR amortized < 1ms if local Yes Yes Redundant — FORGE exists Hetzner Linux VM (CCX33) ~30-50 EUR 10-30ms (internet) No (x86) No (systemd, not launchd) Budget option only if FORGE fails Azure VM (Sweden Central) ~60-80 EUR 10-30ms No No Closest to Azure storage but no Apple Silicon Decision: Use FORGE as warm standby. Zero additional cost. Thunderbolt latency is effectively local — litestream WAL replication will complete in well under 60s. 2.3 FORGE Bootstrap Prerequisites FORGE already runs Ollama. What is missing: litestream installed on FORGE (check: brew list litestream on basicas@FORGE) Azure SP credentials injected into FORGE environment (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID) ~/system/databases/ directory created on FORGE litestream-restore.sh daemon script written and loaded as LaunchAgent on FORGE SSH key access from ANVIL to FORGE for health check and failover scripts Phase 3 — Continuous Restore on FORGE (< 60s RPO) 3.1 Architecture FORGE runs litestream restore in a watch loop per database. Litestream 0.5.x does not have a native watch mode — it restores a snapshot + WAL segments. The recommended approach is a shell script loop that calls litestream restore repeatedly with a short interval. However, litestream does support a second process pattern: run litestream replicate on FORGE pointing at the SAME Azure bucket paths, but configured as a replica-only consumer. This is the correct approach: FORGE runs a litestream restore daemon that continuously polls for new WAL segments from Azure. 3.2 Continuous Restore Strategy Use litestream restore with the -if-replica-exists flag in a loop: #!/usr/bin/env bash # /Users/basicas/system/scripts/litestream-restore-loop.sh # Runs on FORGE. Continuously restores all P0+P1 DBs from Azure. # Interval: 30s poll (gives ~30s RPO in steady state, well within 60s target) set -euo pipefail LITESTREAM=/opt/homebrew/bin/litestream CONFIG=/Users/basicas/system/config/litestream-restore.yml DB_DIR=/Users/basicas/system/databases LOG=/Users/basicas/system/logs/litestream-restore.log INTERVAL=30 # seconds between restore cycles while true; do echo "[$(date -Iseconds)] Starting restore cycle" >> "$LOG" # Restore each DB defined in restore config # litestream restore will only apply new WAL segments if DB already exists $LITESTREAM restore -config "$CONFIG" -if-replica-exists >> "$LOG" 2>&1 || true echo "[$(date -Iseconds)] Restore cycle complete, sleeping ${INTERVAL}s" >> "$LOG" sleep "$INTERVAL" done 3.3 FORGE litestream-restore.yml A separate config file on FORGE that mirrors ANVIL's litestream.yml but uses restore semantics. FORGE is READ-ONLY consumer. It never writes back to Azure. Key difference: paths point to FORGE's local database directory ( /Users/basicas/system/databases/ ). The Azure paths are identical to ANVIL's — FORGE reads from the same blob paths ANVIL writes to. # /Users/basicas/system/config/litestream-restore.yml # FORGE warm standby — continuous restore from Azure # DO NOT run litestream replicate with this config — restore only dbs: - path: /Users/basicas/system/databases/mission-control.db replicas: - name: mc-abs type: abs endpoint: https://alaibackups0ebb.blob.core.windows.net bucket: system-db-backups path: litestream/mission-control # ... (repeat for all P0 and P1 DBs using same Azure paths as ANVIL) # P2 DBs: omit from restore config — not worth continuous restore overhead 3.4 FORGE LaunchAgent for Restore Loop Path: /Users/basicas/Library/LaunchAgents/com.alai.litestream-restore.plist Label com.alai.litestream-restore ProgramArguments /bin/bash /Users/basicas/system/scripts/litestream-restore-loop.sh EnvironmentVariables AZURE_STORAGE_ACCOUNT alaibackups0ebb AZURE_CLIENT_ID 1a0b3018-0c31-474b-918f-531b0a29a669 AZURE_CLIENT_SECRET RETRIEVE_FROM_BITWARDEN_AT_BOOTSTRAP AZURE_TENANT_ID 3454a03f-20b4-4bda-a116-2293c459aecd KeepAlive RunAtLoad StandardOutPath /Users/basicas/system/logs/litestream-restore.log StandardErrorPath /Users/basicas/system/logs/litestream-restore-error.log ThrottleInterval 10 3.5 RPO Calculation ANVIL litestream sync-interval: 1s (WAL segment flushed to Azure every 1s for P0) FORGE restore poll interval: 30s Azure propagation: < 1s (same-region, in-blob operations) Worst-case RPO: 31s (well under 60s target) Expected average RPO: ~15-20s Phase 4 — Ollama Failover Tier Routing 4.1 Current State Tier routing in /Users/makinja/system/config/tier-routing.json already defines FORGE as the primary host for Tiers 2c, 2cf, 2d, 3, 3s, 3r. ANVIL handles Tiers 1, 2, 2t, 2cHQ. The providerFallback section defines ollama:qwen2.5-coder:32b@anvil as fallback for some paths. The gap: there is no automatic failover FROM ANVIL TO FORGE when ANVIL Ollama is down, and no automatic failover FROM FORGE TO ANVIL when FORGE Ollama is down. 4.2 Failover Config Extension Extend /Users/makinja/system/config/tier-routing.json with an ollamaHosts block: "ollamaHosts": { "anvil": { "url": "http://localhost:11434", "tailscale_url": "http://100.103.49.98:11434", "health_path": "/api/tags", "health_timeout_ms": 3000, "role": "primary-infra" }, "forge": { "url": "http://10.0.0.2:11434", "tailscale_url": "http://100.104.164.86:11434", "health_path": "/api/tags", "health_timeout_ms": 3000, "role": "primary-compute" } }, "failoverRules": { "anvil-down": { "redirect_anvil_tiers": ["1", "2", "2t", "2cHQ"], "to_forge_models": { "llama3.1:8b": "llama3.1:8b", "qwen2.5-coder:32b": "qwen2.5-coder:32b-instruct-q8_0" }, "note": "When ANVIL Ollama unreachable, route Tier 1/2 to FORGE equivalents" }, "forge-down": { "redirect_forge_tiers": ["2c", "2cf", "2d", "3", "3s", "3r"], "to_claude": true, "note": "When FORGE Ollama unreachable, escalate to Claude (cost spike acceptable — FORGE failure is rare)" } } 4.3 Health Check Daemon A new lightweight Node.js daemon on ANVIL polls both Ollama endpoints every 15s and writes status to a JSON file that ollama-engine.js reads before routing: Path: /Users/makinja/system/daemons/ollama-health-monitor.js // Pseudocode — implementation by CodeCraft // Runs every 15s, writes to /tmp/ollama-health.json // { // "anvil": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" }, // "forge": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" } // } // tier-router.js reads this file before every dispatch // If anvil.healthy === false: redirect tier 1/2 requests to forge // If forge.healthy === false: redirect tier 2c/3 requests to claude 4.4 Manual Failover Command For Phase 1 (before automatic failover is implemented): # On ANVIL, when FORGE is down — force all routing to ANVIL echo '{"anvil":{"healthy":true},"forge":{"healthy":false,"override":true}}' > /tmp/ollama-health-override.json # When ANVIL is down, from FORGE (if FORGE has ollama-engine.js): # Edit /Users/basicas/system/config/tier-routing.json: set all hosts to "forge" Phase 5 — DNS / Service Discovery 5.1 Options Evaluated Option Mechanism Failover Speed Complexity Cost Tailscale MagicDNS DNS record swap via Tailscale API Manual: ~1 min Low Free Cloudflare DNS + health check CF Load Balancer health-check → DNS swap Automatic: ~30s Medium ~$5/month Local /etc/hosts on each node Static entries, no automatic failover Manual: ~1 min None Free Cloudflare Tunnel alias DNS alias behind CF Tunnel ~30s Medium Free tier 5.2 Recommendation: Tailscale MagicDNS Chosen: Tailscale MagicDNS with manual DNS swap. Rationale: All nodes (ANVIL, FORGE, ab-mac) are already on the same Tailscale network. Tailscale MagicDNS can assign a hostname anvil.alai.internal (or use the device name directly). Current hardcoded addresses ( localhost:11434 , 10.0.0.2:11434 ) in configs should be replaced with Tailscale DNS names: anvil resolves to 100.103.49.98, forge resolves to 100.104.164.86. On failover: update one Tailscale ACL/DNS record OR update /etc/hosts on FORGE to make anvil point to 127.0.0.1 (making FORGE answer for anvil traffic locally). Implementation: In Tailscale admin console: verify MagicDNS is enabled for the tailnet. Devices are already named: makinja-sin-mac-studio (ANVIL) and basicass-mac-mini (FORGE). Add a Tailscale DNS override: anvil.alai → 100.103.49.98 (ANVIL primary). Add to all tool configs: replace localhost:11434 with anvil.alai:11434 , 10.0.0.2:11434 with forge.alai:11434 . Failover procedure: update Tailscale DNS record anvil.alai → 100.104.164.86 (FORGE). This takes effect across all nodes within ~30s (Tailscale DNS TTL). Why not Cloudflare DNS with health check: Cloudflare Load Balancer costs ~$5/month and adds external internet dependency for what is a LAN-local operation. Overkill for current scale. Revisit if ALAI adds a third node outside the LAN. Phase 6 — External Heartbeat 6.1 Requirement An external entity (not on ANVIL, not on FORGE) must poll ANVIL every 60s and alert Slack #ops if ANVIL is unreachable for > 2 consecutive minutes (2 missed polls). 6.2 Mechanism: GitHub Actions Cron (Recommended) Chosen: GitHub Actions scheduled workflow. Cost: free (GitHub public repo or private with Actions minutes). No Azure Function setup required. # .github/workflows/anvil-heartbeat.yml # In a private ALAI GitHub repo (e.g., alai-infra or system-health) name: ANVIL Heartbeat on: schedule: - cron: '* * * * *' # Every minute jobs: heartbeat: runs-on: ubuntu-latest timeout-minutes: 1 steps: - name: Check ANVIL health via Tailscale id: health run: | # ANVIL exposes a health endpoint via Cloudflare Tunnel or public URL # Option A: Hit a public health endpoint (requires CF Tunnel on ANVIL) # Option B: Use Tailscale GitHub Action to join the tailnet and check directly STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ --connect-timeout 10 \ --max-time 15 \ ${{ secrets.ANVIL_HEALTH_URL }}) echo "status=$STATUS" >> $GITHUB_OUTPUT - name: Alert Slack if down if: steps.health.outputs.status != '200' uses: slackapi/slack-github-action@v1 with: payload: | { "channel": "#ops", "text": ":red_circle: ANVIL HEALTH CHECK FAILED\nHTTP Status: ${{ steps.health.outputs.status }}\nTime: ${{ github.run_started_at }}\nANVIL may be down. Check Tailscale and initiate FORGE failover if confirmed." } env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_OPS_WEBHOOK }} 6.3 ANVIL Health Endpoint ANVIL needs a lightweight HTTP health endpoint reachable from the internet (via Cloudflare Tunnel) or via Tailscale GitHub Action. The simplest approach: Create a health check script at /Users/makinja/system/tools/health-server.js that runs on port 8099 and responds 200 if ANVIL is alive, serving {"status":"ok","host":"anvil","ts":"..."} . Expose via existing Cloudflare Tunnel infrastructure. 6.4 Alert Escalation 2 consecutive failures (2 minutes down): Slack #ops message. 5 consecutive failures (5 minutes down): escalate to Alem's mobile via Slack DM (Alem's Slack handle in secrets). 6.5 Azure Function Alternative Azure Function with Timer trigger (every 60s) is viable but requires: Azure subscription billing (Consumption plan: ~$0/month for < 1M executions — effectively free) Azure Function App deployment and maintenance More setup complexity than GitHub Actions Verdict: GitHub Actions preferred for simplicity. Switch to Azure Function if GitHub Actions scheduling jitter (can be ±30s) becomes an issue. Phase 7 — Shared Secrets (FORGE Bitwarden Access) 7.1 Problem FORGE needs access to secrets (Azure SP secret, Bitwarden master password, API keys) without depending on ANVIL being alive. Currently ANVIL holds the Bitwarden session at /tmp/bw-session. 7.2 Options Option Description Risk Separate BW account on FORGE FORGE has its own Bitwarden account with shared collection Low — independent Shared BW session sync ANVIL writes /tmp/bw-session to FORGE via rsync Medium — session expires Azure Key Vault break-glass Critical secrets in AKV, FORGE SP can read them Low — Azure dependency Environment variables in plist Secrets baked into LaunchAgent plist on FORGE Low but plaintext risk 7.3 Recommendation: Two-Layer Approach Layer 1 (operational): FORGE bootstraps its own Bitwarden CLI session independently. FORGE has bw CLI installed. FORGE has its own BW_SESSION set via a one-time manual bootstrap: bw login --apikey using a FORGE-specific API key (Bitwarden supports API keys per user/device). Session is stored in /Users/basicas/.bw-session and refreshed by a LaunchAgent on FORGE. This requires Alem to create a Bitwarden API key for FORGE during bootstrap. Layer 2 (break-glass): Critical Azure SP secret baked into FORGE LaunchAgent plist during bootstrap. The Azure SP secret ( AZURE_CLIENT_SECRET ) is placed directly in the com.alai.litestream-restore.plist EnvironmentVariables block — same pattern as ANVIL. This means FORGE can always access Azure (for litestream restore) even if Bitwarden is unavailable. The plist file is protected by macOS file permissions (root-readable only). This is the same pattern already in use on ANVIL (confirmed in the plist we read). Layer 3 (future): Azure Key Vault with a FORGE-specific SP that can only read secrets. Create a new SP alai-forge-reader with Key Vault Secrets User role. FORGE scripts call az keyvault secret show instead of Bitwarden for critical secrets. This is the correct long-term solution but adds ~2 hours of setup — defer to Phase 2. 7.4 Bootstrap Sequence for FORGE Secrets # On FORGE during initial bootstrap (one-time, performed by Alem or FlowForge): # 1. Install bw CLI brew install bitwarden-cli # 2. Login with API key (avoids interactive login) export BW_CLIENTID="" export BW_CLIENTSECRET="" bw login --apikey bw unlock --passwordenv BW_MASTER_PASSWORD # or interactive # 3. Store session bw unlock > /Users/basicas/.bw-session # 4. Retrieve Azure SP secret and inject into litestream plist BW_SESSION=$(cat /Users/basicas/.bw-session) AZ_SECRET=$(bw get password "alai-backup-writer" --session "$BW_SESSION") # Update the plist AZURE_CLIENT_SECRET value with $AZ_SECRET Phase 8 — Proveo DR Drill Checklist (Angie Jones Validation Task) This is the mandatory validation task per ZAKON PLAN. Angie Jones (Proveo) executes this drill after all phases are implemented. This is a REAL drill — not a dry run. 8.1 Pre-Drill Prerequisites Phase 1 complete: all ~67 DBs replicating to Azure (verify with az storage blob list count) Phase 3 complete: FORGE restore loop running, confirmed by checking FORGE DB file timestamps Phase 4 complete: Ollama health monitor daemon running on ANVIL Phase 5 complete: Tailscale MagicDNS configured ( anvil.alai resolves correctly) Phase 6 complete: GitHub Actions heartbeat workflow deployed and sending test ping Phase 7 complete: FORGE Bitwarden session independently functional 8.2 Drill Procedure Step 1: Establish baseline (T=0) # On ANVIL — record current state node ~/system/tools/mc.js stats # Record open task count sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'" # Record date -Iseconds > /tmp/drill-start.txt Step 2: Simulate ANVIL failure # Graceful shutdown (simulates power outage or kernel panic recovery) # DO NOT run on production without Alem present sudo shutdown -h now # Or: launchctl stop all non-essential services # Alternative: kill Ollama + stop litestream + stop pi-orchestrator (partial failure sim) launchctl stop com.alai.litestream launchctl stop com.john.pi-orchestrator launchctl stop com.john.ollama-serve-v2 Step 3: Measure time to alert (T=2 min) GitHub Actions heartbeat should fire within 2 minutes of ANVIL going offline. Angie records: timestamp of Slack #ops alert arrival. Expected: < 2 min 30s from shutdown to Slack alert. Step 4: FORGE failover execution (T=3 min target) # On FORGE (basicas@100.104.164.86) # 1. Verify latest DBs restored ls -la ~/system/databases/*.db | head -5 sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'" # Compare to baseline — delta should be < 60s of writes # 2. Update Tailscale DNS: anvil.alai → 100.104.164.86 (FORGE) # (Alem updates in Tailscale admin console) # 3. Start pi-orchestrator on FORGE (if installed) # OR: update tier-routing.json to route all requests to forge endpoints # 4. Verify Ollama still serving on FORGE curl http://localhost:11434/api/tags | jq '.models | length' Step 5: Measure RPO # On FORGE after failover BASELINE=$(cat /tmp/drill-baseline-count.txt) # From Step 1 CURRENT=$(sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'") echo "Task count delta: $((BASELINE - CURRENT))" # Check last WAL segment timestamp in Azure az storage blob list \ --container-name system-db-backups \ --account-name alaibackups0ebb \ --prefix litestream/mission-control \ --auth-mode login \ --query "reverse(sort_by([].{name:name,last_modified:properties.lastModified}, &last_modified))[0]" # Record last WAL segment time vs ANVIL shutdown time = actual RPO Step 6: Measure RTO RTO = time from "ANVIL confirmed down" to "FORGE serving requests with < 60s RPO data". Record timestamps at each step. Target: < 5 minutes total. Step 7: Restore ANVIL and verify # Start ANVIL back up # Verify litestream resumes replication tail -f /Users/makinja/system/logs/litestream.log # Verify FORGE restore loop detects ANVIL is back and no duplicate writes 8.3 Acceptance Criteria (Angie signs off when ALL pass) Criterion Target Measured Slack alert latency < 2 min 30s TBD FORGE DB data lag (RPO) < 60s TBD Time to FORGE serving (RTO) < 5 min TBD P0 DB count on FORGE 17 DBs TBD Ollama inference on FORGE Working (test prompt) TBD No data loss on ANVIL restart mission-control.db row count matches TBD 8.4 Findings Documentation After the drill, Angie produces a findings report: Actual RPO measured Actual RTO measured Any P0 DB that failed to restore Any daemon that did not restart on FORGE Recommendations for Phase 2 (automatic failover improvements) Phase 9 — Skillforge BookStack Runbook Specification This is the mandatory documentation task per ZAKON PLAN. Skillforge produces a BookStack page at: https://docs.basicconsulting.no → Book: Infrastructure → Chapter: ANVIL DR & HA . 9.1 Required Sections 9.1.1 Overview Page System architecture diagram (ANVIL — Thunderbolt — FORGE — Azure Blob) Node inventory: ANVIL (96GB M3U), FORGE (256GB M3U), Azure (alaibackups0ebb) RPO/RTO targets and current measured values 9.1.2 Litestream Configuration How litestream works (WAL replication explained for non-experts) DB tier classification table (P0/P1/P2) with justification Retention policy per tier How to add a new DB to replication (step-by-step) How to verify replication is working: az storage blob list command + expected output Where logs live: /Users/makinja/system/logs/litestream.log and -error.log 9.1.3 FORGE Warm Standby What FORGE has installed (litestream, Ollama, models) How the restore loop works: script location, poll interval, log location How to verify FORGE is current: check DB timestamps against Azure last-modified How to SSH to FORGE from ANVIL 9.1.4 Failover Runbook (Step-by-Step) Pre-conditions checklist Decision tree: partial failure vs full ANVIL down Manual failover steps (numbered, copy-pasteable commands) DNS failover: how to update Tailscale MagicDNS Ollama failover: how to edit tier-routing.json on FORGE Expected time per step Rollback procedure: restoring ANVIL to primary 9.1.5 Failure Mode Catalog Failure Detection Response Recovery ANVIL Ollama crash ollama-health-monitor.json Tier routing auto-redirects to FORGE Restart com.john.ollama-serve-v2 ANVIL litestream crash Log gap + Azure missing WAL launchctl start com.alai.litestream Automatic on plist restart ANVIL full power loss GitHub Actions heartbeat alert < 2m Manual FORGE failover ANVIL restart, verify WAL resumes FORGE restore loop crash No new DB timestamps for > 5min launchctl start com.alai.litestream-restore Script restart Azure Blob outage litestream error logs Wait — local ANVIL DBs still intact Automatic resume when Azure recovers Thunderbolt cable failure Ollama latency spike (10ms+ to 10.0.0.2) Routes via Tailscale (100ms+ but functional) Replug Thunderbolt 9.1.6 Monitoring & Alerts GitHub Actions heartbeat: link to workflow, how to check last run Slack #ops: what alerts look like, who is responsible for response How to manually trigger a health check 9.1.7 Secrets & Credentials Azure SP: alai-backup-writer — where stored, how to rotate FORGE Bitwarden: how FORGE unlocks independently What to do if Bitwarden is inaccessible (break-glass: Azure credentials in plist) 9.1.8 DR Drill Schedule Quarterly drill required (next: 90 days after Phase 8 drill) Drill checklist (link to Phase 8 checklist above) Where to store drill findings (BookStack page: DR Drill Log) 9.2 Diagrams Required Architecture diagram (Mermaid or draw.io): ANVIL → Azure → FORGE data flow Failover decision tree : Who detects, who acts, what order DB tier heatmap : Visual table of all 67 DBs colored by tier 9.3 BookStack Sync Skillforge commits the runbook markdown to /Users/makinja/system/rules/anvil-dr-runbook.md and triggers node ~/system/tools/bookstack-sync.js sync to push to BookStack. The com.john.bookstack-sync daemon will keep it current thereafter. Implementation Order & Timeline Phase Description Owner Est. Hours Dependency 1 Litestream expansion (update yml, reload daemon) FlowForge 2h None 2 FORGE bootstrap (litestream install, DB dir, SP creds in plist) FlowForge 1h Phase 1 3 Continuous restore loop on FORGE FlowForge 2h Phase 2 4 Ollama health monitor daemon + failover config FlowForge + CodeCraft 3h Phase 3 5 Tailscale MagicDNS configuration FlowForge 1h None 6 GitHub Actions heartbeat workflow FlowForge 1h Phase 5 7 FORGE Bitwarden bootstrap FlowForge (Alem physical action) 30min Phase 2 8 Proveo DR drill Proveo (Angie Jones) 2h All phases done 9 BookStack runbook Skillforge 3h Phase 8 Total estimated implementation time: ~15.5 hours across 9 phases. Critical path: Phases 1 → 2 → 3 (unblock parallel: 4, 5, 6, 7) → 8 → 9. Risk Register Risk Likelihood Impact Mitigation litestream overloads Azure with 67 DBs at 1s interval Low Medium P2 DBs use 10s interval; Azure Blob is built for high-throughput ingestion FORGE disk fills with restored DBs Low Medium FORGE has 256GB RAM but internal SSD may vary — check df -h on FORGE before bootstrap Thunderbolt cable failure isolates FORGE Low Low Tailscale provides fallback path (100ms latency but functional) WAL segments corrupt between ANVIL write and FORGE restore Very Low High litestream uses SHA256 checksums on all WAL segments — corruption detected at restore Empty DBs (fiken.db, companies.db, etc.) never get a WAL segment until first write Medium Low litestream initializes on first write; these are pre-configured for when they get data GitHub Actions cron jitter (can skip minutes) Medium Low Two consecutive failures required before alert — single skip is acceptable Open Questions for Alem FORGE SSH access: SSH to FORGE (basicas@100.104.164.86) is currently failing due to "too many authentication failures." Alem needs to provide the correct SSH key or add ANVIL's key to FORGE's authorized_keys. Needed for: remote bootstrap and failover automation. FORGE disk capacity: Unknown FORGE SSD size. Need to verify sufficient space for ~1.2 GB of database files + WAL segments. df -h on FORGE before Phase 2. FORGE macOS user: Confirmed user is basicas . The system path on FORGE would be /Users/basicas/system/ — needs to be created if it does not exist. Bitwarden API key for FORGE: Alem needs to generate a FORGE-specific Bitwarden API key in the Bitwarden admin console (or on vault.basicconsulting.no if using Vaultwarden). Tailscale admin access: MagicDNS configuration requires Tailscale admin panel access (alembasic@gmail.com account). Alem configures this step. ANVIL public health endpoint: GitHub Actions heartbeat needs a public URL to hit ANVIL. Does a Cloudflare Tunnel already expose an ANVIL health endpoint? If not, this needs setup. TL;DR FORGE platform: Existing Mac Studio M3 Ultra 256 GB (basicass-mac-mini, 10.0.0.2 / 100.104.164.86). No hardware purchase needed. Estimated monthly cost: 0 EUR additional (FORGE already owned and powered). Azure Blob storage delta: ~€0.12/month for WAL segments across all 67 DBs. GitHub Actions heartbeat: free tier. Total: < €1/month increase . Estimated implementation time: ~15.5 hours across 9 phases. Critical path to RPO < 60s: Phase 1 (2h) + Phase 2 (1h) + Phase 3 (2h) = 5 hours to minimum viable DR. Full HA with automatic failover and DR drill: ~13.5 hours additional. Immediate action (highest leverage): Phase 1 — update litestream.yml to cover all 67 DBs. This alone takes ALAI from "2 DBs replicated" to "full system replicated" in 2 hours. FORGE restore is what converts the backup into an actual hot standby. Alem approval required before implementation.