ANVIL SPOF Elimination Plan (2026-04-20)
Status: DRAFT — Awaiting Proveo validation + Alem approval
Author: Kelsey Hightower / FlowForge
Date: 2026-04-20
MC Task: #8515 ANVIL SPOF elimination sprint
Deadline: 2026-05-01
ANVIL SPOF Elimination Plan
Author: FlowForge (Kelsey Hightower) | MC Task #8515
Date: 2026-04-20
Status: DRAFT — Awaiting Alem approval before any implementation
Executive Summary
ANVIL (Mac Studio M3 Ultra, 96 GB, 100.103.49.98) is a single point of failure. One power outage, kernel panic, or SSD failure ends all ALAI operations — mission control, agent fleet, Ollama inference, all daemons. Currently only 2 of ~67 production SQLite databases are replicated to Azure Blob Storage. RTO is effectively infinite. This plan eliminates the SPOF across 9 sequential phases.
Key finding: FORGE already exists. It is a Mac Studio M3 Ultra 256 GB connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE) with sub-millisecond latency, AND accessible via Tailscale at 100.104.164.86. No new hardware purchase is needed. Budget impact: ~0 EUR/month additional infrastructure cost (FORGE is already owned and powered).
Targets: RPO < 60s | RTO < 5 min (manual failover Phase 1, automatic Phase 2+)
Architecture Overview
ANVIL (primary) FORGE (warm standby)
Mac Studio M3 Ultra 96GB Mac Studio M3 Ultra 256GB
100.103.49.98 (Tailscale) 100.104.164.86 (Tailscale)
10.0.0.1 (Thunderbolt) 10.0.0.2 (Thunderbolt)
│ │
│ Thunderbolt Bridge (< 1ms) │
└────────────────────────────────-─┘
│
▼
Azure Blob Storage
alaibackups0ebb
system-db-backups container
(litestream WAL segments, all DBs)
All replication flows ANVIL → Azure → FORGE (pull-based via litestream restore). FORGE does NOT write back to Azure. Azure is the single durable WAL store.
Phase 1 — Litestream Expansion (all ~67 DBs)
1.1 Database Tier Classification
Priority rationale: P0 = system cannot function without it | P1 = major feature loss | P2 = historical/cache only.
P0 — Mission Critical (system stops without these)
| Database | Size | Write Freq | Justification |
|---|---|---|---|
| mission-control.db | 26 MB | Very high | Primary task ledger — all MC operations. CURRENTLY REPLICATED. |
| hivemind.db | 162 MB | High | Agent memory, HiveMind knowledge graph. CURRENTLY REPLICATED. |
| tasks.db | 4 KB | High | Active task queue — active work in flight |
| costs.db | 256 KB | High | Token cost tracking, budget enforcement |
| events.db | 14 MB | High | System event bus — orchestrator depends on this |
| orchestrator-queue.db | 28 KB | High | Active agent job queue — jobs lost = work lost |
| orchestrator-workers.db | 36 KB | High | Worker state — active session tracking |
| durable-runner.db | 896 KB | Medium | Durable task execution state |
| session-index.db | 56 MB | High | Agent session state — all active sessions |
| knowledge.db | 192 MB | Medium | RAG knowledge base — primary retrieval corpus |
| emails.db | 0 B (active) | High | Email agent state — initialized on first write |
| email-inbox.db | 3.1 MB | High | Live email queue |
| alem-directives.db | active WAL | High | CEO directives — highest trust data |
P0 — Financial / Legal (loss = regulatory exposure)
| Database | Size | Write Freq | Justification |
|---|---|---|---|
| fiken.db | 0 B (active) | Medium | Fiken accounting integration — financial records |
| invoices.db | 36 KB | Medium | Invoice state — revenue tracking |
| contracts.db | 40 KB | Low | Signed contracts — legal documents |
| leads.db | 256 KB | Medium | Sales pipeline — business development |
P1 — Operational (system degrades without these)
| Database | Size | Write Freq | Justification |
|---|---|---|---|
| agent-routing.db | 4.1 MB | Medium | Routing decisions, agent assignment |
| bee-index.db | 4.2 MB | Medium | Bee task index |
| bih-tenders.db | 640 KB | Low | BiH market tenders — business intelligence |
| browser-tasks.db | active WAL | Medium | Browser automation queue |
| companies.db | 0 B (active) | Low | Company registry |
| contacts.db | 192 KB | Low | CRM contacts |
| deploy-registry.db | 16 KB | Low | Deployment history |
| design-reviews.db | 64 KB | Low | Design review state |
| distill.db | 2.0 MB | Medium | Knowledge distillation cache |
| documents.db | 32 KB | Low | Document registry |
| drafts.db | 360 KB | Medium | Draft content |
| drift.db | active WAL | Medium | Config drift detection |
| email-audit.db | 256 KB | Medium | Email audit trail |
| email-briefing.db | 0 B (active) | Low | Daily briefing state |
| email-index.db | 0 B (active) | Low | Email search index |
| email-tracking.db | 36 KB | Medium | Email delivery tracking |
| escalations.db | 24 KB | Medium | Escalation queue |
| facts.db | 20 KB | Low | System facts store |
| flywheel.db | 432 MB | Low | Flywheel learning data — largest DB |
| goals.db | 44 KB | Medium | OKR / goal tracking |
| guardrails-audit.db | 10 MB | Medium | Safety audit trail |
| health-events.db | 15 MB | High | System health events |
| hivemind-archive.db | 6.7 MB | Low | HiveMind historical archive |
| master-control.db | 0 B (active) | Medium | Master control state |
| mc.db | 0 B (active) | Medium | Mission control alias |
| minions.db | 192 KB | Medium | Minion agent registry |
| observability.db | 44 KB | Medium | Metrics and traces |
| orchestrator-events.db | 0 B (active) | Medium | Orchestrator event log |
| pipeline.db | active WAL | Medium | CI/CD pipeline state |
| projects.db | 40 KB | Low | Project registry |
| routing-outcomes.db | 192 KB | Medium | Tier routing outcome log |
| skill-improvements.db | 20 KB | Low | Skill improvement tracking |
| skill-registry.db | 128 KB | Low | Agent skill registry |
| sprint-pipeline.db | 32 KB | Medium | Sprint pipeline state |
| strategy-tracker.db | 128 KB | Low | Strategic initiative tracking |
| teams.db | 40 KB | Low | Team registry |
| tenders.db | 384 KB | Low | Norwegian tender data |
| tickets.db | active WAL | Medium | Support ticket tracking |
| tool-audit.db | 6.1 MB | Medium | Tool usage audit |
| tool-registry.db | 128 KB | Low | Tool registry |
| trace-events.db | 52 MB | High | Distributed trace store |
| applications-tracker.db | 12 KB | Low | Job/grant applications |
P2 — Cache / Reconstructible (loss = inconvenience only)
| Database | Size | Write Freq | Justification |
|---|---|---|---|
| baikal-caldav.db | 108 KB | Low | CalDAV cache — reconstructible from Baikal |
| prompt-cache.db | 320 KB | Medium | LLM prompt cache — can warm from scratch |
| prompt-metrics.db | 28 KB | Low | Prompt performance metrics |
| rag-cache.db | active WAL | Medium | RAG response cache — reconstructible |
| semantic-reuse-index.db | 192 KB | Medium | Semantic cache — reconstructible |
| stbs.db | 0 B (active) | Low | STBS data — empty |
| telemetry.db | 24 KB | Medium | Telemetry — can lose without ops impact |
| token-cost.db | active WAL | Medium | Cost log — reconstructible from API receipts |
| usage.db | 0 B (active) | Low | Usage tracking — empty |
| vcr.db | active WAL | Low | HTTP cassette cache — reconstructible |
1.2 Retention Strategy
Current retention for the 2 replicated DBs: 72h. This is insufficient for P0.
| Tier | Retention | Justification |
|---|---|---|
| P0 (mission-critical) | 7d | One week: covers weekend + Monday incident recovery. 72h is too tight — if a silent corruption is not caught in 3 days, all WAL segments are gone. |
| P0 (financial/legal) | 30d | Regulatory prudence. fiken.db, invoices.db, contracts.db. Matches typical invoice dispute windows. |
| P1 | 72h | Current default. Operationally acceptable. |
| P2 | 24h | Cache data. Disk cost matters more than recovery depth. |
Retention-check-interval: 1h for all tiers (current default, correct).
Sync-interval: 1s for all tiers P0 and P1. 10s for P2 (reduce Azure transaction cost on low-value data).
Azure storage cost estimate at current sizes (~1.2 GB total databases):
- WAL segments are incremental. Estimate ~500 MB/day delta across all active DBs.
- 7-day P0 WAL: ~3.5 GB. 30-day financial: ~1 GB. P1 72h: ~1 GB.
- Total Azure Blob: ~6 GB. At ~€0.02/GB/month = ~€0.12/month. Negligible.
1.3 New litestream.yml
Path: /Users/makinja/system/config/litestream.yml
Note on flywheel.db (432 MB): Include in P1 but with sync-interval: 30s to reduce churn.
Note on knowledge.db (192 MB): P0, sync-interval 1s — it's actively written by RAG ingestion.
# Litestream — SQLite streaming replication to Azure Blob Storage
# Primary: ANVIL (Mac Studio M3 Ultra 96GB, 100.103.49.98)
# Config: /Users/makinja/system/config/litestream.yml
# Auth: Azure SP (alai-backup-writer) via client credentials
# SP: alai-backup-writer (1a0b3018-0c31-474b-918f-531b0a29a669)
# SP has Storage Blob Data Contributor on system-db-backups container
# Litestream reads AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID from env
# Launch: com.alai.litestream.plist (sets env vars in EnvironmentVariables block)
# Updated: 2026-04-20 — ANVIL SPOF Elimination Sprint (MC #8515)
#
# Tier reference:
# P0-critical: retention 7d, sync 1s
# P0-financial: retention 30d, sync 1s
# P1: retention 72h, sync 1s (or 30s for large DBs)
# P2: retention 24h, sync 10s
dbs:
# ── P0 MISSION CRITICAL ──────────────────────────────────────────────────────
- path: /Users/makinja/system/databases/mission-control.db
replicas:
- name: mc-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/mission-control
retention: 168h # 7 days
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/hivemind.db
replicas:
- name: hivemind-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/hivemind
retention: 168h # 7 days
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tasks.db
replicas:
- name: tasks-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tasks
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/costs.db
replicas:
- name: costs-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/costs
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/events.db
replicas:
- name: events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/events
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/orchestrator-queue.db
replicas:
- name: orch-queue-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/orchestrator-queue
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/orchestrator-workers.db
replicas:
- name: orch-workers-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/orchestrator-workers
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/durable-runner.db
replicas:
- name: durable-runner-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/durable-runner
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/session-index.db
replicas:
- name: session-index-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/session-index
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/knowledge.db
replicas:
- name: knowledge-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/knowledge
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/emails.db
replicas:
- name: emails-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/emails
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-inbox.db
replicas:
- name: email-inbox-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-inbox
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/alem-directives.db
replicas:
- name: alem-directives-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/alem-directives
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
# ── P0 FINANCIAL / LEGAL ─────────────────────────────────────────────────────
- path: /Users/makinja/system/databases/fiken.db
replicas:
- name: fiken-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/fiken
retention: 720h # 30 days
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/invoices.db
replicas:
- name: invoices-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/invoices
retention: 720h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/contracts.db
replicas:
- name: contracts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/contracts
retention: 720h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/leads.db
replicas:
- name: leads-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/leads
retention: 720h
retention-check-interval: 1h
sync-interval: 1s
# ── P1 OPERATIONAL ───────────────────────────────────────────────────────────
- path: /Users/makinja/system/databases/agent-routing.db
replicas:
- name: agent-routing-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/agent-routing
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/bee-index.db
replicas:
- name: bee-index-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/bee-index
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/bih-tenders.db
replicas:
- name: bih-tenders-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/bih-tenders
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/browser-tasks.db
replicas:
- name: browser-tasks-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/browser-tasks
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/companies.db
replicas:
- name: companies-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/companies
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/contacts.db
replicas:
- name: contacts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/contacts
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/deploy-registry.db
replicas:
- name: deploy-registry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/deploy-registry
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/design-reviews.db
replicas:
- name: design-reviews-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/design-reviews
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/distill.db
replicas:
- name: distill-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/distill
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/documents.db
replicas:
- name: documents-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/documents
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/drafts.db
replicas:
- name: drafts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/drafts
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/drift.db
replicas:
- name: drift-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/drift
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-audit.db
replicas:
- name: email-audit-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-audit
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-briefing.db
replicas:
- name: email-briefing-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-briefing
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-index.db
replicas:
- name: email-index-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-index
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-tracking.db
replicas:
- name: email-tracking-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-tracking
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/escalations.db
replicas:
- name: escalations-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/escalations
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/facts.db
replicas:
- name: facts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/facts
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/flywheel.db
replicas:
- name: flywheel-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/flywheel
retention: 72h
retention-check-interval: 1h
sync-interval: 30s # 432MB — throttle sync to reduce Azure transactions
- path: /Users/makinja/system/databases/goals.db
replicas:
- name: goals-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/goals
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/guardrails-audit.db
replicas:
- name: guardrails-audit-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/guardrails-audit
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/health-events.db
replicas:
- name: health-events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/health-events
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/hivemind-archive.db
replicas:
- name: hivemind-archive-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/hivemind-archive
retention: 72h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/master-control.db
replicas:
- name: master-control-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/master-control
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/mc.db
replicas:
- name: mc-db-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/mc-db
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/minions.db
replicas:
- name: minions-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/minions
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/observability.db
replicas:
- name: observability-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/observability
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/orchestrator-events.db
replicas:
- name: orch-events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/orchestrator-events
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/pipeline.db
replicas:
- name: pipeline-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/pipeline
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/projects.db
replicas:
- name: projects-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/projects
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/routing-outcomes.db
replicas:
- name: routing-outcomes-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/routing-outcomes
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/skill-improvements.db
replicas:
- name: skill-improvements-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/skill-improvements
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/skill-registry.db
replicas:
- name: skill-registry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/skill-registry
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/sprint-pipeline.db
replicas:
- name: sprint-pipeline-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/sprint-pipeline
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/strategy-tracker.db
replicas:
- name: strategy-tracker-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/strategy-tracker
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/teams.db
replicas:
- name: teams-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/teams
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tenders.db
replicas:
- name: tenders-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tenders
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tickets.db
replicas:
- name: tickets-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tickets
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tool-audit.db
replicas:
- name: tool-audit-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tool-audit
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tool-registry.db
replicas:
- name: tool-registry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tool-registry
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/trace-events.db
replicas:
- name: trace-events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/trace-events
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/applications-tracker.db
replicas:
- name: applications-tracker-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/applications-tracker
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
# ── P2 CACHE / RECONSTRUCTIBLE ───────────────────────────────────────────────
- path: /Users/makinja/system/databases/baikal-caldav.db
replicas:
- name: baikal-caldav-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/baikal-caldav
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/prompt-cache.db
replicas:
- name: prompt-cache-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/prompt-cache
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/prompt-metrics.db
replicas:
- name: prompt-metrics-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/prompt-metrics
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/semantic-reuse-index.db
replicas:
- name: semantic-reuse-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/semantic-reuse-index
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/stbs.db
replicas:
- name: stbs-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/stbs
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/telemetry.db
replicas:
- name: telemetry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/telemetry
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/token-cost.db
replicas:
- name: token-cost-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/token-cost
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/usage.db
replicas:
- name: usage-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/usage
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/vcr.db
replicas:
- name: vcr-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/vcr
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
1.4 Implementation Steps (ANVIL)
- Stop litestream:
launchctl stop com.alai.litestream - Replace
/Users/makinja/system/config/litestream.ymlwith the config above. - Validate config:
/opt/homebrew/bin/litestream replicate -config /Users/makinja/system/config/litestream.yml -config-validate - Start litestream:
launchctl start com.alai.litestream - Verify all DBs appear in Azure:
az storage blob list --container-name system-db-backups --account-name alaibackups0ebb --prefix litestream/ --auth-mode login --query "[].name" | wc -l(expect ~67+ entries). - Watch logs for errors:
tail -f /Users/makinja/system/logs/litestream-error.log
Phase 2 — FORGE Hardware / OS Decision
2.1 FORGE Already Exists — Hardware Decision Is Made
FORGE is confirmed to be a second Mac Studio M3 Ultra with 256 GB unified memory, connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE). Tailscale IP: 100.104.164.86. User: basicas. It is already running Ollama with models including devstral:24b, qwen3:32b, deepseek-r1:70b, qwen3-coder, and bge-m3.
No hardware purchase is required. Monthly infrastructure cost delta: 0 EUR (already owned).
2.2 Why FORGE Wins Over Every Alternative
| Option | Cost/mo | Latency to ANVIL | Apple Silicon | macOS parity | Verdict |
|---|---|---|---|---|---|
| FORGE (Mac Studio M3U 256GB, owned) | 0 EUR | < 1ms (Thunderbolt) | Yes (M3 Ultra) | Yes (same LaunchAgent ecosystem) | CHOSEN |
| Mac Mini M4 Pro (purchase) | ~50 EUR amortized | < 1ms if local | Yes | Yes | Redundant — FORGE exists |
| Hetzner Linux VM (CCX33) | ~30-50 EUR | 10-30ms (internet) | No (x86) | No (systemd, not launchd) | Budget option only if FORGE fails |
| Azure VM (Sweden Central) | ~60-80 EUR | 10-30ms | No | No | Closest to Azure storage but no Apple Silicon |
Decision: Use FORGE as warm standby. Zero additional cost. Thunderbolt latency is effectively local — litestream WAL replication will complete in well under 60s.
2.3 FORGE Bootstrap Prerequisites
FORGE already runs Ollama. What is missing:
- litestream installed on FORGE (check:
brew list litestreamon basicas@FORGE) - Azure SP credentials injected into FORGE environment (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID)
~/system/databases/directory created on FORGElitestream-restore.shdaemon script written and loaded as LaunchAgent on FORGE- SSH key access from ANVIL to FORGE for health check and failover scripts
Phase 3 — Continuous Restore on FORGE (< 60s RPO)
3.1 Architecture
FORGE runs litestream restore in a watch loop per database. Litestream 0.5.x does not have
a native watch mode — it restores a snapshot + WAL segments. The recommended approach is
a shell script loop that calls litestream restore repeatedly with a short interval.
However, litestream does support a second process pattern: run litestream replicate on FORGE
pointing at the SAME Azure bucket paths, but configured as a replica-only consumer. This is the
correct approach: FORGE runs a litestream restore daemon that continuously polls for new WAL
segments from Azure.
3.2 Continuous Restore Strategy
Use litestream restore with the -if-replica-exists flag in a loop:
#!/usr/bin/env bash
# /Users/basicas/system/scripts/litestream-restore-loop.sh
# Runs on FORGE. Continuously restores all P0+P1 DBs from Azure.
# Interval: 30s poll (gives ~30s RPO in steady state, well within 60s target)
set -euo pipefail
LITESTREAM=/opt/homebrew/bin/litestream
CONFIG=/Users/basicas/system/config/litestream-restore.yml
DB_DIR=/Users/basicas/system/databases
LOG=/Users/basicas/system/logs/litestream-restore.log
INTERVAL=30 # seconds between restore cycles
while true; do
echo "[$(date -Iseconds)] Starting restore cycle" >> "$LOG"
# Restore each DB defined in restore config
# litestream restore will only apply new WAL segments if DB already exists
$LITESTREAM restore -config "$CONFIG" -if-replica-exists >> "$LOG" 2>&1 || true
echo "[$(date -Iseconds)] Restore cycle complete, sleeping ${INTERVAL}s" >> "$LOG"
sleep "$INTERVAL"
done
3.3 FORGE litestream-restore.yml
A separate config file on FORGE that mirrors ANVIL's litestream.yml but uses restore semantics.
FORGE is READ-ONLY consumer. It never writes back to Azure.
Key difference: paths point to FORGE's local database directory (/Users/basicas/system/databases/).
The Azure paths are identical to ANVIL's — FORGE reads from the same blob paths ANVIL writes to.
# /Users/basicas/system/config/litestream-restore.yml
# FORGE warm standby — continuous restore from Azure
# DO NOT run litestream replicate with this config — restore only
dbs:
- path: /Users/basicas/system/databases/mission-control.db
replicas:
- name: mc-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/mission-control
# ... (repeat for all P0 and P1 DBs using same Azure paths as ANVIL)
# P2 DBs: omit from restore config — not worth continuous restore overhead
3.4 FORGE LaunchAgent for Restore Loop
Path: /Users/basicas/Library/LaunchAgents/com.alai.litestream-restore.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.alai.litestream-restore</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>/Users/basicas/system/scripts/litestream-restore-loop.sh</string>
</array>
<key>EnvironmentVariables</key>
<dict>
<key>AZURE_STORAGE_ACCOUNT</key>
<string>alaibackups0ebb</string>
<key>AZURE_CLIENT_ID</key>
<string>1a0b3018-0c31-474b-918f-531b0a29a669</string>
<key>AZURE_CLIENT_SECRET</key>
<string>RETRIEVE_FROM_BITWARDEN_AT_BOOTSTRAP</string>
<key>AZURE_TENANT_ID</key>
<string>3454a03f-20b4-4bda-a116-2293c459aecd</string>
</dict>
<key>KeepAlive</key>
<true/>
<key>RunAtLoad</key>
<true/>
<key>StandardOutPath</key>
<string>/Users/basicas/system/logs/litestream-restore.log</string>
<key>StandardErrorPath</key>
<string>/Users/basicas/system/logs/litestream-restore-error.log</string>
<key>ThrottleInterval</key>
<integer>10</integer>
</dict>
</plist>
3.5 RPO Calculation
- ANVIL litestream sync-interval: 1s (WAL segment flushed to Azure every 1s for P0)
- FORGE restore poll interval: 30s
- Azure propagation: < 1s (same-region, in-blob operations)
- Worst-case RPO: 31s (well under 60s target)
- Expected average RPO: ~15-20s
Phase 4 — Ollama Failover Tier Routing
4.1 Current State
Tier routing in /Users/makinja/system/config/tier-routing.json already defines FORGE as the
primary host for Tiers 2c, 2cf, 2d, 3, 3s, 3r. ANVIL handles Tiers 1, 2, 2t, 2cHQ.
The providerFallback section defines ollama:qwen2.5-coder:32b@anvil as fallback for some paths.
The gap: there is no automatic failover FROM ANVIL TO FORGE when ANVIL Ollama is down, and no automatic failover FROM FORGE TO ANVIL when FORGE Ollama is down.
4.2 Failover Config Extension
Extend /Users/makinja/system/config/tier-routing.json with an ollamaHosts block:
"ollamaHosts": {
"anvil": {
"url": "http://localhost:11434",
"tailscale_url": "http://100.103.49.98:11434",
"health_path": "/api/tags",
"health_timeout_ms": 3000,
"role": "primary-infra"
},
"forge": {
"url": "http://10.0.0.2:11434",
"tailscale_url": "http://100.104.164.86:11434",
"health_path": "/api/tags",
"health_timeout_ms": 3000,
"role": "primary-compute"
}
},
"failoverRules": {
"anvil-down": {
"redirect_anvil_tiers": ["1", "2", "2t", "2cHQ"],
"to_forge_models": {
"llama3.1:8b": "llama3.1:8b",
"qwen2.5-coder:32b": "qwen2.5-coder:32b-instruct-q8_0"
},
"note": "When ANVIL Ollama unreachable, route Tier 1/2 to FORGE equivalents"
},
"forge-down": {
"redirect_forge_tiers": ["2c", "2cf", "2d", "3", "3s", "3r"],
"to_claude": true,
"note": "When FORGE Ollama unreachable, escalate to Claude (cost spike acceptable — FORGE failure is rare)"
}
}
4.3 Health Check Daemon
A new lightweight Node.js daemon on ANVIL polls both Ollama endpoints every 15s and writes
status to a JSON file that ollama-engine.js reads before routing:
Path: /Users/makinja/system/daemons/ollama-health-monitor.js
// Pseudocode — implementation by CodeCraft
// Runs every 15s, writes to /tmp/ollama-health.json
// {
// "anvil": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" },
// "forge": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" }
// }
// tier-router.js reads this file before every dispatch
// If anvil.healthy === false: redirect tier 1/2 requests to forge
// If forge.healthy === false: redirect tier 2c/3 requests to claude
4.4 Manual Failover Command
For Phase 1 (before automatic failover is implemented):
# On ANVIL, when FORGE is down — force all routing to ANVIL
echo '{"anvil":{"healthy":true},"forge":{"healthy":false,"override":true}}' > /tmp/ollama-health-override.json
# When ANVIL is down, from FORGE (if FORGE has ollama-engine.js):
# Edit /Users/basicas/system/config/tier-routing.json: set all hosts to "forge"
Phase 5 — DNS / Service Discovery
5.1 Options Evaluated
| Option | Mechanism | Failover Speed | Complexity | Cost |
|---|---|---|---|---|
| Tailscale MagicDNS | DNS record swap via Tailscale API | Manual: ~1 min | Low | Free |
| Cloudflare DNS + health check | CF Load Balancer health-check → DNS swap | Automatic: ~30s | Medium | ~$5/month |
| Local /etc/hosts on each node | Static entries, no automatic failover | Manual: ~1 min | None | Free |
| Cloudflare Tunnel alias | DNS alias behind CF Tunnel | ~30s | Medium | Free tier |
5.2 Recommendation: Tailscale MagicDNS
Chosen: Tailscale MagicDNS with manual DNS swap.
Rationale:
- All nodes (ANVIL, FORGE, ab-mac) are already on the same Tailscale network.
- Tailscale MagicDNS can assign a hostname
anvil.alai.internal(or use the device name directly). - Current hardcoded addresses (
localhost:11434,10.0.0.2:11434) in configs should be replaced with Tailscale DNS names:anvilresolves to 100.103.49.98,forgeresolves to 100.104.164.86. - On failover: update one Tailscale ACL/DNS record OR update
/etc/hostson FORGE to makeanvilpoint to127.0.0.1(making FORGE answer for anvil traffic locally).
Implementation:
- In Tailscale admin console: verify MagicDNS is enabled for the tailnet.
- Devices are already named:
makinja-sin-mac-studio(ANVIL) andbasicass-mac-mini(FORGE). - Add a Tailscale DNS override:
anvil.alai→ 100.103.49.98 (ANVIL primary). - Add to all tool configs: replace
localhost:11434withanvil.alai:11434,10.0.0.2:11434withforge.alai:11434. - Failover procedure: update Tailscale DNS record
anvil.alai→ 100.104.164.86 (FORGE). This takes effect across all nodes within ~30s (Tailscale DNS TTL).
Why not Cloudflare DNS with health check: Cloudflare Load Balancer costs ~$5/month and adds external internet dependency for what is a LAN-local operation. Overkill for current scale. Revisit if ALAI adds a third node outside the LAN.
Phase 6 — External Heartbeat
6.1 Requirement
An external entity (not on ANVIL, not on FORGE) must poll ANVIL every 60s and alert Slack #ops if ANVIL is unreachable for > 2 consecutive minutes (2 missed polls).
6.2 Mechanism: GitHub Actions Cron (Recommended)
Chosen: GitHub Actions scheduled workflow. Cost: free (GitHub public repo or private with Actions minutes). No Azure Function setup required.
# .github/workflows/anvil-heartbeat.yml
# In a private ALAI GitHub repo (e.g., alai-infra or system-health)
name: ANVIL Heartbeat
on:
schedule:
- cron: '* * * * *' # Every minute
jobs:
heartbeat:
runs-on: ubuntu-latest
timeout-minutes: 1
steps:
- name: Check ANVIL health via Tailscale
id: health
run: |
# ANVIL exposes a health endpoint via Cloudflare Tunnel or public URL
# Option A: Hit a public health endpoint (requires CF Tunnel on ANVIL)
# Option B: Use Tailscale GitHub Action to join the tailnet and check directly
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 10 \
--max-time 15 \
${{ secrets.ANVIL_HEALTH_URL }})
echo "status=$STATUS" >> $GITHUB_OUTPUT
- name: Alert Slack if down
if: steps.health.outputs.status != '200'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"channel": "#ops",
"text": ":red_circle: ANVIL HEALTH CHECK FAILED\nHTTP Status: ${{ steps.health.outputs.status }}\nTime: ${{ github.run_started_at }}\nANVIL may be down. Check Tailscale and initiate FORGE failover if confirmed."
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_OPS_WEBHOOK }}
6.3 ANVIL Health Endpoint
ANVIL needs a lightweight HTTP health endpoint reachable from the internet (via Cloudflare Tunnel) or via Tailscale GitHub Action. The simplest approach:
Create a health check script at /Users/makinja/system/tools/health-server.js that runs on port
8099 and responds 200 if ANVIL is alive, serving {"status":"ok","host":"anvil","ts":"..."}.
Expose via existing Cloudflare Tunnel infrastructure.
6.4 Alert Escalation
- 2 consecutive failures (2 minutes down): Slack #ops message.
- 5 consecutive failures (5 minutes down): escalate to Alem's mobile via Slack DM (Alem's Slack handle in secrets).
6.5 Azure Function Alternative
Azure Function with Timer trigger (every 60s) is viable but requires:
- Azure subscription billing (Consumption plan: ~$0/month for < 1M executions — effectively free)
- Azure Function App deployment and maintenance
- More setup complexity than GitHub Actions
Verdict: GitHub Actions preferred for simplicity. Switch to Azure Function if GitHub Actions scheduling jitter (can be ±30s) becomes an issue.
Phase 7 — Shared Secrets (FORGE Bitwarden Access)
7.1 Problem
FORGE needs access to secrets (Azure SP secret, Bitwarden master password, API keys) without depending on ANVIL being alive. Currently ANVIL holds the Bitwarden session at /tmp/bw-session.
7.2 Options
| Option | Description | Risk |
|---|---|---|
| Separate BW account on FORGE | FORGE has its own Bitwarden account with shared collection | Low — independent |
| Shared BW session sync | ANVIL writes /tmp/bw-session to FORGE via rsync | Medium — session expires |
| Azure Key Vault break-glass | Critical secrets in AKV, FORGE SP can read them | Low — Azure dependency |
| Environment variables in plist | Secrets baked into LaunchAgent plist on FORGE | Low but plaintext risk |
7.3 Recommendation: Two-Layer Approach
Layer 1 (operational): FORGE bootstraps its own Bitwarden CLI session independently.
- FORGE has
bwCLI installed. - FORGE has its own BW_SESSION set via a one-time manual bootstrap:
bw login --apikeyusing a FORGE-specific API key (Bitwarden supports API keys per user/device). - Session is stored in
/Users/basicas/.bw-sessionand refreshed by a LaunchAgent on FORGE. - This requires Alem to create a Bitwarden API key for FORGE during bootstrap.
Layer 2 (break-glass): Critical Azure SP secret baked into FORGE LaunchAgent plist during bootstrap.
- The Azure SP secret (
AZURE_CLIENT_SECRET) is placed directly in thecom.alai.litestream-restore.plistEnvironmentVariables block — same pattern as ANVIL. - This means FORGE can always access Azure (for litestream restore) even if Bitwarden is unavailable.
- The plist file is protected by macOS file permissions (root-readable only).
- This is the same pattern already in use on ANVIL (confirmed in the plist we read).
Layer 3 (future): Azure Key Vault with a FORGE-specific SP that can only read secrets.
- Create a new SP
alai-forge-readerwith Key Vault Secrets User role. - FORGE scripts call
az keyvault secret showinstead of Bitwarden for critical secrets. - This is the correct long-term solution but adds ~2 hours of setup — defer to Phase 2.
7.4 Bootstrap Sequence for FORGE Secrets
# On FORGE during initial bootstrap (one-time, performed by Alem or FlowForge):
# 1. Install bw CLI
brew install bitwarden-cli
# 2. Login with API key (avoids interactive login)
export BW_CLIENTID="<forge-api-key-id from Bitwarden>"
export BW_CLIENTSECRET="<forge-api-key-secret>"
bw login --apikey
bw unlock --passwordenv BW_MASTER_PASSWORD # or interactive
# 3. Store session
bw unlock > /Users/basicas/.bw-session
# 4. Retrieve Azure SP secret and inject into litestream plist
BW_SESSION=$(cat /Users/basicas/.bw-session)
AZ_SECRET=$(bw get password "alai-backup-writer" --session "$BW_SESSION")
# Update the plist AZURE_CLIENT_SECRET value with $AZ_SECRET
Phase 8 — Proveo DR Drill Checklist (Angie Jones Validation Task)
This is the mandatory validation task per ZAKON PLAN. Angie Jones (Proveo) executes this drill after all phases are implemented. This is a REAL drill — not a dry run.
8.1 Pre-Drill Prerequisites
- Phase 1 complete: all ~67 DBs replicating to Azure (verify with
az storage blob listcount) - Phase 3 complete: FORGE restore loop running, confirmed by checking FORGE DB file timestamps
- Phase 4 complete: Ollama health monitor daemon running on ANVIL
- Phase 5 complete: Tailscale MagicDNS configured (
anvil.alairesolves correctly) - Phase 6 complete: GitHub Actions heartbeat workflow deployed and sending test ping
- Phase 7 complete: FORGE Bitwarden session independently functional
8.2 Drill Procedure
Step 1: Establish baseline (T=0)
# On ANVIL — record current state
node ~/system/tools/mc.js stats # Record open task count
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'" # Record
date -Iseconds > /tmp/drill-start.txt
Step 2: Simulate ANVIL failure
# Graceful shutdown (simulates power outage or kernel panic recovery)
# DO NOT run on production without Alem present
sudo shutdown -h now # Or: launchctl stop all non-essential services
# Alternative: kill Ollama + stop litestream + stop pi-orchestrator (partial failure sim)
launchctl stop com.alai.litestream
launchctl stop com.john.pi-orchestrator
launchctl stop com.john.ollama-serve-v2
Step 3: Measure time to alert (T=2 min)
- GitHub Actions heartbeat should fire within 2 minutes of ANVIL going offline.
- Angie records: timestamp of Slack #ops alert arrival.
- Expected: < 2 min 30s from shutdown to Slack alert.
Step 4: FORGE failover execution (T=3 min target)
# On FORGE ([email protected])
# 1. Verify latest DBs restored
ls -la ~/system/databases/*.db | head -5
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'"
# Compare to baseline — delta should be < 60s of writes
# 2. Update Tailscale DNS: anvil.alai → 100.104.164.86 (FORGE)
# (Alem updates in Tailscale admin console)
# 3. Start pi-orchestrator on FORGE (if installed)
# OR: update tier-routing.json to route all requests to forge endpoints
# 4. Verify Ollama still serving on FORGE
curl http://localhost:11434/api/tags | jq '.models | length'
Step 5: Measure RPO
# On FORGE after failover
BASELINE=$(cat /tmp/drill-baseline-count.txt) # From Step 1
CURRENT=$(sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'")
echo "Task count delta: $((BASELINE - CURRENT))"
# Check last WAL segment timestamp in Azure
az storage blob list \
--container-name system-db-backups \
--account-name alaibackups0ebb \
--prefix litestream/mission-control \
--auth-mode login \
--query "reverse(sort_by([].{name:name,last_modified:properties.lastModified}, &last_modified))[0]"
# Record last WAL segment time vs ANVIL shutdown time = actual RPO
Step 6: Measure RTO
- RTO = time from "ANVIL confirmed down" to "FORGE serving requests with < 60s RPO data".
- Record timestamps at each step. Target: < 5 minutes total.
Step 7: Restore ANVIL and verify
# Start ANVIL back up
# Verify litestream resumes replication
tail -f /Users/makinja/system/logs/litestream.log
# Verify FORGE restore loop detects ANVIL is back and no duplicate writes
8.3 Acceptance Criteria (Angie signs off when ALL pass)
| Criterion | Target | Measured |
|---|---|---|
| Slack alert latency | < 2 min 30s | TBD |
| FORGE DB data lag (RPO) | < 60s | TBD |
| Time to FORGE serving (RTO) | < 5 min | TBD |
| P0 DB count on FORGE | 17 DBs | TBD |
| Ollama inference on FORGE | Working (test prompt) | TBD |
| No data loss on ANVIL restart | mission-control.db row count matches | TBD |
8.4 Findings Documentation
After the drill, Angie produces a findings report:
- Actual RPO measured
- Actual RTO measured
- Any P0 DB that failed to restore
- Any daemon that did not restart on FORGE
- Recommendations for Phase 2 (automatic failover improvements)
Phase 9 — Skillforge BookStack Runbook Specification
This is the mandatory documentation task per ZAKON PLAN. Skillforge produces a BookStack page
at: https://docs.basicconsulting.no → Book: Infrastructure → Chapter: ANVIL DR & HA.
9.1 Required Sections
9.1.1 Overview Page
- System architecture diagram (ANVIL — Thunderbolt — FORGE — Azure Blob)
- Node inventory: ANVIL (96GB M3U), FORGE (256GB M3U), Azure (alaibackups0ebb)
- RPO/RTO targets and current measured values
9.1.2 Litestream Configuration
- How litestream works (WAL replication explained for non-experts)
- DB tier classification table (P0/P1/P2) with justification
- Retention policy per tier
- How to add a new DB to replication (step-by-step)
- How to verify replication is working:
az storage blob listcommand + expected output - Where logs live:
/Users/makinja/system/logs/litestream.logand-error.log
9.1.3 FORGE Warm Standby
- What FORGE has installed (litestream, Ollama, models)
- How the restore loop works: script location, poll interval, log location
- How to verify FORGE is current: check DB timestamps against Azure last-modified
- How to SSH to FORGE from ANVIL
9.1.4 Failover Runbook (Step-by-Step)
- Pre-conditions checklist
- Decision tree: partial failure vs full ANVIL down
- Manual failover steps (numbered, copy-pasteable commands)
- DNS failover: how to update Tailscale MagicDNS
- Ollama failover: how to edit tier-routing.json on FORGE
- Expected time per step
- Rollback procedure: restoring ANVIL to primary
9.1.5 Failure Mode Catalog
| Failure | Detection | Response | Recovery |
|---|---|---|---|
| ANVIL Ollama crash | ollama-health-monitor.json | Tier routing auto-redirects to FORGE | Restart com.john.ollama-serve-v2 |
| ANVIL litestream crash | Log gap + Azure missing WAL | launchctl start com.alai.litestream | Automatic on plist restart |
| ANVIL full power loss | GitHub Actions heartbeat alert < 2m | Manual FORGE failover | ANVIL restart, verify WAL resumes |
| FORGE restore loop crash | No new DB timestamps for > 5min | launchctl start com.alai.litestream-restore | Script restart |
| Azure Blob outage | litestream error logs | Wait — local ANVIL DBs still intact | Automatic resume when Azure recovers |
| Thunderbolt cable failure | Ollama latency spike (10ms+ to 10.0.0.2) | Routes via Tailscale (100ms+ but functional) | Replug Thunderbolt |
9.1.6 Monitoring & Alerts
- GitHub Actions heartbeat: link to workflow, how to check last run
- Slack #ops: what alerts look like, who is responsible for response
- How to manually trigger a health check
9.1.7 Secrets & Credentials
- Azure SP: alai-backup-writer — where stored, how to rotate
- FORGE Bitwarden: how FORGE unlocks independently
- What to do if Bitwarden is inaccessible (break-glass: Azure credentials in plist)
9.1.8 DR Drill Schedule
- Quarterly drill required (next: 90 days after Phase 8 drill)
- Drill checklist (link to Phase 8 checklist above)
- Where to store drill findings (BookStack page: DR Drill Log)
9.2 Diagrams Required
- Architecture diagram (Mermaid or draw.io): ANVIL → Azure → FORGE data flow
- Failover decision tree: Who detects, who acts, what order
- DB tier heatmap: Visual table of all 67 DBs colored by tier
9.3 BookStack Sync
Skillforge commits the runbook markdown to /Users/makinja/system/rules/anvil-dr-runbook.md and
triggers node ~/system/tools/bookstack-sync.js sync to push to BookStack. The com.john.bookstack-sync
daemon will keep it current thereafter.
Implementation Order & Timeline
| Phase | Description | Owner | Est. Hours | Dependency |
|---|---|---|---|---|
| 1 | Litestream expansion (update yml, reload daemon) | FlowForge | 2h | None |
| 2 | FORGE bootstrap (litestream install, DB dir, SP creds in plist) | FlowForge | 1h | Phase 1 |
| 3 | Continuous restore loop on FORGE | FlowForge | 2h | Phase 2 |
| 4 | Ollama health monitor daemon + failover config | FlowForge + CodeCraft | 3h | Phase 3 |
| 5 | Tailscale MagicDNS configuration | FlowForge | 1h | None |
| 6 | GitHub Actions heartbeat workflow | FlowForge | 1h | Phase 5 |
| 7 | FORGE Bitwarden bootstrap | FlowForge (Alem physical action) | 30min | Phase 2 |
| 8 | Proveo DR drill | Proveo (Angie Jones) | 2h | All phases done |
| 9 | BookStack runbook | Skillforge | 3h | Phase 8 |
Total estimated implementation time: ~15.5 hours across 9 phases. Critical path: Phases 1 → 2 → 3 (unblock parallel: 4, 5, 6, 7) → 8 → 9.
Risk Register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| litestream overloads Azure with 67 DBs at 1s interval | Low | Medium | P2 DBs use 10s interval; Azure Blob is built for high-throughput ingestion |
| FORGE disk fills with restored DBs | Low | Medium | FORGE has 256GB RAM but internal SSD may vary — check df -h on FORGE before bootstrap |
| Thunderbolt cable failure isolates FORGE | Low | Low | Tailscale provides fallback path (100ms latency but functional) |
| WAL segments corrupt between ANVIL write and FORGE restore | Very Low | High | litestream uses SHA256 checksums on all WAL segments — corruption detected at restore |
| Empty DBs (fiken.db, companies.db, etc.) never get a WAL segment until first write | Medium | Low | litestream initializes on first write; these are pre-configured for when they get data |
| GitHub Actions cron jitter (can skip minutes) | Medium | Low | Two consecutive failures required before alert — single skip is acceptable |
Open Questions for Alem
-
FORGE SSH access: SSH to FORGE ([email protected]) is currently failing due to "too many authentication failures." Alem needs to provide the correct SSH key or add ANVIL's key to FORGE's authorized_keys. Needed for: remote bootstrap and failover automation.
-
FORGE disk capacity: Unknown FORGE SSD size. Need to verify sufficient space for ~1.2 GB of database files + WAL segments.
df -hon FORGE before Phase 2. -
FORGE macOS user: Confirmed user is
basicas. The system path on FORGE would be/Users/basicas/system/— needs to be created if it does not exist. -
Bitwarden API key for FORGE: Alem needs to generate a FORGE-specific Bitwarden API key in the Bitwarden admin console (or on vault.basicconsulting.no if using Vaultwarden).
-
Tailscale admin access: MagicDNS configuration requires Tailscale admin panel access ([email protected] account). Alem configures this step.
-
ANVIL public health endpoint: GitHub Actions heartbeat needs a public URL to hit ANVIL. Does a Cloudflare Tunnel already expose an ANVIL health endpoint? If not, this needs setup.
TL;DR
FORGE platform: Existing Mac Studio M3 Ultra 256 GB (basicass-mac-mini, 10.0.0.2 / 100.104.164.86). No hardware purchase needed.
Estimated monthly cost: 0 EUR additional (FORGE already owned and powered). Azure Blob storage delta: ~€0.12/month for WAL segments across all 67 DBs. GitHub Actions heartbeat: free tier. Total: < €1/month increase.
Estimated implementation time: ~15.5 hours across 9 phases. Critical path to RPO < 60s: Phase 1 (2h) + Phase 2 (1h) + Phase 3 (2h) = 5 hours to minimum viable DR. Full HA with automatic failover and DR drill: ~13.5 hours additional.
Immediate action (highest leverage): Phase 1 — update litestream.yml to cover all 67 DBs. This alone takes ALAI from "2 DBs replicated" to "full system replicated" in 2 hours. FORGE restore is what converts the backup into an actual hot standby.
Alem approval required before implementation.
No comments to display
No comments to display