ANVIL SPOF Elimination Plan (2026-04-20)
Status: DRAFT — Awaiting Proveo validation + Alem approval
Author: Kelsey Hightower / FlowForge
Date: 2026-04-20
MC Task: #8515 ANVIL SPOF elimination sprint
Deadline: 2026-05-01
ANVIL SPOF Elimination Plan
Author: FlowForge (Kelsey Hightower) | MC Task #8515
Date: 2026-04-20
Status: DRAFT — Awaiting Alem approval before any implementation
Executive Summary
ANVIL (Mac Studio M3 Ultra, 96 GB, 100.103.49.98) is a single point of failure. One power outage,
kernel panic, or SSD failure ends all ALAI operations — mission control, agent fleet, Ollama inference,
all daemons. Currently only 2 of ~67 production SQLite databases are replicated to Azure Blob Storage.
RTO is effectively infinite. This plan eliminates the SPOF across 9 sequential phases.
Key finding: FORGE already exists. It is a Mac Studio M3 Ultra 256 GB connected to ANVIL via
Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE) with sub-millisecond latency, AND accessible
via Tailscale at 100.104.164.86. No new hardware purchase is needed. Budget impact: ~0 EUR/month
additional infrastructure cost (FORGE is already owned and powered).
Targets: RPO < 60s | RTO < 5 min (manual failover Phase 1, automatic Phase 2+)
Architecture Overview
ANVIL (primary) FORGE (warm standby)
Mac Studio M3 Ultra 96GB Mac Studio M3 Ultra 256GB
100.103.49.98 (Tailscale) 100.104.164.86 (Tailscale)
10.0.0.1 (Thunderbolt) 10.0.0.2 (Thunderbolt)
│ │
│ Thunderbolt Bridge (< 1ms) │
└────────────────────────────────-─┘
│
▼
Azure Blob Storage
alaibackups0ebb
system-db-backups container
(litestream WAL segments, all DBs)
All replication flows ANVIL → Azure → FORGE (pull-based via litestream restore).
FORGE does NOT write back to Azure. Azure is the single durable WAL store.
Phase 1 — Litestream Expansion (all ~67 DBs)
1.1 Database Tier Classification
Priority rationale: P0 = system cannot function without it | P1 = major feature loss | P2 = historical/cache only.
P0 — Mission Critical (system stops without these)
Database
Size
Write Freq
Justification
mission-control.db
26 MB
Very high
Primary task ledger — all MC operations. CURRENTLY REPLICATED.
hivemind.db
162 MB
High
Agent memory, HiveMind knowledge graph. CURRENTLY REPLICATED.
tasks.db
4 KB
High
Active task queue — active work in flight
costs.db
256 KB
High
Token cost tracking, budget enforcement
events.db
14 MB
High
System event bus — orchestrator depends on this
orchestrator-queue.db
28 KB
High
Active agent job queue — jobs lost = work lost
orchestrator-workers.db
36 KB
High
Worker state — active session tracking
durable-runner.db
896 KB
Medium
Durable task execution state
session-index.db
56 MB
High
Agent session state — all active sessions
knowledge.db
192 MB
Medium
RAG knowledge base — primary retrieval corpus
emails.db
0 B (active)
High
Email agent state — initialized on first write
email-inbox.db
3.1 MB
High
Live email queue
alem-directives.db
active WAL
High
CEO directives — highest trust data
P0 — Financial / Legal (loss = regulatory exposure)
Database
Size
Write Freq
Justification
fiken.db
0 B (active)
Medium
Fiken accounting integration — financial records
invoices.db
36 KB
Medium
Invoice state — revenue tracking
contracts.db
40 KB
Low
Signed contracts — legal documents
leads.db
256 KB
Medium
Sales pipeline — business development
P1 — Operational (system degrades without these)
Database
Size
Write Freq
Justification
agent-routing.db
4.1 MB
Medium
Routing decisions, agent assignment
bee-index.db
4.2 MB
Medium
Bee task index
bih-tenders.db
640 KB
Low
BiH market tenders — business intelligence
browser-tasks.db
active WAL
Medium
Browser automation queue
companies.db
0 B (active)
Low
Company registry
contacts.db
192 KB
Low
CRM contacts
deploy-registry.db
16 KB
Low
Deployment history
design-reviews.db
64 KB
Low
Design review state
distill.db
2.0 MB
Medium
Knowledge distillation cache
documents.db
32 KB
Low
Document registry
drafts.db
360 KB
Medium
Draft content
drift.db
active WAL
Medium
Config drift detection
email-audit.db
256 KB
Medium
Email audit trail
email-briefing.db
0 B (active)
Low
Daily briefing state
email-index.db
0 B (active)
Low
Email search index
email-tracking.db
36 KB
Medium
Email delivery tracking
escalations.db
24 KB
Medium
Escalation queue
facts.db
20 KB
Low
System facts store
flywheel.db
432 MB
Low
Flywheel learning data — largest DB
goals.db
44 KB
Medium
OKR / goal tracking
guardrails-audit.db
10 MB
Medium
Safety audit trail
health-events.db
15 MB
High
System health events
hivemind-archive.db
6.7 MB
Low
HiveMind historical archive
master-control.db
0 B (active)
Medium
Master control state
mc.db
0 B (active)
Medium
Mission control alias
minions.db
192 KB
Medium
Minion agent registry
observability.db
44 KB
Medium
Metrics and traces
orchestrator-events.db
0 B (active)
Medium
Orchestrator event log
pipeline.db
active WAL
Medium
CI/CD pipeline state
projects.db
40 KB
Low
Project registry
routing-outcomes.db
192 KB
Medium
Tier routing outcome log
skill-improvements.db
20 KB
Low
Skill improvement tracking
skill-registry.db
128 KB
Low
Agent skill registry
sprint-pipeline.db
32 KB
Medium
Sprint pipeline state
strategy-tracker.db
128 KB
Low
Strategic initiative tracking
teams.db
40 KB
Low
Team registry
tenders.db
384 KB
Low
Norwegian tender data
tickets.db
active WAL
Medium
Support ticket tracking
tool-audit.db
6.1 MB
Medium
Tool usage audit
tool-registry.db
128 KB
Low
Tool registry
trace-events.db
52 MB
High
Distributed trace store
applications-tracker.db
12 KB
Low
Job/grant applications
P2 — Cache / Reconstructible (loss = inconvenience only)
Database
Size
Write Freq
Justification
baikal-caldav.db
108 KB
Low
CalDAV cache — reconstructible from Baikal
prompt-cache.db
320 KB
Medium
LLM prompt cache — can warm from scratch
prompt-metrics.db
28 KB
Low
Prompt performance metrics
rag-cache.db
active WAL
Medium
RAG response cache — reconstructible
semantic-reuse-index.db
192 KB
Medium
Semantic cache — reconstructible
stbs.db
0 B (active)
Low
STBS data — empty
telemetry.db
24 KB
Medium
Telemetry — can lose without ops impact
token-cost.db
active WAL
Medium
Cost log — reconstructible from API receipts
usage.db
0 B (active)
Low
Usage tracking — empty
vcr.db
active WAL
Low
HTTP cassette cache — reconstructible
1.2 Retention Strategy
Current retention for the 2 replicated DBs: 72h. This is insufficient for P0.
Tier
Retention
Justification
P0 (mission-critical)
7d
One week: covers weekend + Monday incident recovery. 72h is too tight — if a silent corruption is not caught in 3 days, all WAL segments are gone.
P0 (financial/legal)
30d
Regulatory prudence. fiken.db, invoices.db, contracts.db. Matches typical invoice dispute windows.
P1
72h
Current default. Operationally acceptable.
P2
24h
Cache data. Disk cost matters more than recovery depth.
Retention-check-interval: 1h for all tiers (current default, correct).
Sync-interval: 1s for all tiers P0 and P1. 10s for P2 (reduce Azure transaction cost on low-value data).
Azure storage cost estimate at current sizes (~1.2 GB total databases):
WAL segments are incremental. Estimate ~500 MB/day delta across all active DBs.
7-day P0 WAL: ~3.5 GB. 30-day financial: ~1 GB. P1 72h: ~1 GB.
Total Azure Blob: ~6 GB. At ~€0.02/GB/month = ~€0.12/month. Negligible.
1.3 New litestream.yml
Path: /Users/makinja/system/config/litestream.yml
Note on flywheel.db (432 MB): Include in P1 but with sync-interval: 30s to reduce churn.
Note on knowledge.db (192 MB): P0, sync-interval 1s — it's actively written by RAG ingestion.
# Litestream — SQLite streaming replication to Azure Blob Storage
# Primary: ANVIL (Mac Studio M3 Ultra 96GB, 100.103.49.98)
# Config: /Users/makinja/system/config/litestream.yml
# Auth: Azure SP (alai-backup-writer) via client credentials
# SP: alai-backup-writer (1a0b3018-0c31-474b-918f-531b0a29a669)
# SP has Storage Blob Data Contributor on system-db-backups container
# Litestream reads AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID from env
# Launch: com.alai.litestream.plist (sets env vars in EnvironmentVariables block)
# Updated: 2026-04-20 — ANVIL SPOF Elimination Sprint (MC #8515)
#
# Tier reference:
# P0-critical: retention 7d, sync 1s
# P0-financial: retention 30d, sync 1s
# P1: retention 72h, sync 1s (or 30s for large DBs)
# P2: retention 24h, sync 10s
dbs:
# ── P0 MISSION CRITICAL ──────────────────────────────────────────────────────
- path: /Users/makinja/system/databases/mission-control.db
replicas:
- name: mc-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/mission-control
retention: 168h # 7 days
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/hivemind.db
replicas:
- name: hivemind-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/hivemind
retention: 168h # 7 days
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tasks.db
replicas:
- name: tasks-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tasks
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/costs.db
replicas:
- name: costs-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/costs
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/events.db
replicas:
- name: events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/events
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/orchestrator-queue.db
replicas:
- name: orch-queue-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/orchestrator-queue
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/orchestrator-workers.db
replicas:
- name: orch-workers-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/orchestrator-workers
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/durable-runner.db
replicas:
- name: durable-runner-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/durable-runner
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/session-index.db
replicas:
- name: session-index-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/session-index
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/knowledge.db
replicas:
- name: knowledge-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/knowledge
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/emails.db
replicas:
- name: emails-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/emails
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-inbox.db
replicas:
- name: email-inbox-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-inbox
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/alem-directives.db
replicas:
- name: alem-directives-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/alem-directives
retention: 168h
retention-check-interval: 1h
sync-interval: 1s
# ── P0 FINANCIAL / LEGAL ─────────────────────────────────────────────────────
- path: /Users/makinja/system/databases/fiken.db
replicas:
- name: fiken-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/fiken
retention: 720h # 30 days
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/invoices.db
replicas:
- name: invoices-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/invoices
retention: 720h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/contracts.db
replicas:
- name: contracts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/contracts
retention: 720h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/leads.db
replicas:
- name: leads-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/leads
retention: 720h
retention-check-interval: 1h
sync-interval: 1s
# ── P1 OPERATIONAL ───────────────────────────────────────────────────────────
- path: /Users/makinja/system/databases/agent-routing.db
replicas:
- name: agent-routing-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/agent-routing
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/bee-index.db
replicas:
- name: bee-index-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/bee-index
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/bih-tenders.db
replicas:
- name: bih-tenders-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/bih-tenders
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/browser-tasks.db
replicas:
- name: browser-tasks-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/browser-tasks
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/companies.db
replicas:
- name: companies-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/companies
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/contacts.db
replicas:
- name: contacts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/contacts
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/deploy-registry.db
replicas:
- name: deploy-registry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/deploy-registry
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/design-reviews.db
replicas:
- name: design-reviews-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/design-reviews
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/distill.db
replicas:
- name: distill-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/distill
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/documents.db
replicas:
- name: documents-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/documents
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/drafts.db
replicas:
- name: drafts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/drafts
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/drift.db
replicas:
- name: drift-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/drift
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-audit.db
replicas:
- name: email-audit-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-audit
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-briefing.db
replicas:
- name: email-briefing-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-briefing
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-index.db
replicas:
- name: email-index-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-index
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/email-tracking.db
replicas:
- name: email-tracking-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/email-tracking
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/escalations.db
replicas:
- name: escalations-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/escalations
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/facts.db
replicas:
- name: facts-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/facts
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/flywheel.db
replicas:
- name: flywheel-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/flywheel
retention: 72h
retention-check-interval: 1h
sync-interval: 30s # 432MB — throttle sync to reduce Azure transactions
- path: /Users/makinja/system/databases/goals.db
replicas:
- name: goals-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/goals
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/guardrails-audit.db
replicas:
- name: guardrails-audit-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/guardrails-audit
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/health-events.db
replicas:
- name: health-events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/health-events
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/hivemind-archive.db
replicas:
- name: hivemind-archive-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/hivemind-archive
retention: 72h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/master-control.db
replicas:
- name: master-control-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/master-control
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/mc.db
replicas:
- name: mc-db-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/mc-db
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/minions.db
replicas:
- name: minions-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/minions
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/observability.db
replicas:
- name: observability-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/observability
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/orchestrator-events.db
replicas:
- name: orch-events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/orchestrator-events
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/pipeline.db
replicas:
- name: pipeline-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/pipeline
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/projects.db
replicas:
- name: projects-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/projects
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/routing-outcomes.db
replicas:
- name: routing-outcomes-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/routing-outcomes
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/skill-improvements.db
replicas:
- name: skill-improvements-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/skill-improvements
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/skill-registry.db
replicas:
- name: skill-registry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/skill-registry
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/sprint-pipeline.db
replicas:
- name: sprint-pipeline-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/sprint-pipeline
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/strategy-tracker.db
replicas:
- name: strategy-tracker-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/strategy-tracker
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/teams.db
replicas:
- name: teams-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/teams
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tenders.db
replicas:
- name: tenders-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tenders
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tickets.db
replicas:
- name: tickets-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tickets
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tool-audit.db
replicas:
- name: tool-audit-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tool-audit
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/tool-registry.db
replicas:
- name: tool-registry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/tool-registry
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/trace-events.db
replicas:
- name: trace-events-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/trace-events
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
- path: /Users/makinja/system/databases/applications-tracker.db
replicas:
- name: applications-tracker-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/applications-tracker
retention: 72h
retention-check-interval: 1h
sync-interval: 1s
# ── P2 CACHE / RECONSTRUCTIBLE ───────────────────────────────────────────────
- path: /Users/makinja/system/databases/baikal-caldav.db
replicas:
- name: baikal-caldav-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/baikal-caldav
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/prompt-cache.db
replicas:
- name: prompt-cache-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/prompt-cache
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/prompt-metrics.db
replicas:
- name: prompt-metrics-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/prompt-metrics
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/semantic-reuse-index.db
replicas:
- name: semantic-reuse-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/semantic-reuse-index
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/stbs.db
replicas:
- name: stbs-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/stbs
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/telemetry.db
replicas:
- name: telemetry-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/telemetry
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/token-cost.db
replicas:
- name: token-cost-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/token-cost
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/usage.db
replicas:
- name: usage-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/usage
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
- path: /Users/makinja/system/databases/vcr.db
replicas:
- name: vcr-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/vcr
retention: 24h
retention-check-interval: 1h
sync-interval: 10s
1.4 Implementation Steps (ANVIL)
Stop litestream: launchctl stop com.alai.litestream
Replace /Users/makinja/system/config/litestream.yml with the config above.
Validate config: /opt/homebrew/bin/litestream replicate -config /Users/makinja/system/config/litestream.yml -config-validate
Start litestream: launchctl start com.alai.litestream
Verify all DBs appear in Azure: az storage blob list --container-name system-db-backups --account-name alaibackups0ebb --prefix litestream/ --auth-mode login --query "[].name" | wc -l (expect ~67+ entries).
Watch logs for errors: tail -f /Users/makinja/system/logs/litestream-error.log
Phase 2 — FORGE Hardware / OS Decision
2.1 FORGE Already Exists — Hardware Decision Is Made
FORGE is confirmed to be a second Mac Studio M3 Ultra with 256 GB unified memory, connected
to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE). Tailscale IP: 100.104.164.86.
User: basicas. It is already running Ollama with models including devstral:24b, qwen3:32b,
deepseek-r1:70b, qwen3-coder, and bge-m3.
No hardware purchase is required. Monthly infrastructure cost delta: 0 EUR (already owned).
2.2 Why FORGE Wins Over Every Alternative
Option
Cost/mo
Latency to ANVIL
Apple Silicon
macOS parity
Verdict
FORGE (Mac Studio M3U 256GB, owned)
0 EUR
< 1ms (Thunderbolt)
Yes (M3 Ultra)
Yes (same LaunchAgent ecosystem)
CHOSEN
Mac Mini M4 Pro (purchase)
~50 EUR amortized
< 1ms if local
Yes
Yes
Redundant — FORGE exists
Hetzner Linux VM (CCX33)
~30-50 EUR
10-30ms (internet)
No (x86)
No (systemd, not launchd)
Budget option only if FORGE fails
Azure VM (Sweden Central)
~60-80 EUR
10-30ms
No
No
Closest to Azure storage but no Apple Silicon
Decision: Use FORGE as warm standby. Zero additional cost. Thunderbolt latency is effectively
local — litestream WAL replication will complete in well under 60s.
2.3 FORGE Bootstrap Prerequisites
FORGE already runs Ollama. What is missing:
litestream installed on FORGE (check: brew list litestream on basicas@FORGE)
Azure SP credentials injected into FORGE environment (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID)
~/system/databases/ directory created on FORGE
litestream-restore.sh daemon script written and loaded as LaunchAgent on FORGE
SSH key access from ANVIL to FORGE for health check and failover scripts
Phase 3 — Continuous Restore on FORGE (< 60s RPO)
3.1 Architecture
FORGE runs litestream restore in a watch loop per database. Litestream 0.5.x does not have
a native watch mode — it restores a snapshot + WAL segments. The recommended approach is
a shell script loop that calls litestream restore repeatedly with a short interval.
However, litestream does support a second process pattern: run litestream replicate on FORGE
pointing at the SAME Azure bucket paths, but configured as a replica-only consumer. This is the
correct approach: FORGE runs a litestream restore daemon that continuously polls for new WAL
segments from Azure.
3.2 Continuous Restore Strategy
Use litestream restore with the -if-replica-exists flag in a loop:
#!/usr/bin/env bash
# /Users/basicas/system/scripts/litestream-restore-loop.sh
# Runs on FORGE. Continuously restores all P0+P1 DBs from Azure.
# Interval: 30s poll (gives ~30s RPO in steady state, well within 60s target)
set -euo pipefail
LITESTREAM=/opt/homebrew/bin/litestream
CONFIG=/Users/basicas/system/config/litestream-restore.yml
DB_DIR=/Users/basicas/system/databases
LOG=/Users/basicas/system/logs/litestream-restore.log
INTERVAL=30 # seconds between restore cycles
while true; do
echo "[$(date -Iseconds)] Starting restore cycle" >> "$LOG"
# Restore each DB defined in restore config
# litestream restore will only apply new WAL segments if DB already exists
$LITESTREAM restore -config "$CONFIG" -if-replica-exists >> "$LOG" 2>&1 || true
echo "[$(date -Iseconds)] Restore cycle complete, sleeping ${INTERVAL}s" >> "$LOG"
sleep "$INTERVAL"
done
3.3 FORGE litestream-restore.yml
A separate config file on FORGE that mirrors ANVIL's litestream.yml but uses restore semantics.
FORGE is READ-ONLY consumer. It never writes back to Azure.
Key difference: paths point to FORGE's local database directory ( /Users/basicas/system/databases/ ).
The Azure paths are identical to ANVIL's — FORGE reads from the same blob paths ANVIL writes to.
# /Users/basicas/system/config/litestream-restore.yml
# FORGE warm standby — continuous restore from Azure
# DO NOT run litestream replicate with this config — restore only
dbs:
- path: /Users/basicas/system/databases/mission-control.db
replicas:
- name: mc-abs
type: abs
endpoint: https://alaibackups0ebb.blob.core.windows.net
bucket: system-db-backups
path: litestream/mission-control
# ... (repeat for all P0 and P1 DBs using same Azure paths as ANVIL)
# P2 DBs: omit from restore config — not worth continuous restore overhead
3.4 FORGE LaunchAgent for Restore Loop
Path: /Users/basicas/Library/LaunchAgents/com.alai.litestream-restore.plist
Label
com.alai.litestream-restore
ProgramArguments
/bin/bash
/Users/basicas/system/scripts/litestream-restore-loop.sh
EnvironmentVariables
AZURE_STORAGE_ACCOUNT
alaibackups0ebb
AZURE_CLIENT_ID
1a0b3018-0c31-474b-918f-531b0a29a669
AZURE_CLIENT_SECRET
RETRIEVE_FROM_BITWARDEN_AT_BOOTSTRAP
AZURE_TENANT_ID
3454a03f-20b4-4bda-a116-2293c459aecd
KeepAlive
RunAtLoad
StandardOutPath
/Users/basicas/system/logs/litestream-restore.log
StandardErrorPath
/Users/basicas/system/logs/litestream-restore-error.log
ThrottleInterval
10
3.5 RPO Calculation
ANVIL litestream sync-interval: 1s (WAL segment flushed to Azure every 1s for P0)
FORGE restore poll interval: 30s
Azure propagation: < 1s (same-region, in-blob operations)
Worst-case RPO: 31s (well under 60s target)
Expected average RPO: ~15-20s
Phase 4 — Ollama Failover Tier Routing
4.1 Current State
Tier routing in /Users/makinja/system/config/tier-routing.json already defines FORGE as the
primary host for Tiers 2c, 2cf, 2d, 3, 3s, 3r. ANVIL handles Tiers 1, 2, 2t, 2cHQ.
The providerFallback section defines ollama:qwen2.5-coder:32b@anvil as fallback for some paths.
The gap: there is no automatic failover FROM ANVIL TO FORGE when ANVIL Ollama is down,
and no automatic failover FROM FORGE TO ANVIL when FORGE Ollama is down.
4.2 Failover Config Extension
Extend /Users/makinja/system/config/tier-routing.json with an ollamaHosts block:
"ollamaHosts": {
"anvil": {
"url": "http://localhost:11434",
"tailscale_url": "http://100.103.49.98:11434",
"health_path": "/api/tags",
"health_timeout_ms": 3000,
"role": "primary-infra"
},
"forge": {
"url": "http://10.0.0.2:11434",
"tailscale_url": "http://100.104.164.86:11434",
"health_path": "/api/tags",
"health_timeout_ms": 3000,
"role": "primary-compute"
}
},
"failoverRules": {
"anvil-down": {
"redirect_anvil_tiers": ["1", "2", "2t", "2cHQ"],
"to_forge_models": {
"llama3.1:8b": "llama3.1:8b",
"qwen2.5-coder:32b": "qwen2.5-coder:32b-instruct-q8_0"
},
"note": "When ANVIL Ollama unreachable, route Tier 1/2 to FORGE equivalents"
},
"forge-down": {
"redirect_forge_tiers": ["2c", "2cf", "2d", "3", "3s", "3r"],
"to_claude": true,
"note": "When FORGE Ollama unreachable, escalate to Claude (cost spike acceptable — FORGE failure is rare)"
}
}
4.3 Health Check Daemon
A new lightweight Node.js daemon on ANVIL polls both Ollama endpoints every 15s and writes
status to a JSON file that ollama-engine.js reads before routing:
Path: /Users/makinja/system/daemons/ollama-health-monitor.js
// Pseudocode — implementation by CodeCraft
// Runs every 15s, writes to /tmp/ollama-health.json
// {
// "anvil": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" },
// "forge": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" }
// }
// tier-router.js reads this file before every dispatch
// If anvil.healthy === false: redirect tier 1/2 requests to forge
// If forge.healthy === false: redirect tier 2c/3 requests to claude
4.4 Manual Failover Command
For Phase 1 (before automatic failover is implemented):
# On ANVIL, when FORGE is down — force all routing to ANVIL
echo '{"anvil":{"healthy":true},"forge":{"healthy":false,"override":true}}' > /tmp/ollama-health-override.json
# When ANVIL is down, from FORGE (if FORGE has ollama-engine.js):
# Edit /Users/basicas/system/config/tier-routing.json: set all hosts to "forge"
Phase 5 — DNS / Service Discovery
5.1 Options Evaluated
Option
Mechanism
Failover Speed
Complexity
Cost
Tailscale MagicDNS
DNS record swap via Tailscale API
Manual: ~1 min
Low
Free
Cloudflare DNS + health check
CF Load Balancer health-check → DNS swap
Automatic: ~30s
Medium
~$5/month
Local /etc/hosts on each node
Static entries, no automatic failover
Manual: ~1 min
None
Free
Cloudflare Tunnel alias
DNS alias behind CF Tunnel
~30s
Medium
Free tier
5.2 Recommendation: Tailscale MagicDNS
Chosen: Tailscale MagicDNS with manual DNS swap.
Rationale:
All nodes (ANVIL, FORGE, ab-mac) are already on the same Tailscale network.
Tailscale MagicDNS can assign a hostname anvil.alai.internal (or use the device name directly).
Current hardcoded addresses ( localhost:11434 , 10.0.0.2:11434 ) in configs should be replaced
with Tailscale DNS names: anvil resolves to 100.103.49.98, forge resolves to 100.104.164.86.
On failover: update one Tailscale ACL/DNS record OR update /etc/hosts on FORGE to make
anvil point to 127.0.0.1 (making FORGE answer for anvil traffic locally).
Implementation:
In Tailscale admin console: verify MagicDNS is enabled for the tailnet.
Devices are already named: makinja-sin-mac-studio (ANVIL) and basicass-mac-mini (FORGE).
Add a Tailscale DNS override: anvil.alai → 100.103.49.98 (ANVIL primary).
Add to all tool configs: replace localhost:11434 with anvil.alai:11434 , 10.0.0.2:11434 with forge.alai:11434 .
Failover procedure: update Tailscale DNS record anvil.alai → 100.104.164.86 (FORGE).
This takes effect across all nodes within ~30s (Tailscale DNS TTL).
Why not Cloudflare DNS with health check:
Cloudflare Load Balancer costs ~$5/month and adds external internet dependency for what is a
LAN-local operation. Overkill for current scale. Revisit if ALAI adds a third node outside the LAN.
Phase 6 — External Heartbeat
6.1 Requirement
An external entity (not on ANVIL, not on FORGE) must poll ANVIL every 60s and alert Slack #ops
if ANVIL is unreachable for > 2 consecutive minutes (2 missed polls).
6.2 Mechanism: GitHub Actions Cron (Recommended)
Chosen: GitHub Actions scheduled workflow. Cost: free (GitHub public repo or private with
Actions minutes). No Azure Function setup required.
# .github/workflows/anvil-heartbeat.yml
# In a private ALAI GitHub repo (e.g., alai-infra or system-health)
name: ANVIL Heartbeat
on:
schedule:
- cron: '* * * * *' # Every minute
jobs:
heartbeat:
runs-on: ubuntu-latest
timeout-minutes: 1
steps:
- name: Check ANVIL health via Tailscale
id: health
run: |
# ANVIL exposes a health endpoint via Cloudflare Tunnel or public URL
# Option A: Hit a public health endpoint (requires CF Tunnel on ANVIL)
# Option B: Use Tailscale GitHub Action to join the tailnet and check directly
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 10 \
--max-time 15 \
${{ secrets.ANVIL_HEALTH_URL }})
echo "status=$STATUS" >> $GITHUB_OUTPUT
- name: Alert Slack if down
if: steps.health.outputs.status != '200'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"channel": "#ops",
"text": ":red_circle: ANVIL HEALTH CHECK FAILED\nHTTP Status: ${{ steps.health.outputs.status }}\nTime: ${{ github.run_started_at }}\nANVIL may be down. Check Tailscale and initiate FORGE failover if confirmed."
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_OPS_WEBHOOK }}
6.3 ANVIL Health Endpoint
ANVIL needs a lightweight HTTP health endpoint reachable from the internet (via Cloudflare Tunnel)
or via Tailscale GitHub Action. The simplest approach:
Create a health check script at /Users/makinja/system/tools/health-server.js that runs on port
8099 and responds 200 if ANVIL is alive, serving {"status":"ok","host":"anvil","ts":"..."} .
Expose via existing Cloudflare Tunnel infrastructure.
6.4 Alert Escalation
2 consecutive failures (2 minutes down): Slack #ops message.
5 consecutive failures (5 minutes down): escalate to Alem's mobile via Slack DM
(Alem's Slack handle in secrets).
6.5 Azure Function Alternative
Azure Function with Timer trigger (every 60s) is viable but requires:
Azure subscription billing (Consumption plan: ~$0/month for < 1M executions — effectively free)
Azure Function App deployment and maintenance
More setup complexity than GitHub Actions
Verdict: GitHub Actions preferred for simplicity. Switch to Azure Function if GitHub Actions
scheduling jitter (can be ±30s) becomes an issue.
Phase 7 — Shared Secrets (FORGE Bitwarden Access)
7.1 Problem
FORGE needs access to secrets (Azure SP secret, Bitwarden master password, API keys) without
depending on ANVIL being alive. Currently ANVIL holds the Bitwarden session at /tmp/bw-session.
7.2 Options
Option
Description
Risk
Separate BW account on FORGE
FORGE has its own Bitwarden account with shared collection
Low — independent
Shared BW session sync
ANVIL writes /tmp/bw-session to FORGE via rsync
Medium — session expires
Azure Key Vault break-glass
Critical secrets in AKV, FORGE SP can read them
Low — Azure dependency
Environment variables in plist
Secrets baked into LaunchAgent plist on FORGE
Low but plaintext risk
7.3 Recommendation: Two-Layer Approach
Layer 1 (operational): FORGE bootstraps its own Bitwarden CLI session independently.
FORGE has bw CLI installed.
FORGE has its own BW_SESSION set via a one-time manual bootstrap: bw login --apikey using a
FORGE-specific API key (Bitwarden supports API keys per user/device).
Session is stored in /Users/basicas/.bw-session and refreshed by a LaunchAgent on FORGE.
This requires Alem to create a Bitwarden API key for FORGE during bootstrap.
Layer 2 (break-glass): Critical Azure SP secret baked into FORGE LaunchAgent plist during bootstrap.
The Azure SP secret ( AZURE_CLIENT_SECRET ) is placed directly in the
com.alai.litestream-restore.plist EnvironmentVariables block — same pattern as ANVIL.
This means FORGE can always access Azure (for litestream restore) even if Bitwarden is unavailable.
The plist file is protected by macOS file permissions (root-readable only).
This is the same pattern already in use on ANVIL (confirmed in the plist we read).
Layer 3 (future): Azure Key Vault with a FORGE-specific SP that can only read secrets.
Create a new SP alai-forge-reader with Key Vault Secrets User role.
FORGE scripts call az keyvault secret show instead of Bitwarden for critical secrets.
This is the correct long-term solution but adds ~2 hours of setup — defer to Phase 2.
7.4 Bootstrap Sequence for FORGE Secrets
# On FORGE during initial bootstrap (one-time, performed by Alem or FlowForge):
# 1. Install bw CLI
brew install bitwarden-cli
# 2. Login with API key (avoids interactive login)
export BW_CLIENTID=""
export BW_CLIENTSECRET=""
bw login --apikey
bw unlock --passwordenv BW_MASTER_PASSWORD # or interactive
# 3. Store session
bw unlock > /Users/basicas/.bw-session
# 4. Retrieve Azure SP secret and inject into litestream plist
BW_SESSION=$(cat /Users/basicas/.bw-session)
AZ_SECRET=$(bw get password "alai-backup-writer" --session "$BW_SESSION")
# Update the plist AZURE_CLIENT_SECRET value with $AZ_SECRET
Phase 8 — Proveo DR Drill Checklist (Angie Jones Validation Task)
This is the mandatory validation task per ZAKON PLAN. Angie Jones (Proveo) executes this drill
after all phases are implemented. This is a REAL drill — not a dry run.
8.1 Pre-Drill Prerequisites
Phase 1 complete: all ~67 DBs replicating to Azure (verify with az storage blob list count)
Phase 3 complete: FORGE restore loop running, confirmed by checking FORGE DB file timestamps
Phase 4 complete: Ollama health monitor daemon running on ANVIL
Phase 5 complete: Tailscale MagicDNS configured ( anvil.alai resolves correctly)
Phase 6 complete: GitHub Actions heartbeat workflow deployed and sending test ping
Phase 7 complete: FORGE Bitwarden session independently functional
8.2 Drill Procedure
Step 1: Establish baseline (T=0)
# On ANVIL — record current state
node ~/system/tools/mc.js stats # Record open task count
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'" # Record
date -Iseconds > /tmp/drill-start.txt
Step 2: Simulate ANVIL failure
# Graceful shutdown (simulates power outage or kernel panic recovery)
# DO NOT run on production without Alem present
sudo shutdown -h now # Or: launchctl stop all non-essential services
# Alternative: kill Ollama + stop litestream + stop pi-orchestrator (partial failure sim)
launchctl stop com.alai.litestream
launchctl stop com.john.pi-orchestrator
launchctl stop com.john.ollama-serve-v2
Step 3: Measure time to alert (T=2 min)
GitHub Actions heartbeat should fire within 2 minutes of ANVIL going offline.
Angie records: timestamp of Slack #ops alert arrival.
Expected: < 2 min 30s from shutdown to Slack alert.
Step 4: FORGE failover execution (T=3 min target)
# On FORGE (basicas@100.104.164.86)
# 1. Verify latest DBs restored
ls -la ~/system/databases/*.db | head -5
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'"
# Compare to baseline — delta should be < 60s of writes
# 2. Update Tailscale DNS: anvil.alai → 100.104.164.86 (FORGE)
# (Alem updates in Tailscale admin console)
# 3. Start pi-orchestrator on FORGE (if installed)
# OR: update tier-routing.json to route all requests to forge endpoints
# 4. Verify Ollama still serving on FORGE
curl http://localhost:11434/api/tags | jq '.models | length'
Step 5: Measure RPO
# On FORGE after failover
BASELINE=$(cat /tmp/drill-baseline-count.txt) # From Step 1
CURRENT=$(sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'")
echo "Task count delta: $((BASELINE - CURRENT))"
# Check last WAL segment timestamp in Azure
az storage blob list \
--container-name system-db-backups \
--account-name alaibackups0ebb \
--prefix litestream/mission-control \
--auth-mode login \
--query "reverse(sort_by([].{name:name,last_modified:properties.lastModified}, &last_modified))[0]"
# Record last WAL segment time vs ANVIL shutdown time = actual RPO
Step 6: Measure RTO
RTO = time from "ANVIL confirmed down" to "FORGE serving requests with < 60s RPO data".
Record timestamps at each step. Target: < 5 minutes total.
Step 7: Restore ANVIL and verify
# Start ANVIL back up
# Verify litestream resumes replication
tail -f /Users/makinja/system/logs/litestream.log
# Verify FORGE restore loop detects ANVIL is back and no duplicate writes
8.3 Acceptance Criteria (Angie signs off when ALL pass)
Criterion
Target
Measured
Slack alert latency
< 2 min 30s
TBD
FORGE DB data lag (RPO)
< 60s
TBD
Time to FORGE serving (RTO)
< 5 min
TBD
P0 DB count on FORGE
17 DBs
TBD
Ollama inference on FORGE
Working (test prompt)
TBD
No data loss on ANVIL restart
mission-control.db row count matches
TBD
8.4 Findings Documentation
After the drill, Angie produces a findings report:
Actual RPO measured
Actual RTO measured
Any P0 DB that failed to restore
Any daemon that did not restart on FORGE
Recommendations for Phase 2 (automatic failover improvements)
Phase 9 — Skillforge BookStack Runbook Specification
This is the mandatory documentation task per ZAKON PLAN. Skillforge produces a BookStack page
at: https://docs.basicconsulting.no → Book: Infrastructure → Chapter: ANVIL DR & HA .
9.1 Required Sections
9.1.1 Overview Page
System architecture diagram (ANVIL — Thunderbolt — FORGE — Azure Blob)
Node inventory: ANVIL (96GB M3U), FORGE (256GB M3U), Azure (alaibackups0ebb)
RPO/RTO targets and current measured values
9.1.2 Litestream Configuration
How litestream works (WAL replication explained for non-experts)
DB tier classification table (P0/P1/P2) with justification
Retention policy per tier
How to add a new DB to replication (step-by-step)
How to verify replication is working: az storage blob list command + expected output
Where logs live: /Users/makinja/system/logs/litestream.log and -error.log
9.1.3 FORGE Warm Standby
What FORGE has installed (litestream, Ollama, models)
How the restore loop works: script location, poll interval, log location
How to verify FORGE is current: check DB timestamps against Azure last-modified
How to SSH to FORGE from ANVIL
9.1.4 Failover Runbook (Step-by-Step)
Pre-conditions checklist
Decision tree: partial failure vs full ANVIL down
Manual failover steps (numbered, copy-pasteable commands)
DNS failover: how to update Tailscale MagicDNS
Ollama failover: how to edit tier-routing.json on FORGE
Expected time per step
Rollback procedure: restoring ANVIL to primary
9.1.5 Failure Mode Catalog
Failure
Detection
Response
Recovery
ANVIL Ollama crash
ollama-health-monitor.json
Tier routing auto-redirects to FORGE
Restart com.john.ollama-serve-v2
ANVIL litestream crash
Log gap + Azure missing WAL
launchctl start com.alai.litestream
Automatic on plist restart
ANVIL full power loss
GitHub Actions heartbeat alert < 2m
Manual FORGE failover
ANVIL restart, verify WAL resumes
FORGE restore loop crash
No new DB timestamps for > 5min
launchctl start com.alai.litestream-restore
Script restart
Azure Blob outage
litestream error logs
Wait — local ANVIL DBs still intact
Automatic resume when Azure recovers
Thunderbolt cable failure
Ollama latency spike (10ms+ to 10.0.0.2)
Routes via Tailscale (100ms+ but functional)
Replug Thunderbolt
9.1.6 Monitoring & Alerts
GitHub Actions heartbeat: link to workflow, how to check last run
Slack #ops: what alerts look like, who is responsible for response
How to manually trigger a health check
9.1.7 Secrets & Credentials
Azure SP: alai-backup-writer — where stored, how to rotate
FORGE Bitwarden: how FORGE unlocks independently
What to do if Bitwarden is inaccessible (break-glass: Azure credentials in plist)
9.1.8 DR Drill Schedule
Quarterly drill required (next: 90 days after Phase 8 drill)
Drill checklist (link to Phase 8 checklist above)
Where to store drill findings (BookStack page: DR Drill Log)
9.2 Diagrams Required
Architecture diagram (Mermaid or draw.io): ANVIL → Azure → FORGE data flow
Failover decision tree : Who detects, who acts, what order
DB tier heatmap : Visual table of all 67 DBs colored by tier
9.3 BookStack Sync
Skillforge commits the runbook markdown to /Users/makinja/system/rules/anvil-dr-runbook.md and
triggers node ~/system/tools/bookstack-sync.js sync to push to BookStack. The com.john.bookstack-sync
daemon will keep it current thereafter.
Implementation Order & Timeline
Phase
Description
Owner
Est. Hours
Dependency
1
Litestream expansion (update yml, reload daemon)
FlowForge
2h
None
2
FORGE bootstrap (litestream install, DB dir, SP creds in plist)
FlowForge
1h
Phase 1
3
Continuous restore loop on FORGE
FlowForge
2h
Phase 2
4
Ollama health monitor daemon + failover config
FlowForge + CodeCraft
3h
Phase 3
5
Tailscale MagicDNS configuration
FlowForge
1h
None
6
GitHub Actions heartbeat workflow
FlowForge
1h
Phase 5
7
FORGE Bitwarden bootstrap
FlowForge (Alem physical action)
30min
Phase 2
8
Proveo DR drill
Proveo (Angie Jones)
2h
All phases done
9
BookStack runbook
Skillforge
3h
Phase 8
Total estimated implementation time: ~15.5 hours across 9 phases.
Critical path: Phases 1 → 2 → 3 (unblock parallel: 4, 5, 6, 7) → 8 → 9.
Risk Register
Risk
Likelihood
Impact
Mitigation
litestream overloads Azure with 67 DBs at 1s interval
Low
Medium
P2 DBs use 10s interval; Azure Blob is built for high-throughput ingestion
FORGE disk fills with restored DBs
Low
Medium
FORGE has 256GB RAM but internal SSD may vary — check df -h on FORGE before bootstrap
Thunderbolt cable failure isolates FORGE
Low
Low
Tailscale provides fallback path (100ms latency but functional)
WAL segments corrupt between ANVIL write and FORGE restore
Very Low
High
litestream uses SHA256 checksums on all WAL segments — corruption detected at restore
Empty DBs (fiken.db, companies.db, etc.) never get a WAL segment until first write
Medium
Low
litestream initializes on first write; these are pre-configured for when they get data
GitHub Actions cron jitter (can skip minutes)
Medium
Low
Two consecutive failures required before alert — single skip is acceptable
Open Questions for Alem
FORGE SSH access: SSH to FORGE (basicas@100.104.164.86) is currently failing due to
"too many authentication failures." Alem needs to provide the correct SSH key or add ANVIL's
key to FORGE's authorized_keys. Needed for: remote bootstrap and failover automation.
FORGE disk capacity: Unknown FORGE SSD size. Need to verify sufficient space for ~1.2 GB
of database files + WAL segments. df -h on FORGE before Phase 2.
FORGE macOS user: Confirmed user is basicas . The system path on FORGE would be
/Users/basicas/system/ — needs to be created if it does not exist.
Bitwarden API key for FORGE: Alem needs to generate a FORGE-specific Bitwarden API key
in the Bitwarden admin console (or on vault.basicconsulting.no if using Vaultwarden).
Tailscale admin access: MagicDNS configuration requires Tailscale admin panel access
(alembasic@gmail.com account). Alem configures this step.
ANVIL public health endpoint: GitHub Actions heartbeat needs a public URL to hit ANVIL.
Does a Cloudflare Tunnel already expose an ANVIL health endpoint? If not, this needs setup.
TL;DR
FORGE platform: Existing Mac Studio M3 Ultra 256 GB (basicass-mac-mini, 10.0.0.2 / 100.104.164.86).
No hardware purchase needed.
Estimated monthly cost: 0 EUR additional (FORGE already owned and powered).
Azure Blob storage delta: ~€0.12/month for WAL segments across all 67 DBs.
GitHub Actions heartbeat: free tier.
Total: < €1/month increase .
Estimated implementation time: ~15.5 hours across 9 phases.
Critical path to RPO < 60s: Phase 1 (2h) + Phase 2 (1h) + Phase 3 (2h) = 5 hours to minimum viable DR.
Full HA with automatic failover and DR drill: ~13.5 hours additional.
Immediate action (highest leverage): Phase 1 — update litestream.yml to cover all 67 DBs.
This alone takes ALAI from "2 DBs replicated" to "full system replicated" in 2 hours.
FORGE restore is what converts the backup into an actual hot standby.
Alem approval required before implementation.