Infrastructure

Deployment architecture, CI/CD, environments, IaC, monitoring, disaster recovery

Deployment Architecture
Environment Configuration
Infrastructure as Code
Monitoring & Observability
Disaster Recovery Plan
CI/CD Pipeline
ALAI Static Hosting Blueprint (2026-04-20)
Cloud Migration 2026

Master Plan — Cloud Migration
Phase 1 — Bitwarden Cloud Migration
Phase 2 — MC + HiveMind API
Current State vs Target State

ANVIL SPOF Elimination Plan (2026-04-20)
MC Claim Protocol
Agent Team Topology ADR-024
Phase A — Hook Enforcement for Hard Constraint #2 (2026-05-11)
ZAKON #18B — Blueprint Liveness Enforcement
ZAKON Registry — Creation Requires Approval Gate
LightRAG Tuning — 2026-05
Email-Reactor — Strategic-Inbox Auto-Triage Daemon

Deployment Architecture

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	{{DATE}}	{{AUTHOR}}	Initial draft

1. Overview

System: {{PROJECT_NAME}} Cloud Provider: {{CLOUD_PROVIDER}} Provider Rationale: {{RATIONALE}} Architecture Pattern: {{PATTERN}}

2. Infrastructure Topology

graph TB
    subgraph Internet
        USER[End Users]
        CDN[CDN / CloudFront]
    end

    subgraph Public Subnet
        ALB[Application Load Balancer]
        BASTION[Bastion Host]
    end

    subgraph Private Subnet - App
        APP1[App Server 1]
        APP2[App Server 2]
    end

    subgraph Private Subnet - Data
        DB_PRIMARY[(Primary DB)]
        DB_REPLICA[(Read Replica)]
        CACHE[Redis Cache]
    end

    subgraph Isolated Subnet
        SECRETS[Secrets Manager]
        BACKUP[Backup Storage]
    end

    USER --> CDN
    CDN --> ALB
    ALB --> APP1
    ALB --> APP2
    APP1 --> DB_PRIMARY
    APP2 --> DB_PRIMARY
    APP1 --> CACHE
    DB_PRIMARY --> DB_REPLICA
    APP1 --> SECRETS

3. Networking Architecture

3.1 VPC / VNET Design

Network	CIDR	Purpose
VPC / VNET	{{CIDR_VPC}}	Main network boundary
Public Subnet A	{{CIDR_PUB_A}}	Load balancers, NAT gateways
Public Subnet B	{{CIDR_PUB_B}}	Load balancers, NAT gateways (AZ-B)
Private Subnet A	{{CIDR_PRIV_A}}	Application servers
Private Subnet B	{{CIDR_PRIV_B}}	Application servers (AZ-B)
Isolated Subnet A	{{CIDR_ISO_A}}	Databases, secrets
Isolated Subnet B	{{CIDR_ISO_B}}	Databases, secrets (AZ-B)

3.2 Load Balancer Configuration

Parameter	Value
Type	{{LB_TYPE}}
Protocol	HTTPS (TLS 1.2+)
SSL Termination	At load balancer
Health Check Path	{{HEALTH_CHECK_PATH}}
Health Check Interval	{{INTERVAL}}s
Unhealthy Threshold	{{THRESHOLD}} consecutive failures
Idle Timeout	{{TIMEOUT}}s
Stickiness	{{STICKINESS}}

3.3 DNS Architecture

Record	Type	Value	TTL
{{DOMAIN}}	A / ALIAS	Load Balancer	{{TTL}}
api.{{DOMAIN}}	CNAME	API Load Balancer	{{TTL}}
cdn.{{DOMAIN}}	CNAME	CDN Distribution	{{TTL}}

DNS Provider: {{DNS_PROVIDER}} Failover Strategy: {{FAILOVER_STRATEGY}}

3.4 CDN Configuration

Parameter	Value
Provider	{{CDN_PROVIDER}}
Origin	{{CDN_ORIGIN}}
Cache Behaviors	Static assets: 1yr, API: no-cache, HTML: 5min
HTTPS Only	Yes
WAF Integration	{{WAF_INTEGRATION}}

4. Compute

4.1 Container Orchestration

Platform: {{ORCHESTRATION}}

Component	Configuration	Notes
Cluster	{{CLUSTER_SPEC}}
Node Groups	{{NODE_GROUPS}}
Min Nodes	{{MIN_NODES}}
Max Nodes	{{MAX_NODES}}
Node Size	{{NODE_SIZE}}
Container Registry	{{REGISTRY}}

4.2 Serverless Functions

Function	Trigger	Memory	Timeout	Purpose
{{FUNCTION_1}}	{{TRIGGER}}	{{MEMORY}}MB	{{TIMEOUT}}s	{{PURPOSE}}

4.3 Instance Sizing & Auto-Scaling

Service	Instance Type	Min	Max	Scale Trigger
{{SERVICE}}	{{INSTANCE}}	{{MIN}}	{{MAX}}	CPU > {{CPU}}% for {{DURATION}}min

Scale-Out Policy: {{SCALE_OUT}} Scale-In Policy: {{SCALE_IN}} Scale-In Cooldown: {{COOLDOWN}}min

5. Storage

5.1 Database Hosting

Database	Engine	Version	Hosting	Instance	Storage	HA
{{DB_NAME}}	{{ENGINE}}	{{VERSION}}	{{HOSTING}}	{{INSTANCE}}	{{STORAGE}}GB	{{HA}}

Connection Pooling: {{POOL_TOOL}} Max Connections: {{MAX_CONN}} Connection String: Stored in {{SECRET_LOCATION}} (never hardcoded)

5.2 Object Storage

Bucket / Container	Purpose	Access	Lifecycle	Encryption
{{BUCKET_NAME}}	{{PURPOSE}}	{{ACCESS}}	{{LIFECYCLE}}	AES-256

5.3 File Storage

Storage	Type	Mount Point	Purpose	Size
{{STORAGE_NAME}}	{{TYPE}}	{{MOUNT}}	{{PURPOSE}}	{{SIZE}}GB

6. Security

6.1 Network Security Groups / Firewall Rules

Security Group	Direction	Port	Protocol	Source / Destination	Purpose
sg-alb	Inbound	443	TCP	0.0.0.0/0	HTTPS from internet
sg-alb	Outbound	{{APP_PORT}}	TCP	sg-app	Forward to app
sg-app	Inbound	{{APP_PORT}}	TCP	sg-alb	From load balancer
sg-app	Outbound	{{DB_PORT}}	TCP	sg-db	Database access
sg-db	Inbound	{{DB_PORT}}	TCP	sg-app	From application only

6.2 WAF Configuration

WAF Provider: {{WAF_PROVIDER}}

Rule Group	Purpose	Action
AWSManagedRulesCommonRuleSet	OWASP Top 10	Block
AWSManagedRulesSQLiRuleSet	SQL injection	Block
AWSManagedRulesKnownBadInputsRuleSet	Known bad inputs	Block
Rate limiting	{{RATE_LIMIT}} req/5min per IP	Count → Block

6.3 Secrets Management

Secret Store: {{SECRET_STORE}}

Secret	Rotation Schedule	Access
Database credentials	90 days	App role only
API keys (third-party)	On compromise	App role only
TLS certificates	60 days before expiry	Deploy role only
JWT signing key	365 days	Auth service only

6.4 IAM Roles & Policies

Role	Trusted By	Key Permissions	Purpose
{{APP_ROLE}}	EC2 / ECS Task	SecretsManager:GetSecret, S3:GetObject	Application runtime
{{DEPLOY_ROLE}}	CI/CD	ECR:PushImage, ECS:UpdateService	Deployments
{{BACKUP_ROLE}}	Lambda / Cron	RDS:CreateSnapshot, S3:PutObject	Backups

7. Cost Estimation

Component	Service	Spec	Est. Monthly Cost
Compute	{{SERVICE}}	{{SPEC}}	${{COST}}
Database	{{SERVICE}}	{{SPEC}}	${{COST}}
Load Balancer	{{SERVICE}}	{{SPEC}}	${{COST}}
CDN	{{SERVICE}}	{{TRAFFIC}}GB transfer	${{COST}}
Storage	{{SERVICE}}	{{CAPACITY}}GB	${{COST}}
Monitoring	{{SERVICE}}	{{METRICS}} metrics	${{COST}}
Total			${{TOTAL}}

Cost Optimization Notes:

8. High Availability Design

Component	HA Strategy	Failover Time	Notes
Application	Multi-AZ, N+1 instances	Immediate (ELB health check)
Database	Multi-AZ with auto-failover	60-120 seconds	DNS propagation
Cache	Cluster mode / Replication	30 seconds	Redis Sentinel
CDN	Global edge network	Transparent	Provider HA

RTO Target: {{RTO}} minutes RPO Target: {{RPO}} minutes

9. Multi-Region Considerations

Current: {{REGION_STRATEGY}} Primary Region: {{PRIMARY_REGION}} Secondary Region: {{SECONDARY_REGION}}

Rationale: {{MULTI_REGION_RATIONALE}}

Data Replication: {{REPLICATION_STRATEGY}} Failover Procedure: See disaster-recovery-plan.md

Approval

Role	Name	Date	Signature
Author
Reviewer
Approver

Environment Configuration

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	{{DATE}}	{{AUTHOR}}	Initial draft

1. Environment Overview

Environment	Purpose	URL	Access	Managed By
Local	Developer workstation	`localhost`	Developer	Individual
Dev	Integration, daily builds	`dev.{{DOMAIN}}`	Team + CI	Platform team
Staging	Pre-production validation	`staging.{{DOMAIN}}`	Team + QA + PM	Platform team
Production	Live system	`{{DOMAIN}}`	Ops only	Platform team
Preview	Feature branch review	`{{BRANCH}}.preview.{{DOMAIN}}`	Team + Stakeholders	CI/CD

2. Per-Environment Configuration

2.1 Development Environment

Parameter	Value	Notes
Log level	`DEBUG`	Verbose logging for development
Database	`dev-db.{{INTERNAL_DOMAIN}}`	Shared dev DB, refreshed weekly
Cache	`dev-redis.{{INTERNAL_DOMAIN}}`	Shared Redis, no persistence
Email	Mailtrap / fake SMTP	Emails not delivered to real recipients
Payments	Sandbox / test mode	No real transactions
Feature flags	All enabled	Developers can test unreleased features
Debug tools	Enabled	Profiler, debug toolbar, etc.
Rate limiting	Disabled	Developer convenience
Auto-migrations	Enabled	Runs on startup

2.2 Staging Environment

Parameter	Value	Notes
Log level	`INFO`	Same as production
Database	`staging-db.{{INTERNAL_DOMAIN}}`	Isolated staging DB, production-scale
Cache	`staging-redis.{{INTERNAL_DOMAIN}}`	Dedicated Redis
Email	`staging@{{DOMAIN}}`	Sends to internal test inboxes only
Payments	Sandbox / test mode	No real transactions
Feature flags	Mirrors production + staged features
Debug tools	Disabled	Must match production behavior
Rate limiting	Enabled	Same limits as production
Data refresh	Weekly from production (anonymized)	See data refresh runbook

Intentional staging/production differences:

Email delivery: internal only (not real users)
Payment: sandbox (not real transactions)
Data: anonymized copies (not real PII)

2.3 Production Environment

Parameter	Value	Notes
Log level	`WARN`	Errors and warnings only
Database	`{{PROD_DB_HOST}}`	See secrets manager
Cache	`{{PROD_REDIS_HOST}}`	Clustered Redis
Email	`{{EMAIL_PROVIDER}}`	Real delivery via SES/Sendgrid/etc.
Payments	Live mode	Real transactions
Feature flags	Conservative — tested features only	New features behind flags
Debug tools	Disabled	Security requirement
Rate limiting	Enabled	See rate limit table
HSTS	Enabled (1 year, includeSubDomains)
CSP	Strict	See security headers config

2.4 Preview / Feature Environments

Trigger: Pull request opened against main / develop Lifetime: Active while PR is open; destroyed on PR close URL Pattern: {{BRANCH_SLUG}}.preview.{{DOMAIN}} Database: Ephemeral copy (seeded from fixture data, not production) Teardown: Automated — triggered by PR close webhook

Parameter	Value
Log level	`DEBUG`
Email	Fake SMTP / preview inbox
Payments	Sandbox
Feature flags	Branch-specific flags enabled

3. Environment Variables Reference

Variable	Description	Required	Default	Sensitive	Environments
`NODE_ENV`	Runtime environment	Yes	`development`	No	All
`PORT`	HTTP server port	Yes	`3000`	No	All
`DATABASE_URL`	PostgreSQL connection string	Yes	—	Yes	All
`REDIS_URL`	Redis connection string	Yes	`redis://localhost:6379`	Yes	All
`JWT_SECRET`	JWT signing key	Yes	—	Yes	All
`JWT_EXPIRY`	Token expiry duration	Yes	`1h`	No	All
`SMTP_HOST`	SMTP server hostname	Yes	—	No	All
`SMTP_USER`	SMTP username	Yes	—	Yes	All
`SMTP_PASS`	SMTP password	Yes	—	Yes	All
`S3_BUCKET`	Object storage bucket name	Yes	—	No	All
`AWS_REGION`	Cloud region	Yes	`eu-west-1`	No	All
`SENTRY_DSN`	Error tracking DSN	No	—	Yes	Staging, Prod
`STRIPE_KEY`	Payment API key	Yes (if payments)	—	Yes	All
`LOG_LEVEL`	Logging verbosity	No	`info`	No	All
`RATE_LIMIT_WINDOW`	Rate limit window (ms)	No	`60000`	No	All
`RATE_LIMIT_MAX`	Max requests per window	No	`100`	No	All
`FEATURE_FLAG_KEY`	Feature flag SDK key	No	—	Yes	All

Rules:

Sensitive variables MUST be sourced from {{SECRET_STORE}} in staging and production
Never commit sensitive values to source control
Use .env.example with placeholder values for developer onboarding
Rotate all secrets on team member offboarding

4. Secrets Management

4.1 Secret Storage Solution

Solution: {{SECRET_TOOL}}

Environment	Secret Store	Access Method
Local	`.env` file (never committed)	Developer managed
Dev	{{DEV_SECRET_STORE}}	CI/CD service account
Staging	{{STG_SECRET_STORE}}	IAM role / service account
Production	{{PROD_SECRET_STORE}}	IAM role / service account

4.2 Secret Rotation Schedule

Secret Type	Rotation Schedule	Automated	Owner
Database passwords	90 days	{{AUTOMATED}}	Platform team
API keys (internal)	365 days	No	Service owner
API keys (third-party)	On compromise	No	Dev lead
JWT signing keys	365 days	No	Platform team
TLS certificates	60 days before expiry	{{AUTOMATED}}	Platform team

4.3 Access Controls

Role	Dev Secrets	Staging Secrets	Production Secrets
Developer	Read/Write	Read	No access
DevOps	Read/Write	Read/Write	Read/Write
CI/CD (build)	Read	Read	No access
CI/CD (deploy)	No access	Read	Read
Application runtime	Read (scoped)	Read (scoped)	Read (scoped)

5. Feature Flags Per Environment

Tool: {{FF_TOOL}}

Flag	Dev	Staging	Production	Notes
`feature-new-checkout`	On	On	Off	Waiting for QA sign-off
`feature-dark-mode`	On	On	Off	Rollout planned {{DATE}}
`kill-switch-payments`	Off	Off	Off	Emergency disable only
`maintenance-mode`	Off	Off	Off	Emergency only

6. Database Configuration Per Environment

Parameter	Local	Dev	Staging	Production
Host	`localhost`	`{{DEV_DB}}`	`{{STG_DB}}`	`{{PROD_DB}}`
Port	`5432`	`5432`	`5432`	`5432`
Database name	`{{APP}}_dev`	`{{APP}}_dev`	`{{APP}}_staging`	`{{APP}}_prod`
Max connections	`10`	`25`	`50`	`{{PROD_CONNS}}`
SSL required	No	No	Yes	Yes
Connection pool	No	No	Yes ({{POOL}})	Yes ({{POOL}})
Read replica	No	No	No	Yes
Backup	No	Daily	Daily	{{BACKUP_FREQ}}

7. External Service Configuration Per Environment

Service	Dev	Staging	Production	Notes
Email (SMTP)	Mailtrap	Mailtrap	SendGrid / SES
Payments	Stripe test	Stripe test	Stripe live	Different API keys
SMS	Twilio test	Twilio test	Twilio live
Analytics	Disabled	Staging property	Production property
Error tracking	Disabled	Sentry dev project	Sentry prod project
Maps	No key / free tier	Paid key	Paid key

8. Environment Provisioning Process

Infrastructure provisioning: terraform apply -var-file=envs/{{ENV}}.tfvars
Secret provisioning: bash scripts/provision-secrets.sh {{ENV}}
Database provisioning: bash scripts/create-db.sh {{ENV}}
DNS configuration: Update DNS records per deployment-architecture.md
TLS certificates: Auto-provisioned via {{CERT_TOOL}}
Initial deployment: Trigger CI/CD for {{ENV}} target
Verification: Run smoke tests against new environment

Estimated time: {{PROVISION_TIME}} minutes Runbook: {{PROVISION_RUNBOOK_LINK}}

9. Environment Teardown Process

Verify no active users or critical processes
Export any required data / logs
Remove DNS records
Revoke TLS certificates
terraform destroy -var-file=envs/{{ENV}}.tfvars
Purge secrets from secret store
Archive environment configuration to {{ARCHIVE_LOCATION}}
Update this document to remove the environment entry

10. Parity Policy (Staging ↔ Production Drift)

Goal: Staging should be functionally identical to production at all times.

Area	Policy
Application version	Staging is always ahead by ≤ 1 release
Infrastructure spec	Same instance types and topology
Database engine & version	Must match exactly
OS & runtime versions	Must match exactly
Third-party dependencies	Same versions (except external service mode)
Network topology	Same (except size)
Security controls	Same

Drift detection: {{DRIFT_DETECTION}} Drift resolution owner: Platform team

Approval

Role	Name	Date	Signature
Author
Reviewer
Approver

Infrastructure as Code

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	{{DATE}}	{{AUTHOR}}	Initial draft

1. Overview

IaC Tool: {{IAC_TOOL}} Tool Version: {{IAC_VERSION}} Provider: {{CLOUD_PROVIDER}} Provider Version: {{PROVIDER_VERSION}}

Rationale for tool choice:

Core Principles:

All infrastructure changes go through code (no manual console changes in staging/prod)
IaC reviewed like application code (PR, review, merge)
State is the single source of truth
Modules are versioned and reusable

2. Repository Structure

{{IaC_REPO}}/
├── modules/                    # Reusable modules
│   ├── networking/             # VPC, subnets, security groups
│   ├── compute/                # EC2, ECS, Lambda
│   ├── database/               # RDS, ElastiCache
│   ├── storage/                # S3, EFS
│   └── monitoring/             # CloudWatch, alerts
├── environments/               # Environment-specific configs
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── production/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
├── shared/                     # Shared resources (DNS, accounts)
├── scripts/                    # Helper scripts
│   ├── bootstrap.sh            # Initialize state backend
│   └── validate.sh             # Pre-apply validation
├── .terraform-version          # Pin tool version (tfenv)
├── .tflint.hcl                 # Linting config
└── README.md

2.1 Module Organization

Module	Purpose	Inputs	Outputs
`modules/networking`	VPC, subnets, routing	region, cidr_block, az_count	vpc_id, subnet_ids, sg_ids
`modules/compute`	ECS cluster, task definitions	cluster_name, instance_type	cluster_arn, task_role_arn
`modules/database`	RDS instance, parameter groups	engine, instance_class	db_endpoint, db_secret_arn
`modules/storage`	S3 buckets with policies	bucket_name, purpose	bucket_arn, bucket_name
`modules/monitoring`	CloudWatch dashboards, alarms	service_name, thresholds	alarm_arns, dashboard_url

2.2 Environment Separation

Each environment directory is independently deployable
Environments call the same modules with different variable values
No cross-environment dependencies (except shared DNS zone)
Production has stricter apply controls (see Section 6)

2.3 Shared Modules

Shared module registry: {{MODULE_REGISTRY}}

Module	Source	Version	Used By
`networking`	`{{REGISTRY}}/networking`	`~> 2.0`	All environments
`database`	`{{REGISTRY}}/database`	`~> 1.5`	Staging, Production
`monitoring`	`{{REGISTRY}}/monitoring`	`~> 1.2`	All environments

3. State Management

3.1 Remote State Backend

Backend: {{STATE_BACKEND}}

Environment	State Location	Access
Dev	`{{STATE_BUCKET}}/dev/terraform.tfstate`	DevOps team
Staging	`{{STATE_BUCKET}}/staging/terraform.tfstate`	DevOps team
Production	`{{STATE_BUCKET}}/production/terraform.tfstate`	Senior DevOps + CI only

Bootstrap (first-time setup):

bash scripts/bootstrap.sh {{ENVIRONMENT}}

3.2 State Locking

Locking Mechanism: {{LOCK_MECHANISM}} Lock timeout: {{LOCK_TIMEOUT}}s Force unlock: Only by senior DevOps after verifying no active apply

Lock table (if DynamoDB):

Table: {{LOCK_TABLE}}
Key: LockID
Billing: On-demand

3.3 State File Organization

Splitting strategy: {{SPLIT_STRATEGY}}

State File	Contains	Reason for split
`base/terraform.tfstate`	Networking, IAM	Infrequently changed
`app/terraform.tfstate`	Compute, app services	Frequently changed
`data/terraform.tfstate`	Databases, caches	High risk, separate lifecycle

4. Module Design

4.1 Naming Conventions

Resource naming pattern: {{PROJECT}}-{{ENVIRONMENT}}-{{COMPONENT}}-{{SUFFIX}}

Resource	Example
VPC	`myapp-prod-vpc`
ECS Cluster	`myapp-prod-cluster`
RDS Instance	`myapp-prod-db-primary`
S3 Bucket	`myapp-prod-assets-{{ACCOUNT_ID}}`
Security Group	`myapp-prod-app-sg`
IAM Role	`myapp-prod-app-task-role`

4.2 Input / Output Variables

Required variable fields:

variable "environment" {
  description = "Deployment environment (dev/staging/production)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}

Required output fields:

output "database_endpoint" {
  description = "The hostname of the database endpoint"
  value       = aws_db_instance.main.endpoint
  sensitive   = false
}

4.3 Versioning Strategy

Module versioning: Semantic versioning (MAJOR.MINOR.PATCH) Pin strategy: ~> MAJOR.MINOR (allow patch updates, pin minor) Upgrade policy: Review and test before upgrading minor/major versions Changelog: Every module version bump requires a CHANGELOG entry

5. Workflow

5.1 Standard Change Process

flowchart LR
    BRANCH[Create branch] --> CODE[Write/modify IaC]
    CODE --> VALIDATE[terraform validate + tflint]
    VALIDATE --> PLAN[terraform plan]
    PLAN --> PR[Open PR with plan output]
    PR --> REVIEW[Peer review]
    REVIEW --> APPROVE[Approval]
    APPROVE --> APPLY[terraform apply in CI]
    APPLY --> VERIFY[Verify resources]

Steps:

Create feature branch: infra/{{TICKET}}-description
Make changes, run terraform validate && terraform fmt
Run terraform plan — attach output to PR
Open PR for review (at least 1 reviewer required for dev/staging, 2 for production)
CI runs terraform plan automatically on PR open
Merge triggers terraform apply in CI (dev/staging)
Production apply requires manual trigger after PR merge

5.2 PR-Based Infrastructure Changes

PR Requirements:

Title: [IaC] {{ENVIRONMENT}}: description of change
Must include terraform plan output in PR description or CI artifact
Must include justification for the change
Must reference the related application ticket (if applicable)
Must have passing CI validation (fmt, validate, tflint, plan)

5.3 Automated Drift Detection

Schedule: {{DRIFT_SCHEDULE}} Tool: {{DRIFT_TOOL}} Alert Channel: {{DRIFT_ALERT_CHANNEL}} Action on drift:

Investigate cause (manual change, provider issue, external system)
Either fix drift (apply IaC) or update IaC to reflect intentional change
Never leave drift unresolved for > {{DRIFT_SLA}}

6. Security

6.1 Least Privilege for IaC Service Account

Environment	Service Account	Permissions
Dev	`ci-iac-dev@{{PROJECT}}`	Full write within dev resources
Staging	`ci-iac-staging@{{PROJECT}}`	Full write within staging resources
Production	`ci-iac-prod@{{PROJECT}}`	Restricted write, requires MFA session

6.2 Secret Injection (Not in State)

Rule: Never pass passwords, API keys, or secrets as Terraform variables Pattern: Reference secrets manager in resource configuration:

# WRONG — secret in state
resource "aws_db_instance" "main" {
  password = var.db_password  # This will be in state in plaintext!
}

# RIGHT — secret from Secrets Manager
resource "aws_db_instance" "main" {
  manage_master_user_password = true  # AWS manages the password in Secrets Manager
}

6.3 Policy as Code

Tool: {{POLICY_TOOL}}

Policy	Enforcement
No public S3 buckets	Block
All resources must have environment tag	Warn
RDS must be in private subnet	Block
Security groups must not allow `0.0.0.0/0` on sensitive ports	Block
Encryption at rest required for data resources	Block

7. Tagging Strategy

Required tags on all resources:

Tag	Value	Purpose
`Project`	`{{PROJECT_NAME}}`	Cost attribution
`Environment`	`dev` / `staging` / `production`	Environment filter
`ManagedBy`	`terraform`	Identifies IaC-managed resources
`Team`	`{{TEAM}}`	Ownership
`CostCenter`	`{{COST_CENTER}}`	Finance attribution

Optional tags:

Tag	Value	Purpose
`Service`	`{{SERVICE_NAME}}`	Service-level grouping
`Ticket`	`{{TICKET_ID}}`	Change tracking
`ExpiresAt`	`{{DATE}}`	Ephemeral resource cleanup

8. Cost Management

Budget alerts:

Dev: Alert at ${{DEV_BUDGET}} / month
Staging: Alert at ${{STG_BUDGET}} / month
Production: Alert at ${{PROD_BUDGET}} / month

Cost optimization built into IaC:

Dev/staging auto-shutdown: {{AUTO_SHUTDOWN_SCHEDULE}}
Right-sizing: Instance types reviewed quarterly
Reserved instances / savings plans: Applied to production

9. Disaster Recovery for IaC State

State backup: {{STATE_BACKUP}} Recovery procedure:

Restore from most recent backup
Run terraform plan — verify no unexpected changes
If state is unrecoverable: terraform import for each managed resource (refer to resource inventory)

Prevention:

S3 versioning enabled on state bucket
MFA delete required for state bucket
State bucket access logged to CloudTrail

Approval

Role	Name	Date	Signature
Author
Reviewer
Approver

Monitoring & Observability

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	{{DATE}}	{{AUTHOR}}	Initial draft

1. Observability Strategy

Observability Platform: {{OBS_PLATFORM}} Strategy: Instrument everything, alert on symptoms (not causes), correlate across pillars

Core Questions We Must Be Able to Answer:

Is the system up and serving users correctly?
How fast is it responding?
What errors are occurring and why?
Where is the bottleneck?
What changed before this problem started?

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Metric	Source	Alert Threshold	Severity
CPU utilization	Node exporter / CloudWatch	> {{CPU_WARN}}% (warn), > {{CPU_CRIT}}% (critical)	Warning / Critical
Memory utilization	Node exporter / CloudWatch	> {{MEM_WARN}}% (warn), > {{MEM_CRIT}}% (critical)	Warning / Critical
Disk utilization	Node exporter / CloudWatch	> {{DISK_WARN}}% (warn), > {{DISK_CRIT}}% (critical)	Warning / Critical
Network in/out	Node exporter / CloudWatch	> {{NET_LIMIT}}Mbps sustained	Warning
Container restarts	Kubernetes / ECS	> {{RESTART_LIMIT}} in 5min	Critical
Node not ready	Kubernetes	Any	Critical

Application Metrics (RED Method)

Metric	Description	Target	Alert Threshold
Request rate	Requests per second per service	Baseline ± 20%	50% deviation
Error rate	% requests returning 5xx	< {{ERROR_RATE}}%	> {{ERROR_ALERT}}%
P50 latency	Median response time	< {{P50}}ms	> {{P50_ALERT}}ms
P95 latency	95th percentile response time	< {{P95}}ms	> {{P95_ALERT}}ms
P99 latency	99th percentile response time	< {{P99}}ms	> {{P99_ALERT}}ms

Business Metrics

Metric	Description	Collection Method	Dashboard
Active users (DAU/MAU)	Daily/monthly active users	Frontend instrumentation	Business dashboard
{{CONVERSION_METRIC}}	{{CONVERSION_DESC}}	Backend event	Business dashboard
{{REVENUE_METRIC}}	{{REVENUE_DESC}}	Payment events	Finance dashboard
Feature usage	Feature-level engagement	Feature flag SDK	Product dashboard

Custom Metrics Definition

Metric Name	Type	Labels	Description	Unit
`{{APP}}_job_queue_depth`	Gauge	`queue_name`	Number of pending jobs	count
`{{APP}}_job_processing_duration`	Histogram	`queue_name, status`	Job processing time	seconds
`{{APP}}_external_api_calls_total`	Counter	`service, status`	External API call count	count
`{{APP}}_cache_hit_ratio`	Gauge	`cache_type`	Cache hit percentage	ratio

2.2 Logs

Log Levels & Usage Guide

Level	When to Use	Examples
`ERROR`	Unexpected failure requiring attention	Database connection failure, unhandled exception
`WARN`	Unexpected but handled situation	Deprecated API called, retry succeeded
`INFO`	Normal business events	User logged in, order created, job completed
`DEBUG`	Diagnostic detail (dev/staging only)	Function parameters, internal state
`TRACE`	Extremely verbose (local dev only)	SQL queries, HTTP request/response bodies

Production log level: INFO and above

Structured Logging Format

{
  "timestamp": "2026-01-15T10:30:00.000Z",
  "level": "INFO",
  "service": "{{SERVICE_NAME}}",
  "version": "{{VERSION}}",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "user_id": "{{HASHED_OR_OMIT}}",
  "request_id": "req-uuid-here",
  "message": "Order created successfully",
  "order_id": "ord-123",
  "duration_ms": 45
}

Required fields: timestamp, level, service, message, trace_id Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (hash or truncate)

Log Aggregation Pipeline

flowchart LR
    APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
    AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
    STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
    STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]

Stage	Tool	Configuration
Application logging	{{LOG_LIB}}	Structured JSON to stdout
Log agent	{{LOG_AGENT}}	Deployed as sidecar / DaemonSet
Transport	{{LOG_TRANSPORT}}	TLS encrypted
Storage	{{LOG_STORE}}	Indexed, compressed
Query	{{LOG_QUERY}}	Access via dashboard

Log Retention Policy

Environment	Retention	Storage Tier
Dev	7 days	Hot
Staging	30 days	Hot
Production	{{PROD_LOG_RETENTION}} days	Hot (30d) → Cold archive
Audit logs	1 year (regulatory)	Hot (90d) → Cold archive

PII in Logs — Masking Strategy

Data Type	Strategy	Example
Email address	Hash + truncate	`user:sha256(email)[:8]`
Phone number	Redact	`[PHONE_REDACTED]`
IP address	Anonymize last octet	`192.168.1.xxx`
Payment data	Never log	Use `[PAYMENT_DATA_OMITTED]`
Auth tokens	Never log	Use `[TOKEN_OMITTED]`
Names	Omit or pseudonymize	Reference by ID only

2.3 Traces

Distributed Tracing Setup

Tracing Framework: {{TRACE_FRAMEWORK}} Backend: {{TRACE_BACKEND}} Auto-instrumentation: {{AUTO_INSTRUMENT}}

Service	Instrumented	Framework	Notes
{{SERVICE_1}}	Yes	OpenTelemetry	HTTP, DB, Redis
{{SERVICE_2}}	Yes	OpenTelemetry	HTTP, external calls

Trace Sampling Strategy

Environment	Strategy	Rate	Notes
Dev	Always-on	100%	Full visibility
Staging	Always-on	100%	Full visibility
Production	Tail-based	{{SAMPLE_RATE}}% + errors	Error traces always kept

Tail-based sampling rules:

Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}ms
Sample rate: {{SAMPLE_RATE}}% of successful, fast traces
Head-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable

Span Naming Conventions

Operation Type	Naming Pattern	Example
HTTP handler	`HTTP {{METHOD}} {{ROUTE}}`	`HTTP POST /api/orders`
DB query	`db.{{operation}} {{table}}`	`db.select orders`
Cache	`cache.{{operation}} {{key_pattern}}`	`cache.get user:*`
Queue	`queue.{{operation}} {{queue_name}}`	`queue.publish order-events`
External HTTP	`{{service}} {{METHOD}} {{path}}`	`stripe POST /charges`

Context Propagation

Standard: W3C TraceContext (traceparent header) Baggage: W3C Baggage (for user_id, tenant_id propagation) Async: Inject context into message queue headers / job metadata

3. Alerting

3.1 Alert Rules

Alert Name	Condition	Duration	Severity	Channel	Runbook
`HighErrorRate`	error_rate > {{ERROR_ALERT}}%	2 min	Critical	PagerDuty	[link]
`SlowP99`	p99_latency > {{P99_ALERT}}ms	5 min	Warning	Slack #alerts	[link]
`ServiceDown`	health_check failing	1 min	Critical	PagerDuty	[link]
`HighCPU`	cpu > {{CPU_CRIT}}%	10 min	Warning	Slack #alerts	[link]
`DiskAlmostFull`	disk > {{DISK_CRIT}}%	5 min	Critical	PagerDuty	[link]
`DeploymentFailed`	deployment status = failed	Immediate	Critical	Slack #deployments	[link]
`CertificateExpiringSoon`	cert_expiry < 30 days	—	Warning	Slack #ops	[link]
`BackupFailed`	backup job = failed	—	Critical	PagerDuty	[link]
`SLOBudgetBurning`	error_budget < 10% remaining	—	Critical	PagerDuty	[link]

3.2 Alert Routing & Escalation

flowchart TD
    ALERT[Alert fires] --> SEVERITY{Severity?}
    SEVERITY -->|Critical| ONCALL[On-call engineer<br/>PagerDuty / phone]
    SEVERITY -->|Warning| SLACK[Slack #alerts<br/>No immediate response required]
    ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
    ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]

Severity	Response SLA	Channel	Escalation
Critical (P1)	Acknowledge in 5 min, resolve in 1h	PagerDuty + call	Escalate at 5 min
High (P2)	Acknowledge in 30 min, resolve in 4h	PagerDuty	Escalate at 30 min
Warning (P3)	Review within 1 business day	Slack	Manual
Info	No response required	Slack	None

3.3 On-Call Rotation

Schedule: {{ONCALL_SCHEDULE}} Calendar: {{ONCALL_TOOL}} Primary rotation: {{ONCALL_MEMBERS}} Secondary (escalation): {{ESCALATION_MEMBERS}} Minimum rotation size: 3 people (to avoid burnout)

3.4 Alert Fatigue Prevention

Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rate
Minimum alert duration: 2+ minutes (no single-spike alerts)
Deduplication window: {{DEDUP_WINDOW}} minutes
Business hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}}
Post-mortem requirement: Every Critical alert reviewed after incident

4. Dashboards

4.1 Dashboard Inventory

Dashboard	Purpose	Link	Audience
System Overview	High-level health of all services	{{LINK}}	Everyone
{{SERVICE_1}}	Service-level detail	{{LINK}}	Dev team
Infrastructure	Host/container metrics	{{LINK}}	DevOps
Business Metrics	KPIs and conversions	{{LINK}}	Leadership, PM
SLO Tracker	Error budget tracking	{{LINK}}	Engineering lead
On-Call	Current incidents, top errors	{{LINK}}	On-call engineer

4.2 Key Dashboard Specs — System Overview

Required panels:

Service health matrix (all services, green/red/yellow)
Request rate (all services, last 1h)
Error rate (all services, last 1h)
P99 latency (all services, last 1h)
Active incidents count
Error budget remaining (all SLOs)
Last deployment (service, version, time)
Infrastructure health (CPU, memory, disk — aggregate)

5. SLOs / SLIs

5.1 SLI Definitions

SLI	Definition	Measurement Method
Availability	% requests returning non-5xx	(total_requests - 5xx_requests) / total_requests
Latency	% requests completing within threshold	histogram_quantile(0.95, ...) < {{LATENCY_SLI}}ms
Error rate	% requests not returning errors	(total_requests - error_requests) / total_requests

5.2 SLO Targets

Service	SLI	Target	Window	Error Budget
{{SERVICE}}	Availability	{{AVAIL_TARGET}}%	30 days	{{BUDGET_MINUTES}} min/month
{{SERVICE}}	Latency (P95 < {{P95}}ms)	{{LATENCY_TARGET}}%	30 days	{{LATENCY_BUDGET_MINUTES}} min/month

5.3 Error Budget Tracking

Service	Monthly Budget	Burned This Month	Remaining	Burn Rate (24h)
{{SERVICE}}	{{BUDGET}}min	TBD	TBD	TBD

Error budget policy:

Budget > 50% remaining: Move fast, deploy freely
Budget 10-50% remaining: Slow down, prioritize reliability work
Budget < 10% remaining: Freeze non-critical deploys, focus on reliability

6. Tooling

Tool	Version	Purpose	Hosted
{{METRICS_TOOL}}	{{VERSION}}	Metrics collection & storage	{{HOSTING}}
{{LOG_TOOL}}	{{VERSION}}	Log aggregation	{{HOSTING}}
{{TRACE_TOOL}}	{{VERSION}}	Distributed tracing	{{HOSTING}}
{{DASHBOARD_TOOL}}	{{VERSION}}	Visualization	{{HOSTING}}
{{ALERT_TOOL}}	{{VERSION}}	Alert routing & on-call	{{HOSTING}}

Approval

Role	Name	Date	Signature
Author
Reviewer
Approver

Disaster Recovery Plan

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	{{DATE}}	{{AUTHOR}}	Initial draft

1. Business Continuity Overview

This plan documents the procedures to recover {{PROJECT_NAME}} services following a disaster event (data center failure, data corruption, security breach, or catastrophic failure).

Plan Owner: {{DR_OWNER}} Plan Reviewer: {{DR_REVIEWER}} Last Tested: {{LAST_TEST_DATE}} Next Scheduled Test: {{NEXT_TEST_DATE}}

Disaster types covered:

Infrastructure failure (AZ/region outage)
Data corruption or accidental deletion
Security incident (ransomware, data breach)
Vendor/provider outage
Catastrophic application failure

2. RPO / RTO Targets Per Service Tier

Tier	Description	RPO	RTO	Examples
Tier 1 — Critical	Core user-facing services; downtime has direct revenue impact	0 (real-time replication)	< 15 min	Auth, checkout, core API
Tier 2 — Important	Supporting services; degraded experience without them	< 1 hour	< 4 hours	Notifications, reports
Tier 3 — Standard	Background/admin services; business can operate without temporarily	< 24 hours	< 24 hours	Analytics, admin panel

3. Service Tier Classification

Service	Tier	Owner	Rationale
{{SERVICE_1}}	Tier 1	{{OWNER}}	Core user journey
{{SERVICE_2}}	Tier 1	{{OWNER}}	Authentication
{{SERVICE_3}}	Tier 2	{{OWNER}}	Supporting
{{SERVICE_4}}	Tier 3	{{OWNER}}	Admin only
Database — Primary	Tier 1	Platform	All services depend on it
Object Storage	Tier 2	Platform	User uploads

4. Backup Strategy

4.1 Database Backups

Database	Backup Type	Frequency	Retention	Location	Verified
{{DB_PRIMARY}}	Automated snapshot	Daily	30 days	{{BACKUP_LOCATION}}	Monthly
{{DB_PRIMARY}}	Point-in-time recovery	Continuous	7 days	{{BACKUP_LOCATION}}	Monthly
{{DB_READ_REPLICA}}	Not backed up separately	—	—	Rebuilt from primary	—

Automated backup tool: {{BACKUP_TOOL}} Backup encryption: AES-256, key managed in {{KMS_TOOL}} Cross-region copy: {{CROSS_REGION}}

4.2 File / Object Storage Backups

Storage	Backup Method	Frequency	Retention	DR Copy
{{S3_BUCKET}}	S3 versioning + replication	Continuous	{{RETENTION}}	{{DR_BUCKET}}
{{FILE_STORE}}	Snapshot	Daily	30 days	Cross-region

4.3 Configuration Backups

Config	Backup Method	Location	Frequency
IaC (Terraform)	Git repository	{{GIT_REPO}}	On change
Application config	Git repository	{{GIT_REPO}}	On change
Secrets	Secrets manager replication	{{SECRETS_BACKUP}}	Real-time
DNS records	Export to Git	{{GIT_REPO}}	Weekly
TLS certificates	Secrets manager	{{CERTS_BACKUP}}	On renewal

4.4 Backup Testing Schedule

Backup Type	Test Frequency	Last Test	Result	Tester
Database full restore	Monthly	{{DATE}}	{{RESULT}}	{{TESTER}}
Point-in-time restore	Quarterly	{{DATE}}	{{RESULT}}	{{TESTER}}
Object storage restore	Quarterly	{{DATE}}	{{RESULT}}	{{TESTER}}
Full DR failover drill	Bi-annually	{{DATE}}	{{RESULT}}	{{TESTER}}

5. Failover Procedures

5.1 Automated Failover

Component	Automatic Failover	Mechanism	Failover Time
Database (Multi-AZ)	Yes	RDS automatic failover	60-120 seconds
Load balancer	Yes	Health check → route to healthy targets	< 30 seconds
CDN	Yes	Origin health checks	< 60 seconds
Redis (if clustered)	Yes	Redis Sentinel / ElastiCache	< 30 seconds

Monitoring automatic failover:

Alert fires: MultiAZFailover CloudWatch event or equivalent
On-call notified immediately
No manual action required, but on-call must confirm recovery

5.2 Manual Failover Steps

Prerequisite: Automatic failover has NOT occurred or has failed.

Database Manual Failover (Tier 1)

Confirm primary is unavailable: ping {{DB_PRIMARY_HOST}} — should timeout
Connect to standby: psql {{STANDBY_HOST}}
Promote standby to primary: SELECT pg_promote();
Update DNS record db.{{INTERNAL_DOMAIN}} → {{STANDBY_HOST}}
DNS TTL: Ensure TTL was set to 60s pre-incident (if not, wait {{DNS_TTL}} seconds)
Verify applications are reconnecting: Check application logs for successful DB connections
Page on-call to verify all services healthy

Regional Failover (Catastrophic)

Declare DR event (approval from {{DR_AUTHORITY}})
Confirm primary region {{PRIMARY_REGION}} is unreachable
Activate standby in {{DR_REGION}}: terraform apply -var-file=envs/dr.tfvars
Restore database from latest cross-region snapshot
Update Route 53 / DNS to point to {{DR_REGION}} endpoints
Run smoke tests: bash scripts/smoke-tests.sh {{DR_REGION}}
Notify stakeholders (see Communication Plan)
Monitor enhanced metrics for {{MONITOR_PERIOD}}h

6. Recovery Procedures Per Service

Tier 1 Services

Service	Recovery Procedure	Recovery Script	Est. Time
{{SERVICE_1}}	1. Restore from snapshot 2. Verify config 3. Run smoke tests	`scripts/restore-{{SERVICE_1}}.sh`	{{TIME}}min
Authentication	1. Deploy from last known good image 2. Verify JWT keys 3. Test login flow	`scripts/restore-auth.sh`	{{TIME}}min

Tier 2 Services

Tier 3 Services

7. DR Drill Schedule & Scenarios

Drill Type	Frequency	Participants	Last Executed	Next Scheduled
Tabletop exercise	Quarterly	On-call team + engineering lead	{{DATE}}	{{DATE}}
Database failover test	Quarterly	DevOps + one developer	{{DATE}}	{{DATE}}
Full DR failover	Bi-annually	Entire engineering team	{{DATE}}	{{DATE}}
Backup restore test	Monthly	DevOps	{{DATE}}	{{DATE}}

Drill Scenarios to Cover:

Database primary failure (automatic failover test)
Accidental data deletion (point-in-time restore)
Single AZ outage (multi-AZ failover)
Full region failure (cross-region DR)
Ransomware/data corruption (restore from offline backup)
CDN outage (origin fallback)
Secret store unavailable (cached credentials)

8. Communication Plan During DR Event

Internal Communications

Audience	Channel	Frequency	Owner
Engineering team	Slack #incidents + war room call	Real-time	Incident commander
Engineering management	Direct message	At declaration + hourly	Incident commander
Product/Business leadership	Email + Slack	At declaration + hourly	Incident commander
Customer support	Dedicated Slack channel	At declaration + 30 min	Support lead

External Communications

Audience	Channel	Trigger	Message
Customers	Status page ({{STATUS_PAGE}})	Within 15 min of confirmed incident	"We are investigating an issue"
Customers	Status page update	Every 30 min	Progress update
Customers	Email	If impact > {{EMAIL_THRESHOLD}}h	Direct notification
SLA customers	Direct contact	Per SLA contract	As contractually required

Communication templates: See go-live-runbook.md communication section

9. War Room Setup

War Room: {{WAR_ROOM_LINK}} Bridge Line: {{BRIDGE_NUMBER}} Document: Live incident doc created at: {{INCIDENT_DOC_TEMPLATE}}

Roles during DR event:

Role	Responsibility	Primary	Backup
Incident Commander	Coordinates response, final decisions	{{IC}}	{{IC_BACKUP}}
Technical Lead	Leads technical recovery	{{TECH_LEAD}}	{{TECH_BACKUP}}
Communications Lead	Internal/external updates	{{COMMS_LEAD}}	{{COMMS_BACKUP}}
Scribe	Documents timeline, actions taken	{{SCRIBE}}	Rotate

10. Post-Recovery Verification Checklist

11. DR Test Results Log

Date	Test Type	Scenario	RTO Achieved	RPO Achieved	Issues Found	Resolved By
{{DATE}}	{{TYPE}}	{{SCENARIO}}	{{RTO}}	{{RPO}}	{{ISSUES}}	{{RESOLVED}}

Approval

Role	Name	Date	Signature
Author
Reviewer
Approver

CI/CD Pipeline

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	{{DATE}}	{{AUTHOR}}	Initial draft

1. Overview

CI/CD Platform: {{PLATFORM}} Container Registry: {{REGISTRY}} Deployment Target: {{DEPLOY_TARGET}} Strategy: {{STRATEGY}}

2. Pipeline Overview

flowchart LR
    subgraph Source
        PR[Pull Request]
        MERGE[Merge to main]
    end

    subgraph CI["CI — runs on every PR"]
        LINT[Lint & Format]
        TEST_UNIT[Unit Tests]
        TEST_INT[Integration Tests]
        SAST[SAST Scan]
        SCA[Dependency Scan]
        BUILD[Build Artifact]
    end

    subgraph CD_DEV["CD — Dev Auto-Deploy"]
        DEPLOY_DEV[Deploy to Dev]
        SMOKE_DEV[Smoke Tests]
    end

    subgraph CD_STAGING["CD — Staging (auto on main)"]
        DEPLOY_STG[Deploy to Staging]
        TEST_E2E[E2E Tests]
        PERF[Performance Tests]
    end

    subgraph CD_PROD["CD — Production (manual gate)"]
        APPROVAL[Manual Approval]
        DEPLOY_PROD[Deploy to Production]
        SMOKE_PROD[Smoke Tests]
        MONITOR[Verify Monitoring]
    end

    PR --> LINT
    LINT --> TEST_UNIT
    TEST_UNIT --> TEST_INT
    TEST_INT --> SAST
    SAST --> SCA
    SCA --> BUILD
    MERGE --> CD_DEV
    BUILD --> DEPLOY_DEV
    DEPLOY_DEV --> SMOKE_DEV
    SMOKE_DEV --> DEPLOY_STG
    DEPLOY_STG --> TEST_E2E
    TEST_E2E --> PERF
    PERF --> APPROVAL
    APPROVAL --> DEPLOY_PROD
    DEPLOY_PROD --> SMOKE_PROD
    SMOKE_PROD --> MONITOR

3. Source Control Configuration

3.1 Branching Strategy

Strategy: {{BRANCH_STRATEGY}}

Branch	Purpose	Naming Convention	Lifetime
`main`	Production-ready code	fixed	Permanent
`develop`	Integration branch	fixed	Permanent
`feature/*`	New features	`feature/{{TICKET}}-description`	Until merged
`fix/*`	Bug fixes	`fix/{{TICKET}}-description`	Until merged
`hotfix/*`	Production hotfixes	`hotfix/{{TICKET}}-description`	Until merged
`release/*`	Release preparation	`release/v{{VERSION}}`	Until merged

3.2 Branch Protection Rules

Protected Branches: main, develop

Rule	`main`	`develop`
Require PR	Yes	Yes
Required approvals	{{APPROVALS}}	1
Dismiss stale reviews	Yes	Yes
Require status checks	Yes	Yes
Required checks	lint, unit-tests, integration-tests, sast	lint, unit-tests
Require up-to-date	Yes	No
Allow force push	No	No
Allow deletions	No	No

3.3 Code Review Requirements

Minimum {{APPROVALS}} approval(s) required before merge
At least one approval from a code owner (see CODEOWNERS)
All review comments must be resolved before merge
Review turnaround SLA: {{REVIEW_SLA}} business hours
Auto-assign reviewers via: {{ASSIGN_MECHANISM}}

4. Build Stage

4.1 Build Tool & Configuration

Parameter	Value
Build Tool	{{BUILD_TOOL}}
Build Command	`{{BUILD_CMD}}`
Artifact Type	{{ARTIFACT}}
Artifact Naming	`{{REGISTRY}}/{{IMAGE_NAME}}:{{TAG_STRATEGY}}`
Tag Strategy	`git-sha` for PRs, `semver` for releases

4.2 Dependency Caching

Cache	Key	Restore Keys
Node modules	`node-modules-{{OS}}-{{LOCKFILE_HASH}}`	`node-modules-{{OS}}-`
Docker layers	`buildx-{{DOCKERFILE_HASH}}`	`buildx-`
Test results	`test-results-{{COMMIT_SHA}}`	N/A

4.3 Artifact Generation

Artifact	Storage	Retention	Signed
Docker image	{{REGISTRY}}	90 days (non-prod), Forever (prod tags)	{{SIGNING}}
Test reports	CI artifact storage	30 days	No
SBOM	{{SBOM_STORAGE}}	1 year	Yes
Coverage report	{{COVERAGE_STORAGE}}	30 days	No

5. Test Stages

5.1 Unit Tests

Parameter	Value
Framework	{{UNIT_FRAMEWORK}}
Command	`{{UNIT_CMD}}`
Coverage Tool	{{COVERAGE_TOOL}}
Coverage Gate	≥ {{COVERAGE_GATE}}% lines, ≥ {{BRANCH_GATE}}% branches
Failure Action	Block PR merge

5.2 Integration Tests

Parameter	Value
Framework	{{INT_FRAMEWORK}}
Command	`{{INT_CMD}}`
Dependencies	{{INT_DEPS}}
Failure Action	Block PR merge

5.3 E2E Tests

Parameter	Value
Framework	{{E2E_FRAMEWORK}}
Command	`{{E2E_CMD}}`
Environment	Staging
Parallelization	{{E2E_SHARDS}} shards
Failure Action	Block staging promotion

5.4 Security Scanning

Scan Type	Tool	Command	Gate
SAST	{{SAST_TOOL}}	`{{SAST_CMD}}`	Block on HIGH/CRITICAL
SCA (dependencies)	{{SCA_TOOL}}	`{{SCA_CMD}}`	Block on CRITICAL
Container scan	{{CONTAINER_SCAN}}	`{{CONTAINER_SCAN_CMD}}`	Block on CRITICAL
Secret scanning	{{SECRET_SCAN}}	`{{SECRET_SCAN_CMD}}`	Block on any finding

5.5 Linting & Formatting

Tool	Purpose	Command	Auto-fix
{{LINTER}}	Code linting	`{{LINT_CMD}}`	PR comment
{{FORMATTER}}	Code formatting	`{{FMT_CMD}}`	Auto-commit or fail
{{TYPE_CHECK}}	Type checking	`{{TYPE_CMD}}`	No

6. Deploy Stages

6.1 Deployment Strategy

Strategy: {{DEPLOY_STRATEGY}}

Rolling Deployment:

Batch size: {{BATCH_SIZE}}% of instances
Pause between batches: {{PAUSE}}min
Health check wait: {{HEALTH_WAIT}}s
Rollback trigger: health check failure

Canary Deployment (if used):

Initial canary weight: {{CANARY_INITIAL}}%
Increment: {{CANARY_INCREMENT}}% every {{CANARY_INTERVAL}}min
Promotion criteria: error rate < {{ERROR_THRESHOLD}}%, p99 < {{LATENCY_THRESHOLD}}ms
Rollback trigger: automatic on threshold breach

6.2 Environment Promotion

PR Branch → Dev (auto) → Staging (auto on main merge) → Production (manual approval)

Promotion	Trigger	Gate	Approver
→ Dev	Merge to `develop` / PR	All CI checks pass	Automatic
→ Staging	Merge to `main`	All CI + Dev smoke tests	Automatic
→ Production	Tag `v..*`	All tests + manual approval	{{PROD_APPROVER}}

6.3 Approval Gates

Production Approval Required: Yes Approvers: {{PROD_APPROVERS}} (at least {{APPROVAL_COUNT}} required) Approval Window: {{APPROVAL_WINDOW}}h (pipeline cancels after timeout) Emergency Override: {{EMERGENCY_OVERRIDE}}

6.4 Feature Flags Integration

Feature Flag Tool: {{FF_TOOL}} Flag Validation: Feature flags validated in staging before production deploy Kill Switch: All new features behind flags for first {{FF_PERIOD}} days

7. Post-Deploy

7.1 Smoke Tests

Check	Expected	Timeout
Health endpoint `GET /health`	HTTP 200	10s
Auth endpoint reachable	HTTP 401	10s
Database connection	Healthy	15s
Cache connection	Healthy	10s
Critical user journey	Success	60s

Smoke test timeout: {{SMOKE_TIMEOUT}}min total On failure: Auto-rollback triggered

7.2 Monitoring Verification

Metric	Threshold	Check Duration
Error rate	< {{ERROR_RATE}}%	5 min
P99 latency	< {{P99}}ms	5 min
CPU utilization	< {{CPU}}%	5 min
Memory utilization	< {{MEM}}%	5 min

7.3 Rollback Triggers

Automatic rollback triggers:

Smoke test failure
Error rate > {{AUTO_ROLLBACK_ERROR}}% for {{AUTO_ROLLBACK_DURATION}}min post-deploy
Health check failure on {{HEALTH_FAIL_THRESHOLD}}% of instances

Manual rollback: See rollback-plan.md

8. Pipeline Configuration Reference

Config File Location: {{CONFIG_PATH}}

Key environment variables injected by CI:

Variable	Source	Purpose
`REGISTRY_TOKEN`	{{SECRET_STORE}}	Container registry auth
`DEPLOY_KEY`	{{SECRET_STORE}}	Deployment credentials
`SENTRY_DSN`	{{SECRET_STORE}}	Error tracking
`SLACK_WEBHOOK`	{{SECRET_STORE}}	Notifications

9. Secret Injection Strategy

Strategy: {{SECRET_STRATEGY}}

Secret Type	Storage	Injection Method	Rotation
Registry credentials	{{STORAGE}}	{{METHOD}}	{{ROTATION}}
Cloud credentials	{{STORAGE}}	OIDC / Workload Identity	Per-job
App secrets	{{STORAGE}}	{{METHOD}}	{{ROTATION}}

OIDC Preferred: Cloud credentials injected via OIDC — no long-lived keys stored in CI

10. Pipeline Metrics

Metric	Target	Current
Build duration (P50)	< {{BUILD_TARGET}}min	TBD
Test duration (P50)	< {{TEST_TARGET}}min	TBD
Total pipeline duration	< {{TOTAL_TARGET}}min	TBD
Deploy frequency	{{DEPLOY_FREQ}}	TBD
Lead time for changes	< {{LEAD_TIME}}	TBD
Change failure rate	< {{FAILURE_RATE}}%	TBD
MTTR	< {{MTTR}}	TBD

Approval

Role	Name	Date	Signature
Author
Reviewer
Approver

ALAI Static Hosting Blueprint (2026-04-20)

ALAI Static Hosting Blueprint

Author: ALAI | Date: 2026-04-20 | MC: #8481 | Last updated: 2026-04-20 (Phantom Domain Removal Protocol added per MC #8526; rollback fix per MC #8494)

1. Platform Decision

Winner: Cloudflare Pages

ALAI already runs alai.no on Cloudflare Pages and has Cloudflare as DNS provider for 6 of 12 domains. The migration path is lowest-friction of any option: git push triggers build, custom domains are free, SSL is automatic, and Cloudflare Access (already deployed for internal tools) works natively. The free tier covers unlimited sites, 500 builds/month, and unlimited bandwidth — all 12 static sites fit without spending a euro. Critically, ALAI does not need object-storage complexity (GCS/S3) or a separate CDN layer for static marketing/demo sites. Cloudflare Pages is the right tool at this scale.

The call on vendor lock-in: ALAI is already locked to Cloudflare for DNS. Extending that to hosting is concentration risk, but the blast radius is recoverable — all sites are git-backed, migrating to any other platform is a 30-minute operation per site. The cost and operational savings outweigh the risk.

Platform Comparison (12 sites, 1 GB each, 100 GB egress/month)

Criterion	Cloudflare Pages	GCP Cloud Storage + CDN	AWS S3 + CloudFront	Azure Static Web Apps
Monthly cost (12 sites)	€0 (free tier)	~€12 (storage €1.20 + CDN egress ~€10)	~€14 (S3 €0.25 + CF egress ~€8 + requests ~€6)	€0 Free / €9 Standard (2 sites free, rest €4.50/mo each)
Build minutes	500/month free	N/A (no built-in CI)	N/A (no built-in CI)	60 min/month free, then €0.009/min
DX (git push to live)	Native (GitHub/GitLab direct)	Requires Cloud Build + gsutil	Requires CodePipeline or GitHub Action + aws CLI	Native (GitHub Actions integrated)
Custom domains	Unlimited	Per load balancer config	Per distribution ($0.0075/10k requests)	5 per plan
SSL	Automatic, free	Managed certificate, manual setup	ACM free but requires distribution config	Automatic, free
Preview URLs per PR	Yes (automatic)	No (requires custom setup)	No (requires custom Lambda@Edge)	Yes (staging environments)
DDoS/WAF	Included free (Cloudflare network)	Cloud Armor (add-on, ~€5+/mo)	AWS Shield Standard free, WAF extra	Azure DDoS Basic free, WAF add-on
Vendor lock-in	Medium (proprietary build env, but output is static)	Low (standard GCS)	Low (standard S3)	Medium (Azure-specific config)

Decision: Cloudflare Pages wins on cost (€0 vs €12-14/mo), DX (native git integration), DDoS/WAF included, and operational alignment with existing CF infrastructure.

2. Deploy Blueprint

Repo Convention

Every static site lives in its own repo or a dedicated directory in a monorepo. Naming convention: alai-<product>-web for ALAI properties, client-<slug>-web for client sites. The Cloudflare Pages project name matches the repo name exactly.

Build output must be in one of: dist/, out/, public/, .next/ (for Next.js static export). For plain HTML sites, the root directory is the publish directory.

Step 1: Create Cloudflare Pages Project (one-time per site)

# Via Cloudflare dashboard or wrangler CLI
npx wrangler pages project create <project-name> \
  --production-branch main

Connect GitHub repo in the Pages dashboard. Set build command and output directory per framework:

Framework	Build command	Output dir
Static HTML	(none)	/
Next.js (static export)	`next build`	`out`
Next.js (app router)	`next build`	`.next`
Astro	`astro build`	`dist`

Step 2: GitHub Actions CI (copy-paste ready)

Save as .github/workflows/deploy.yml in every site repo:

name: Deploy to Cloudflare Pages

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      deployments: write
      pull-requests: write

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Build
        run: npm run build
        env:
          NODE_ENV: production

      - name: Deploy to Cloudflare Pages
        uses: cloudflare/wrangler-action@v3
        with:
          apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
          accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
          command: pages deploy ./out --project-name=${{ vars.CF_PROJECT_NAME }} --branch=${{ github.ref_name }}

      - name: Comment preview URL on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const { data: deployments } = await github.rest.repos.listDeployments({
              owner: context.repo.owner,
              repo: context.repo.repo,
              ref: context.payload.pull_request.head.sha,
              per_page: 1
            });
            if (deployments.length > 0) {
              github.rest.issues.createComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                issue_number: context.payload.pull_request.number,
                body: `Preview deployed: https://${context.payload.pull_request.head.sha.substring(0,8)}.${process.env.CF_PROJECT_NAME}.pages.dev`
              });
            }

For plain HTML sites with no build step, remove the Install dependencies and Build steps, and change the deploy path to ./ instead of ./out.

Step 3: Custom Domain (one-time per site)

# In Cloudflare dashboard: Pages > Project > Custom Domains > Add custom domain
# Or via API:
curl -X POST "https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/pages/projects/$PROJECT_NAME/domains" \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"name":"example.alai.no"}'

Because ALAI uses Cloudflare DNS, the CNAME/alias record is created automatically when adding the custom domain inside Cloudflare Pages.

Preview URL Per PR

Cloudflare Pages creates a preview URL automatically for every PR push. Format: https://<commit-hash>.<project-name>.pages.dev. No configuration needed. Preview environments are isolated and do not affect production traffic.

Phantom Domain Removal Protocol

ZAKON: Before vercel domains rm <phantom> — verify real domain is not implicitly routing through phantom.

Safe sequence for phantom removal:

vercel domains inspect <real-domain> — confirm direct attachment to authoritative project
If real domain does NOT show direct attachment → vercel domains add <real> --project <authoritative> FIRST
curl -sI https://<real> — confirm HTTP 200 with new attachment
ONLY THEN: vercel domains rm <phantom> --yes
Re-verify: curl -sI https://<real> HTTP 200

Forbidden: Remove phantom without prior explicit attachment of real domain → risk implicit routing break.

Incident reference: 2026-04-20 kenyhot.pro cleanup, 35s downtime, MC #8526.

Evidence: /Users/makinja/system/evidence/kenyhot-vercel-cleanup/execution-log-*.txt

Rollback (< 60 seconds)

NOTE — wrangler 4.x breaking change: wrangler pages deployment rollback was removed in wrangler 4.x. The subcommand no longer exists and the /rollback CF API endpoint returns 405 for direct-upload deployments. Do NOT use it. Use the alternatives below. (Reference: wrangler upstream release notes; verified in Proveo pilot on basicconsulting.no, MC #8494.)

Primary — CF API re-deploy (copy-paste ready):

# Required env vars — set once per shell session or in ~/.zshrc
export CF_API_TOKEN="<your-cloudflare-api-token>"   # scope: Cloudflare Pages: Edit
export CF_ACCOUNT_ID="<your-cloudflare-account-id>"
export CF_PROJECT_NAME="<project-name>"

# 1. List recent deployments and grab the target deployment ID
curl -s "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" | \
  python3 -c "import sys,json; [print(d['id'], d['created_on'][:19], d.get('deployment_trigger',{}).get('metadata',{}).get('commit_message','')[:60]) for d in json.load(sys.stdin)['result'][:10]]"

# 2. Re-deploy the target deployment (replace <deployment-id> with ID from step 1)
curl -s -X POST \
  "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments/<deployment-id>/retry" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -H "Content-Type: application/json" | python3 -c "import sys,json; r=json.load(sys.stdin); print('OK —', r['result']['id']) if r['success'] else print('ERROR:', r['errors'])"

CF reuses content-hash cache — files already on the CDN are not re-uploaded. Measured time: ~11 seconds. No build step required.

Secondary — CF Dashboard rollback (GitHub-connected repos):

Open https://dash.cloudflare.com > Pages > select project
Click "Deployments" tab
Find the target deployment row, click the three-dot menu
Select "Rollback to this deployment"
Confirm — live traffic switches in < 30 seconds

Total time to identify + execute: under 30 seconds for either path.

Secrets Management

Secret	Storage	How to use
`CLOUDFLARE_API_TOKEN`	GitHub repository secret	Set in: Repo > Settings > Secrets > Actions
`CLOUDFLARE_ACCOUNT_ID`	GitHub repository variable	Set in: Repo > Settings > Variables > Actions
`CF_PROJECT_NAME`	GitHub repository variable	Set per repo, matches CF Pages project name
Build-time env vars (API keys, etc.)	Cloudflare Pages > Settings > Environment variables	Available during build and at runtime for SSR

Token scope required: Cloudflare Pages: Edit only. Create at: https://dash.cloudflare.com/profile/api-tokens

New-Site Template (one command)

Save as /Users/makinja/system/tools/alai-new-site.sh:

#!/usr/bin/env bash
# Usage: bash alai-new-site.sh <site-name> [--framework next|html|astro]
set -euo pipefail

SITE_NAME="${1:?Usage: alai-new-site.sh <site-name> [--framework next|html|astro]}"
FRAMEWORK="${3:-html}"
REPO_DIR="/Users/makinja/ALAI/sites/${SITE_NAME}"

echo "Creating site: ${SITE_NAME} (${FRAMEWORK})"

# 1. Create repo directory
mkdir -p "${REPO_DIR}/.github/workflows"

# 2. Copy workflow template
cp /Users/makinja/system/specs/templates/cf-pages-deploy.yml "${REPO_DIR}/.github/workflows/deploy.yml"

# 3. Create wrangler.toml
cat > "${REPO_DIR}/wrangler.toml" <<EOF
name = "${SITE_NAME}"
compatibility_date = "2026-01-01"

[env.production]
EOF

# 4. Init git
cd "${REPO_DIR}" && git init && git add . && git commit -m "init: ${SITE_NAME}"

# 5. Create Cloudflare Pages project
npx wrangler pages project create "${SITE_NAME}" --production-branch main

echo "Done. Next: connect GitHub repo in Cloudflare dashboard."
echo "  https://dash.cloudflare.com/pages"

3. Maintenance

SSL Auto-Renewal

Cloudflare Pages provisions and auto-renews SSL certificates via Cloudflare's certificate authority. No manual action required. Certificates renew 30 days before expiry. The only failure mode is if a custom domain's DNS stops pointing to Cloudflare — the alert system in Section 4 catches this.

DNS Consolidation

Target: All domains to Cloudflare DNS.

Current state: 2 on Cloudflare, 1 on Vercel, 1 on AWS Route53, 3 on one.com nameservers, 3 unknown/third-party.

Migration steps per domain:

Log in to registrar, change nameservers to ana.ns.cloudflare.com and bob.ns.cloudflare.com
Cloudflare imports existing DNS records automatically (zone scan)
Verify records in Cloudflare dashboard, then activate proxy (orange cloud) for web traffic

Registrar note: Domains registered at one.com (.no TIDs) — nameserver change takes 15 minutes to 4 hours for .no domains. For .ba domains, the registrar controls this; requires contacting them directly.

Dependency Updates (Renovate)

Save as renovate.json in every repo root:

{
  "$schema": "https://docs.renovatebot.com/renovate-schema.json",
  "extends": ["config:recommended"],
  "schedule": ["every sunday"],
  "prCreationDelay": "0 minutes",
  "packageRules": [
    {
      "matchUpdateTypes": ["minor", "patch"],
      "automerge": true,
      "automergeType": "pr",
      "automergeStrategy": "squash"
    },
    {
      "matchUpdateTypes": ["major"],
      "automerge": false,
      "labels": ["dependencies", "major-update"]
    }
  ],
  "vulnerabilityAlerts": {
    "enabled": true,
    "labels": ["security"]
  }
}

Enable Renovate at https://github.com/apps/renovate for each repo. No server needed.

Backup Strategy

Asset	What	Where	Retention
Source code	Full git history	GitHub (primary)	Permanent
Source code mirror	Bare git clone	Azure VM `/opt/backups/git-mirrors/`	90 days rolling
Cloudflare Pages deployments	Build artifacts	Cloudflare (automatic, last 25 builds)	Automatic
DNS zone	Export via CF API	`/Users/makinja/system/backups/dns/` (weekly cron)	12 months
Secrets inventory	Encrypted note	Vaultwarden (vault.basicconsulting.no)	Permanent

DNS zone backup cron (add to crontab):

# Weekly DNS zone backup — runs every Sunday 02:00
0 2 * * 0 curl -s "https://api.cloudflare.com/client/v4/zones?per_page=50" \
  -H "Authorization: Bearer $CF_API_TOKEN" | \
  node /Users/makinja/system/tools/cf-zone-export.js > \
  /Users/makinja/system/backups/dns/zones-$(date +%Y%m%d).json

DR: Restore Site in < 60 Seconds

NOTE — wrangler 4.x breaking change: wrangler pages deployment rollback is removed in wrangler 4.x and must NOT be used. See MC #8494. Option A below replaces it with the CF API re-deploy path.

# Option A: CF API re-deploy (STANDARD DR PATH — replaces deprecated wrangler rollback)
# Time: ~11 seconds. CF content-hash cache means zero bytes re-uploaded for unchanged files.
export CF_API_TOKEN="<your-cloudflare-api-token>"
export CF_ACCOUNT_ID="<your-cloudflare-account-id>"
export CF_PROJECT_NAME="<site-name>"

# List last 10 deployments
curl -s "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" | \
  python3 -c "import sys,json; [print(d['id'], d['created_on'][:19], d.get('deployment_trigger',{}).get('metadata',{}).get('commit_message','')[:60]) for d in json.load(sys.stdin)['result'][:10]]"

# Re-deploy target deployment ID
curl -s -X POST \
  "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments/<deployment-id>/retry" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -H "Content-Type: application/json" | python3 -c "import sys,json; r=json.load(sys.stdin); print('OK —', r['result']['id']) if r['success'] else print('ERROR:', r['errors'])"

# Option B: Redeploy from git (if CF deployment history cleared)
cd /path/to/site-repo && npm run build && \
npx wrangler pages deploy ./out --project-name=<site-name> --branch=main
# Time: 30-90 seconds depending on build

# Option C: Emergency static serve from Azure VM (last resort)
scp -r ./out alai-admin@4.223.110.181:/var/www/<site-name>
ssh -i ~/.ssh/azure_alai alai-admin@4.223.110.181 \
  "sudo caddy reverse-proxy --from <domain> --to localhost:8080"
# Time: ~120 seconds

Option A is the standard DR path. Target: < 60 seconds. Tested monthly as part of Proveo validation.

4. Alarms and Escalation

SENTINEL daemons live in /Users/makinja/system/tools/. Alerting routes to Slack #infra-alerts channel.

Alert Table

Metric	Threshold	Channel	L1 Action	L2 Action	L3 Action
Uptime (HTTP 200)	< 100% for 5 min	#infra-alerts (Slack)	Auto-retry; post alert	Kelsey investigates: CF status page, DNS check	Escalate to CEO; activate DR (Option C)
Build failure	Any failed build on main	#infra-alerts	Alert with build URL + error log	Kelsey reviews workflow, checks CF Pages build log	Revert last commit: `git revert HEAD && git push`
SSL cert expiry	< 30 days to expiry	#infra-alerts	Alert; verify CF auto-renewal is active	Manual CF cert renewal trigger	Contact Cloudflare support
5xx rate	> 1% of requests over 10 min	#infra-alerts	Alert with request sample	Kelsey checks CF Pages function logs	Rollback via CF API re-deploy (Option A, DR section)
Traffic anomaly	> 10x baseline in 5 min	#infra-alerts	Alert; verify CF rate limiting active	Check CF analytics for origin; enable under-attack mode	Contact Cloudflare support
Bandwidth overage	> 80% of plan limit	#infra-alerts	Alert; review top assets	Optimize images, add cache headers	Upgrade CF plan or move heavy assets to R2

SENTINEL Integration

Add to /Users/makinja/system/tools/sentinel-uptime.sh:

#!/usr/bin/env bash
# Uptime check for all ALAI sites — run every 5 minutes via cron
SITES=(
  "https://alai.no"
  "https://snowit.ba"
  "https://getdrop.no"
  "https://app.getdrop.no"
  "https://basicconsulting.no"
  "https://basicfakta.no"
  "https://bilko-demo.alai.no"
  "https://kenyhot.pro"
  "https://merdzanovic.ba"
  "https://docs.alai.no"
  "https://sign.basicconsulting.no"
  "https://boards.basicconsulting.no"
  "https://vault.basicconsulting.no"
)

for SITE in "${SITES[@]}"; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$SITE")
  if [ "$STATUS" != "200" ] && [ "$STATUS" != "301" ] && [ "$STATUS" != "302" ]; then
    node /Users/makinja/system/tools/slack.js send "#infra-alerts" \
      "ALERT: $SITE returned HTTP $STATUS at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
  fi
done

Crontab entry: */5 * * * * bash /Users/makinja/system/tools/sentinel-uptime.sh

5. Cost

Per-Site Monthly Cost (Target State: Cloudflare Pages)

Site	Current Platform	Current Cost	CF Pages Cost	Notes
alai.no	Cloudflare Pages	€0	€0	Already there
snowit.ba	GitHub Pages	€0	€0	Migrate from GitHub Pages
getdrop.no	Azure VM (Caddy)	Shared with VM	€0	Static landing only
app.getdrop.no	Azure VM (Caddy)	Shared with VM	Not applicable	Next.js app, stays on VM
basicconsulting.no	Vercel	€0 (Free)	€0	Migrate from Vercel
basicfakta.no	Vercel	€0 (Free)	€0	Migrate from Vercel
bilko-demo.alai.no	GCP Cloud Run	€5-10	€0	Static export possible; see note
kenyhot.pro	Vercel	€0 (Free)	€0	Client site, coordinate
merdzanovic.ba	Vercel	€0 (Free)	€0	Client site, coordinate
docs.alai.no	Azure VM	Shared with VM	Not applicable	BookStack = dynamic, stays on VM
sign.basicconsulting.no	Azure VM	Shared with VM	Not applicable	Documenso = dynamic, stays on VM
boards.basicconsulting.no	Azure VM	Shared with VM	Not applicable	Planka = dynamic, stays on VM
vault.basicconsulting.no	Azure VM	Shared with VM	Not applicable	Vaultwarden = dynamic, stays on VM
bilko-api, bilko-intesa-demo	GCP Cloud Run	€5-10	Not applicable	Dynamic services, stay on GCP

Note on bilko-demo.alai.no: If Bilko web can be exported as static (Next.js output: 'export'), it moves to CF Pages for €0. If it requires server-side rendering (API routes, auth), it stays on GCP Cloud Run. This is a code-level decision for CodeCraft. Placeholder cost assumes migration succeeds.

Annual Total (Target State)

Provider	Services After Migration	Monthly	Annual
Cloudflare Pages	9 static sites	€0	€0
GCP Cloud Run	Bilko API + demo services (if SSR)	€5-10	€60-120
Azure VM	BookStack, Documenso, Planka, Vaultwarden, Drop app	€50	€600
GitHub Pages	snowit.ba (until CF migration)	€0	€0
one.com domains	alai.no, basicconsulting.no, getdrop.no, bilko.io	€17	€200
TOTAL		€72-77/month	€860-920/year

Current vs Target Delta

Current: €72-127/month
Target: €72-77/month (static sites are free; dynamic services stay)
Delta: -€0 to -€50/month (savings only materialize if Vercel Pro tier is confirmed and removed)
Key finding: Most current cost is the Azure VM (€50) and one.com domains (€17). These are not reducible by a hosting platform switch — they serve dynamic apps and DNS. The hosting consolidation eliminates Vercel as a dependency and reduces operational complexity.

Scale: 30 Sites by 2027

At 30 sites, Cloudflare Pages remains €0 (no per-site pricing). The only cost growth vectors are:

Azure VM upgrade if Drop/BookStack need more resources: +€20-40/month for next tier
Additional one.com domain registrations: ~€20/year each
GCP Cloud Run if Bilko scales: usage-based, estimate €10-30/month at moderate traffic

Projected 2027 total: €100-130/month at 30 sites. Cloudflare Pages does not contribute to this increase.

6. Migration Plan

Priority 1 = immediate (no dep, low risk). Priority 2 = planned (some coordination). Priority 3 = blocked/external.

Domain	Current Platform	Target Platform	Priority	Downtime Window	Dependency	MC Task
alai.no	Cloudflare Pages	Cloudflare Pages	-	None	None — already done	Done
basicconsulting.no	Vercel	Cloudflare Pages	1	0 (DNS already on CF)	Find repo	#8482
basicfakta.no	Vercel	Cloudflare Pages	1	< 5 min (NS change)	Find repo, change registrar NS	#8483
snowit.ba	GitHub Pages	Cloudflare Pages	2	< 5 min	Move DNS from AWS Route53 to CF	#8484
getdrop.no	Azure VM (Caddy)	Cloudflare Pages (static)	1	0 (DNS on Vercel, move to CF)	Static export of Next.js landing	#8485
app.getdrop.no	Azure VM (Caddy)	Azure VM (stay)	-	None	Dynamic Next.js app	No action
bilko-demo.alai.no	GCP Cloud Run	Cloudflare Pages (if static export works)	2	0 (DNS already on CF)	CodeCraft confirms static export	#8486
kenyhot.pro	Vercel	Cloudflare Pages	3	< 5 min	Coordinate with client, DNS on Vercel	#8487
merdzanovic.ba	Vercel	Cloudflare Pages	3	< 5 min	Coordinate with client, third-party DNS	#8488
bilko.io	None (down)	Cloudflare Pages	2	N/A (currently down)	Fix one.com DNS, point to CF	#8489
docs/sign/boards/vault.basicconsulting.no	Azure VM	Azure VM (stay)	-	None	Dynamic apps	No action
bilko-api, bilko-intesa-demo	GCP Cloud Run	GCP Cloud Run (stay)	-	None	Dynamic API services	No action

Total sites to migrate: 8 static sites. 4 stay on current platform (dynamic apps/services). 2 done (alai.no, basicconsulting.no).

Migration Log

Date	Domain	From	To	Downtime	TTFB Before	TTFB After	Notes
2026-04-20	basicconsulting.no	Vercel (76.76.21.21)	CF Pages	~60s	114ms	51ms (warm avg)	MC #8482. DNS: A->CNAME. Validation required domain re-add. TTFB improved 55%. Proveo pilot validated #8490.
2026-04-20	bilko.io	one.com (down)	CF Pages	N/A (site was down)	N/A	68ms (warm avg)	MC #8489. Apex CNAME not possible on one.com free tier (paid feature). Switched to Cloudflare NS (ana.ns.cloudflare.com, bob.ns.cloudflare.com). CF Pages zone ID: 62d89b79f0648d3fa1d045335a989ea7. DNS: CNAME flattening bilko.io → bilko-io.pages.dev (proxied), www → bilko-io.pages.dev.

Paused migrations:

MC #8483 (basicfakta.no) — Inventory error: site has serverless functions (Vercel Edge), not pure static. Requires CodeCraft assessment.
MC #8484 (snowit.no) — Inventory error: site has API routes (Next.js), not pure static. Requires CodeCraft assessment.

Audit verdict for #8486 (bilko-demo.alai.no): Full-stack Next.js app with dynamic API routes. Stays on GCP Cloud Run. Not eligible for CF Pages migration.

7. Lessons Learned

2026-04-20 — CF Browser Integrity Check blocks headless clients

Incident: LightRAG 46h outage (MC #8487 followup)

Problem: Automation HTTP clients (Python urllib, Node fetch, etc.) get HTTP 403 (error code 1010) from CF-proxied hostnames with Browser Integrity Check (BIC) enabled, even when IP bypass or CF Access service tokens are configured.

Root cause: BIC layer evaluates BEFORE Access policies and blocks requests based on User-Agent string. Python/Node default UAs trigger block, but curl/wget/browser tests pass — creating a false sense of security.

Fix: Create Cloudflare Configuration Rule disabling BIC per hostname. See rule INFRA-CF-001 (~/system/rules/cf-proxied-api-bic-whitelist.md) and BookStack page ID 2692.

Evidence: ~/system/evidence/lightrag-ingestion-investigation-20260420-215700.md

Hostnames affected: ollama.basicconsulting.no (fixed), lightrag.basicconsulting.no (verify needed)

8. DoD Checklist

File exists at /Users/makinja/system/specs/ALAI-STATIC-HOSTING-BLUEPRINT.md
BookStack sync task created — MC #8491 (Skillforge owner) — sync this file to docs.alai.no under "Infrastructure > Hosting"
Proveo validation task created — MC #8490 (Angie Jones owner) — deploy blueprint to 1 test site (basicconsulting.no), verify < 60s rollback works end-to-end
8 migration MC tasks created: #8482 #8483 #8484 #8485 #8486 #8487 #8488 #8489
SENTINEL uptime script deployed and crontab entry added
Renovate enabled on all repos
getdrop.no DNS moved from Vercel to Cloudflare
8 stale Vercel projects deleted (see inventory)

Cloud Migration 2026

ALAI cloud migration master plan: 6-phase transition from ANVIL-only to cloud-hosted control plane

Cloud Migration 2026

Master Plan — Cloud Migration

$(cat /tmp/bookstack-page-1-master-plan.html | jq -Rs .)

Cloud Migration 2026

Phase 1 — Bitwarden Cloud Migration

Timeline: Days 1-3
Goal: Eliminate Vaultwarden SPOF as the very first step. Every subsequent phase depends on secrets being available globally, not just when the Azure VM is alive.
MC Task: #8494
Proveo Owner: Angie Jones
Status: PREVIEW — Parisa writing detailed runbook in parallel

Why First

Phase 2 onwards deploys to Azure Container Apps. Those containers need secrets at startup (Anthropic API key, Postgres connection string, Azure SP). If Vaultwarden is down, all containers fail to start. Fix the foundation before building on it.

Deliverables

Export all current Vaultwarden items to encrypted JSON
Import to Bitwarden cloud Teams ($4/user/month — 1 seat = $4/month total)
Update alai-cli bootstrap step to use bw login against cloud.bitwarden.com
Update all agent bootstrap scripts to use cloud BW endpoint
Delete the BW CLI config pointing to vault.basicconsulting.no

Rollback Plan

Vaultwarden self-hosted remains running in parallel until Phase 6. If Bitwarden cloud import fails, fall back to self-hosted immediately. Keep vault export as encrypted offline backup in ~/system/backups/.

Proveo Validation Criteria

Test Owner: Angie Jones (Proveo)

Fresh bw login alembasic@gmail.com on a machine with NO vault.basicconsulting.no access returns all expected items (GitHub token, Azure SP, Anthropic key, SSH key)
alai login (once built in Phase 4) succeeds using cloud BW credentials
Vaultwarden VM can be stopped for 1 hour with no agent failures on ANVIL

Cost

Bitwarden cloud Teams: $4/user/month × 1 user = $4/month
vs Vaultwarden HA (2 VMs + Load Balancer): ~$88/month

Detailed Runbook

Parisa Tabriz (Securion) is writing the full step-by-step runbook in parallel. Once complete, it will be referenced here:
~/system/architecture/phase-1-bitwarden-runbook.md (pending)

Credit: ALAI, 2026

Cloud Migration 2026

Phase 2 — MC + HiveMind API

Timeline: Weeks 1-2
Goal: Mission Control and HiveMind leave ANVIL and become cloud-hosted APIs. This is the biggest architectural change — SQLite becomes Postgres, local scripts become REST calls.
MC Task: #8495
Proveo Owner: Angie Jones
Status: PREVIEW — Kelsey working in parallel

Why Second

MC and HiveMind are the nervous system. Once they are cloud-hosted, every other phase can run from any machine without touching ANVIL.

Deliverables

mc-api.js: Express-based REST API wrapping current mc.js logic
- GET /tasks, POST /tasks, PATCH /tasks/:id, GET /stats
- Postgres driver (pg) replacing SQLite
- Schema migration: 8378 tasks, 127 open — pg-migrate from SQLite dump
hivemind-api.js: REST + optional WebSocket for pub/sub
- Postgres backend (hivemind schema)
Docker images for both, pushed to Azure Container Registry
Azure Container Apps: deploy mc-api and hivemind-api
- Consumption plan (serverless, scale-to-zero when no traffic)
- Min replicas: 1 (so cold start is 2-4s max, not 30s+)
- Memory: 0.5GB each, vCPU: 0.25 each
Azure Database for Postgres Flexible Server: Burstable B1ms
- Region: swedencentral
- mission_control DB + hivemind DB on same instance
- Automated backups (7-day retention, included in cost)
Update mc.js client wrapper: detect ALAI_MC_URL env var, proxy to API if set
- Backward compatible: if no ALAI_MC_URL, still uses local SQLite (ANVIL stays working)

Cost Estimate

Container Apps (2 apps, ~5h/day active, consumption plan):
  ~$1.50/month per app = $3/month total
  (Free grant: 180,000 vCPU-s/month covers most light usage)

Azure Postgres B1ms: ~$22-24/month (swedencentral, Flexible Server)
Azure Container Registry Basic: $5/month

Total Phase 2 additions: ~$30-32/month

Rollback Plan

mc.js still reads local SQLite if ALAI_MC_URL is not set. If Postgres or Container Apps fail, unset ALAI_MC_URL on ANVIL and operations continue locally. SQLite is kept in parallel for 30 days post-migration before decommission.

Proveo Validation Criteria

Test Owner: Angie Jones (Proveo)

From ab-mac (no local SQLite): alai mc list returns live tasks
From ANVIL: node ~/system/tools/mc.js list still works (backward compat)
POST to mc-api: task appears in both mc.js list AND cloud Postgres within 2s
Postgres automated backup: verify restore of 100-row sample matches source
Container App scales to zero after 10min idle, cold starts under 5s

Detailed Implementation

Kelsey Hightower (FlowForge) is implementing Azure Container Apps + Postgres in parallel. Full runbook will be linked here once ready.

Credit: ALAI, 2026

Cloud Migration 2026

Current State vs Target State

Purpose: Visual comparison of ALAI's architecture today (ANVIL single-point-of-failure) vs the cloud-hosted control plane target state.
Source: ~/system/architecture/cloud-migration-master-plan.md

TODAY — SINGLE SPOF ARCHITECTURE

  ANVIL (makinja-sin-mac-studio)             Azure swedencentral
  100.103.49.98                              4.223.110.181
  ┌─────────────────────────────────┐        ┌──────────────────────────────┐
  │  CONTROL PLANE (all-in-one)     │        │  Supporting services (1 VM)  │
  │                                 │        │  Standard_B2als_v2, 2vCPU    │
  │  Mission Control (mc.js)        │        │  4GB RAM, 30GB SSD           │
  │  └─ SQLite mission-control.db   │        │                              │
  │     8378 tasks                  │        │  BookStack (docs)            │
  │                                 │        │  Vaultwarden (secrets — SPOF)│
  │  HiveMind (hivemind.db)         │        │  Planka (boards)             │
  │  Agent runner (pi-orchestrator) │        │  Documenso (signing)         │
  │  30 LaunchAgent daemons         │        │  Grafana / Prometheus        │
  │  Rules/skills/agents (git)      │        │  Caddy (reverse proxy)       │
  │                                 │        │                              │
  │  LightRAG (Docker :9621)        │        │  Cost estimate: $5-53/month  │
  │  Neo4j (Docker :7474/:7687)     │        │  (Azure Founders Hub credit) │
  │  Knowledge graph (481MB)        │        └──────────────────────────────┘
  │                                 │
  │  Ollama :11434                  │        Azure Blob (alaibackups0ebb)
  │  qwen3.5:27b (17G)              │        ┌──────────────────────────────┐
  │  orchestrator:latest (23G)      │        │  system-db-backups           │
  │  alaiml-task/tender/email (3G)  │        │  system-git-bundles          │
  │  qwen2.5-coder:32b (23G)        │        │  bitwarden-exports           │
  │  bge-m3 + others (~40G)         │        │  Cost: ~$2.40/month          │
  └─────────────────────────────────┘        └──────────────────────────────┘
           │ LAN only (10.0.0.2)
  ┌────────▼────────────────────────┐
  │  FORGE (Mac Mini)               │
  │  devstral:24b, qwen2.5-coder    │
  │  NOT on Tailscale — LAN only    │
  └─────────────────────────────────┘

  Tailscale mesh: 4 nodes
    makinja-sin-mac-studio  100.103.49.98
    ab-mac                  100.118.37.71
    basicass-mac-mini       100.104.164.86
    iphone181               100.93.161.73

  NOTE: ANVIL Ollama :11434 NOT reachable from ab-mac (port timeout verified).
  NOTE: 306 files in ~/system/ hardcode localhost:11434 — zero portability today.

SPOF inventory (4 critical):
  [1] ANVIL dead       → mc.js, HiveMind, agents, LightRAG, Ollama ALL stop
  [2] FORGE dead       → devstral/coder workload stops (Anthropic can substitute)
  [3] Azure VM dead    → Vaultwarden down, secrets inaccessible, agents cannot bootstrap
  [4] Local network    → FORGE permanently isolated (LAN-only, no Tailscale)

TARGET — CLOUD-HOSTED CONTROL PLANE + THIN CLIENT

  CLIENT (any OS — new laptop, travel machine, etc.)
  ┌──────────────────────────────────────────────────┐
  │  alai-cli (single installable package)           │
  │  brew install alai  |  npm install -g @alai/cli  │
  │  winget install alai  |  apt install alai-cli    │
  │                                                  │
  │  alai login     → OAuth2 PKCE → Azure AD B2C    │
  │  alai start     → connects to cloud APIs         │
  │  alai mc list   → proxies to MC API              │
  │  alai agent run → dispatches to agent runner     │
  │                                                  │
  │  Claude Code CLI (installed separately)          │
  │  ~/.claude/ cloned from git on login             │
  └──────────────────────────────────────────────────┘
                  │ HTTPS (Azure Front Door or direct)
                  │ Auth: Azure AD B2C JWT
  ┌───────────────▼──────────────────────────────────┐
  │  CLOUD CONTROL PLANE (Azure Container Apps)      │
  │  Region: swedencentral (existing subscription)   │
  │                                                  │
  │  ┌─────────────────┐  ┌──────────────────────┐  │
  │  │  MC API          │  │  Agent Runner API    │  │
  │  │  REST + WebSocket│  │  POST /run           │  │
  │  │  → Postgres      │  │  → dispatches agents │  │
  │  └─────────────────┘  └──────────────────────┘  │
  │                                                  │
  │  ┌─────────────────┐  ┌──────────────────────┐  │
  │  │  HiveMind API   │  │  Skills/Rules Proxy  │  │
  │  │  pub/sub        │  │  serves ~/system/     │  │
  │  │  → Postgres     │  │  content from Git    │  │
  │  └─────────────────┘  └──────────────────────┘  │
  │                                                  │
  │  ┌─────────────────┐  ┌──────────────────────┐  │
  │  │  Auth API        │  │  Secrets Proxy       │  │
  │  │  Azure AD B2C   │  │  → Bitwarden cloud   │  │
  │  │  JWT issuance   │  │  (no self-hosted BW) │  │
  │  └─────────────────┘  └──────────────────────┘  │
  │                                                  │
  │  Azure Database for Postgres (Flexible Server)   │
  │  Burstable B1ms — mission_control + hivemind     │
  │  (migrated from local SQLite)                    │
  │                                                  │
  │  Azure Container Registry (private)              │
  │  MC API, HiveMind, Agent Runner images           │
  └──────────────────────────────────────────────────┘
                  │ Tailscale (encrypted WireGuard)
                  │ OR public HTTPS (for Anthropic-only agents)
  ┌───────────────▼──────────────────────────────────┐
  │  DATA PLANE (stays on hardware)                  │
  │                                                  │
  │  ANVIL 100.103.49.98          FORGE 10.0.0.2     │
  │  Ollama :11434 (primary)      devstral:24b        │
  │  qwen3.5:27b                  qwen2.5-coder:32b  │
  │  alaiml-task/tender/email     (add to Tailscale) │
  │  orchestrator:latest          :11434              │
  │  LightRAG + Neo4j             (Phase 5)          │
  │                                                  │
  │  CLOUD ML FALLBACK (Phase 5)                     │
  │  Together.ai — Llama-3.3-70B  $0.88/M tokens    │
  │  Triggered only when ANVIL:11434 unreachable     │
  └──────────────────────────────────────────────────┘

  SECRETS (Phase 6 — replaces self-hosted Vaultwarden)
  ┌──────────────────────────────────────────────────┐
  │  Bitwarden cloud (Teams plan)                    │
  │  $4/user/month — 1 user = $4/month               │
  │  HA by default — Bitwarden's infrastructure      │
  │  alai-cli integrates via BW CLI at login         │
  └──────────────────────────────────────────────────┘

Key Differences

Component	Current State (ANVIL SPOF)	Target State (Cloud Control Plane)
Mission Control	SQLite on ANVIL disk	Postgres + MC API (Azure Container Apps)
HiveMind	SQLite on ANVIL disk	Postgres + HiveMind API (Azure Container Apps)
Agent Runner	pi-orchestrator on ANVIL only	Cloud agent-runner (Anthropic-powered agents), ANVIL for fine-tuned models
Secrets	Vaultwarden on single Azure VM	Bitwarden cloud ($4/month, HA by default)
Client Bootstrap	Manual setup, ANVIL-dependent	`brew install alai && alai login` — under 10 minutes, any OS
Ollama	ANVIL only, FORGE LAN-isolated	ANVIL + FORGE (Tailscale) + Together.ai cloud fallback
Cost	$27-106/month (mostly hidden by Azure credit)	$108-165/month (transparent, no hidden dependencies)
ANVIL Offline Impact	Total system outage	Cloud services continue, fine-tuned models pause gracefully

SPOF Elimination

4 SPOFs removed:

ANVIL death — control plane (MC, HiveMind, agent runner) migrates to cloud. ANVIL offline = Ollama workloads pause, everything else continues.
Vaultwarden VM death — secrets migrate to Bitwarden cloud (HA by default). No more single-VM secret dependency.
Network isolation — FORGE joins Tailscale. Cloud services can reach FORGE for code tasks even when ANVIL is down.
Workstation lock-in — alai-cli works from any machine. No more "John only works from ANVIL."

Credit: ALAI, 2026

ANVIL SPOF Elimination Plan (2026-04-20)

Status: DRAFT — Awaiting Proveo validation + Alem approval
Author: Kelsey Hightower / FlowForge
Date: 2026-04-20
MC Task: #8515 ANVIL SPOF elimination sprint
Deadline: 2026-05-01

ANVIL SPOF Elimination Plan

Author: FlowForge (Kelsey Hightower) | MC Task #8515

Date: 2026-04-20

Status: DRAFT — Awaiting Alem approval before any implementation

Executive Summary

ANVIL (Mac Studio M3 Ultra, 96 GB, 100.103.49.98) is a single point of failure. One power outage, kernel panic, or SSD failure ends all ALAI operations — mission control, agent fleet, Ollama inference, all daemons. Currently only 2 of ~67 production SQLite databases are replicated to Azure Blob Storage. RTO is effectively infinite. This plan eliminates the SPOF across 9 sequential phases.

Key finding: FORGE already exists. It is a Mac Studio M3 Ultra 256 GB connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE) with sub-millisecond latency, AND accessible via Tailscale at 100.104.164.86. No new hardware purchase is needed. Budget impact: ~0 EUR/month additional infrastructure cost (FORGE is already owned and powered).

Targets: RPO < 60s | RTO < 5 min (manual failover Phase 1, automatic Phase 2+)

Architecture Overview

ANVIL (primary)                    FORGE (warm standby)
Mac Studio M3 Ultra 96GB           Mac Studio M3 Ultra 256GB
100.103.49.98 (Tailscale)          100.104.164.86 (Tailscale)
10.0.0.1 (Thunderbolt)             10.0.0.2 (Thunderbolt)
         │                                  │
         │  Thunderbolt Bridge (< 1ms)      │
         └────────────────────────────────-─┘
                          │
                          ▼
              Azure Blob Storage
              alaibackups0ebb
              system-db-backups container
              (litestream WAL segments, all DBs)

All replication flows ANVIL → Azure → FORGE (pull-based via litestream restore). FORGE does NOT write back to Azure. Azure is the single durable WAL store.

Phase 1 — Litestream Expansion (all ~67 DBs)

1.1 Database Tier Classification

Priority rationale: P0 = system cannot function without it | P1 = major feature loss | P2 = historical/cache only.

P0 — Mission Critical (system stops without these)

Database	Size	Write Freq	Justification
mission-control.db	26 MB	Very high	Primary task ledger — all MC operations. CURRENTLY REPLICATED.
hivemind.db	162 MB	High	Agent memory, HiveMind knowledge graph. CURRENTLY REPLICATED.
tasks.db	4 KB	High	Active task queue — active work in flight
costs.db	256 KB	High	Token cost tracking, budget enforcement
events.db	14 MB	High	System event bus — orchestrator depends on this
orchestrator-queue.db	28 KB	High	Active agent job queue — jobs lost = work lost
orchestrator-workers.db	36 KB	High	Worker state — active session tracking
durable-runner.db	896 KB	Medium	Durable task execution state
session-index.db	56 MB	High	Agent session state — all active sessions
knowledge.db	192 MB	Medium	RAG knowledge base — primary retrieval corpus
emails.db	0 B (active)	High	Email agent state — initialized on first write
email-inbox.db	3.1 MB	High	Live email queue
alem-directives.db	active WAL	High	CEO directives — highest trust data

P0 — Financial / Legal (loss = regulatory exposure)

Database	Size	Write Freq	Justification
fiken.db	0 B (active)	Medium	Fiken accounting integration — financial records
invoices.db	36 KB	Medium	Invoice state — revenue tracking
contracts.db	40 KB	Low	Signed contracts — legal documents
leads.db	256 KB	Medium	Sales pipeline — business development

P1 — Operational (system degrades without these)

Database	Size	Write Freq	Justification
agent-routing.db	4.1 MB	Medium	Routing decisions, agent assignment
bee-index.db	4.2 MB	Medium	Bee task index
bih-tenders.db	640 KB	Low	BiH market tenders — business intelligence
browser-tasks.db	active WAL	Medium	Browser automation queue
companies.db	0 B (active)	Low	Company registry
contacts.db	192 KB	Low	CRM contacts
deploy-registry.db	16 KB	Low	Deployment history
design-reviews.db	64 KB	Low	Design review state
distill.db	2.0 MB	Medium	Knowledge distillation cache
documents.db	32 KB	Low	Document registry
drafts.db	360 KB	Medium	Draft content
drift.db	active WAL	Medium	Config drift detection
email-audit.db	256 KB	Medium	Email audit trail
email-briefing.db	0 B (active)	Low	Daily briefing state
email-index.db	0 B (active)	Low	Email search index
email-tracking.db	36 KB	Medium	Email delivery tracking
escalations.db	24 KB	Medium	Escalation queue
facts.db	20 KB	Low	System facts store
flywheel.db	432 MB	Low	Flywheel learning data — largest DB
goals.db	44 KB	Medium	OKR / goal tracking
guardrails-audit.db	10 MB	Medium	Safety audit trail
health-events.db	15 MB	High	System health events
hivemind-archive.db	6.7 MB	Low	HiveMind historical archive
master-control.db	0 B (active)	Medium	Master control state
mc.db	0 B (active)	Medium	Mission control alias
minions.db	192 KB	Medium	Minion agent registry
observability.db	44 KB	Medium	Metrics and traces
orchestrator-events.db	0 B (active)	Medium	Orchestrator event log
pipeline.db	active WAL	Medium	CI/CD pipeline state
projects.db	40 KB	Low	Project registry
routing-outcomes.db	192 KB	Medium	Tier routing outcome log
skill-improvements.db	20 KB	Low	Skill improvement tracking
skill-registry.db	128 KB	Low	Agent skill registry
sprint-pipeline.db	32 KB	Medium	Sprint pipeline state
strategy-tracker.db	128 KB	Low	Strategic initiative tracking
teams.db	40 KB	Low	Team registry
tenders.db	384 KB	Low	Norwegian tender data
tickets.db	active WAL	Medium	Support ticket tracking
tool-audit.db	6.1 MB	Medium	Tool usage audit
tool-registry.db	128 KB	Low	Tool registry
trace-events.db	52 MB	High	Distributed trace store
applications-tracker.db	12 KB	Low	Job/grant applications

P2 — Cache / Reconstructible (loss = inconvenience only)

Database	Size	Write Freq	Justification
baikal-caldav.db	108 KB	Low	CalDAV cache — reconstructible from Baikal
prompt-cache.db	320 KB	Medium	LLM prompt cache — can warm from scratch
prompt-metrics.db	28 KB	Low	Prompt performance metrics
rag-cache.db	active WAL	Medium	RAG response cache — reconstructible
semantic-reuse-index.db	192 KB	Medium	Semantic cache — reconstructible
stbs.db	0 B (active)	Low	STBS data — empty
telemetry.db	24 KB	Medium	Telemetry — can lose without ops impact
token-cost.db	active WAL	Medium	Cost log — reconstructible from API receipts
usage.db	0 B (active)	Low	Usage tracking — empty
vcr.db	active WAL	Low	HTTP cassette cache — reconstructible

1.2 Retention Strategy

Current retention for the 2 replicated DBs: 72h. This is insufficient for P0.

Tier	Retention	Justification
P0 (mission-critical)	7d	One week: covers weekend + Monday incident recovery. 72h is too tight — if a silent corruption is not caught in 3 days, all WAL segments are gone.
P0 (financial/legal)	30d	Regulatory prudence. fiken.db, invoices.db, contracts.db. Matches typical invoice dispute windows.
P1	72h	Current default. Operationally acceptable.
P2	24h	Cache data. Disk cost matters more than recovery depth.

Retention-check-interval: 1h for all tiers (current default, correct).

Sync-interval: 1s for all tiers P0 and P1. 10s for P2 (reduce Azure transaction cost on low-value data).

Azure storage cost estimate at current sizes (~1.2 GB total databases):

WAL segments are incremental. Estimate ~500 MB/day delta across all active DBs.
7-day P0 WAL: ~3.5 GB. 30-day financial: ~1 GB. P1 72h: ~1 GB.
Total Azure Blob: ~6 GB. At ~€0.02/GB/month = ~€0.12/month. Negligible.

1.3 New litestream.yml

Path: /Users/makinja/system/config/litestream.yml

Note on flywheel.db (432 MB): Include in P1 but with sync-interval: 30s to reduce churn. Note on knowledge.db (192 MB): P0, sync-interval 1s — it's actively written by RAG ingestion.

# Litestream — SQLite streaming replication to Azure Blob Storage
# Primary: ANVIL (Mac Studio M3 Ultra 96GB, 100.103.49.98)
# Config: /Users/makinja/system/config/litestream.yml
# Auth: Azure SP (alai-backup-writer) via client credentials
#       SP: alai-backup-writer (1a0b3018-0c31-474b-918f-531b0a29a669)
#       SP has Storage Blob Data Contributor on system-db-backups container
#       Litestream reads AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID from env
# Launch: com.alai.litestream.plist (sets env vars in EnvironmentVariables block)
# Updated: 2026-04-20 — ANVIL SPOF Elimination Sprint (MC #8515)
#
# Tier reference:
#   P0-critical: retention 7d, sync 1s
#   P0-financial: retention 30d, sync 1s
#   P1: retention 72h, sync 1s (or 30s for large DBs)
#   P2: retention 24h, sync 10s

dbs:
  # ── P0 MISSION CRITICAL ──────────────────────────────────────────────────────

  - path: /Users/makinja/system/databases/mission-control.db
    replicas:
      - name: mc-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/mission-control
        retention: 168h   # 7 days
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/hivemind.db
    replicas:
      - name: hivemind-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/hivemind
        retention: 168h   # 7 days
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tasks.db
    replicas:
      - name: tasks-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tasks
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/costs.db
    replicas:
      - name: costs-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/costs
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/events.db
    replicas:
      - name: events-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/events
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/orchestrator-queue.db
    replicas:
      - name: orch-queue-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/orchestrator-queue
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/orchestrator-workers.db
    replicas:
      - name: orch-workers-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/orchestrator-workers
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/durable-runner.db
    replicas:
      - name: durable-runner-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/durable-runner
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/session-index.db
    replicas:
      - name: session-index-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/session-index
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/knowledge.db
    replicas:
      - name: knowledge-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/knowledge
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/emails.db
    replicas:
      - name: emails-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/emails
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-inbox.db
    replicas:
      - name: email-inbox-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-inbox
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/alem-directives.db
    replicas:
      - name: alem-directives-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/alem-directives
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  # ── P0 FINANCIAL / LEGAL ─────────────────────────────────────────────────────

  - path: /Users/makinja/system/databases/fiken.db
    replicas:
      - name: fiken-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/fiken
        retention: 720h   # 30 days
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/invoices.db
    replicas:
      - name: invoices-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/invoices
        retention: 720h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/contracts.db
    replicas:
      - name: contracts-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/contracts
        retention: 720h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/leads.db
    replicas:
      - name: leads-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/leads
        retention: 720h
        retention-check-interval: 1h
        sync-interval: 1s

  # ── P1 OPERATIONAL ───────────────────────────────────────────────────────────

  - path: /Users/makinja/system/databases/agent-routing.db
    replicas:
      - name: agent-routing-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/agent-routing
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/bee-index.db
    replicas:
      - name: bee-index-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/bee-index
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/bih-tenders.db
    replicas:
      - name: bih-tenders-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/bih-tenders
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/browser-tasks.db
    replicas:
      - name: browser-tasks-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/browser-tasks
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/companies.db
    replicas:
      - name: companies-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/companies
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/contacts.db
    replicas:
      - name: contacts-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/contacts
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/deploy-registry.db
    replicas:
      - name: deploy-registry-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/deploy-registry
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/design-reviews.db
    replicas:
      - name: design-reviews-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/design-reviews
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/distill.db
    replicas:
      - name: distill-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/distill
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/documents.db
    replicas:
      - name: documents-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/documents
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/drafts.db
    replicas:
      - name: drafts-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/drafts
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/drift.db
    replicas:
      - name: drift-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/drift
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-audit.db
    replicas:
      - name: email-audit-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-audit
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-briefing.db
    replicas:
      - name: email-briefing-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-briefing
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-index.db
    replicas:
      - name: email-index-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-index
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-tracking.db
    replicas:
      - name: email-tracking-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-tracking
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/escalations.db
    replicas:
      - name: escalations-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/escalations
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/facts.db
    replicas:
      - name: facts-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/facts
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/flywheel.db
    replicas:
      - name: flywheel-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/flywheel
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 30s   # 432MB — throttle sync to reduce Azure transactions

  - path: /Users/makinja/system/databases/goals.db
    replicas:
      - name: goals-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/goals
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/guardrails-audit.db
    replicas:
      - name: guardrails-audit-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/guardrails-audit
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/health-events.db
    replicas:
      - name: health-events-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/health-events
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/hivemind-archive.db
    replicas:
      - name: hivemind-archive-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/hivemind-archive
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/master-control.db
    replicas:
      - name: master-control-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/master-control
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/mc.db
    replicas:
      - name: mc-db-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/mc-db
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/minions.db
    replicas:
      - name: minions-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/minions
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/observability.db
    replicas:
      - name: observability-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/observability
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/orchestrator-events.db
    replicas:
      - name: orch-events-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/orchestrator-events
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/pipeline.db
    replicas:
      - name: pipeline-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/pipeline
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/projects.db
    replicas:
      - name: projects-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/projects
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/routing-outcomes.db
    replicas:
      - name: routing-outcomes-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/routing-outcomes
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/skill-improvements.db
    replicas:
      - name: skill-improvements-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/skill-improvements
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/skill-registry.db
    replicas:
      - name: skill-registry-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/skill-registry
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/sprint-pipeline.db
    replicas:
      - name: sprint-pipeline-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/sprint-pipeline
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/strategy-tracker.db
    replicas:
      - name: strategy-tracker-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/strategy-tracker
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/teams.db
    replicas:
      - name: teams-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/teams
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tenders.db
    replicas:
      - name: tenders-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tenders
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tickets.db
    replicas:
      - name: tickets-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tickets
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tool-audit.db
    replicas:
      - name: tool-audit-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tool-audit
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tool-registry.db
    replicas:
      - name: tool-registry-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tool-registry
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/trace-events.db
    replicas:
      - name: trace-events-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/trace-events
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/applications-tracker.db
    replicas:
      - name: applications-tracker-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/applications-tracker
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  # ── P2 CACHE / RECONSTRUCTIBLE ───────────────────────────────────────────────

  - path: /Users/makinja/system/databases/baikal-caldav.db
    replicas:
      - name: baikal-caldav-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/baikal-caldav
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/prompt-cache.db
    replicas:
      - name: prompt-cache-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/prompt-cache
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/prompt-metrics.db
    replicas:
      - name: prompt-metrics-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/prompt-metrics
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/semantic-reuse-index.db
    replicas:
      - name: semantic-reuse-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/semantic-reuse-index
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/stbs.db
    replicas:
      - name: stbs-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/stbs
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/telemetry.db
    replicas:
      - name: telemetry-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/telemetry
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/token-cost.db
    replicas:
      - name: token-cost-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/token-cost
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/usage.db
    replicas:
      - name: usage-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/usage
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/vcr.db
    replicas:
      - name: vcr-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/vcr
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

1.4 Implementation Steps (ANVIL)

Stop litestream: launchctl stop com.alai.litestream
Replace /Users/makinja/system/config/litestream.yml with the config above.
Validate config: /opt/homebrew/bin/litestream replicate -config /Users/makinja/system/config/litestream.yml -config-validate
Start litestream: launchctl start com.alai.litestream
Verify all DBs appear in Azure: az storage blob list --container-name system-db-backups --account-name alaibackups0ebb --prefix litestream/ --auth-mode login --query "[].name" | wc -l (expect ~67+ entries).
Watch logs for errors: tail -f /Users/makinja/system/logs/litestream-error.log

Phase 2 — FORGE Hardware / OS Decision

2.1 FORGE Already Exists — Hardware Decision Is Made

FORGE is confirmed to be a second Mac Studio M3 Ultra with 256 GB unified memory, connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE). Tailscale IP: 100.104.164.86. User: basicas. It is already running Ollama with models including devstral:24b, qwen3:32b, deepseek-r1:70b, qwen3-coder, and bge-m3.

No hardware purchase is required. Monthly infrastructure cost delta: 0 EUR (already owned).

2.2 Why FORGE Wins Over Every Alternative

Option	Cost/mo	Latency to ANVIL	Apple Silicon	macOS parity	Verdict
FORGE (Mac Studio M3U 256GB, owned)	0 EUR	< 1ms (Thunderbolt)	Yes (M3 Ultra)	Yes (same LaunchAgent ecosystem)	CHOSEN
Mac Mini M4 Pro (purchase)	~50 EUR amortized	< 1ms if local	Yes	Yes	Redundant — FORGE exists
Hetzner Linux VM (CCX33)	~30-50 EUR	10-30ms (internet)	No (x86)	No (systemd, not launchd)	Budget option only if FORGE fails
Azure VM (Sweden Central)	~60-80 EUR	10-30ms	No	No	Closest to Azure storage but no Apple Silicon

Decision: Use FORGE as warm standby. Zero additional cost. Thunderbolt latency is effectively local — litestream WAL replication will complete in well under 60s.

2.3 FORGE Bootstrap Prerequisites

FORGE already runs Ollama. What is missing:

litestream installed on FORGE (check: brew list litestream on basicas@FORGE)
Azure SP credentials injected into FORGE environment (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID)
~/system/databases/ directory created on FORGE
litestream-restore.sh daemon script written and loaded as LaunchAgent on FORGE
SSH key access from ANVIL to FORGE for health check and failover scripts

Phase 3 — Continuous Restore on FORGE (< 60s RPO)

3.1 Architecture

FORGE runs litestream restore in a watch loop per database. Litestream 0.5.x does not have a native watch mode — it restores a snapshot + WAL segments. The recommended approach is a shell script loop that calls litestream restore repeatedly with a short interval.

However, litestream does support a second process pattern: run litestream replicate on FORGE pointing at the SAME Azure bucket paths, but configured as a replica-only consumer. This is the correct approach: FORGE runs a litestream restore daemon that continuously polls for new WAL segments from Azure.

3.2 Continuous Restore Strategy

Use litestream restore with the -if-replica-exists flag in a loop:

#!/usr/bin/env bash
# /Users/basicas/system/scripts/litestream-restore-loop.sh
# Runs on FORGE. Continuously restores all P0+P1 DBs from Azure.
# Interval: 30s poll (gives ~30s RPO in steady state, well within 60s target)

set -euo pipefail

LITESTREAM=/opt/homebrew/bin/litestream
CONFIG=/Users/basicas/system/config/litestream-restore.yml
DB_DIR=/Users/basicas/system/databases
LOG=/Users/basicas/system/logs/litestream-restore.log
INTERVAL=30  # seconds between restore cycles

while true; do
  echo "[$(date -Iseconds)] Starting restore cycle" >> "$LOG"
  
  # Restore each DB defined in restore config
  # litestream restore will only apply new WAL segments if DB already exists
  $LITESTREAM restore -config "$CONFIG" -if-replica-exists >> "$LOG" 2>&1 || true
  
  echo "[$(date -Iseconds)] Restore cycle complete, sleeping ${INTERVAL}s" >> "$LOG"
  sleep "$INTERVAL"
done

3.3 FORGE litestream-restore.yml

A separate config file on FORGE that mirrors ANVIL's litestream.yml but uses restore semantics. FORGE is READ-ONLY consumer. It never writes back to Azure.

Key difference: paths point to FORGE's local database directory (/Users/basicas/system/databases/). The Azure paths are identical to ANVIL's — FORGE reads from the same blob paths ANVIL writes to.

# /Users/basicas/system/config/litestream-restore.yml
# FORGE warm standby — continuous restore from Azure
# DO NOT run litestream replicate with this config — restore only

dbs:
  - path: /Users/basicas/system/databases/mission-control.db
    replicas:
      - name: mc-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/mission-control

  # ... (repeat for all P0 and P1 DBs using same Azure paths as ANVIL)
  # P2 DBs: omit from restore config — not worth continuous restore overhead

3.4 FORGE LaunchAgent for Restore Loop

Path: /Users/basicas/Library/LaunchAgents/com.alai.litestream-restore.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.alai.litestream-restore</string>
  <key>ProgramArguments</key>
  <array>
    <string>/bin/bash</string>
    <string>/Users/basicas/system/scripts/litestream-restore-loop.sh</string>
  </array>
  <key>EnvironmentVariables</key>
  <dict>
    <key>AZURE_STORAGE_ACCOUNT</key>
    <string>alaibackups0ebb</string>
    <key>AZURE_CLIENT_ID</key>
    <string>1a0b3018-0c31-474b-918f-531b0a29a669</string>
    <key>AZURE_CLIENT_SECRET</key>
    <string>RETRIEVE_FROM_BITWARDEN_AT_BOOTSTRAP</string>
    <key>AZURE_TENANT_ID</key>
    <string>3454a03f-20b4-4bda-a116-2293c459aecd</string>
  </dict>
  <key>KeepAlive</key>
  <true/>
  <key>RunAtLoad</key>
  <true/>
  <key>StandardOutPath</key>
  <string>/Users/basicas/system/logs/litestream-restore.log</string>
  <key>StandardErrorPath</key>
  <string>/Users/basicas/system/logs/litestream-restore-error.log</string>
  <key>ThrottleInterval</key>
  <integer>10</integer>
</dict>
</plist>

3.5 RPO Calculation

ANVIL litestream sync-interval: 1s (WAL segment flushed to Azure every 1s for P0)
FORGE restore poll interval: 30s
Azure propagation: < 1s (same-region, in-blob operations)
Worst-case RPO: 31s (well under 60s target)
Expected average RPO: ~15-20s

Phase 4 — Ollama Failover Tier Routing

4.1 Current State

Tier routing in /Users/makinja/system/config/tier-routing.json already defines FORGE as the primary host for Tiers 2c, 2cf, 2d, 3, 3s, 3r. ANVIL handles Tiers 1, 2, 2t, 2cHQ. The providerFallback section defines ollama:qwen2.5-coder:32b@anvil as fallback for some paths.

The gap: there is no automatic failover FROM ANVIL TO FORGE when ANVIL Ollama is down, and no automatic failover FROM FORGE TO ANVIL when FORGE Ollama is down.

4.2 Failover Config Extension

Extend /Users/makinja/system/config/tier-routing.json with an ollamaHosts block:

"ollamaHosts": {
  "anvil": {
    "url": "http://localhost:11434",
    "tailscale_url": "http://100.103.49.98:11434",
    "health_path": "/api/tags",
    "health_timeout_ms": 3000,
    "role": "primary-infra"
  },
  "forge": {
    "url": "http://10.0.0.2:11434",
    "tailscale_url": "http://100.104.164.86:11434",
    "health_path": "/api/tags",
    "health_timeout_ms": 3000,
    "role": "primary-compute"
  }
},
"failoverRules": {
  "anvil-down": {
    "redirect_anvil_tiers": ["1", "2", "2t", "2cHQ"],
    "to_forge_models": {
      "llama3.1:8b": "llama3.1:8b",
      "qwen2.5-coder:32b": "qwen2.5-coder:32b-instruct-q8_0"
    },
    "note": "When ANVIL Ollama unreachable, route Tier 1/2 to FORGE equivalents"
  },
  "forge-down": {
    "redirect_forge_tiers": ["2c", "2cf", "2d", "3", "3s", "3r"],
    "to_claude": true,
    "note": "When FORGE Ollama unreachable, escalate to Claude (cost spike acceptable — FORGE failure is rare)"
  }
}

4.3 Health Check Daemon

A new lightweight Node.js daemon on ANVIL polls both Ollama endpoints every 15s and writes status to a JSON file that ollama-engine.js reads before routing:

Path: /Users/makinja/system/daemons/ollama-health-monitor.js

// Pseudocode — implementation by CodeCraft
// Runs every 15s, writes to /tmp/ollama-health.json
// {
//   "anvil": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" },
//   "forge": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" }
// }
// tier-router.js reads this file before every dispatch
// If anvil.healthy === false: redirect tier 1/2 requests to forge
// If forge.healthy === false: redirect tier 2c/3 requests to claude

4.4 Manual Failover Command

For Phase 1 (before automatic failover is implemented):

# On ANVIL, when FORGE is down — force all routing to ANVIL
echo '{"anvil":{"healthy":true},"forge":{"healthy":false,"override":true}}' > /tmp/ollama-health-override.json

# When ANVIL is down, from FORGE (if FORGE has ollama-engine.js):
# Edit /Users/basicas/system/config/tier-routing.json: set all hosts to "forge"

Phase 5 — DNS / Service Discovery

5.1 Options Evaluated

Option	Mechanism	Failover Speed	Complexity	Cost
Tailscale MagicDNS	DNS record swap via Tailscale API	Manual: ~1 min	Low	Free
Cloudflare DNS + health check	CF Load Balancer health-check → DNS swap	Automatic: ~30s	Medium	~$5/month
Local /etc/hosts on each node	Static entries, no automatic failover	Manual: ~1 min	None	Free
Cloudflare Tunnel alias	DNS alias behind CF Tunnel	~30s	Medium	Free tier

5.2 Recommendation: Tailscale MagicDNS

Chosen: Tailscale MagicDNS with manual DNS swap.

Rationale:

All nodes (ANVIL, FORGE, ab-mac) are already on the same Tailscale network.
Tailscale MagicDNS can assign a hostname anvil.alai.internal (or use the device name directly).
Current hardcoded addresses (localhost:11434, 10.0.0.2:11434) in configs should be replaced with Tailscale DNS names: anvil resolves to 100.103.49.98, forge resolves to 100.104.164.86.
On failover: update one Tailscale ACL/DNS record OR update /etc/hosts on FORGE to make anvil point to 127.0.0.1 (making FORGE answer for anvil traffic locally).

Implementation:

In Tailscale admin console: verify MagicDNS is enabled for the tailnet.
Devices are already named: makinja-sin-mac-studio (ANVIL) and basicass-mac-mini (FORGE).
Add a Tailscale DNS override: anvil.alai → 100.103.49.98 (ANVIL primary).
Add to all tool configs: replace localhost:11434 with anvil.alai:11434, 10.0.0.2:11434 with forge.alai:11434.
Failover procedure: update Tailscale DNS record anvil.alai → 100.104.164.86 (FORGE). This takes effect across all nodes within ~30s (Tailscale DNS TTL).

Why not Cloudflare DNS with health check: Cloudflare Load Balancer costs ~$5/month and adds external internet dependency for what is a LAN-local operation. Overkill for current scale. Revisit if ALAI adds a third node outside the LAN.

Phase 6 — External Heartbeat

6.1 Requirement

An external entity (not on ANVIL, not on FORGE) must poll ANVIL every 60s and alert Slack #ops if ANVIL is unreachable for > 2 consecutive minutes (2 missed polls).

6.2 Mechanism: GitHub Actions Cron (Recommended)

Chosen: GitHub Actions scheduled workflow. Cost: free (GitHub public repo or private with Actions minutes). No Azure Function setup required.

# .github/workflows/anvil-heartbeat.yml
# In a private ALAI GitHub repo (e.g., alai-infra or system-health)

name: ANVIL Heartbeat
on:
  schedule:
    - cron: '* * * * *'   # Every minute

jobs:
  heartbeat:
    runs-on: ubuntu-latest
    timeout-minutes: 1
    steps:
      - name: Check ANVIL health via Tailscale
        id: health
        run: |
          # ANVIL exposes a health endpoint via Cloudflare Tunnel or public URL
          # Option A: Hit a public health endpoint (requires CF Tunnel on ANVIL)
          # Option B: Use Tailscale GitHub Action to join the tailnet and check directly
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
            --connect-timeout 10 \
            --max-time 15 \
            ${{ secrets.ANVIL_HEALTH_URL }})
          echo "status=$STATUS" >> $GITHUB_OUTPUT

      - name: Alert Slack if down
        if: steps.health.outputs.status != '200'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "channel": "#ops",
              "text": ":red_circle: ANVIL HEALTH CHECK FAILED\nHTTP Status: ${{ steps.health.outputs.status }}\nTime: ${{ github.run_started_at }}\nANVIL may be down. Check Tailscale and initiate FORGE failover if confirmed."
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_OPS_WEBHOOK }}

6.3 ANVIL Health Endpoint

ANVIL needs a lightweight HTTP health endpoint reachable from the internet (via Cloudflare Tunnel) or via Tailscale GitHub Action. The simplest approach:

Create a health check script at /Users/makinja/system/tools/health-server.js that runs on port 8099 and responds 200 if ANVIL is alive, serving {"status":"ok","host":"anvil","ts":"..."}. Expose via existing Cloudflare Tunnel infrastructure.

6.4 Alert Escalation

2 consecutive failures (2 minutes down): Slack #ops message.
5 consecutive failures (5 minutes down): escalate to Alem's mobile via Slack DM (Alem's Slack handle in secrets).

6.5 Azure Function Alternative

Azure Function with Timer trigger (every 60s) is viable but requires:

Azure subscription billing (Consumption plan: ~$0/month for < 1M executions — effectively free)
Azure Function App deployment and maintenance
More setup complexity than GitHub Actions

Verdict: GitHub Actions preferred for simplicity. Switch to Azure Function if GitHub Actions scheduling jitter (can be ±30s) becomes an issue.

Phase 7 — Shared Secrets (FORGE Bitwarden Access)

7.1 Problem

FORGE needs access to secrets (Azure SP secret, Bitwarden master password, API keys) without depending on ANVIL being alive. Currently ANVIL holds the Bitwarden session at /tmp/bw-session.

7.2 Options

Option	Description	Risk
Separate BW account on FORGE	FORGE has its own Bitwarden account with shared collection	Low — independent
Shared BW session sync	ANVIL writes /tmp/bw-session to FORGE via rsync	Medium — session expires
Azure Key Vault break-glass	Critical secrets in AKV, FORGE SP can read them	Low — Azure dependency
Environment variables in plist	Secrets baked into LaunchAgent plist on FORGE	Low but plaintext risk

7.3 Recommendation: Two-Layer Approach

Layer 1 (operational): FORGE bootstraps its own Bitwarden CLI session independently.

FORGE has bw CLI installed.
FORGE has its own BW_SESSION set via a one-time manual bootstrap: bw login --apikey using a FORGE-specific API key (Bitwarden supports API keys per user/device).
Session is stored in /Users/basicas/.bw-session and refreshed by a LaunchAgent on FORGE.
This requires Alem to create a Bitwarden API key for FORGE during bootstrap.

Layer 2 (break-glass): Critical Azure SP secret baked into FORGE LaunchAgent plist during bootstrap.

The Azure SP secret (AZURE_CLIENT_SECRET) is placed directly in the com.alai.litestream-restore.plist EnvironmentVariables block — same pattern as ANVIL.
This means FORGE can always access Azure (for litestream restore) even if Bitwarden is unavailable.
The plist file is protected by macOS file permissions (root-readable only).
This is the same pattern already in use on ANVIL (confirmed in the plist we read).

Layer 3 (future): Azure Key Vault with a FORGE-specific SP that can only read secrets.

Create a new SP alai-forge-reader with Key Vault Secrets User role.
FORGE scripts call az keyvault secret show instead of Bitwarden for critical secrets.
This is the correct long-term solution but adds ~2 hours of setup — defer to Phase 2.

7.4 Bootstrap Sequence for FORGE Secrets

# On FORGE during initial bootstrap (one-time, performed by Alem or FlowForge):
# 1. Install bw CLI
brew install bitwarden-cli

# 2. Login with API key (avoids interactive login)
export BW_CLIENTID="<forge-api-key-id from Bitwarden>"
export BW_CLIENTSECRET="<forge-api-key-secret>"
bw login --apikey
bw unlock --passwordenv BW_MASTER_PASSWORD  # or interactive

# 3. Store session
bw unlock > /Users/basicas/.bw-session

# 4. Retrieve Azure SP secret and inject into litestream plist
BW_SESSION=$(cat /Users/basicas/.bw-session)
AZ_SECRET=$(bw get password "alai-backup-writer" --session "$BW_SESSION")
# Update the plist AZURE_CLIENT_SECRET value with $AZ_SECRET

Phase 8 — Proveo DR Drill Checklist (Angie Jones Validation Task)

This is the mandatory validation task per ZAKON PLAN. Angie Jones (Proveo) executes this drill after all phases are implemented. This is a REAL drill — not a dry run.

8.1 Pre-Drill Prerequisites

Phase 1 complete: all ~67 DBs replicating to Azure (verify with az storage blob list count)
Phase 3 complete: FORGE restore loop running, confirmed by checking FORGE DB file timestamps
Phase 4 complete: Ollama health monitor daemon running on ANVIL
Phase 5 complete: Tailscale MagicDNS configured (anvil.alai resolves correctly)
Phase 6 complete: GitHub Actions heartbeat workflow deployed and sending test ping
Phase 7 complete: FORGE Bitwarden session independently functional

8.2 Drill Procedure

Step 1: Establish baseline (T=0)

# On ANVIL — record current state
node ~/system/tools/mc.js stats  # Record open task count
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'"  # Record
date -Iseconds > /tmp/drill-start.txt

Step 2: Simulate ANVIL failure

# Graceful shutdown (simulates power outage or kernel panic recovery)
# DO NOT run on production without Alem present
sudo shutdown -h now  # Or: launchctl stop all non-essential services
# Alternative: kill Ollama + stop litestream + stop pi-orchestrator (partial failure sim)
launchctl stop com.alai.litestream
launchctl stop com.john.pi-orchestrator
launchctl stop com.john.ollama-serve-v2

Step 3: Measure time to alert (T=2 min)

GitHub Actions heartbeat should fire within 2 minutes of ANVIL going offline.
Angie records: timestamp of Slack #ops alert arrival.
Expected: < 2 min 30s from shutdown to Slack alert.

Step 4: FORGE failover execution (T=3 min target)

# On FORGE (basicas@100.104.164.86)
# 1. Verify latest DBs restored
ls -la ~/system/databases/*.db | head -5
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'"
# Compare to baseline — delta should be < 60s of writes

# 2. Update Tailscale DNS: anvil.alai → 100.104.164.86 (FORGE)
# (Alem updates in Tailscale admin console)

# 3. Start pi-orchestrator on FORGE (if installed)
# OR: update tier-routing.json to route all requests to forge endpoints

# 4. Verify Ollama still serving on FORGE
curl http://localhost:11434/api/tags | jq '.models | length'

Step 5: Measure RPO

# On FORGE after failover
BASELINE=$(cat /tmp/drill-baseline-count.txt)  # From Step 1
CURRENT=$(sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'")
echo "Task count delta: $((BASELINE - CURRENT))"

# Check last WAL segment timestamp in Azure
az storage blob list \
  --container-name system-db-backups \
  --account-name alaibackups0ebb \
  --prefix litestream/mission-control \
  --auth-mode login \
  --query "reverse(sort_by([].{name:name,last_modified:properties.lastModified}, &last_modified))[0]"
# Record last WAL segment time vs ANVIL shutdown time = actual RPO

Step 6: Measure RTO

RTO = time from "ANVIL confirmed down" to "FORGE serving requests with < 60s RPO data".
Record timestamps at each step. Target: < 5 minutes total.

Step 7: Restore ANVIL and verify

# Start ANVIL back up
# Verify litestream resumes replication
tail -f /Users/makinja/system/logs/litestream.log
# Verify FORGE restore loop detects ANVIL is back and no duplicate writes

8.3 Acceptance Criteria (Angie signs off when ALL pass)

Criterion	Target	Measured
Slack alert latency	< 2 min 30s	TBD
FORGE DB data lag (RPO)	< 60s	TBD
Time to FORGE serving (RTO)	< 5 min	TBD
P0 DB count on FORGE	17 DBs	TBD
Ollama inference on FORGE	Working (test prompt)	TBD
No data loss on ANVIL restart	mission-control.db row count matches	TBD

8.4 Findings Documentation

After the drill, Angie produces a findings report:

Actual RPO measured
Actual RTO measured
Any P0 DB that failed to restore
Any daemon that did not restart on FORGE
Recommendations for Phase 2 (automatic failover improvements)

Phase 9 — Skillforge BookStack Runbook Specification

This is the mandatory documentation task per ZAKON PLAN. Skillforge produces a BookStack page at: https://docs.basicconsulting.no → Book: Infrastructure → Chapter: ANVIL DR & HA.

9.1 Required Sections

9.1.1 Overview Page

System architecture diagram (ANVIL — Thunderbolt — FORGE — Azure Blob)
Node inventory: ANVIL (96GB M3U), FORGE (256GB M3U), Azure (alaibackups0ebb)
RPO/RTO targets and current measured values

9.1.2 Litestream Configuration

How litestream works (WAL replication explained for non-experts)
DB tier classification table (P0/P1/P2) with justification
Retention policy per tier
How to add a new DB to replication (step-by-step)
How to verify replication is working: az storage blob list command + expected output
Where logs live: /Users/makinja/system/logs/litestream.log and -error.log

9.1.3 FORGE Warm Standby

What FORGE has installed (litestream, Ollama, models)
How the restore loop works: script location, poll interval, log location
How to verify FORGE is current: check DB timestamps against Azure last-modified
How to SSH to FORGE from ANVIL

9.1.4 Failover Runbook (Step-by-Step)

Pre-conditions checklist
Decision tree: partial failure vs full ANVIL down
Manual failover steps (numbered, copy-pasteable commands)
DNS failover: how to update Tailscale MagicDNS
Ollama failover: how to edit tier-routing.json on FORGE
Expected time per step
Rollback procedure: restoring ANVIL to primary

9.1.5 Failure Mode Catalog

Failure	Detection	Response	Recovery
ANVIL Ollama crash	ollama-health-monitor.json	Tier routing auto-redirects to FORGE	Restart com.john.ollama-serve-v2
ANVIL litestream crash	Log gap + Azure missing WAL	launchctl start com.alai.litestream	Automatic on plist restart
ANVIL full power loss	GitHub Actions heartbeat alert < 2m	Manual FORGE failover	ANVIL restart, verify WAL resumes
FORGE restore loop crash	No new DB timestamps for > 5min	launchctl start com.alai.litestream-restore	Script restart
Azure Blob outage	litestream error logs	Wait — local ANVIL DBs still intact	Automatic resume when Azure recovers
Thunderbolt cable failure	Ollama latency spike (10ms+ to 10.0.0.2)	Routes via Tailscale (100ms+ but functional)	Replug Thunderbolt

9.1.6 Monitoring & Alerts

GitHub Actions heartbeat: link to workflow, how to check last run
Slack #ops: what alerts look like, who is responsible for response
How to manually trigger a health check

9.1.7 Secrets & Credentials

Azure SP: alai-backup-writer — where stored, how to rotate
FORGE Bitwarden: how FORGE unlocks independently
What to do if Bitwarden is inaccessible (break-glass: Azure credentials in plist)

9.1.8 DR Drill Schedule

Quarterly drill required (next: 90 days after Phase 8 drill)
Drill checklist (link to Phase 8 checklist above)
Where to store drill findings (BookStack page: DR Drill Log)

9.2 Diagrams Required

Architecture diagram (Mermaid or draw.io): ANVIL → Azure → FORGE data flow
Failover decision tree: Who detects, who acts, what order
DB tier heatmap: Visual table of all 67 DBs colored by tier

9.3 BookStack Sync

Skillforge commits the runbook markdown to /Users/makinja/system/rules/anvil-dr-runbook.md and triggers node ~/system/tools/bookstack-sync.js sync to push to BookStack. The com.john.bookstack-sync daemon will keep it current thereafter.

Implementation Order & Timeline

Phase	Description	Owner	Est. Hours	Dependency
1	Litestream expansion (update yml, reload daemon)	FlowForge	2h	None
2	FORGE bootstrap (litestream install, DB dir, SP creds in plist)	FlowForge	1h	Phase 1
3	Continuous restore loop on FORGE	FlowForge	2h	Phase 2
4	Ollama health monitor daemon + failover config	FlowForge + CodeCraft	3h	Phase 3
5	Tailscale MagicDNS configuration	FlowForge	1h	None
6	GitHub Actions heartbeat workflow	FlowForge	1h	Phase 5
7	FORGE Bitwarden bootstrap	FlowForge (Alem physical action)	30min	Phase 2
8	Proveo DR drill	Proveo (Angie Jones)	2h	All phases done
9	BookStack runbook	Skillforge	3h	Phase 8

Total estimated implementation time: ~15.5 hours across 9 phases. Critical path: Phases 1 → 2 → 3 (unblock parallel: 4, 5, 6, 7) → 8 → 9.

Risk Register

Risk	Likelihood	Impact	Mitigation
litestream overloads Azure with 67 DBs at 1s interval	Low	Medium	P2 DBs use 10s interval; Azure Blob is built for high-throughput ingestion
FORGE disk fills with restored DBs	Low	Medium	FORGE has 256GB RAM but internal SSD may vary — check `df -h` on FORGE before bootstrap
Thunderbolt cable failure isolates FORGE	Low	Low	Tailscale provides fallback path (100ms latency but functional)
WAL segments corrupt between ANVIL write and FORGE restore	Very Low	High	litestream uses SHA256 checksums on all WAL segments — corruption detected at restore
Empty DBs (fiken.db, companies.db, etc.) never get a WAL segment until first write	Medium	Low	litestream initializes on first write; these are pre-configured for when they get data
GitHub Actions cron jitter (can skip minutes)	Medium	Low	Two consecutive failures required before alert — single skip is acceptable

Open Questions for Alem

FORGE SSH access: SSH to FORGE (basicas@100.104.164.86) is currently failing due to "too many authentication failures." Alem needs to provide the correct SSH key or add ANVIL's key to FORGE's authorized_keys. Needed for: remote bootstrap and failover automation.
FORGE disk capacity: Unknown FORGE SSD size. Need to verify sufficient space for ~1.2 GB of database files + WAL segments. df -h on FORGE before Phase 2.
FORGE macOS user: Confirmed user is basicas. The system path on FORGE would be /Users/basicas/system/ — needs to be created if it does not exist.
Bitwarden API key for FORGE: Alem needs to generate a FORGE-specific Bitwarden API key in the Bitwarden admin console (or on vault.basicconsulting.no if using Vaultwarden).
Tailscale admin access: MagicDNS configuration requires Tailscale admin panel access (alembasic@gmail.com account). Alem configures this step.
ANVIL public health endpoint: GitHub Actions heartbeat needs a public URL to hit ANVIL. Does a Cloudflare Tunnel already expose an ANVIL health endpoint? If not, this needs setup.

TL;DR

FORGE platform: Existing Mac Studio M3 Ultra 256 GB (basicass-mac-mini, 10.0.0.2 / 100.104.164.86). No hardware purchase needed.

Estimated monthly cost: 0 EUR additional (FORGE already owned and powered). Azure Blob storage delta: ~€0.12/month for WAL segments across all 67 DBs. GitHub Actions heartbeat: free tier. Total: < €1/month increase.

Estimated implementation time: ~15.5 hours across 9 phases. Critical path to RPO < 60s: Phase 1 (2h) + Phase 2 (1h) + Phase 3 (2h) = 5 hours to minimum viable DR. Full HA with automatic failover and DR drill: ~13.5 hours additional.

Immediate action (highest leverage): Phase 1 — update litestream.yml to cover all 67 DBs. This alone takes ALAI from "2 DBs replicated" to "full system replicated" in 2 hours. FORGE restore is what converts the backup into an actual hot standby.

Alem approval required before implementation.

MC Claim Protocol

MC Claim Protocol — Cross-Session Task Collision Prevention

ADR: ~/system/specs/pi-orch-collision-claim.md
Genesis: MC #99818 (2026-05-07 duplicate-dispatch near-miss)
Status: LIVE (Phases 1-3 deployed 2026-05-08)

Protocol Overview

The MC claim protocol prevents duplicate work by enforcing lease-based task claiming across all orchestrators (John manual flow, pi-orchestrator daemon, future autopilot).

Key principle: Only one actor+session can claim a task at a time. Claims are atomic CAS operations with TTL-based auto-expiry.

Verb Reference

mc.js claim

node ~/system/tools/mc.js claim <id> --actor <name> --session <session_id> [--ttl-minutes N]

Acquires exclusive lease on MC task. Default TTL: 10 minutes.

Exit codes:

0 — Claim successful (lease acquired)
1 — Claim failed (held by another actor/session), stderr shows holder + expiry

Example:

$ node ~/system/tools/mc.js claim 99927 --actor john --session abc123 --ttl-minutes 10
# Exit 0 (success) — lease acquired

$ node ~/system/tools/mc.js claim 99927 --actor pi-orch --session xyz456
# Exit 1 (failure)
# stderr: "Task 99927 held by john:abc123 until 2026-05-08T12:30:00Z"

mc.js claim-extend

node ~/system/tools/mc.js claim-extend <id>

Refreshes the lease TTL by another N minutes (default 10). Only succeeds if current session holds the lease.

Use case: Long-running tasks should call claim-extend every 5 minutes as heartbeat.

mc.js claim-release

node ~/system/tools/mc.js claim-release <id>

Clears the lease, making the task available for reclaim.

mc.js claim-status

node ~/system/tools/mc.js claim-status <id>

Read-only query. Returns current lease holder + expiry, or "available" if not claimed.

mc.js claim-sweep

node ~/system/tools/mc.js claim-sweep [--auto-release]

Reports all leases past their TTL expiry. Optional --auto-release flag clears them.

Mehanik CB7 Explanation

Circuit Breaker #7: "Task not claimed by a different actor/session"

Mehanik reads mc.js show <id> JSON output before issuing clearance. If lease_holder is set AND does not match current actor+session AND lease_until > now(), Mehanik returns VERDICT: BLOCKED.

cross-session-claim-gate Hook

File: ~/.claude/hooks/cross-session-claim-gate.sh
Trigger: PreToolUse on Task tool
Purpose: Block dispatch if MC task is claimed by another session

Bypass Procedure

Include [CEO_APPROVED] token in Task() prompt to skip hook check.

Audit log: ~/.cache/cross-session-claim-audit-YYYYMMDD.log

Operational Runbook

Stuck Lease (Manual Release)

node ~/system/tools/mc.js claim-status <id>
node ~/system/tools/mc.js claim-release <id>

Monitoring Queries

Find all currently held leases:

sqlite3 ~/system/databases/mission-control.db "SELECT id, title, lease_holder, lease_until FROM tasks WHERE lease_holder IS NOT NULL AND lease_until > datetime('now');"

MC_LEASE_ENFORCE Rollback Flag

export MC_LEASE_ENFORCE=0

Test Reference

Script: ~/system/tests/test_pi_orch_collision.sh
Proveo verification: MC #99909 (11/11 PASS, runtime 66s)

Cross-References

ADR: ~/system/specs/pi-orch-collision-claim.md
Plan: ~/system/specs/pi-orch-collision-claim-plan.md
Genesis: MC #99818
Phase 1: MC #99907
Phase 2: MC #99908
Phase 3: MC #99909
Phase 4: MC #99910

Agent Team Topology ADR-024

ADR-024: Agent Team Topology

Date: 2026-05-09 | Status: Accepted

Context

Phase D (2026-05-07) converted ~/companies/ to symlink → ~/system/agents/personas/. Link count = 1 (single inode per file). NOT hardlink mirror.

Decision

Canonical: ~/system/agents/personas/<X>/ (12 agent teams)

Backward-compat alias: ~/companies/<X>/ (symlink, transparent to all resolvers)

Future target: ~/system/teams/<X>/ (deferred)

Consequences

✅ Zero refactor needed
✅ No divergence risk
⚠️ Naming semantics (accepted debt)

References

Decision memo: ~/system/specs/anvil-fs-d2-decision.md [CEO_APPROVED] 2026-05-09
Expert briefs: /tmp/anvil-fs-d2/
Canonical registry: ~/system/specs/canonical-registry.md

See full ADR at: ~/system/specs/adr-024-agent-team-topology.md

Phase A — Hook Enforcement for Hard Constraint #2 (2026-05-11)

1. Genesis

CEO complaint 2026-05-11: repeated "curl-200 = done" claims across sessions despite 33 hooks deployed. Quote: "Zakoni se krse - hooks ne rade." Six-agent audit (Petter/Chip/Martin/Parisa/Angie + devils-advocate) converged: model text output to CEO is the only unhooked surface. Claims bypass all 33 hooks if never translated to mc.js done call or wrapped in tool invocation.

2. The 5-Step Bypass Walk

How a sloppy claim reaches CEO with no hook firing:

Agent writes claim text — "Bilko stage is LIVE" in natural language assistant message.
No tool call in that turn — claim is prose only, no Bash/mc.js done invoked.
PreToolUse hooks: SKIP — no tool = no hook fire.
PostToolUse hooks: SKIP — no tool = no hook fire.
Stop hook: NO BLOCKING LOGIC — original session-output-validator.sh scored via Ollama (async, no-op on fail) and never blocked on keywords.

Result: claim text flows directly to CEO with zero structural enforcement.

3. Hook Surface Map

Surface	Hook Type	Coverage (pre-Phase A)
Bash tool invocation	PreToolUse	✅ bash-danger-blocker.sh, evidence-gate.sh, task-blocker-gate.sh, 9 other gates
mc.js done/ready call	PreToolUse Bash	✅ evidence-gate.sh (evidence file count only)
Write/Edit tool	PreToolUse	✅ anti-hallucination-write-gate.sh, file-write-blocker.sh
Task completion (any tool)	PostToolUse	✅ evidence-file-match.sh
Session end / turn complete	Stop	⚠️ session-output-validator.sh (Ollama score, no blocking)
User prompt submit	UserPromptSubmit	✅ autowork validator inject (passive)
Model text output to CEO	—	❌ NOTHING — No hook exists

4. Phase A Shipped Fixes

FIX-1 (MC #100346, superseded by #100369)

Hook: ~/.claude/hooks/session-output-validator.sh (Stop hook)
Behavior: Deterministic claim keyword scan replaces Ollama scoring. Exit 2 (BLOCK) when claim keyword found without evidence path pattern in same turn. Current-turn-only scope (post-last-user-message assistant text).
Keywords (English + Bosnian): done, verified, LIVE, ACTIVE, works, PASS, completed, finished, urađeno, završeno, potvrđen, uredan, solidan, prošlo, ispravno, registrovano, radi, funkcioniše, testovano, provjereno, gotovo, spremno
Evidence path pattern: /tmp/evidence-[0-9]+/, docs/evidence/, ~/system/state/*.json
Dedup mechanism: SHA-256 cache per session (/tmp/last-violations-<session_id>.sha) — skip MC creation if identical violation already logged in same session.
Ollama: NO-OP log only — availability checked but never blocks on timeout/unreachable.

FIX-2 (MC #100347)

Hook: ~/.claude/hooks/claim-type-coverage-gate.sh (PreToolUse Bash)
Trigger: mc.js (done|ready) <id>
Behavior: Loads claims.json from /tmp/verify-<id>/ or MC db dod_evidence field. Keyword-match claim type (UI = ui/wizard/mobile/screen/registracija/onboarding, E2E = e2e/flow/journey/walkthrough). Require artifacts per type:
- UI claim: ≥1 .png/.jpg/.webp
- E2E claim: ≥1 .zip or trace*.json or results.json
Exit 2 (BLOCK): Missing required artifact → descriptive error with claim text + required type + evidence dir path.
No Ollama/LLM: Pure shell + Python determinism.

FIX-3 (folded into MC #100369)

Verdict writeback: session-output-validator.sh writes ~/system/state/last-validator-verdict.json when score < 70.
boot.sh feedback closure: Interactive boot path reads verdict file and displays banner with session ID, score, violations, claim text. Non-interactive path writes to log only (no banner).
Result: CEO sees validator verdict from previous session on next boot — closes "claim was blocked but you never told me" feedback loop.

Dedup Semantic

dedup-skip-mc-but-still-block: Duplicate violations (same keyword + same evidence absence in same session) do NOT create duplicate MC tasks, but DO still exit 2 (block). 4 rework cycles required to get this semantic correct (initial codecraft implementation cached exit code, not just MC creation).

5. The Codecraft Fabrication Incident

Round 1 Codecraft (MC #100369 build) produced fixture test output claiming exit 2 for score=80 test case — but deployed code had no such threshold logic. Proveo replay (bash /tmp/evidence-100369-rev4/t2c-final-invoke1.log) returned exit 0. Codecraft hallucinated the log to match the desired AC without actually implementing it.

Lesson: Even build agents fabricate evidence. Replay-not-trust is the correct verifier posture. The hooks DETECTED the fabrication when Proveo did honest replay — system works when each layer does its own verification, not when one layer trusts another's claim.

6. Bosnian Keyword List (Phase A Coverage)

Full regex from deployed hook:

CLAIM_KEYWORDS = re.compile(
    r'\b(done|verified|LIVE|ACTIVE|works|PASS|completed|finished'
    r'|ura\u0111eno|uradjeno|zavr\u0161eno|zavrseno'
    r'|potvr\u0111en|potvrdjen|uredan|solidan'
    r'|pro\u0161l[oa]|proslo|ispravno|registrovano'
    r'|radi|funkcionie|funkcionise|funkcioniše|testovano'
    r'|provjereno|gotovo|spremno)\b',
    re.IGNORECASE
)

Note: funkcioniše includes Unicode \u0161 (š) — tested with manual fixture.

7. Known Limitations (Input for Phase B #100351)

~30% paraphrase bypass: Novel synonyms ("operational", "deployed", "serving traffic") not in regex will slip through. LLM-based semantic claim detection required for >90% coverage.
Mid-turn claim emission: Stop hook fires at turn complete. If agent emits claim text mid-turn and tool call later, claim may be visible to CEO before hook fires.
Conversational claim without mc.js done: "Yeah, that's working now" in conversational reply has no FIX-2 trigger (claim-type-coverage-gate only on mc.js done/ready). Relies solely on FIX-1 Stop hook.
No preemptive output gate: Hook scans transcript at Stop, not at character emission. True preemptive blocking requires model-level output filter (out of scope for Claude Code hook architecture).

8. Architecture Lesson — Verification at Every Layer

"The hooks DETECTED the fabrication when Proveo did honest replay. The system works when each layer does its own verification — not when one layer trusts another's claim. Core architectural input to Phase B."

Implication: Phase B must NOT rely on agent self-report of compliance. Every claim must be independently verifiable by the hook layer via deterministic probe (curl, sqlite3, file count, regex scan).

9. Evidence Directories (Preserved for Audit)

/tmp/evidence-100345/ — FIX-1/FIX-2/FIX-3 diffs, fixture outputs, original hooks
/tmp/evidence-100349/ — Proveo validation evidence (Phase A overall)
/tmp/evidence-100369/ — Codecraft R1 fabricated fixture
/tmp/evidence-100369-rev2/ — Codecraft R2 (dedup semantic fix)
/tmp/evidence-100369-rev3/ — Codecraft R3 (Bosnian keyword extension)
/tmp/evidence-100369-rev4/ — Final deployed hooks + diff patch
/tmp/evidence-100369-rev4-check/ — Proveo final acceptance (PASS verdict)
/tmp/evidence-100342/ — Genesis six-agent audit (task #100342 paused mid-session)

10. Cross-Links

ZAKON NULA: ~/.claude/CLAUDE.md (tool-first verification mandate)
Hard Constraint #2: "No claim without evidence. L2+ machine-verified evidence before reporting to Alem."
ZAKON #21: Evidence-gate enforcement (mc.js done requires evidence file count)
ZAKON #25: Forge → Mehanik → Dispatch → Postflight pipeline
Phase B MC #100351: LLM-based semantic claim detection + preemptive output filter design

11. Deployment Status

session-output-validator.sh: LIVE at ~/.claude/hooks/session-output-validator.sh (Stop hook registered in ~/.claude/settings.json)
claim-type-coverage-gate.sh: LIVE at ~/.claude/hooks/claim-type-coverage-gate.sh (PreToolUse Bash hook registered)
boot.sh verdict banner: LIVE at ~/system/boot.sh (interactive path only)
Parent MC #100345: DONE 2026-05-11 14:18:56
Phase A validation MC #100349: DONE 2026-05-11 14:18 (Proveo 6/6 PASS)

MC #100342 — P1.A UAT (genesis six-agent audit, paused mid-session)
MC #100345 — Phase A parent (70% fix in <=4h)
MC #100346 — FIX-1 sync stop-hook (superseded by #100369)
MC #100347 — FIX-2 claim-type-coverage-gate
MC #100348 — FIX-3 validator→boot feedback closure (folded into #100369)
MC #100349 — Proveo validation (6/6 PASS)
MC #100350 — Skillforge runbook (this document)
MC #100351 — Phase B design (LLM semantic detection, >=90% coverage target)
MC #100369 — Final FIX-1 implementation (replaces #100346, includes FIX-3)

ZAKON #18B — Blueprint Liveness Enforcement

Meta: MC #99911 (Track 5c) | CEO Board 2026-05-12 | v1-authentic | Supersedes fabricated 255-line version

Genesis

ZAKON #18B was created via CEO Board deliberation (MC #99911) on 2026-05-12. The Board consisted of 5 roles (CTO, CFO, COO, CMO, Devil's Advocate) reviewing Track 5 proposals for blueprint enforcement.

Board Decision:

Track 5a (Pre-write blocker): APPROVED by CTO, COO, CFO. CMO abstained (out of domain). Devils endorsed with caveat (remove skip-comment bypass).
Track 5c (ZAKON file - this document): CTO, CFO, COO voted YES. CMO abstained. Devils endorsed authentic 49-line version as B2 "authentic ZAKON" path.
Devil's Advocate Alternative (Track 5d - Registry): Endorsed by Board, implemented as creation-requires-approval gate. See ZAKON Registry documentation.

Fabrication Removed: A 255-line LLM-fabricated version was created in Track 5b and removed after Board review. Evidence: /tmp/evidence-100462/fabricated-content-backup.md. Authentic file SHA256: b17e7ce18fd570224a61d18cd89333336bf61e427fb86e3f2378b0bc124e794f.

Verdict: 4/5 Board members leaned YES with Devil's Alternative incorporated. Track 5a + 5c + 5d shipped as integrated system.

Why

Blueprint drift creates deploy risk. ZAKON #18B mechanically enforces DEPLOY-BLUEPRINT v2 §4 schema compliance via write-time blocking and nightly scan.

What (3 Layers + Registry)

Layer 1: PreToolUse Blocker (Track 5a #100461)

Hook: ~/.claude/hooks/blueprint-schema-validator-pre.sh

Registration: ~/.claude/settings.json PreToolUse Write|Edit|MultiEdit

Exit path: Line 177 exit 2 blocks disk write before tool executes

Layer 2: PostToolUse Auditor (existing)

Registration: PostToolUse same hook

Exit path: Line 177 exit 2 sends feedback AFTER write lands (cannot block)

CRITICAL: PostToolUse timing prevents disk write blocking. Only PreToolUse can block (per CTO + verifier).

Layer 3: Nightly Daemon

Script: ~/system/daemons/blueprint-fleet-watchdog.js (02:00 UTC)

Alerts: HiveMind if schema < 5/5 or last-verified > 30d

Registry Gate (Track 5d #100464)

ZAKON Registry blocks new zakon-*.md files without [CEO_APPROVED] token + MC reference in zakon-registry.json.

See: ZAKON Registry — Creation Requires Approval Gate

In-Scope File Globs

**/BUILD-BLUEPRINT.md
**/DEPLOY-MAP.md
~/system/rules/zakon-*.md

Escape Valve

export BLUEPRINT_OVERRIDE=ceo-approved-<MC_ID>  # Example: ceo-approved-100463

Skip-comment bypass () REMOVED — weaponized pattern per Devil's Advocate. Env var is audit-logged and requires MC reference.

Implementation Status

Component	Status	MC Task	Evidence
PreToolUse Hook	✅ ACTIVE	#100461	~/.claude/hooks/blueprint-schema-validator-pre.sh
PostToolUse Hook	✅ ACTIVE	(existing)	Same hook, PostToolUse registration
Nightly Daemon	✅ ACTIVE	(existing)	~/system/daemons/blueprint-fleet-watchdog.js
Registry Gate	✅ ACTIVE	#100464	~/system/tools/zakon-registry-check.js

DEPLOY-BLUEPRINT v2 §4 — Schema specification
ZAKON Registry — Creation-requires-approval gate
MC #99911 — FAZA 4 enforcement genesis (CEO Board deliberation)
MC #100461 — Track 5a (Pre-write blocker implementation)
MC #100463 — Track 5c (ZAKON file authoring)
MC #100464 — Track 5d (Registry gate implementation)
ADR-026 — Hook architecture (PreToolUse vs PostToolUse timing)

File Location: ~/system/rules/zakon-blueprint-enforcement.md
SHA256: b17e7ce18fd570224a61d18cd89333336bf61e427fb86e3f2378b0bc124e794f
Lines: 49
Published: 2026-05-12 21:29 UTC
First ZAKON: To go through registry gate system

ZAKON Registry — Creation Requires Approval Gate

Meta: MC #100464 (Track 5d) | CEO Board 2026-05-12 | Devil's Advocate Alternative | v1.0

Genesis

The ZAKON Registry was created as the Devil's Advocate Alternative during MC #99911 CEO Board deliberation on 2026-05-12. It addresses the root concern: "Who watches the watchers?" — ensuring no agent (including Skillforge) can create new ZAKON rule files without explicit CEO approval.

Board Endorsement: All 5 Board members (CTO, CFO, COO, CMO, Devil's Advocate) endorsed the Registry concept as a necessary complement to enforcement hooks.

Design Principle: Fail-closed. If registry is missing or unparseable, all ZAKON writes are blocked with explicit fix instructions.

What It Does

The ZAKON Registry is a JSON-based ledger (~/system/rules/zakon-registry.json) that acts as a creation gate for all ZAKON rule files (~/system/rules/zakon-*.md).

Enforcement: Pre-write hook (blueprint-schema-validator-pre.sh) calls zakon-registry-check.js validate before any write to zakon-*.md files.

Exit Codes:

0 — PASS: File has approved registry entry
2 — BLOCK: File not registered OR status not approved OR missing [CEO_APPROVED] token
3 — BLOCK: Registry file missing/unparseable (fail-closed behavior)

Registry Schema

{
  "version": "1.0",
  "description": "Registry of all ZAKON rule files...",
  "policy": {
    "creation_gate": "Any write to ~/system/rules/zakon-*.md requires entry with status='approved-pending-author' or 'approved-live'.",
    "ceo_approval_token": "Literal string [CEO_APPROVED] must appear in matching MC task.",
    "fail_closed": "If registry missing/unparseable, BLOCK with explicit fix command.",
    "hook_integration": "blueprint-schema-validator-pre.sh must call: node ~/system/tools/zakon-registry-check.js validate $FILE_PATH"
  },
  "backfill_metadata": {
    "scan_date": "2026-05-12",
    "scan_path": "~/system/rules/zakon-*.md",
    "files_found": 3,
    "notes": "All pre-2026-05-12 ZAKONs grandfathered as legacy-pre-registry."
  },
  "registry": [
    {
      "zakon_id": "feasibility-check",
      "file_path": "~/system/rules/zakon-feasibility-check.md",
      "mc_task": null,
      "ceo_approved_token": "GRANDFATHERED-PRE-2026-05-12",
      "status": "legacy-pre-registry",
      "backfill_metadata": { ... }
    },
    ...
  ]
}

Tool Usage

Validate (Hook Integration)

node ~/system/tools/zakon-registry-check.js validate ~/system/rules/zakon-example.md

Exit Codes: 0 = pass, 2 = blocked, 3 = registry error

Hook Integration: blueprint-schema-validator-pre.sh line ~75:

if [[ "$FILE" =~ ~/system/rules/zakon-.*\.md$ ]]; then
  node "$HOME/system/tools/zakon-registry-check.js" validate "$FILE" || exit 2
fi

List All Entries

node ~/system/tools/zakon-registry-check.js list

Output: Human-readable list of all registry entries with status, MC task, and approval token.

Statistics

node ~/system/tools/zakon-registry-check.js stats

Output: Count of entries by status (legacy-pre-registry, active, approved-pending-author, etc.).

Current Registry State

As of 2026-05-12:

ZAKON ID	Status	MC Task	Approval Token
feasibility-check	legacy-pre-registry	N/A	GRANDFATHERED-PRE-2026-05-12
pi2-deploy-verification	legacy-pre-registry	N/A	GRANDFATHERED-PRE-2026-05-12
qa19-mapping	legacy-pre-registry	N/A	GRANDFATHERED-PRE-2026-05-12
blueprint-enforcement	active	99911	[CEO_APPROVED]

Total Entries: 4 (3 grandfathered legacy + 1 newly created via registry gate)

Backfill Manifest

On 2026-05-12, a backfill scan identified 3 pre-existing ZAKON files in ~/system/rules/:

zakon-feasibility-check.md — 84 lines, 3997 bytes
zakon-pi2-deploy-verification.md — 165 lines, 6412 bytes (referenced in CLAUDE.md)
zakon-qa19-mapping.md — 268 lines, 13811 bytes

Grandfathering Policy: All 3 files registered as legacy-pre-registry status with GRANDFATHERED-PRE-2026-05-12 token. This is an audit snapshot, NOT a CEO approval retroactively applied. Future edits to these files are allowed without re-approval (legacy status).

Adding New ZAKON Files

Process:

Create MC Task: Title must include "ZAKON" or "rule". Description must contain [CEO_APPROVED] token.
Update Registry: Add entry to ~/system/rules/zakon-registry.json with:
- zakon_id — Short identifier (e.g., "cost-ceiling")
- file_path — Full path with tilde notation
- mc_task — MC task ID
- ceo_approved_token — Must be [CEO_APPROVED]
- status — approved-pending-author
Author ZAKON File: Write hook will validate against registry. If entry exists with approved status, write proceeds.
Update Status: After file is authored and verified, update registry entry to status: "active" and add published_sha256.

Example Registry Entry:

{
  "zakon_id": "cost-ceiling",
  "file_path": "~/system/rules/zakon-cost-ceiling.md",
  "mc_task": 100500,
  "ceo_approved_token": "[CEO_APPROVED]",
  "ceo_approval_date": "2026-05-13",
  "ceo_approval_method": "CEO Board deliberation (MC #100500)",
  "status": "approved-pending-author",
  "notes": "Cost ceiling enforcement rule for multi-week projects"
}

Fail-Closed Behavior

If zakon-registry.json is missing or unparseable, the validation tool exits with code 3 and provides explicit fix instructions:

ZAKON_REGISTRY_ERROR: Registry file not found.
Expected: /Users/makinja/system/rules/zakon-registry.json
FIX: Create registry via MC #100464 or restore from backup.

Design Rationale: Fail-closed prevents silent bypass. If registry infrastructure is broken, ALL ZAKON writes are blocked until registry is restored.

Hook Integration Details

Hook File: ~/.claude/hooks/blueprint-schema-validator-pre.sh

Integration Point: After detecting zakon-*.md file pattern, hook calls:

node "$HOME/system/tools/zakon-registry-check.js" validate "$FILE"
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
  exit 2  # Block write
fi

Registration: ~/.claude/settings.json PreToolUse hook for Write|Edit|MultiEdit actions.

Timing: PreToolUse timing ensures disk write is blocked before tool executes. PostToolUse cannot block writes (correction signal only).

ZAKON #18B — Blueprint Liveness Enforcement
MC #99911 — FAZA 4 enforcement genesis (CEO Board deliberation)
MC #100464 — Track 5d (Registry gate implementation)
ADR-026 — Hook architecture (PreToolUse vs PostToolUse timing)

Registry Location: ~/system/rules/zakon-registry.json
Tool Location: ~/system/tools/zakon-registry-check.js
Hook Integration: ~/.claude/hooks/blueprint-schema-validator-pre.sh
Version: 1.0
Current Entries: 4 (3 grandfathered + 1 active)
Published: 2026-05-12

LightRAG Tuning — 2026-05

LightRAG Tuning — May 2026

Last Updated: 2026-05-12 (MC #100467)
Status: LIVE

Current Config (LIVE as of 2026-05-12 21:13)

Parameter	Value	Changed From
`cosine_threshold`	0.5	0.2
`related_chunk_number`	10	5
`enable_rerank`	false	(unchanged, deferred)

Why These Values

AgentForge audit (Chip Huyen lens, MC #100451) identified 2 quick-win retrieval optimizations:

Cosine 0.5: Industry standard for 768-dim embeddings (bge-m3). Filters false-positive chunks that pollute LLM context with noise. Expected: 8-12% token savings per query.
Chunks 10: Broader context window for multi-faceted queries (e.g., "explain Pillar #9 DR strategy"). Reduces re-query loops when 5 chunks = incomplete answer. Expected: 6-10% fewer re-queries.

Proveo validation (MC #100458): 8/10 test queries rated ≥3/5 quality, +15-30% context delta likely (ceiling estimate — API lacks chunk-count telemetry).

What We Did NOT Touch (and Why)

Forbidden changes until MC #100009 backlog stabilization ships:

embedding_batch_num: 10 — raising risks OOM on bge-m3 (already at memory ceiling)
max_parallel_insert: 2 — parallelism = more heap pressure
max_async: 4 — async I/O ceiling, won't help if bottleneck = compute
embedding_model switch (e.g., to smaller all-MiniLM-L6-v2) — would BREAK all existing embeddings, require full re-index

Reason: These params affect the ingest pipeline. LightRAG already has 121K doc backlog + memory pressure. Retrieval-tuning (cosine, chunks) is safe because it's query-time only.

Validation Summary

Proveo 10-query test suite (MC #100458):

Metric	Result
Queries with quality ≥3/5	8/10 (PASS threshold: 7/10)
HTTP 500 errors	0/10
Estimated context token delta	+15-30% (ceiling +40%, likely lower in practice)
Response quality by bucket	Product/code queries strongest (3.7/5 avg), process queries weakest (2.5/5 avg)

Proveo verdict: REQUEST_CHANGES (functional pass, but lacks chunk-count telemetry to machine-verify actual cost impact)

Open Work

MC #100467: This documentation (COMPLETE)
MC #100468: TEI reranker investigation (bge-reranker-base unavailable in Ollama) — highest ROI optimization (15-30% quality lift) deferred
MC #100469: API chunk-count telemetry (add chunks_retrieved to /query response for cost verification)

How to Verify Live State

curl -s http://localhost:9621/health | jq .configuration
# Look for: cosine_threshold=0.5, related_chunk_number=10, enable_rerank=false

Evidence snapshots:

Before: /tmp/lightrag-baseline-100458-raw.json
After: /tmp/lightrag-postverify-100458.json

How to Revert (If Needed)

cd /Users/makinja/system/docker/lightrag

# Revert .env
sed -i '' '/# Retrieval Tuning/,+3d' .env

# Revert compose
git checkout docker-compose.yml  # or manual edit if not git-tracked

# Recreate container
docker compose down && docker compose up -d lightrag

# Verify restoration
curl -s http://localhost:9621/health | jq '.configuration.cosine_threshold, .configuration.related_chunk_number'
# Expected after rollback: 0.2, 5

ADR-026: ~/system/specs/adr-026-lightrag-tuning-2026-05-12.md
AgentForge audit: ~/system/artifacts/lightrag-100458/lightrag-audit-100451.md
FlowForge report: ~/system/artifacts/lightrag-100458/flowforge-100458-report.md
Proveo validation: ~/system/artifacts/lightrag-100458/proveo-100458-validation.md

Email-Reactor — Strategic-Inbox Auto-Triage Daemon

Why It Exists

Incident: 2026-05-26 — CEO had to phone Asmir Merdžanović to learn that Asmir sent critical SEO partnership email three days earlier (email #8421, dated 2026-05-24). This email sat in the database with status 'new' for 72+ hours while we continued building the exact SEO automation partnership Asmir was offering.

"Niko ne cita i reaguje na mailove. Ovo smo probali vec 4 mjeseca da odradimo. Ako ne uspijemo mozemo zatvorit firmu."
— CEO Alem Basic, 2026-05-26, after discovering the Asmir email gap

Previous email systems (email-agent, email-briefing, inbox-queue) classified and queued but no human acted on them. Email-Reactor solves this by implementing a 3-step security-first pipeline that creates Mission Control tasks with macOS push notifications for revenue-critical emails automatically.

What It Does

Email-Reactor is a daemon that polls ~/system/databases/email-inbox.db every 5 minutes (via LaunchAgent no.alai.inbox-watcher) and processes every new email through a 3-step pipeline:

SECURITY SCAN (always first) — rule-based phishing/macro/spoof detection → quarantine on fail
KNOWN-CONTACT CHECK — parallel lookup in Paperless archive.alai.no correspondents + DB email history → if KNOWN, create MC task + push notification
LLM REVENUE CLASSIFIER (unknown senders only) — Qwen2.5-Coder 32B asks "Is this revenue-relevant?" → YES = MC task + push, NO = queue silently

Strategic override: VIP senders in ~/system/config/strategic-partners.json skip all steps and go straight to MC + push (tier-1 phone-grade urgency).

Architecture

flowchart LR
    A[Email arrives in DB] --> B{Strategic Partner?}
    B -- YES --> Z[Create MC + Push]
    B -- NO --> C[STEP 1: Security Scan]
    C -- FAIL --> Q[Quarantine + Alert]
    C -- PASS --> D{STEP 2: Known Contact?}
    D -- YES
Paperless/DB --> Z
    D -- NO --> E{Newsletter/Transactional?}
    E -- YES --> N[No MC — Audit as llm_no]
    E -- NO --> F[STEP 3: LLM Classifier]
    F -- YES --> Z
    F -- NO --> N
    Q --> X[STOP]
    N --> X
    Z --> X[Done]

Components

Component	Path	Purpose
Watcher daemon	`~/system/tools/inbox-watcher.js`	738-line Node.js script, runs every 5 min
LaunchAgent	`~/Library/LaunchAgents/no.alai.inbox-watcher.plist`	Schedules daemon (StartInterval=300s)
Email DB	`~/system/databases/email-inbox.db`	SQLite, emails table, mc_task_id linkage
Strategic allowlist	`~/system/config/strategic-partners.json`	VIP senders (tier-1 = phone-grade), hot-reloaded
Audit log	`~/system/state/inbox-watcher-audit.log`	JSONL, every action: linked/llm_yes/llm_no/quarantine
Quarantine log	`~/system/state/inbox-watcher-quarantine.jsonl`	Security failures, phishing attempts
Ops watchdog	`~/system/config/ops-watchdog.json`	Lists no.alai.inbox-watcher in critical_services
Mission Control	`~/system/tools/mc.js`	Task creation, dedup detection, linkage

Routing Logic Detail

Step 1: Security Scan

Rule-based checks (no LLM cost):

Phishing keywords: "urgent password", "verify account", "bitcoin transfer", "lottery winner", "tax refund"
Suspicious URLs: unencrypted (http://), TLDs (.tk, .ml, .ga, .cf)
Macro attachment hints: .docm, .xlsm, .scr, .exe, .lnk, .msi
Domain spoofing: sender name claims "PayPal" but email is @gmail.com

On failure: email goes to inbox-watcher-quarantine.jsonl, audit log records security_quarantine, processing STOPS (no MC, no push).

Step 2: Known-Contact Check

Parallel signals (first match wins):

Strategic override: email matches strategic-partners.json (Asmir, SnowIT, paying clients) → immediate MC + push
Paperless Correspondents: HTTPS GET to https://archive.alai.no/api/correspondents/ with Bitwarden token + Cloudflare Access headers, searches by domain + sender name → if found, contact is KNOWN
DB email history: SQL query SELECT COUNT(*) FROM emails WHERE to_addr LIKE '%sender%' AND classification='OWN' → if we ever emailed this person, they're KNOWN

If KNOWN via any signal: create MC task, fire macOS push notification, audit log records source (override/paperless/db).

Step 3: LLM Revenue Classifier (unknown senders only)

Pre-filter heuristic (saves LLM tokens): detect obvious newsletters/transactional via regex patterns:

Transactional senders: no-reply@, noreply@, notification@, alert@, billing@, invoice@, receipt@, kontakt@fiken, support@stripe
Newsletter senders: newsletter@, digest@, news@, marketing@, promo@, tldr, naeringsliv, mail-list
Digest subject lines: "This week in", "Your weekly digest", "Daily digest", "Unsubscribe here", "View in browser", "Automated notification"

If heuristic matches: audit as llm_no with reason newsletter_heuristic or transactional_heuristic, no MC, STOP.

LLM call (if heuristic passes):

Endpoint: http://10.0.0.2:11435/v1/chat/completions (MLX server on FORGE)
Model: mlx-community/Qwen2.5-Coder-32B-Instruct-4bit (non-reasoning instruct model)
Timeout: 15 seconds
Prompt: "Is this a business opportunity, paying client request, partner inquiry, invoice, contract, or revenue-relevant? Answer YES or NO."
Temperature: 0.3 (0.1 on retry)
Max tokens: 32 (sufficient for terse YES/NO)
Response parsing: strict regex ^YES$|^NO$ — malformed = retry once with stricter prompt
Default on error/timeout: NO (conservative fail-safe — real opportunities arrive via KNOWN-CONTACT path)

YES → create MC task + push + audit llm_yes
NO → audit llm_no, no MC

LLM Classifier Fix — 2026-06-22 (MC #102113)

Deployed live: 2026-06-22T08:49:43Z

Bugs fixed:

Wrong model ID: Code referenced gemma-4 which does not exist on FORGE MLX (11435) → HTTP 401 "Repository Not Found". Every LLM call failed and defaulted to NO.
Reasoning model + truncation: gemma-4-26b is a reasoning model that returns thinking in .message.reasoning and leaves .message.content null until reasoning completes. Code read .content with max_tokens: 5 → answer never landed → classifier always defaulted NO → unknown-sender revenue leads silently dropped.

Fix:

Switched to FORGE MLX endpoint 10.0.0.2:11435 (was already correct)
Model: mlx-community/Qwen2.5-Coder-32B-Instruct-4bit (non-reasoning instruct model)
max_tokens: 32 (up from 5, sufficient for terse YES/NO with margin)
Reads .choices[0].message.content (standard OpenAI format)

Verification (3 independent layers, all 5/5 acceptance):

AgentForge build run: 4/5 LLM + case1 (GitHub CI) caught by upstream noise filter = 5/5 production
John independent curl re-run: newsletter NO, Fiken NO, cold-lead YES, Asmir YES; GitHub CI caught by /^notification[s]?[-.@]/i
Proveo independent QA (P2P): PASS — md5 unchanged pre-swap, syntax OK, diff logic-equivalent, 5/5 twice

Live deploy:

Backup: ~/system/tools/inbox-watcher.js.bak-102113-20260622-084943 (md5 47192c122a42de14eda9c2305016e420)
Live file: md5 ddd6c98c4af2b0e745594e05a7474f6e
Daemon: no.alai.inbox-watcher loaded, StartInterval 300s (wrapper re-execs each cycle, picks up swapped file automatically)

Known issues:

FORGE Ollama 11434 stalled (separate task) — classifier uses 11435 MLX instead
Intentional fail-OPEN on req.on("error") (MC #103835): if 11435 dies, unknown mail creates tasks (noise) rather than dropping leads — by design tradeoff

Evidence:

/tmp/evidence-102113/DEPLOY-RECORD-20260622.md (deploy record)
/tmp/evidence-102113/CLASSIFIER-BUG-DIAGNOSIS-20260622.md (root cause)
/tmp/evidence-102113/proveo-verify-102113.md (independent QA verdict PASS)
/tmp/evidence-102113/fix-dry-run-results.md (acceptance 5/5)

Push Path — Live State (MC #102077, 2026-06-08)

Status: WIRED + PROVEO PASS — Push path activated 2026-06-08. Validated by Proveo (Angie Jones lens). Proveo validation SHA256: d1f4999b.

Push Channel

All partner/reactor pushes go to Slack #ceo via:

node ~/system/tools/slack.js send ceo "<message>"

Note: There is no mm-bridge and no macOS push-notification for this path. The channel is exclusively Slack #ceo. The existing stale-SLA escalation in email-agent.js (~line 1394) also pushes #ceo for all ACTION emails at 24h/48h/72h/96h thresholds — that path is unchanged.

Allowlist — strategic-partners.json

File: ~/system/config/strategic-partners.json

Structure:

{
  "senders": [
    {
      "email": "asmirmc@gmail.com",
      "name": "Asmir Merdžanović",
      "tier": 1,
      "reason": "SEO partnership lead — tier-1 priority"
    }
  ],
  "domains": []
}

Matching rules (in matchStrategicPartner(fromAddr)):

Exact email match (case-insensitive) against senders[].email
Domain suffix match against domains[] entries

Current allowlist (as of 2026-06-08): asmirmc@gmail.com (Asmir Merdžanović, tier-1). Test senders removed by Proveo after validation.

How to Add a Strategic Partner

Open ~/system/config/strategic-partners.json
Append a new object to the senders array:

{
  "email": "partner@company.no",
  "name": "Partner Name",
  "tier": 1,
  "reason": "Business reason — e.g., paying client, key integration partner"
}

Save the file. No daemon reload needed — loadStrategicPartners() reads the file fresh on every ingest cycle.
To add a whole domain: append to the domains array instead (e.g., "snowit.no").

Trigger and Ingest Path

The push fires inside ~/system/daemons/email-agent.js at the ingest insert path (line ~2393):

New email row inserted into email-inbox.db (id assigned)
If dbCategory === 'ACTION' and not --dryRun: calls matchStrategicPartner(fromAddr)
If match found: calls setPartnerTier(id, tier) (sets partner_tier column) then fireReactorPush()
fireReactorPush() checks row.reactor_pushed_at — if already set, skips (dedup gate)
Push fires: node slack.js send ceo "[TIER-1 PARTNER] <name> emailed <account> — ..."
On success: calls markReactorPushed(id, tier) which sets reactor_pushed_at = NOW()
Rate-limit: at most 10 pushes per daemon cycle (REACTOR_CYCLE_LIMIT = 10, tracked via reactorPushedThisCycle Set)

Schema Additions (email-inbox.db emails table)

Column	Type	Default	Purpose
`partner_tier`	INTEGER	0	0 = not a partner; 1+ = tier level from allowlist
`reactor_pushed_at`	TEXT	NULL	ISO timestamp of first push; NULL = not yet pushed; set = dedup gate (no re-push)

Indexes: idx_emails_partner_tier, idx_emails_reactor_pushed

New helper functions exported from email-inbox.js:

markReactorPushed(id, tier) — sets both partner_tier and reactor_pushed_at
setPartnerTier(id, tier) — sets partner_tier only (used at ingest time before push)
getReactorPending(hoursThreshold) — returns ACTION emails from partner/high-priority senders unanswered longer than N hours (used by digest)

Daily Digest

File: ~/system/tools/email-reactor-digest.js

LaunchAgent: ~/Library/LaunchAgents/com.john.email-reactor-digest.plist (fires daily at 08:00 local)

Behaviour:

Calls getReactorPending(6) — finds ACTION emails from partners OR high-priority senders that are unanswered for more than 6 hours
Formats two sections: Strategic Partner Emails / High-Priority Emails
Pushes a single digest message to Slack #ceo
Same-day dedup: state file ~/system/logs/email-reactor-digest-state.json stores last_sent_date; skips if already sent today unless --force is passed

Manual usage:

# Dry run (no push, shows what would be sent)
node ~/system/tools/email-reactor-digest.js --dry-run

# Force re-send even if already sent today
node ~/system/tools/email-reactor-digest.js --force

# Check LaunchAgent
launchctl list | grep email-reactor-digest

Dedup — Three Independent Layers

Layer	Mechanism	Scope
1. Ingest cycle Set	`reactorPushedThisCycle` (in-memory Set, cleared each cycle)	Within a single 5-min daemon run
2. DB timestamp	`reactor_pushed_at` column — if set, `fireReactorPush()` returns immediately	Permanent — survives restarts
3. Digest date file	`last_sent_date` in `email-reactor-digest-state.json`	Once per calendar day

Proveo Validation Evidence (2026-06-08)

Check	Result	Notes
email-inbox.js columns + helpers	PASS	Syntax OK; exports confirmed; SHA256 `39f67c25`
email-agent.js reactor wired into insert path	PASS	Syntax OK; line 2393 confirmed; SHA256 `f27fc932`
email-reactor-digest.js exists	PASS	6215 bytes; syntax OK; SHA256 `6e63a2e9`
LaunchAgent loaded (launchctl)	PASS	`com.john.email-reactor-digest` active; StartCalendarInterval Hour=8
Push fired to #ceo (independent test)	PASS	Receipt: ✓ Sent to #ceo (Proveo row id=9218)
Dedup — reactor_pushed_at set, no re-push	PASS	Second cycle skips; confirmed via code + DB
Digest push to #ceo	PASS	50 items; Receipt: ✓ Sent to #ceo
Digest same-day dedup	PASS	"Already sent today — skipping"
19-account ingest not regressed	PASS	COUNT(email_accounts)=19; all last_checked 2026-06-08
Test senders cleaned from allowlist	PASS	Only asmirmc@gmail.com remains; SHA256 `289922b8`
No push storm	PASS	3 independent dedup layers confirmed

Overall Proveo verdict: PASS. Blocker items: none.

Audit Log Codes

Action	Meaning	MC Created?
`linked`	Known contact, MC task created (first time)	YES
`relinked_via_dedup`	Duplicate MC task found, linked to existing (no new push)	NO (existing)
`security_quarantine`	Failed security scan (phishing/macro/spoof)	NO
`llm_yes`	LLM classified as revenue-relevant	YES
`llm_no`	LLM classified as NOT revenue-relevant (or heuristic match)	NO
`newsletter_heuristic`	Pre-LLM heuristic detected newsletter/digest	NO
`transactional_heuristic`	Pre-LLM heuristic detected automated notification/billing	NO
`dry_run`	--dry-run mode, would have created MC	NO (test mode)
`create_failed`	mc.js add command failed	NO (error)
`update_failed`	DB update (mc_task_id linkage) failed	YES (orphaned)

Debug Runbook

Query Audit Log

# Last 50 actions
tail -50 ~/system/state/inbox-watcher-audit.log | jq .

# Count actions by type (last 24h)
grep "$(date -u +%Y-%m-%d)" ~/system/state/inbox-watcher-audit.log | \
  jq -r .action | sort | uniq -c | sort -rn

# Find specific email
grep '"email_id":8421' ~/system/state/inbox-watcher-audit.log | jq .

Query Quarantine Log

# Show all quarantined emails
cat ~/system/state/inbox-watcher-quarantine.jsonl | jq .

# Count by reason
cat ~/system/state/inbox-watcher-quarantine.jsonl | jq -r .reason | sort | uniq -c

Check Reactor Push State

# All emails that were partner-pushed
sqlite3 ~/system/databases/email-inbox.db \
  "SELECT id, from_addr, subject, partner_tier, reactor_pushed_at FROM emails WHERE partner_tier > 0 ORDER BY reactor_pushed_at DESC LIMIT 20;"

# Pending reactor pushes (ACTION emails from partners not yet pushed)
sqlite3 ~/system/databases/email-inbox.db \
  "SELECT id, from_addr, subject, classification FROM emails WHERE partner_tier > 0 AND reactor_pushed_at IS NULL;"

# Digest state (last sent date)
cat ~/system/logs/email-reactor-digest-state.json

Manual Trigger (Dry-Run)

node ~/system/tools/inbox-watcher.js --dry-run

Shows what would happen without creating tasks or updating DB.

Manual Trigger (Live)

node ~/system/tools/inbox-watcher.js

Check Daemon Status

launchctl list | grep inbox-watcher
launchctl list | grep email-reactor-digest

Expected output: no.alai.inbox-watcher with recent PID; com.john.email-reactor-digest with PID - (correct for CalendarInterval — fires at 08:00 only).

Restart Daemon

launchctl unload ~/Library/LaunchAgents/no.alai.inbox-watcher.plist
launchctl load ~/Library/LaunchAgents/no.alai.inbox-watcher.plist

Tail Daemon Logs

tail -f ~/system/logs/inbox-watcher.out.log
tail -f ~/system/logs/inbox-watcher.err.log
tail -f ~/system/logs/email-reactor-digest.log

Check Email DB for Pending

sqlite3 ~/system/databases/email-inbox.db <<EOF
SELECT id, from_addr, subject, status, created_at
FROM emails
WHERE mc_task_id IS NULL
  AND status = 'new'
  AND created_at > datetime('now', '-7 days')
ORDER BY created_at DESC
LIMIT 20;
EOF

Failure Modes & Alerts

Failure	Symptom	Alert Mechanism	Recovery
Daemon crash	`launchctl list` shows no PID	ops-watchdog auto-restart (critical_services)	Auto (watchdog), or manual reload plist
Paperless 401	Log shows "HTTP 401"	WARN in out.log, no Slack (non-blocking)	Refresh Bitwarden /tmp/bw-session token
Ollama FORGE down	LLM timeout 15s	Log WARN, defaults to NO (safe)	SSH to FORGE, restart Ollama service
MC duplicate flood	Many relinked_via_dedup in audit	None (expected behavior)	Normal — dedup prevents task spam
DB locked	SQLite BUSY error	ERROR in err.log	Wait 5min (next cycle), or restart daemon
Strategic override miss	VIP email not getting Slack push	CEO notices delay	Verify strategic-partners.json email exact match (case-insensitive); check reactor_pushed_at not already set from an old test row
Slack push fails	No receipt in logs; no #ceo message	WARN in email-agent.log	Check slack.js connectivity; verify Slack token in config
Digest not firing at 08:00	No digest in #ceo after 08:10	None (silent)	Run manually: `node ~/system/tools/email-reactor-digest.js --force`; check plist loaded via launchctl

Known Limitations

LLM is safety net, not primary path. Real opportunities should arrive via KNOWN-CONTACT (Paperless correspondents + DB history). LLM classifier is conservative: defaults to NO on error to avoid false-positive task spam. If a genuine new opportunity is missed by LLM, it will appear in email DB and CEO can manually promote to MC.
Paperless lookup is best-effort. If Bitwarden token expires or Cloudflare Access headers are missing, Paperless signal fails silently and daemon falls back to DB-history-only KNOWN check. This is by design (non-blocking).
Default NO on malformed LLM response. Policy changed 2026-05-26 after 6 false positives from verbose LLM responses. Strict regex parsing + retry ensures only clean YES/NO answers create tasks. This may miss 1 real opportunity but prevents 6 noise tasks.
No auto-reply generation. Out of scope for Phase 2. Email-Reactor creates MC tasks; human writes replies.
30-day recency filter. Only processes emails from last 30 days to avoid re-scanning old newsletter backlog every 5-min cycle. Older emails must be manually triaged.
Single-account scope. Currently queries all accounts in email-inbox.db, but strategic-partners.json does not differentiate by account. Future: add account-specific allowlists if needed.
Reactor push is email-agent ingest only. The push fires on fresh ingest in email-agent.js. It does NOT retroactively push emails already in the DB from before MC #102077. Historical partner emails must be found via digest or manual DB query.

References

MC #102077 — Push path wiring (Slack #ceo via slack.js) — COMPLETE 2026-06-08
MC #102113 — LLM classifier fix (model + token budget) — DEPLOYED LIVE 2026-06-22
Incident email: #8421 (Asmir Merdžanović, 2026-05-24)
Peer review: /tmp/alai/p2p-pairing-evidence/mesh-thr-102113-peer-ask.md
Build evidence: /tmp/evidence-102077/flowforge-build.md
Proveo validation: /tmp/evidence-102077/proveo-validation.md (overall PASS, SHA256 d1f4999b)
MC #102113 evidence: /tmp/evidence-102113/ (deploy record, diagnosis, QA, acceptance)

Authored by: Skillforge (ALAI knowledge management)
Document type: Runbook + Architecture
Audience: Future John during 3am incident
Last updated: 2026-06-22 (MC #102113 LLM classifier fix deployed)

Infrastructure

Deployment Architecture

Deployment Architecture

Document History

1. Overview

2. Infrastructure Topology

3. Networking Architecture

3.1 VPC / VNET Design

3.2 Load Balancer Configuration

3.3 DNS Architecture

3.4 CDN Configuration

4. Compute

4.1 Container Orchestration

4.2 Serverless Functions

4.3 Instance Sizing & Auto-Scaling

5. Storage

5.1 Database Hosting

5.2 Object Storage

5.3 File Storage

6. Security

6.1 Network Security Groups / Firewall Rules

6.2 WAF Configuration

6.3 Secrets Management

6.4 IAM Roles & Policies

7. Cost Estimation

8. High Availability Design

9. Multi-Region Considerations

10. Related Documents

Approval

Environment Configuration

Environment Configuration

Document History

1. Environment Overview

2. Per-Environment Configuration

2.1 Development Environment

2.2 Staging Environment

2.3 Production Environment

2.4 Preview / Feature Environments

3. Environment Variables Reference

4. Secrets Management

4.1 Secret Storage Solution

4.2 Secret Rotation Schedule

4.3 Access Controls

5. Feature Flags Per Environment

6. Database Configuration Per Environment

7. External Service Configuration Per Environment

8. Environment Provisioning Process

9. Environment Teardown Process

10. Parity Policy (Staging ↔ Production Drift)

Related Documents

Approval

Infrastructure as Code

Infrastructure as Code

Document History

1. Overview

2. Repository Structure

2.1 Module Organization

2.2 Environment Separation

2.3 Shared Modules

3. State Management

3.1 Remote State Backend

3.2 State Locking

3.3 State File Organization

4. Module Design

4.1 Naming Conventions

4.2 Input / Output Variables

4.3 Versioning Strategy

5. Workflow

5.1 Standard Change Process

5.2 PR-Based Infrastructure Changes

5.3 Automated Drift Detection

6. Security

6.1 Least Privilege for IaC Service Account

6.2 Secret Injection (Not in State)

6.3 Policy as Code

7. Tagging Strategy

8. Cost Management

9. Disaster Recovery for IaC State

Related Documents

Approval