Infrastructure

Deployment architecture, CI/CD, environments, IaC, monitoring, disaster recovery

Deployment Architecture

Deployment Architecture

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version Date Author Changes
0.1 {{DATE}} {{AUTHOR}} Initial draft

1. Overview

System: {{PROJECT_NAME}} Cloud Provider: {{CLOUD_PROVIDER}} Provider Rationale: {{RATIONALE}} Architecture Pattern: {{PATTERN}}


2. Infrastructure Topology

graph TB
    subgraph Internet
        USER[End Users]
        CDN[CDN / CloudFront]
    end

    subgraph Public Subnet
        ALB[Application Load Balancer]
        BASTION[Bastion Host]
    end

    subgraph Private Subnet - App
        APP1[App Server 1]
        APP2[App Server 2]
    end

    subgraph Private Subnet - Data
        DB_PRIMARY[(Primary DB)]
        DB_REPLICA[(Read Replica)]
        CACHE[Redis Cache]
    end

    subgraph Isolated Subnet
        SECRETS[Secrets Manager]
        BACKUP[Backup Storage]
    end

    USER --> CDN
    CDN --> ALB
    ALB --> APP1
    ALB --> APP2
    APP1 --> DB_PRIMARY
    APP2 --> DB_PRIMARY
    APP1 --> CACHE
    DB_PRIMARY --> DB_REPLICA
    APP1 --> SECRETS

3. Networking Architecture

3.1 VPC / VNET Design

Network CIDR Purpose
VPC / VNET {{CIDR_VPC}} Main network boundary
Public Subnet A {{CIDR_PUB_A}} Load balancers, NAT gateways
Public Subnet B {{CIDR_PUB_B}} Load balancers, NAT gateways (AZ-B)
Private Subnet A {{CIDR_PRIV_A}} Application servers
Private Subnet B {{CIDR_PRIV_B}} Application servers (AZ-B)
Isolated Subnet A {{CIDR_ISO_A}} Databases, secrets
Isolated Subnet B {{CIDR_ISO_B}} Databases, secrets (AZ-B)

3.2 Load Balancer Configuration

Parameter Value
Type {{LB_TYPE}}
Protocol HTTPS (TLS 1.2+)
SSL Termination At load balancer
Health Check Path {{HEALTH_CHECK_PATH}}
Health Check Interval {{INTERVAL}}s
Unhealthy Threshold {{THRESHOLD}} consecutive failures
Idle Timeout {{TIMEOUT}}s
Stickiness {{STICKINESS}}

3.3 DNS Architecture

Record Type Value TTL
{{DOMAIN}} A / ALIAS Load Balancer {{TTL}}
api.{{DOMAIN}} CNAME API Load Balancer {{TTL}}
cdn.{{DOMAIN}} CNAME CDN Distribution {{TTL}}

DNS Provider: {{DNS_PROVIDER}} Failover Strategy: {{FAILOVER_STRATEGY}}

3.4 CDN Configuration

Parameter Value
Provider {{CDN_PROVIDER}}
Origin {{CDN_ORIGIN}}
Cache Behaviors Static assets: 1yr, API: no-cache, HTML: 5min
HTTPS Only Yes
WAF Integration {{WAF_INTEGRATION}}

4. Compute

4.1 Container Orchestration

Platform: {{ORCHESTRATION}}

Component Configuration Notes
Cluster {{CLUSTER_SPEC}}
Node Groups {{NODE_GROUPS}}
Min Nodes {{MIN_NODES}}
Max Nodes {{MAX_NODES}}
Node Size {{NODE_SIZE}}
Container Registry {{REGISTRY}}

4.2 Serverless Functions

Function Trigger Memory Timeout Purpose
{{FUNCTION_1}} {{TRIGGER}} {{MEMORY}}MB {{TIMEOUT}}s {{PURPOSE}}

4.3 Instance Sizing & Auto-Scaling

Service Instance Type Min Max Scale Trigger
{{SERVICE}} {{INSTANCE}} {{MIN}} {{MAX}} CPU > {{CPU}}% for {{DURATION}}min

Scale-Out Policy: {{SCALE_OUT}} Scale-In Policy: {{SCALE_IN}} Scale-In Cooldown: {{COOLDOWN}}min


5. Storage

5.1 Database Hosting

Database Engine Version Hosting Instance Storage HA
{{DB_NAME}} {{ENGINE}} {{VERSION}} {{HOSTING}} {{INSTANCE}} {{STORAGE}}GB {{HA}}

Connection Pooling: {{POOL_TOOL}} Max Connections: {{MAX_CONN}} Connection String: Stored in {{SECRET_LOCATION}} (never hardcoded)

5.2 Object Storage

Bucket / Container Purpose Access Lifecycle Encryption
{{BUCKET_NAME}} {{PURPOSE}} {{ACCESS}} {{LIFECYCLE}} AES-256

5.3 File Storage

Storage Type Mount Point Purpose Size
{{STORAGE_NAME}} {{TYPE}} {{MOUNT}} {{PURPOSE}} {{SIZE}}GB

6. Security

6.1 Network Security Groups / Firewall Rules

Security Group Direction Port Protocol Source / Destination Purpose
sg-alb Inbound 443 TCP 0.0.0.0/0 HTTPS from internet
sg-alb Outbound {{APP_PORT}} TCP sg-app Forward to app
sg-app Inbound {{APP_PORT}} TCP sg-alb From load balancer
sg-app Outbound {{DB_PORT}} TCP sg-db Database access
sg-db Inbound {{DB_PORT}} TCP sg-app From application only

6.2 WAF Configuration

WAF Provider: {{WAF_PROVIDER}}

Rule Group Purpose Action
AWSManagedRulesCommonRuleSet OWASP Top 10 Block
AWSManagedRulesSQLiRuleSet SQL injection Block
AWSManagedRulesKnownBadInputsRuleSet Known bad inputs Block
Rate limiting {{RATE_LIMIT}} req/5min per IP Count → Block

6.3 Secrets Management

Secret Store: {{SECRET_STORE}}

Secret Rotation Schedule Access
Database credentials 90 days App role only
API keys (third-party) On compromise App role only
TLS certificates 60 days before expiry Deploy role only
JWT signing key 365 days Auth service only

6.4 IAM Roles & Policies

Role Trusted By Key Permissions Purpose
{{APP_ROLE}} EC2 / ECS Task SecretsManager:GetSecret, S3:GetObject Application runtime
{{DEPLOY_ROLE}} CI/CD ECR:PushImage, ECS:UpdateService Deployments
{{BACKUP_ROLE}} Lambda / Cron RDS:CreateSnapshot, S3:PutObject Backups

7. Cost Estimation

Component Service Spec Est. Monthly Cost
Compute {{SERVICE}} {{SPEC}} ${{COST}}
Database {{SERVICE}} {{SPEC}} ${{COST}}
Load Balancer {{SERVICE}} {{SPEC}} ${{COST}}
CDN {{SERVICE}} {{TRAFFIC}}GB transfer ${{COST}}
Storage {{SERVICE}} {{CAPACITY}}GB ${{COST}}
Monitoring {{SERVICE}} {{METRICS}} metrics ${{COST}}
Total ${{TOTAL}}

Cost Optimization Notes:


8. High Availability Design

Component HA Strategy Failover Time Notes
Application Multi-AZ, N+1 instances Immediate (ELB health check)
Database Multi-AZ with auto-failover 60-120 seconds DNS propagation
Cache Cluster mode / Replication 30 seconds Redis Sentinel
CDN Global edge network Transparent Provider HA

RTO Target: {{RTO}} minutes RPO Target: {{RPO}} minutes


9. Multi-Region Considerations

Current: {{REGION_STRATEGY}} Primary Region: {{PRIMARY_REGION}} Secondary Region: {{SECONDARY_REGION}}

Rationale: {{MULTI_REGION_RATIONALE}}

Data Replication: {{REPLICATION_STRATEGY}} Failover Procedure: See disaster-recovery-plan.md



Approval

Role Name Date Signature
Author
Reviewer
Approver

Environment Configuration

Environment Configuration

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version Date Author Changes
0.1 {{DATE}} {{AUTHOR}} Initial draft

1. Environment Overview

Environment Purpose URL Access Managed By
Local Developer workstation localhost Developer Individual
Dev Integration, daily builds dev.{{DOMAIN}} Team + CI Platform team
Staging Pre-production validation staging.{{DOMAIN}} Team + QA + PM Platform team
Production Live system {{DOMAIN}} Ops only Platform team
Preview Feature branch review {{BRANCH}}.preview.{{DOMAIN}} Team + Stakeholders CI/CD

2. Per-Environment Configuration

2.1 Development Environment

Parameter Value Notes
Log level DEBUG Verbose logging for development
Database dev-db.{{INTERNAL_DOMAIN}} Shared dev DB, refreshed weekly
Cache dev-redis.{{INTERNAL_DOMAIN}} Shared Redis, no persistence
Email Mailtrap / fake SMTP Emails not delivered to real recipients
Payments Sandbox / test mode No real transactions
Feature flags All enabled Developers can test unreleased features
Debug tools Enabled Profiler, debug toolbar, etc.
Rate limiting Disabled Developer convenience
Auto-migrations Enabled Runs on startup

2.2 Staging Environment

Parameter Value Notes
Log level INFO Same as production
Database staging-db.{{INTERNAL_DOMAIN}} Isolated staging DB, production-scale
Cache staging-redis.{{INTERNAL_DOMAIN}} Dedicated Redis
Email staging@{{DOMAIN}} Sends to internal test inboxes only
Payments Sandbox / test mode No real transactions
Feature flags Mirrors production + staged features
Debug tools Disabled Must match production behavior
Rate limiting Enabled Same limits as production
Data refresh Weekly from production (anonymized) See data refresh runbook

Intentional staging/production differences:

2.3 Production Environment

Parameter Value Notes
Log level WARN Errors and warnings only
Database {{PROD_DB_HOST}} See secrets manager
Cache {{PROD_REDIS_HOST}} Clustered Redis
Email {{EMAIL_PROVIDER}} Real delivery via SES/Sendgrid/etc.
Payments Live mode Real transactions
Feature flags Conservative — tested features only New features behind flags
Debug tools Disabled Security requirement
Rate limiting Enabled See rate limit table
HSTS Enabled (1 year, includeSubDomains)
CSP Strict See security headers config

2.4 Preview / Feature Environments

Trigger: Pull request opened against main / develop Lifetime: Active while PR is open; destroyed on PR close URL Pattern: {{BRANCH_SLUG}}.preview.{{DOMAIN}} Database: Ephemeral copy (seeded from fixture data, not production) Teardown: Automated — triggered by PR close webhook

Parameter Value
Log level DEBUG
Email Fake SMTP / preview inbox
Payments Sandbox
Feature flags Branch-specific flags enabled

3. Environment Variables Reference

Variable Description Required Default Sensitive Environments
NODE_ENV Runtime environment Yes development No All
PORT HTTP server port Yes 3000 No All
DATABASE_URL PostgreSQL connection string Yes Yes All
REDIS_URL Redis connection string Yes redis://localhost:6379 Yes All
JWT_SECRET JWT signing key Yes Yes All
JWT_EXPIRY Token expiry duration Yes 1h No All
SMTP_HOST SMTP server hostname Yes No All
SMTP_USER SMTP username Yes Yes All
SMTP_PASS SMTP password Yes Yes All
S3_BUCKET Object storage bucket name Yes No All
AWS_REGION Cloud region Yes eu-west-1 No All
SENTRY_DSN Error tracking DSN No Yes Staging, Prod
STRIPE_KEY Payment API key Yes (if payments) Yes All
LOG_LEVEL Logging verbosity No info No All
RATE_LIMIT_WINDOW Rate limit window (ms) No 60000 No All
RATE_LIMIT_MAX Max requests per window No 100 No All
FEATURE_FLAG_KEY Feature flag SDK key No Yes All

Rules:


4. Secrets Management

4.1 Secret Storage Solution

Solution: {{SECRET_TOOL}}

Environment Secret Store Access Method
Local .env file (never committed) Developer managed
Dev {{DEV_SECRET_STORE}} CI/CD service account
Staging {{STG_SECRET_STORE}} IAM role / service account
Production {{PROD_SECRET_STORE}} IAM role / service account

4.2 Secret Rotation Schedule

Secret Type Rotation Schedule Automated Owner
Database passwords 90 days {{AUTOMATED}} Platform team
API keys (internal) 365 days No Service owner
API keys (third-party) On compromise No Dev lead
JWT signing keys 365 days No Platform team
TLS certificates 60 days before expiry {{AUTOMATED}} Platform team

4.3 Access Controls

Role Dev Secrets Staging Secrets Production Secrets
Developer Read/Write Read No access
DevOps Read/Write Read/Write Read/Write
CI/CD (build) Read Read No access
CI/CD (deploy) No access Read Read
Application runtime Read (scoped) Read (scoped) Read (scoped)

5. Feature Flags Per Environment

Tool: {{FF_TOOL}}

Flag Dev Staging Production Notes
feature-new-checkout On On Off Waiting for QA sign-off
feature-dark-mode On On Off Rollout planned {{DATE}}
kill-switch-payments Off Off Off Emergency disable only
maintenance-mode Off Off Off Emergency only

6. Database Configuration Per Environment

Parameter Local Dev Staging Production
Host localhost {{DEV_DB}} {{STG_DB}} {{PROD_DB}}
Port 5432 5432 5432 5432
Database name {{APP}}_dev {{APP}}_dev {{APP}}_staging {{APP}}_prod
Max connections 10 25 50 {{PROD_CONNS}}
SSL required No No Yes Yes
Connection pool No No Yes ({{POOL}}) Yes ({{POOL}})
Read replica No No No Yes
Backup No Daily Daily {{BACKUP_FREQ}}

7. External Service Configuration Per Environment

Service Dev Staging Production Notes
Email (SMTP) Mailtrap Mailtrap SendGrid / SES
Payments Stripe test Stripe test Stripe live Different API keys
SMS Twilio test Twilio test Twilio live
Analytics Disabled Staging property Production property
Error tracking Disabled Sentry dev project Sentry prod project
Maps No key / free tier Paid key Paid key

8. Environment Provisioning Process

  1. Infrastructure provisioning: terraform apply -var-file=envs/{{ENV}}.tfvars
  2. Secret provisioning: bash scripts/provision-secrets.sh {{ENV}}
  3. Database provisioning: bash scripts/create-db.sh {{ENV}}
  4. DNS configuration: Update DNS records per deployment-architecture.md
  5. TLS certificates: Auto-provisioned via {{CERT_TOOL}}
  6. Initial deployment: Trigger CI/CD for {{ENV}} target
  7. Verification: Run smoke tests against new environment

Estimated time: {{PROVISION_TIME}} minutes Runbook: {{PROVISION_RUNBOOK_LINK}}


9. Environment Teardown Process

  1. Verify no active users or critical processes
  2. Export any required data / logs
  3. Remove DNS records
  4. Revoke TLS certificates
  5. terraform destroy -var-file=envs/{{ENV}}.tfvars
  6. Purge secrets from secret store
  7. Archive environment configuration to {{ARCHIVE_LOCATION}}
  8. Update this document to remove the environment entry

10. Parity Policy (Staging ↔ Production Drift)

Goal: Staging should be functionally identical to production at all times.

Area Policy
Application version Staging is always ahead by ≤ 1 release
Infrastructure spec Same instance types and topology
Database engine & version Must match exactly
OS & runtime versions Must match exactly
Third-party dependencies Same versions (except external service mode)
Network topology Same (except size)
Security controls Same

Drift detection: {{DRIFT_DETECTION}} Drift resolution owner: Platform team



Approval

Role Name Date Signature
Author
Reviewer
Approver

Infrastructure as Code

Infrastructure as Code

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version Date Author Changes
0.1 {{DATE}} {{AUTHOR}} Initial draft

1. Overview

IaC Tool: {{IAC_TOOL}} Tool Version: {{IAC_VERSION}} Provider: {{CLOUD_PROVIDER}} Provider Version: {{PROVIDER_VERSION}}

Rationale for tool choice:

{{IAC_RATIONALE}}

Core Principles:


2. Repository Structure

{{IaC_REPO}}/
├── modules/                    # Reusable modules
│   ├── networking/             # VPC, subnets, security groups
│   ├── compute/                # EC2, ECS, Lambda
│   ├── database/               # RDS, ElastiCache
│   ├── storage/                # S3, EFS
│   └── monitoring/             # CloudWatch, alerts
├── environments/               # Environment-specific configs
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── production/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
├── shared/                     # Shared resources (DNS, accounts)
├── scripts/                    # Helper scripts
│   ├── bootstrap.sh            # Initialize state backend
│   └── validate.sh             # Pre-apply validation
├── .terraform-version          # Pin tool version (tfenv)
├── .tflint.hcl                 # Linting config
└── README.md

2.1 Module Organization

Module Purpose Inputs Outputs
modules/networking VPC, subnets, routing region, cidr_block, az_count vpc_id, subnet_ids, sg_ids
modules/compute ECS cluster, task definitions cluster_name, instance_type cluster_arn, task_role_arn
modules/database RDS instance, parameter groups engine, instance_class db_endpoint, db_secret_arn
modules/storage S3 buckets with policies bucket_name, purpose bucket_arn, bucket_name
modules/monitoring CloudWatch dashboards, alarms service_name, thresholds alarm_arns, dashboard_url

2.2 Environment Separation

2.3 Shared Modules

Shared module registry: {{MODULE_REGISTRY}}

Module Source Version Used By
networking {{REGISTRY}}/networking ~> 2.0 All environments
database {{REGISTRY}}/database ~> 1.5 Staging, Production
monitoring {{REGISTRY}}/monitoring ~> 1.2 All environments

3. State Management

3.1 Remote State Backend

Backend: {{STATE_BACKEND}}

Environment State Location Access
Dev {{STATE_BUCKET}}/dev/terraform.tfstate DevOps team
Staging {{STATE_BUCKET}}/staging/terraform.tfstate DevOps team
Production {{STATE_BUCKET}}/production/terraform.tfstate Senior DevOps + CI only

Bootstrap (first-time setup):

bash scripts/bootstrap.sh {{ENVIRONMENT}}

3.2 State Locking

Locking Mechanism: {{LOCK_MECHANISM}} Lock timeout: {{LOCK_TIMEOUT}}s Force unlock: Only by senior DevOps after verifying no active apply

Lock table (if DynamoDB):

3.3 State File Organization

Splitting strategy: {{SPLIT_STRATEGY}}

State File Contains Reason for split
base/terraform.tfstate Networking, IAM Infrequently changed
app/terraform.tfstate Compute, app services Frequently changed
data/terraform.tfstate Databases, caches High risk, separate lifecycle

4. Module Design

4.1 Naming Conventions

Resource naming pattern: {{PROJECT}}-{{ENVIRONMENT}}-{{COMPONENT}}-{{SUFFIX}}

Resource Example
VPC myapp-prod-vpc
ECS Cluster myapp-prod-cluster
RDS Instance myapp-prod-db-primary
S3 Bucket myapp-prod-assets-{{ACCOUNT_ID}}
Security Group myapp-prod-app-sg
IAM Role myapp-prod-app-task-role

4.2 Input / Output Variables

Required variable fields:

variable "environment" {
  description = "Deployment environment (dev/staging/production)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}

Required output fields:

output "database_endpoint" {
  description = "The hostname of the database endpoint"
  value       = aws_db_instance.main.endpoint
  sensitive   = false
}

4.3 Versioning Strategy

Module versioning: Semantic versioning (MAJOR.MINOR.PATCH) Pin strategy: ~> MAJOR.MINOR (allow patch updates, pin minor) Upgrade policy: Review and test before upgrading minor/major versions Changelog: Every module version bump requires a CHANGELOG entry


5. Workflow

5.1 Standard Change Process

flowchart LR
    BRANCH[Create branch] --> CODE[Write/modify IaC]
    CODE --> VALIDATE[terraform validate + tflint]
    VALIDATE --> PLAN[terraform plan]
    PLAN --> PR[Open PR with plan output]
    PR --> REVIEW[Peer review]
    REVIEW --> APPROVE[Approval]
    APPROVE --> APPLY[terraform apply in CI]
    APPLY --> VERIFY[Verify resources]

Steps:

  1. Create feature branch: infra/{{TICKET}}-description
  2. Make changes, run terraform validate && terraform fmt
  3. Run terraform plan — attach output to PR
  4. Open PR for review (at least 1 reviewer required for dev/staging, 2 for production)
  5. CI runs terraform plan automatically on PR open
  6. Merge triggers terraform apply in CI (dev/staging)
  7. Production apply requires manual trigger after PR merge

5.2 PR-Based Infrastructure Changes

PR Requirements:

5.3 Automated Drift Detection

Schedule: {{DRIFT_SCHEDULE}} Tool: {{DRIFT_TOOL}} Alert Channel: {{DRIFT_ALERT_CHANNEL}} Action on drift:

  1. Investigate cause (manual change, provider issue, external system)
  2. Either fix drift (apply IaC) or update IaC to reflect intentional change
  3. Never leave drift unresolved for > {{DRIFT_SLA}}

6. Security

6.1 Least Privilege for IaC Service Account

Environment Service Account Permissions
Dev ci-iac-dev@{{PROJECT}} Full write within dev resources
Staging ci-iac-staging@{{PROJECT}} Full write within staging resources
Production ci-iac-prod@{{PROJECT}} Restricted write, requires MFA session

6.2 Secret Injection (Not in State)

Rule: Never pass passwords, API keys, or secrets as Terraform variables Pattern: Reference secrets manager in resource configuration:

# WRONG — secret in state
resource "aws_db_instance" "main" {
  password = var.db_password  # This will be in state in plaintext!
}

# RIGHT — secret from Secrets Manager
resource "aws_db_instance" "main" {
  manage_master_user_password = true  # AWS manages the password in Secrets Manager
}

6.3 Policy as Code

Tool: {{POLICY_TOOL}}

Policy Enforcement
No public S3 buckets Block
All resources must have environment tag Warn
RDS must be in private subnet Block
Security groups must not allow 0.0.0.0/0 on sensitive ports Block
Encryption at rest required for data resources Block

7. Tagging Strategy

Required tags on all resources:

Tag Value Purpose
Project {{PROJECT_NAME}} Cost attribution
Environment dev / staging / production Environment filter
ManagedBy terraform Identifies IaC-managed resources
Team {{TEAM}} Ownership
CostCenter {{COST_CENTER}} Finance attribution

Optional tags:

Tag Value Purpose
Service {{SERVICE_NAME}} Service-level grouping
Ticket {{TICKET_ID}} Change tracking
ExpiresAt {{DATE}} Ephemeral resource cleanup

8. Cost Management

Budget alerts:

Cost optimization built into IaC:


9. Disaster Recovery for IaC State

State backup: {{STATE_BACKUP}} Recovery procedure:

  1. Restore from most recent backup
  2. Run terraform plan — verify no unexpected changes
  3. If state is unrecoverable: terraform import for each managed resource (refer to resource inventory)

Prevention:



Approval

Role Name Date Signature
Author
Reviewer
Approver

Monitoring & Observability

Monitoring & Observability

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version Date Author Changes
0.1 {{DATE}} {{AUTHOR}} Initial draft

1. Observability Strategy

Observability Platform: {{OBS_PLATFORM}} Strategy: Instrument everything, alert on symptoms (not causes), correlate across pillars

Core Questions We Must Be Able to Answer:

  1. Is the system up and serving users correctly?
  2. How fast is it responding?
  3. What errors are occurring and why?
  4. Where is the bottleneck?
  5. What changed before this problem started?

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Metric Source Alert Threshold Severity
CPU utilization Node exporter / CloudWatch > {{CPU_WARN}}% (warn), > {{CPU_CRIT}}% (critical) Warning / Critical
Memory utilization Node exporter / CloudWatch > {{MEM_WARN}}% (warn), > {{MEM_CRIT}}% (critical) Warning / Critical
Disk utilization Node exporter / CloudWatch > {{DISK_WARN}}% (warn), > {{DISK_CRIT}}% (critical) Warning / Critical
Network in/out Node exporter / CloudWatch > {{NET_LIMIT}}Mbps sustained Warning
Container restarts Kubernetes / ECS > {{RESTART_LIMIT}} in 5min Critical
Node not ready Kubernetes Any Critical

Application Metrics (RED Method)

Metric Description Target Alert Threshold
Request rate Requests per second per service Baseline ± 20% 50% deviation
Error rate % requests returning 5xx < {{ERROR_RATE}}% > {{ERROR_ALERT}}%
P50 latency Median response time < {{P50}}ms > {{P50_ALERT}}ms
P95 latency 95th percentile response time < {{P95}}ms > {{P95_ALERT}}ms
P99 latency 99th percentile response time < {{P99}}ms > {{P99_ALERT}}ms

Business Metrics

Metric Description Collection Method Dashboard
Active users (DAU/MAU) Daily/monthly active users Frontend instrumentation Business dashboard
{{CONVERSION_METRIC}} {{CONVERSION_DESC}} Backend event Business dashboard
{{REVENUE_METRIC}} {{REVENUE_DESC}} Payment events Finance dashboard
Feature usage Feature-level engagement Feature flag SDK Product dashboard

Custom Metrics Definition

Metric Name Type Labels Description Unit
{{APP}}_job_queue_depth Gauge queue_name Number of pending jobs count
{{APP}}_job_processing_duration Histogram queue_name, status Job processing time seconds
{{APP}}_external_api_calls_total Counter service, status External API call count count
{{APP}}_cache_hit_ratio Gauge cache_type Cache hit percentage ratio

2.2 Logs

Log Levels & Usage Guide

Level When to Use Examples
ERROR Unexpected failure requiring attention Database connection failure, unhandled exception
WARN Unexpected but handled situation Deprecated API called, retry succeeded
INFO Normal business events User logged in, order created, job completed
DEBUG Diagnostic detail (dev/staging only) Function parameters, internal state
TRACE Extremely verbose (local dev only) SQL queries, HTTP request/response bodies

Production log level: INFO and above

Structured Logging Format

{
  "timestamp": "2026-01-15T10:30:00.000Z",
  "level": "INFO",
  "service": "{{SERVICE_NAME}}",
  "version": "{{VERSION}}",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "user_id": "{{HASHED_OR_OMIT}}",
  "request_id": "req-uuid-here",
  "message": "Order created successfully",
  "order_id": "ord-123",
  "duration_ms": 45
}

Required fields: timestamp, level, service, message, trace_id Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (hash or truncate)

Log Aggregation Pipeline

flowchart LR
    APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
    AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
    STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
    STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]
Stage Tool Configuration
Application logging {{LOG_LIB}} Structured JSON to stdout
Log agent {{LOG_AGENT}} Deployed as sidecar / DaemonSet
Transport {{LOG_TRANSPORT}} TLS encrypted
Storage {{LOG_STORE}} Indexed, compressed
Query {{LOG_QUERY}} Access via dashboard

Log Retention Policy

Environment Retention Storage Tier
Dev 7 days Hot
Staging 30 days Hot
Production {{PROD_LOG_RETENTION}} days Hot (30d) → Cold archive
Audit logs 1 year (regulatory) Hot (90d) → Cold archive

PII in Logs — Masking Strategy

Data Type Strategy Example
Email address Hash + truncate user:sha256(email)[:8]
Phone number Redact [PHONE_REDACTED]
IP address Anonymize last octet 192.168.1.xxx
Payment data Never log Use [PAYMENT_DATA_OMITTED]
Auth tokens Never log Use [TOKEN_OMITTED]
Names Omit or pseudonymize Reference by ID only

2.3 Traces

Distributed Tracing Setup

Tracing Framework: {{TRACE_FRAMEWORK}} Backend: {{TRACE_BACKEND}} Auto-instrumentation: {{AUTO_INSTRUMENT}}

Service Instrumented Framework Notes
{{SERVICE_1}} Yes OpenTelemetry HTTP, DB, Redis
{{SERVICE_2}} Yes OpenTelemetry HTTP, external calls

Trace Sampling Strategy

Environment Strategy Rate Notes
Dev Always-on 100% Full visibility
Staging Always-on 100% Full visibility
Production Tail-based {{SAMPLE_RATE}}% + errors Error traces always kept

Tail-based sampling rules:

Span Naming Conventions

Operation Type Naming Pattern Example
HTTP handler HTTP {{METHOD}} {{ROUTE}} HTTP POST /api/orders
DB query db.{{operation}} {{table}} db.select orders
Cache cache.{{operation}} {{key_pattern}} cache.get user:*
Queue queue.{{operation}} {{queue_name}} queue.publish order-events
External HTTP {{service}} {{METHOD}} {{path}} stripe POST /charges

Context Propagation

Standard: W3C TraceContext (traceparent header) Baggage: W3C Baggage (for user_id, tenant_id propagation) Async: Inject context into message queue headers / job metadata


3. Alerting

3.1 Alert Rules

Alert Name Condition Duration Severity Channel Runbook
HighErrorRate error_rate > {{ERROR_ALERT}}% 2 min Critical PagerDuty [link]
SlowP99 p99_latency > {{P99_ALERT}}ms 5 min Warning Slack #alerts [link]
ServiceDown health_check failing 1 min Critical PagerDuty [link]
HighCPU cpu > {{CPU_CRIT}}% 10 min Warning Slack #alerts [link]
DiskAlmostFull disk > {{DISK_CRIT}}% 5 min Critical PagerDuty [link]
DeploymentFailed deployment status = failed Immediate Critical Slack #deployments [link]
CertificateExpiringSoon cert_expiry < 30 days Warning Slack #ops [link]
BackupFailed backup job = failed Critical PagerDuty [link]
SLOBudgetBurning error_budget < 10% remaining Critical PagerDuty [link]

3.2 Alert Routing & Escalation

flowchart TD
    ALERT[Alert fires] --> SEVERITY{Severity?}
    SEVERITY -->|Critical| ONCALL[On-call engineer<br/>PagerDuty / phone]
    SEVERITY -->|Warning| SLACK[Slack #alerts<br/>No immediate response required]
    ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
    ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]
Severity Response SLA Channel Escalation
Critical (P1) Acknowledge in 5 min, resolve in 1h PagerDuty + call Escalate at 5 min
High (P2) Acknowledge in 30 min, resolve in 4h PagerDuty Escalate at 30 min
Warning (P3) Review within 1 business day Slack Manual
Info No response required Slack None

3.3 On-Call Rotation

Schedule: {{ONCALL_SCHEDULE}} Calendar: {{ONCALL_TOOL}} Primary rotation: {{ONCALL_MEMBERS}} Secondary (escalation): {{ESCALATION_MEMBERS}} Minimum rotation size: 3 people (to avoid burnout)

3.4 Alert Fatigue Prevention


4. Dashboards

4.1 Dashboard Inventory

Dashboard Purpose Link Audience
System Overview High-level health of all services {{LINK}} Everyone
{{SERVICE_1}} Service-level detail {{LINK}} Dev team
Infrastructure Host/container metrics {{LINK}} DevOps
Business Metrics KPIs and conversions {{LINK}} Leadership, PM
SLO Tracker Error budget tracking {{LINK}} Engineering lead
On-Call Current incidents, top errors {{LINK}} On-call engineer

4.2 Key Dashboard Specs — System Overview

Required panels:

  1. Service health matrix (all services, green/red/yellow)
  2. Request rate (all services, last 1h)
  3. Error rate (all services, last 1h)
  4. P99 latency (all services, last 1h)
  5. Active incidents count
  6. Error budget remaining (all SLOs)
  7. Last deployment (service, version, time)
  8. Infrastructure health (CPU, memory, disk — aggregate)

5. SLOs / SLIs

5.1 SLI Definitions

SLI Definition Measurement Method
Availability % requests returning non-5xx (total_requests - 5xx_requests) / total_requests
Latency % requests completing within threshold histogram_quantile(0.95, ...) < {{LATENCY_SLI}}ms
Error rate % requests not returning errors (total_requests - error_requests) / total_requests

5.2 SLO Targets

Service SLI Target Window Error Budget
{{SERVICE}} Availability {{AVAIL_TARGET}}% 30 days {{BUDGET_MINUTES}} min/month
{{SERVICE}} Latency (P95 < {{P95}}ms) {{LATENCY_TARGET}}% 30 days {{LATENCY_BUDGET_MINUTES}} min/month

5.3 Error Budget Tracking

Service Monthly Budget Burned This Month Remaining Burn Rate (24h)
{{SERVICE}} {{BUDGET}}min TBD TBD TBD

Error budget policy:


6. Tooling

Tool Version Purpose Hosted
{{METRICS_TOOL}} {{VERSION}} Metrics collection & storage {{HOSTING}}
{{LOG_TOOL}} {{VERSION}} Log aggregation {{HOSTING}}
{{TRACE_TOOL}} {{VERSION}} Distributed tracing {{HOSTING}}
{{DASHBOARD_TOOL}} {{VERSION}} Visualization {{HOSTING}}
{{ALERT_TOOL}} {{VERSION}} Alert routing & on-call {{HOSTING}}


Approval

Role Name Date Signature
Author
Reviewer
Approver

Disaster Recovery Plan

Disaster Recovery Plan

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version Date Author Changes
0.1 {{DATE}} {{AUTHOR}} Initial draft

1. Business Continuity Overview

This plan documents the procedures to recover {{PROJECT_NAME}} services following a disaster event (data center failure, data corruption, security breach, or catastrophic failure).

Plan Owner: {{DR_OWNER}} Plan Reviewer: {{DR_REVIEWER}} Last Tested: {{LAST_TEST_DATE}} Next Scheduled Test: {{NEXT_TEST_DATE}}

Disaster types covered:


2. RPO / RTO Targets Per Service Tier

Tier Description RPO RTO Examples
Tier 1 — Critical Core user-facing services; downtime has direct revenue impact 0 (real-time replication) < 15 min Auth, checkout, core API
Tier 2 — Important Supporting services; degraded experience without them < 1 hour < 4 hours Notifications, reports
Tier 3 — Standard Background/admin services; business can operate without temporarily < 24 hours < 24 hours Analytics, admin panel

3. Service Tier Classification

Service Tier Owner Rationale
{{SERVICE_1}} Tier 1 {{OWNER}} Core user journey
{{SERVICE_2}} Tier 1 {{OWNER}} Authentication
{{SERVICE_3}} Tier 2 {{OWNER}} Supporting
{{SERVICE_4}} Tier 3 {{OWNER}} Admin only
Database — Primary Tier 1 Platform All services depend on it
Object Storage Tier 2 Platform User uploads

4. Backup Strategy

4.1 Database Backups

Database Backup Type Frequency Retention Location Verified
{{DB_PRIMARY}} Automated snapshot Daily 30 days {{BACKUP_LOCATION}} Monthly
{{DB_PRIMARY}} Point-in-time recovery Continuous 7 days {{BACKUP_LOCATION}} Monthly
{{DB_READ_REPLICA}} Not backed up separately Rebuilt from primary

Automated backup tool: {{BACKUP_TOOL}} Backup encryption: AES-256, key managed in {{KMS_TOOL}} Cross-region copy: {{CROSS_REGION}}

4.2 File / Object Storage Backups

Storage Backup Method Frequency Retention DR Copy
{{S3_BUCKET}} S3 versioning + replication Continuous {{RETENTION}} {{DR_BUCKET}}
{{FILE_STORE}} Snapshot Daily 30 days Cross-region

4.3 Configuration Backups

Config Backup Method Location Frequency
IaC (Terraform) Git repository {{GIT_REPO}} On change
Application config Git repository {{GIT_REPO}} On change
Secrets Secrets manager replication {{SECRETS_BACKUP}} Real-time
DNS records Export to Git {{GIT_REPO}} Weekly
TLS certificates Secrets manager {{CERTS_BACKUP}} On renewal

4.4 Backup Testing Schedule

Backup Type Test Frequency Last Test Result Tester
Database full restore Monthly {{DATE}} {{RESULT}} {{TESTER}}
Point-in-time restore Quarterly {{DATE}} {{RESULT}} {{TESTER}}
Object storage restore Quarterly {{DATE}} {{RESULT}} {{TESTER}}
Full DR failover drill Bi-annually {{DATE}} {{RESULT}} {{TESTER}}

5. Failover Procedures

5.1 Automated Failover

Component Automatic Failover Mechanism Failover Time
Database (Multi-AZ) Yes RDS automatic failover 60-120 seconds
Load balancer Yes Health check → route to healthy targets < 30 seconds
CDN Yes Origin health checks < 60 seconds
Redis (if clustered) Yes Redis Sentinel / ElastiCache < 30 seconds

Monitoring automatic failover:

5.2 Manual Failover Steps

Prerequisite: Automatic failover has NOT occurred or has failed.

Database Manual Failover (Tier 1)

  1. Confirm primary is unavailable: ping {{DB_PRIMARY_HOST}} — should timeout
  2. Connect to standby: psql {{STANDBY_HOST}}
  3. Promote standby to primary: SELECT pg_promote();
  4. Update DNS record db.{{INTERNAL_DOMAIN}}{{STANDBY_HOST}}
  5. DNS TTL: Ensure TTL was set to 60s pre-incident (if not, wait {{DNS_TTL}} seconds)
  6. Verify applications are reconnecting: Check application logs for successful DB connections
  7. Page on-call to verify all services healthy

Regional Failover (Catastrophic)

  1. Declare DR event (approval from {{DR_AUTHORITY}})
  2. Confirm primary region {{PRIMARY_REGION}} is unreachable
  3. Activate standby in {{DR_REGION}}: terraform apply -var-file=envs/dr.tfvars
  4. Restore database from latest cross-region snapshot
  5. Update Route 53 / DNS to point to {{DR_REGION}} endpoints
  6. Run smoke tests: bash scripts/smoke-tests.sh {{DR_REGION}}
  7. Notify stakeholders (see Communication Plan)
  8. Monitor enhanced metrics for {{MONITOR_PERIOD}}h

6. Recovery Procedures Per Service

Tier 1 Services

Service Recovery Procedure Recovery Script Est. Time
{{SERVICE_1}} 1. Restore from snapshot
2. Verify config
3. Run smoke tests
scripts/restore-{{SERVICE_1}}.sh {{TIME}}min
Authentication 1. Deploy from last known good image
2. Verify JWT keys
3. Test login flow
scripts/restore-auth.sh {{TIME}}min

Tier 2 Services

Tier 3 Services


7. DR Drill Schedule & Scenarios

Drill Type Frequency Participants Last Executed Next Scheduled
Tabletop exercise Quarterly On-call team + engineering lead {{DATE}} {{DATE}}
Database failover test Quarterly DevOps + one developer {{DATE}} {{DATE}}
Full DR failover Bi-annually Entire engineering team {{DATE}} {{DATE}}
Backup restore test Monthly DevOps {{DATE}} {{DATE}}

Drill Scenarios to Cover:

  1. Database primary failure (automatic failover test)
  2. Accidental data deletion (point-in-time restore)
  3. Single AZ outage (multi-AZ failover)
  4. Full region failure (cross-region DR)
  5. Ransomware/data corruption (restore from offline backup)
  6. CDN outage (origin fallback)
  7. Secret store unavailable (cached credentials)

8. Communication Plan During DR Event

Internal Communications

Audience Channel Frequency Owner
Engineering team Slack #incidents + war room call Real-time Incident commander
Engineering management Direct message At declaration + hourly Incident commander
Product/Business leadership Email + Slack At declaration + hourly Incident commander
Customer support Dedicated Slack channel At declaration + 30 min Support lead

External Communications

Audience Channel Trigger Message
Customers Status page ({{STATUS_PAGE}}) Within 15 min of confirmed incident "We are investigating an issue"
Customers Status page update Every 30 min Progress update
Customers Email If impact > {{EMAIL_THRESHOLD}}h Direct notification
SLA customers Direct contact Per SLA contract As contractually required

Communication templates: See go-live-runbook.md communication section


9. War Room Setup

War Room: {{WAR_ROOM_LINK}} Bridge Line: {{BRIDGE_NUMBER}} Document: Live incident doc created at: {{INCIDENT_DOC_TEMPLATE}}

Roles during DR event:

Role Responsibility Primary Backup
Incident Commander Coordinates response, final decisions {{IC}} {{IC_BACKUP}}
Technical Lead Leads technical recovery {{TECH_LEAD}} {{TECH_BACKUP}}
Communications Lead Internal/external updates {{COMMS_LEAD}} {{COMMS_BACKUP}}
Scribe Documents timeline, actions taken {{SCRIBE}} Rotate

10. Post-Recovery Verification Checklist


11. DR Test Results Log

Date Test Type Scenario RTO Achieved RPO Achieved Issues Found Resolved By
{{DATE}} {{TYPE}} {{SCENARIO}} {{RTO}} {{RPO}} {{ISSUES}} {{RESOLVED}}


Approval

Role Name Date Signature
Author
Reviewer
Approver

CI/CD Pipeline

CI/CD Pipeline

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version Date Author Changes
0.1 {{DATE}} {{AUTHOR}} Initial draft

1. Overview

CI/CD Platform: {{PLATFORM}} Container Registry: {{REGISTRY}} Deployment Target: {{DEPLOY_TARGET}} Strategy: {{STRATEGY}}


2. Pipeline Overview

flowchart LR
    subgraph Source
        PR[Pull Request]
        MERGE[Merge to main]
    end

    subgraph CI["CI — runs on every PR"]
        LINT[Lint & Format]
        TEST_UNIT[Unit Tests]
        TEST_INT[Integration Tests]
        SAST[SAST Scan]
        SCA[Dependency Scan]
        BUILD[Build Artifact]
    end

    subgraph CD_DEV["CD — Dev Auto-Deploy"]
        DEPLOY_DEV[Deploy to Dev]
        SMOKE_DEV[Smoke Tests]
    end

    subgraph CD_STAGING["CD — Staging (auto on main)"]
        DEPLOY_STG[Deploy to Staging]
        TEST_E2E[E2E Tests]
        PERF[Performance Tests]
    end

    subgraph CD_PROD["CD — Production (manual gate)"]
        APPROVAL[Manual Approval]
        DEPLOY_PROD[Deploy to Production]
        SMOKE_PROD[Smoke Tests]
        MONITOR[Verify Monitoring]
    end

    PR --> LINT
    LINT --> TEST_UNIT
    TEST_UNIT --> TEST_INT
    TEST_INT --> SAST
    SAST --> SCA
    SCA --> BUILD
    MERGE --> CD_DEV
    BUILD --> DEPLOY_DEV
    DEPLOY_DEV --> SMOKE_DEV
    SMOKE_DEV --> DEPLOY_STG
    DEPLOY_STG --> TEST_E2E
    TEST_E2E --> PERF
    PERF --> APPROVAL
    APPROVAL --> DEPLOY_PROD
    DEPLOY_PROD --> SMOKE_PROD
    SMOKE_PROD --> MONITOR

3. Source Control Configuration

3.1 Branching Strategy

Strategy: {{BRANCH_STRATEGY}}

Branch Purpose Naming Convention Lifetime
main Production-ready code fixed Permanent
develop Integration branch fixed Permanent
feature/* New features feature/{{TICKET}}-description Until merged
fix/* Bug fixes fix/{{TICKET}}-description Until merged
hotfix/* Production hotfixes hotfix/{{TICKET}}-description Until merged
release/* Release preparation release/v{{VERSION}} Until merged

3.2 Branch Protection Rules

Protected Branches: main, develop

Rule main develop
Require PR Yes Yes
Required approvals {{APPROVALS}} 1
Dismiss stale reviews Yes Yes
Require status checks Yes Yes
Required checks lint, unit-tests, integration-tests, sast lint, unit-tests
Require up-to-date Yes No
Allow force push No No
Allow deletions No No

3.3 Code Review Requirements


4. Build Stage

4.1 Build Tool & Configuration

Parameter Value
Build Tool {{BUILD_TOOL}}
Build Command {{BUILD_CMD}}
Artifact Type {{ARTIFACT}}
Artifact Naming {{REGISTRY}}/{{IMAGE_NAME}}:{{TAG_STRATEGY}}
Tag Strategy git-sha for PRs, semver for releases

4.2 Dependency Caching

Cache Key Restore Keys
Node modules node-modules-{{OS}}-{{LOCKFILE_HASH}} node-modules-{{OS}}-
Docker layers buildx-{{DOCKERFILE_HASH}} buildx-
Test results test-results-{{COMMIT_SHA}} N/A

4.3 Artifact Generation

Artifact Storage Retention Signed
Docker image {{REGISTRY}} 90 days (non-prod), Forever (prod tags) {{SIGNING}}
Test reports CI artifact storage 30 days No
SBOM {{SBOM_STORAGE}} 1 year Yes
Coverage report {{COVERAGE_STORAGE}} 30 days No

5. Test Stages

5.1 Unit Tests

Parameter Value
Framework {{UNIT_FRAMEWORK}}
Command {{UNIT_CMD}}
Coverage Tool {{COVERAGE_TOOL}}
Coverage Gate ≥ {{COVERAGE_GATE}}% lines, ≥ {{BRANCH_GATE}}% branches
Failure Action Block PR merge

5.2 Integration Tests

Parameter Value
Framework {{INT_FRAMEWORK}}
Command {{INT_CMD}}
Dependencies {{INT_DEPS}}
Failure Action Block PR merge

5.3 E2E Tests

Parameter Value
Framework {{E2E_FRAMEWORK}}
Command {{E2E_CMD}}
Environment Staging
Parallelization {{E2E_SHARDS}} shards
Failure Action Block staging promotion

5.4 Security Scanning

Scan Type Tool Command Gate
SAST {{SAST_TOOL}} {{SAST_CMD}} Block on HIGH/CRITICAL
SCA (dependencies) {{SCA_TOOL}} {{SCA_CMD}} Block on CRITICAL
Container scan {{CONTAINER_SCAN}} {{CONTAINER_SCAN_CMD}} Block on CRITICAL
Secret scanning {{SECRET_SCAN}} {{SECRET_SCAN_CMD}} Block on any finding

5.5 Linting & Formatting

Tool Purpose Command Auto-fix
{{LINTER}} Code linting {{LINT_CMD}} PR comment
{{FORMATTER}} Code formatting {{FMT_CMD}} Auto-commit or fail
{{TYPE_CHECK}} Type checking {{TYPE_CMD}} No

6. Deploy Stages

6.1 Deployment Strategy

Strategy: {{DEPLOY_STRATEGY}}

Rolling Deployment:

Canary Deployment (if used):

6.2 Environment Promotion

PR Branch → Dev (auto) → Staging (auto on main merge) → Production (manual approval)
Promotion Trigger Gate Approver
→ Dev Merge to develop / PR All CI checks pass Automatic
→ Staging Merge to main All CI + Dev smoke tests Automatic
→ Production Tag v*.*.* All tests + manual approval {{PROD_APPROVER}}

6.3 Approval Gates

Production Approval Required: Yes Approvers: {{PROD_APPROVERS}} (at least {{APPROVAL_COUNT}} required) Approval Window: {{APPROVAL_WINDOW}}h (pipeline cancels after timeout) Emergency Override: {{EMERGENCY_OVERRIDE}}

6.4 Feature Flags Integration

Feature Flag Tool: {{FF_TOOL}} Flag Validation: Feature flags validated in staging before production deploy Kill Switch: All new features behind flags for first {{FF_PERIOD}} days


7. Post-Deploy

7.1 Smoke Tests

Check Expected Timeout
Health endpoint GET /health HTTP 200 10s
Auth endpoint reachable HTTP 401 10s
Database connection Healthy 15s
Cache connection Healthy 10s
Critical user journey Success 60s

Smoke test timeout: {{SMOKE_TIMEOUT}}min total On failure: Auto-rollback triggered

7.2 Monitoring Verification

Metric Threshold Check Duration
Error rate < {{ERROR_RATE}}% 5 min
P99 latency < {{P99}}ms 5 min
CPU utilization < {{CPU}}% 5 min
Memory utilization < {{MEM}}% 5 min

7.3 Rollback Triggers

Automatic rollback triggers:

Manual rollback: See rollback-plan.md


8. Pipeline Configuration Reference

Config File Location: {{CONFIG_PATH}}

Key environment variables injected by CI:

Variable Source Purpose
REGISTRY_TOKEN {{SECRET_STORE}} Container registry auth
DEPLOY_KEY {{SECRET_STORE}} Deployment credentials
SENTRY_DSN {{SECRET_STORE}} Error tracking
SLACK_WEBHOOK {{SECRET_STORE}} Notifications

9. Secret Injection Strategy

Strategy: {{SECRET_STRATEGY}}

Secret Type Storage Injection Method Rotation
Registry credentials {{STORAGE}} {{METHOD}} {{ROTATION}}
Cloud credentials {{STORAGE}} OIDC / Workload Identity Per-job
App secrets {{STORAGE}} {{METHOD}} {{ROTATION}}

OIDC Preferred: Cloud credentials injected via OIDC — no long-lived keys stored in CI


10. Pipeline Metrics

Metric Target Current
Build duration (P50) < {{BUILD_TARGET}}min TBD
Test duration (P50) < {{TEST_TARGET}}min TBD
Total pipeline duration < {{TOTAL_TARGET}}min TBD
Deploy frequency {{DEPLOY_FREQ}} TBD
Lead time for changes < {{LEAD_TIME}} TBD
Change failure rate < {{FAILURE_RATE}}% TBD
MTTR < {{MTTR}} TBD


Approval

Role Name Date Signature
Author
Reviewer
Approver

ALAI Static Hosting Blueprint (2026-04-20)

ALAI Static Hosting Blueprint

Author: ALAI | Date: 2026-04-20 | MC: #8481 | Last updated: 2026-04-20 (Phantom Domain Removal Protocol added per MC #8526; rollback fix per MC #8494)


1. Platform Decision

Winner: Cloudflare Pages

ALAI already runs alai.no on Cloudflare Pages and has Cloudflare as DNS provider for 6 of 12 domains. The migration path is lowest-friction of any option: git push triggers build, custom domains are free, SSL is automatic, and Cloudflare Access (already deployed for internal tools) works natively. The free tier covers unlimited sites, 500 builds/month, and unlimited bandwidth — all 12 static sites fit without spending a euro. Critically, ALAI does not need object-storage complexity (GCS/S3) or a separate CDN layer for static marketing/demo sites. Cloudflare Pages is the right tool at this scale.

The call on vendor lock-in: ALAI is already locked to Cloudflare for DNS. Extending that to hosting is concentration risk, but the blast radius is recoverable — all sites are git-backed, migrating to any other platform is a 30-minute operation per site. The cost and operational savings outweigh the risk.

Platform Comparison (12 sites, 1 GB each, 100 GB egress/month)

Criterion Cloudflare Pages GCP Cloud Storage + CDN AWS S3 + CloudFront Azure Static Web Apps
Monthly cost (12 sites) €0 (free tier) ~€12 (storage €1.20 + CDN egress ~€10) ~€14 (S3 €0.25 + CF egress ~€8 + requests ~€6) €0 Free / €9 Standard (2 sites free, rest €4.50/mo each)
Build minutes 500/month free N/A (no built-in CI) N/A (no built-in CI) 60 min/month free, then €0.009/min
DX (git push to live) Native (GitHub/GitLab direct) Requires Cloud Build + gsutil Requires CodePipeline or GitHub Action + aws CLI Native (GitHub Actions integrated)
Custom domains Unlimited Per load balancer config Per distribution ($0.0075/10k requests) 5 per plan
SSL Automatic, free Managed certificate, manual setup ACM free but requires distribution config Automatic, free
Preview URLs per PR Yes (automatic) No (requires custom setup) No (requires custom Lambda@Edge) Yes (staging environments)
DDoS/WAF Included free (Cloudflare network) Cloud Armor (add-on, ~€5+/mo) AWS Shield Standard free, WAF extra Azure DDoS Basic free, WAF add-on
Vendor lock-in Medium (proprietary build env, but output is static) Low (standard GCS) Low (standard S3) Medium (Azure-specific config)

Decision: Cloudflare Pages wins on cost (€0 vs €12-14/mo), DX (native git integration), DDoS/WAF included, and operational alignment with existing CF infrastructure.


2. Deploy Blueprint

Repo Convention

Every static site lives in its own repo or a dedicated directory in a monorepo. Naming convention: alai-<product>-web for ALAI properties, client-<slug>-web for client sites. The Cloudflare Pages project name matches the repo name exactly.

Build output must be in one of: dist/, out/, public/, .next/ (for Next.js static export). For plain HTML sites, the root directory is the publish directory.

Step 1: Create Cloudflare Pages Project (one-time per site)

# Via Cloudflare dashboard or wrangler CLI
npx wrangler pages project create <project-name> \
  --production-branch main

Connect GitHub repo in the Pages dashboard. Set build command and output directory per framework:

Framework Build command Output dir
Static HTML (none) /
Next.js (static export) next build out
Next.js (app router) next build .next
Astro astro build dist

Step 2: GitHub Actions CI (copy-paste ready)

Save as .github/workflows/deploy.yml in every site repo:

name: Deploy to Cloudflare Pages

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      deployments: write
      pull-requests: write

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Build
        run: npm run build
        env:
          NODE_ENV: production

      - name: Deploy to Cloudflare Pages
        uses: cloudflare/wrangler-action@v3
        with:
          apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
          accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
          command: pages deploy ./out --project-name=${{ vars.CF_PROJECT_NAME }} --branch=${{ github.ref_name }}

      - name: Comment preview URL on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const { data: deployments } = await github.rest.repos.listDeployments({
              owner: context.repo.owner,
              repo: context.repo.repo,
              ref: context.payload.pull_request.head.sha,
              per_page: 1
            });
            if (deployments.length > 0) {
              github.rest.issues.createComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                issue_number: context.payload.pull_request.number,
                body: `Preview deployed: https://${context.payload.pull_request.head.sha.substring(0,8)}.${process.env.CF_PROJECT_NAME}.pages.dev`
              });
            }

For plain HTML sites with no build step, remove the Install dependencies and Build steps, and change the deploy path to ./ instead of ./out.

Step 3: Custom Domain (one-time per site)

# In Cloudflare dashboard: Pages > Project > Custom Domains > Add custom domain
# Or via API:
curl -X POST "https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/pages/projects/$PROJECT_NAME/domains" \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"name":"example.alai.no"}'

Because ALAI uses Cloudflare DNS, the CNAME/alias record is created automatically when adding the custom domain inside Cloudflare Pages.

Preview URL Per PR

Cloudflare Pages creates a preview URL automatically for every PR push. Format: https://<commit-hash>.<project-name>.pages.dev. No configuration needed. Preview environments are isolated and do not affect production traffic.

Phantom Domain Removal Protocol

ZAKON: Before vercel domains rm <phantom> — verify real domain is not implicitly routing through phantom.

Safe sequence for phantom removal:

  1. vercel domains inspect <real-domain> — confirm direct attachment to authoritative project
  2. If real domain does NOT show direct attachment → vercel domains add <real> --project <authoritative> FIRST
  3. curl -sI https://<real> — confirm HTTP 200 with new attachment
  4. ONLY THEN: vercel domains rm <phantom> --yes
  5. Re-verify: curl -sI https://<real> HTTP 200

Forbidden: Remove phantom without prior explicit attachment of real domain → risk implicit routing break.

Incident reference: 2026-04-20 kenyhot.pro cleanup, 35s downtime, MC #8526.

Evidence: /Users/makinja/system/evidence/kenyhot-vercel-cleanup/execution-log-*.txt

Rollback (< 60 seconds)

NOTE — wrangler 4.x breaking change: wrangler pages deployment rollback was removed in wrangler 4.x. The subcommand no longer exists and the /rollback CF API endpoint returns 405 for direct-upload deployments. Do NOT use it. Use the alternatives below. (Reference: wrangler upstream release notes; verified in Proveo pilot on basicconsulting.no, MC #8494.)

Primary — CF API re-deploy (copy-paste ready):

# Required env vars — set once per shell session or in ~/.zshrc
export CF_API_TOKEN="<your-cloudflare-api-token>"   # scope: Cloudflare Pages: Edit
export CF_ACCOUNT_ID="<your-cloudflare-account-id>"
export CF_PROJECT_NAME="<project-name>"

# 1. List recent deployments and grab the target deployment ID
curl -s "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" | \
  python3 -c "import sys,json; [print(d['id'], d['created_on'][:19], d.get('deployment_trigger',{}).get('metadata',{}).get('commit_message','')[:60]) for d in json.load(sys.stdin)['result'][:10]]"

# 2. Re-deploy the target deployment (replace <deployment-id> with ID from step 1)
curl -s -X POST \
  "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments/<deployment-id>/retry" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -H "Content-Type: application/json" | python3 -c "import sys,json; r=json.load(sys.stdin); print('OK —', r['result']['id']) if r['success'] else print('ERROR:', r['errors'])"

CF reuses content-hash cache — files already on the CDN are not re-uploaded. Measured time: ~11 seconds. No build step required.

Secondary — CF Dashboard rollback (GitHub-connected repos):

  1. Open https://dash.cloudflare.com > Pages > select project
  2. Click "Deployments" tab
  3. Find the target deployment row, click the three-dot menu
  4. Select "Rollback to this deployment"
  5. Confirm — live traffic switches in < 30 seconds

Total time to identify + execute: under 30 seconds for either path.

Secrets Management

Secret Storage How to use
CLOUDFLARE_API_TOKEN GitHub repository secret Set in: Repo > Settings > Secrets > Actions
CLOUDFLARE_ACCOUNT_ID GitHub repository variable Set in: Repo > Settings > Variables > Actions
CF_PROJECT_NAME GitHub repository variable Set per repo, matches CF Pages project name
Build-time env vars (API keys, etc.) Cloudflare Pages > Settings > Environment variables Available during build and at runtime for SSR

Token scope required: Cloudflare Pages: Edit only. Create at: https://dash.cloudflare.com/profile/api-tokens

New-Site Template (one command)

Save as /Users/makinja/system/tools/alai-new-site.sh:

#!/usr/bin/env bash
# Usage: bash alai-new-site.sh <site-name> [--framework next|html|astro]
set -euo pipefail

SITE_NAME="${1:?Usage: alai-new-site.sh <site-name> [--framework next|html|astro]}"
FRAMEWORK="${3:-html}"
REPO_DIR="/Users/makinja/ALAI/sites/${SITE_NAME}"

echo "Creating site: ${SITE_NAME} (${FRAMEWORK})"

# 1. Create repo directory
mkdir -p "${REPO_DIR}/.github/workflows"

# 2. Copy workflow template
cp /Users/makinja/system/specs/templates/cf-pages-deploy.yml "${REPO_DIR}/.github/workflows/deploy.yml"

# 3. Create wrangler.toml
cat > "${REPO_DIR}/wrangler.toml" <<EOF
name = "${SITE_NAME}"
compatibility_date = "2026-01-01"

[env.production]
EOF

# 4. Init git
cd "${REPO_DIR}" && git init && git add . && git commit -m "init: ${SITE_NAME}"

# 5. Create Cloudflare Pages project
npx wrangler pages project create "${SITE_NAME}" --production-branch main

echo "Done. Next: connect GitHub repo in Cloudflare dashboard."
echo "  https://dash.cloudflare.com/pages"

3. Maintenance

SSL Auto-Renewal

Cloudflare Pages provisions and auto-renews SSL certificates via Cloudflare's certificate authority. No manual action required. Certificates renew 30 days before expiry. The only failure mode is if a custom domain's DNS stops pointing to Cloudflare — the alert system in Section 4 catches this.

DNS Consolidation

Target: All domains to Cloudflare DNS.

Current state: 2 on Cloudflare, 1 on Vercel, 1 on AWS Route53, 3 on one.com nameservers, 3 unknown/third-party.

Migration steps per domain:

  1. Log in to registrar, change nameservers to ana.ns.cloudflare.com and bob.ns.cloudflare.com
  2. Cloudflare imports existing DNS records automatically (zone scan)
  3. Verify records in Cloudflare dashboard, then activate proxy (orange cloud) for web traffic

Registrar note: Domains registered at one.com (.no TIDs) — nameserver change takes 15 minutes to 4 hours for .no domains. For .ba domains, the registrar controls this; requires contacting them directly.

Dependency Updates (Renovate)

Save as renovate.json in every repo root:

{
  "$schema": "https://docs.renovatebot.com/renovate-schema.json",
  "extends": ["config:recommended"],
  "schedule": ["every sunday"],
  "prCreationDelay": "0 minutes",
  "packageRules": [
    {
      "matchUpdateTypes": ["minor", "patch"],
      "automerge": true,
      "automergeType": "pr",
      "automergeStrategy": "squash"
    },
    {
      "matchUpdateTypes": ["major"],
      "automerge": false,
      "labels": ["dependencies", "major-update"]
    }
  ],
  "vulnerabilityAlerts": {
    "enabled": true,
    "labels": ["security"]
  }
}

Enable Renovate at https://github.com/apps/renovate for each repo. No server needed.

Backup Strategy

Asset What Where Retention
Source code Full git history GitHub (primary) Permanent
Source code mirror Bare git clone Azure VM /opt/backups/git-mirrors/ 90 days rolling
Cloudflare Pages deployments Build artifacts Cloudflare (automatic, last 25 builds) Automatic
DNS zone Export via CF API /Users/makinja/system/backups/dns/ (weekly cron) 12 months
Secrets inventory Encrypted note Vaultwarden (vault.basicconsulting.no) Permanent

DNS zone backup cron (add to crontab):

# Weekly DNS zone backup — runs every Sunday 02:00
0 2 * * 0 curl -s "https://api.cloudflare.com/client/v4/zones?per_page=50" \
  -H "Authorization: Bearer $CF_API_TOKEN" | \
  node /Users/makinja/system/tools/cf-zone-export.js > \
  /Users/makinja/system/backups/dns/zones-$(date +%Y%m%d).json

DR: Restore Site in < 60 Seconds

NOTE — wrangler 4.x breaking change: wrangler pages deployment rollback is removed in wrangler 4.x and must NOT be used. See MC #8494. Option A below replaces it with the CF API re-deploy path.

# Option A: CF API re-deploy (STANDARD DR PATH — replaces deprecated wrangler rollback)
# Time: ~11 seconds. CF content-hash cache means zero bytes re-uploaded for unchanged files.
export CF_API_TOKEN="<your-cloudflare-api-token>"
export CF_ACCOUNT_ID="<your-cloudflare-account-id>"
export CF_PROJECT_NAME="<site-name>"

# List last 10 deployments
curl -s "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" | \
  python3 -c "import sys,json; [print(d['id'], d['created_on'][:19], d.get('deployment_trigger',{}).get('metadata',{}).get('commit_message','')[:60]) for d in json.load(sys.stdin)['result'][:10]]"

# Re-deploy target deployment ID
curl -s -X POST \
  "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/pages/projects/${CF_PROJECT_NAME}/deployments/<deployment-id>/retry" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -H "Content-Type: application/json" | python3 -c "import sys,json; r=json.load(sys.stdin); print('OK —', r['result']['id']) if r['success'] else print('ERROR:', r['errors'])"

# Option B: Redeploy from git (if CF deployment history cleared)
cd /path/to/site-repo && npm run build && \
npx wrangler pages deploy ./out --project-name=<site-name> --branch=main
# Time: 30-90 seconds depending on build

# Option C: Emergency static serve from Azure VM (last resort)
scp -r ./out alai-admin@4.223.110.181:/var/www/<site-name>
ssh -i ~/.ssh/azure_alai alai-admin@4.223.110.181 \
  "sudo caddy reverse-proxy --from <domain> --to localhost:8080"
# Time: ~120 seconds

Option A is the standard DR path. Target: < 60 seconds. Tested monthly as part of Proveo validation.


4. Alarms and Escalation

SENTINEL daemons live in /Users/makinja/system/tools/. Alerting routes to Slack #infra-alerts channel.

Alert Table

Metric Threshold Channel L1 Action L2 Action L3 Action
Uptime (HTTP 200) < 100% for 5 min #infra-alerts (Slack) Auto-retry; post alert Kelsey investigates: CF status page, DNS check Escalate to CEO; activate DR (Option C)
Build failure Any failed build on main #infra-alerts Alert with build URL + error log Kelsey reviews workflow, checks CF Pages build log Revert last commit: git revert HEAD && git push
SSL cert expiry < 30 days to expiry #infra-alerts Alert; verify CF auto-renewal is active Manual CF cert renewal trigger Contact Cloudflare support
5xx rate > 1% of requests over 10 min #infra-alerts Alert with request sample Kelsey checks CF Pages function logs Rollback via CF API re-deploy (Option A, DR section)
Traffic anomaly > 10x baseline in 5 min #infra-alerts Alert; verify CF rate limiting active Check CF analytics for origin; enable under-attack mode Contact Cloudflare support
Bandwidth overage > 80% of plan limit #infra-alerts Alert; review top assets Optimize images, add cache headers Upgrade CF plan or move heavy assets to R2

SENTINEL Integration

Add to /Users/makinja/system/tools/sentinel-uptime.sh:

#!/usr/bin/env bash
# Uptime check for all ALAI sites — run every 5 minutes via cron
SITES=(
  "https://alai.no"
  "https://snowit.ba"
  "https://getdrop.no"
  "https://app.getdrop.no"
  "https://basicconsulting.no"
  "https://basicfakta.no"
  "https://bilko-demo.alai.no"
  "https://kenyhot.pro"
  "https://merdzanovic.ba"
  "https://docs.alai.no"
  "https://sign.basicconsulting.no"
  "https://boards.basicconsulting.no"
  "https://vault.basicconsulting.no"
)

for SITE in "${SITES[@]}"; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$SITE")
  if [ "$STATUS" != "200" ] && [ "$STATUS" != "301" ] && [ "$STATUS" != "302" ]; then
    node /Users/makinja/system/tools/slack.js send "#infra-alerts" \
      "ALERT: $SITE returned HTTP $STATUS at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
  fi
done

Crontab entry: */5 * * * * bash /Users/makinja/system/tools/sentinel-uptime.sh


5. Cost

Per-Site Monthly Cost (Target State: Cloudflare Pages)

Site Current Platform Current Cost CF Pages Cost Notes
alai.no Cloudflare Pages €0 €0 Already there
snowit.ba GitHub Pages €0 €0 Migrate from GitHub Pages
getdrop.no Azure VM (Caddy) Shared with VM €0 Static landing only
app.getdrop.no Azure VM (Caddy) Shared with VM Not applicable Next.js app, stays on VM
basicconsulting.no Vercel €0 (Free) €0 Migrate from Vercel
basicfakta.no Vercel €0 (Free) €0 Migrate from Vercel
bilko-demo.alai.no GCP Cloud Run €5-10 €0 Static export possible; see note
kenyhot.pro Vercel €0 (Free) €0 Client site, coordinate
merdzanovic.ba Vercel €0 (Free) €0 Client site, coordinate
docs.alai.no Azure VM Shared with VM Not applicable BookStack = dynamic, stays on VM
sign.basicconsulting.no Azure VM Shared with VM Not applicable Documenso = dynamic, stays on VM
boards.basicconsulting.no Azure VM Shared with VM Not applicable Planka = dynamic, stays on VM
vault.basicconsulting.no Azure VM Shared with VM Not applicable Vaultwarden = dynamic, stays on VM
bilko-api, bilko-intesa-demo GCP Cloud Run €5-10 Not applicable Dynamic services, stay on GCP

Note on bilko-demo.alai.no: If Bilko web can be exported as static (Next.js output: 'export'), it moves to CF Pages for €0. If it requires server-side rendering (API routes, auth), it stays on GCP Cloud Run. This is a code-level decision for CodeCraft. Placeholder cost assumes migration succeeds.

Annual Total (Target State)

Provider Services After Migration Monthly Annual
Cloudflare Pages 9 static sites €0 €0
GCP Cloud Run Bilko API + demo services (if SSR) €5-10 €60-120
Azure VM BookStack, Documenso, Planka, Vaultwarden, Drop app €50 €600
GitHub Pages snowit.ba (until CF migration) €0 €0
one.com domains alai.no, basicconsulting.no, getdrop.no, bilko.io €17 €200
TOTAL €72-77/month €860-920/year

Current vs Target Delta

Scale: 30 Sites by 2027

At 30 sites, Cloudflare Pages remains €0 (no per-site pricing). The only cost growth vectors are:

Projected 2027 total: €100-130/month at 30 sites. Cloudflare Pages does not contribute to this increase.


6. Migration Plan

Priority 1 = immediate (no dep, low risk). Priority 2 = planned (some coordination). Priority 3 = blocked/external.

Domain Current Platform Target Platform Priority Downtime Window Dependency MC Task
alai.no Cloudflare Pages Cloudflare Pages - None None — already done Done
basicconsulting.no Vercel Cloudflare Pages 1 0 (DNS already on CF) Find repo #8482
basicfakta.no Vercel Cloudflare Pages 1 < 5 min (NS change) Find repo, change registrar NS #8483
snowit.ba GitHub Pages Cloudflare Pages 2 < 5 min Move DNS from AWS Route53 to CF #8484
getdrop.no Azure VM (Caddy) Cloudflare Pages (static) 1 0 (DNS on Vercel, move to CF) Static export of Next.js landing #8485
app.getdrop.no Azure VM (Caddy) Azure VM (stay) - None Dynamic Next.js app No action
bilko-demo.alai.no GCP Cloud Run Cloudflare Pages (if static export works) 2 0 (DNS already on CF) CodeCraft confirms static export #8486
kenyhot.pro Vercel Cloudflare Pages 3 < 5 min Coordinate with client, DNS on Vercel #8487
merdzanovic.ba Vercel Cloudflare Pages 3 < 5 min Coordinate with client, third-party DNS #8488
bilko.io None (down) Cloudflare Pages 2 N/A (currently down) Fix one.com DNS, point to CF #8489
docs/sign/boards/vault.basicconsulting.no Azure VM Azure VM (stay) - None Dynamic apps No action
bilko-api, bilko-intesa-demo GCP Cloud Run GCP Cloud Run (stay) - None Dynamic API services No action

Total sites to migrate: 8 static sites. 4 stay on current platform (dynamic apps/services). 2 done (alai.no, basicconsulting.no).

Migration Log

Date Domain From To Downtime TTFB Before TTFB After Notes
2026-04-20 basicconsulting.no Vercel (76.76.21.21) CF Pages ~60s 114ms 51ms (warm avg) MC #8482. DNS: A->CNAME. Validation required domain re-add. TTFB improved 55%. Proveo pilot validated #8490.
2026-04-20 bilko.io one.com (down) CF Pages N/A (site was down) N/A 68ms (warm avg) MC #8489. Apex CNAME not possible on one.com free tier (paid feature). Switched to Cloudflare NS (ana.ns.cloudflare.com, bob.ns.cloudflare.com). CF Pages zone ID: 62d89b79f0648d3fa1d045335a989ea7. DNS: CNAME flattening bilko.io → bilko-io.pages.dev (proxied), www → bilko-io.pages.dev.

Paused migrations:

Audit verdict for #8486 (bilko-demo.alai.no): Full-stack Next.js app with dynamic API routes. Stays on GCP Cloud Run. Not eligible for CF Pages migration.


7. Lessons Learned

2026-04-20 — CF Browser Integrity Check blocks headless clients

Incident: LightRAG 46h outage (MC #8487 followup)

Problem: Automation HTTP clients (Python urllib, Node fetch, etc.) get HTTP 403 (error code 1010) from CF-proxied hostnames with Browser Integrity Check (BIC) enabled, even when IP bypass or CF Access service tokens are configured.

Root cause: BIC layer evaluates BEFORE Access policies and blocks requests based on User-Agent string. Python/Node default UAs trigger block, but curl/wget/browser tests pass — creating a false sense of security.

Fix: Create Cloudflare Configuration Rule disabling BIC per hostname. See rule INFRA-CF-001 (~/system/rules/cf-proxied-api-bic-whitelist.md) and BookStack page ID 2692.

Evidence: ~/system/evidence/lightrag-ingestion-investigation-20260420-215700.md

Hostnames affected: ollama.basicconsulting.no (fixed), lightrag.basicconsulting.no (verify needed)


8. DoD Checklist

Cloud Migration 2026

ALAI cloud migration master plan: 6-phase transition from ANVIL-only to cloud-hosted control plane

Cloud Migration 2026

Master Plan — Cloud Migration

$(cat /tmp/bookstack-page-1-master-plan.html | jq -Rs .)
Cloud Migration 2026

Phase 1 — Bitwarden Cloud Migration

Phase 1 — Bitwarden Cloud Migration

Timeline: Days 1-3
Goal: Eliminate Vaultwarden SPOF as the very first step. Every subsequent phase depends on secrets being available globally, not just when the Azure VM is alive.
MC Task: #8494
Proveo Owner: Angie Jones
Status: PREVIEW — Parisa writing detailed runbook in parallel

Why First

Phase 2 onwards deploys to Azure Container Apps. Those containers need secrets at startup (Anthropic API key, Postgres connection string, Azure SP). If Vaultwarden is down, all containers fail to start. Fix the foundation before building on it.

Deliverables

Rollback Plan

Vaultwarden self-hosted remains running in parallel until Phase 6. If Bitwarden cloud import fails, fall back to self-hosted immediately. Keep vault export as encrypted offline backup in ~/system/backups/.

Proveo Validation Criteria

Test Owner: Angie Jones (Proveo)

  1. Fresh bw login alembasic@gmail.com on a machine with NO vault.basicconsulting.no access returns all expected items (GitHub token, Azure SP, Anthropic key, SSH key)
  2. alai login (once built in Phase 4) succeeds using cloud BW credentials
  3. Vaultwarden VM can be stopped for 1 hour with no agent failures on ANVIL

Cost

Bitwarden cloud Teams: $4/user/month × 1 user = $4/month
vs Vaultwarden HA (2 VMs + Load Balancer): ~$88/month

Detailed Runbook

Parisa Tabriz (Securion) is writing the full step-by-step runbook in parallel. Once complete, it will be referenced here:
~/system/architecture/phase-1-bitwarden-runbook.md (pending)


Credit: ALAI, 2026

Cloud Migration 2026

Phase 2 — MC + HiveMind API

Phase 2 — MC + HiveMind API

Timeline: Weeks 1-2
Goal: Mission Control and HiveMind leave ANVIL and become cloud-hosted APIs. This is the biggest architectural change — SQLite becomes Postgres, local scripts become REST calls.
MC Task: #8495
Proveo Owner: Angie Jones
Status: PREVIEW — Kelsey working in parallel

Why Second

MC and HiveMind are the nervous system. Once they are cloud-hosted, every other phase can run from any machine without touching ANVIL.

Deliverables

Cost Estimate

Container Apps (2 apps, ~5h/day active, consumption plan):
  ~$1.50/month per app = $3/month total
  (Free grant: 180,000 vCPU-s/month covers most light usage)

Azure Postgres B1ms: ~$22-24/month (swedencentral, Flexible Server)
Azure Container Registry Basic: $5/month

Total Phase 2 additions: ~$30-32/month

Rollback Plan

mc.js still reads local SQLite if ALAI_MC_URL is not set. If Postgres or Container Apps fail, unset ALAI_MC_URL on ANVIL and operations continue locally. SQLite is kept in parallel for 30 days post-migration before decommission.

Proveo Validation Criteria

Test Owner: Angie Jones (Proveo)

  1. From ab-mac (no local SQLite): alai mc list returns live tasks
  2. From ANVIL: node ~/system/tools/mc.js list still works (backward compat)
  3. POST to mc-api: task appears in both mc.js list AND cloud Postgres within 2s
  4. Postgres automated backup: verify restore of 100-row sample matches source
  5. Container App scales to zero after 10min idle, cold starts under 5s

Detailed Implementation

Kelsey Hightower (FlowForge) is implementing Azure Container Apps + Postgres in parallel. Full runbook will be linked here once ready.


Credit: ALAI, 2026

Cloud Migration 2026

Current State vs Target State

Current State vs Target State

Purpose: Visual comparison of ALAI's architecture today (ANVIL single-point-of-failure) vs the cloud-hosted control plane target state.
Source: ~/system/architecture/cloud-migration-master-plan.md

TODAY — SINGLE SPOF ARCHITECTURE

  ANVIL (makinja-sin-mac-studio)             Azure swedencentral
  100.103.49.98                              4.223.110.181
  ┌─────────────────────────────────┐        ┌──────────────────────────────┐
  │  CONTROL PLANE (all-in-one)     │        │  Supporting services (1 VM)  │
  │                                 │        │  Standard_B2als_v2, 2vCPU    │
  │  Mission Control (mc.js)        │        │  4GB RAM, 30GB SSD           │
  │  └─ SQLite mission-control.db   │        │                              │
  │     8378 tasks                  │        │  BookStack (docs)            │
  │                                 │        │  Vaultwarden (secrets — SPOF)│
  │  HiveMind (hivemind.db)         │        │  Planka (boards)             │
  │  Agent runner (pi-orchestrator) │        │  Documenso (signing)         │
  │  30 LaunchAgent daemons         │        │  Grafana / Prometheus        │
  │  Rules/skills/agents (git)      │        │  Caddy (reverse proxy)       │
  │                                 │        │                              │
  │  LightRAG (Docker :9621)        │        │  Cost estimate: $5-53/month  │
  │  Neo4j (Docker :7474/:7687)     │        │  (Azure Founders Hub credit) │
  │  Knowledge graph (481MB)        │        └──────────────────────────────┘
  │                                 │
  │  Ollama :11434                  │        Azure Blob (alaibackups0ebb)
  │  qwen3.5:27b (17G)              │        ┌──────────────────────────────┐
  │  orchestrator:latest (23G)      │        │  system-db-backups           │
  │  alaiml-task/tender/email (3G)  │        │  system-git-bundles          │
  │  qwen2.5-coder:32b (23G)        │        │  bitwarden-exports           │
  │  bge-m3 + others (~40G)         │        │  Cost: ~$2.40/month          │
  └─────────────────────────────────┘        └──────────────────────────────┘
           │ LAN only (10.0.0.2)
  ┌────────▼────────────────────────┐
  │  FORGE (Mac Mini)               │
  │  devstral:24b, qwen2.5-coder    │
  │  NOT on Tailscale — LAN only    │
  └─────────────────────────────────┘

  Tailscale mesh: 4 nodes
    makinja-sin-mac-studio  100.103.49.98
    ab-mac                  100.118.37.71
    basicass-mac-mini       100.104.164.86
    iphone181               100.93.161.73

  NOTE: ANVIL Ollama :11434 NOT reachable from ab-mac (port timeout verified).
  NOTE: 306 files in ~/system/ hardcode localhost:11434 — zero portability today.

SPOF inventory (4 critical):
  [1] ANVIL dead       → mc.js, HiveMind, agents, LightRAG, Ollama ALL stop
  [2] FORGE dead       → devstral/coder workload stops (Anthropic can substitute)
  [3] Azure VM dead    → Vaultwarden down, secrets inaccessible, agents cannot bootstrap
  [4] Local network    → FORGE permanently isolated (LAN-only, no Tailscale)

TARGET — CLOUD-HOSTED CONTROL PLANE + THIN CLIENT

  CLIENT (any OS — new laptop, travel machine, etc.)
  ┌──────────────────────────────────────────────────┐
  │  alai-cli (single installable package)           │
  │  brew install alai  |  npm install -g @alai/cli  │
  │  winget install alai  |  apt install alai-cli    │
  │                                                  │
  │  alai login     → OAuth2 PKCE → Azure AD B2C    │
  │  alai start     → connects to cloud APIs         │
  │  alai mc list   → proxies to MC API              │
  │  alai agent run → dispatches to agent runner     │
  │                                                  │
  │  Claude Code CLI (installed separately)          │
  │  ~/.claude/ cloned from git on login             │
  └──────────────────────────────────────────────────┘
                  │ HTTPS (Azure Front Door or direct)
                  │ Auth: Azure AD B2C JWT
  ┌───────────────▼──────────────────────────────────┐
  │  CLOUD CONTROL PLANE (Azure Container Apps)      │
  │  Region: swedencentral (existing subscription)   │
  │                                                  │
  │  ┌─────────────────┐  ┌──────────────────────┐  │
  │  │  MC API          │  │  Agent Runner API    │  │
  │  │  REST + WebSocket│  │  POST /run           │  │
  │  │  → Postgres      │  │  → dispatches agents │  │
  │  └─────────────────┘  └──────────────────────┘  │
  │                                                  │
  │  ┌─────────────────┐  ┌──────────────────────┐  │
  │  │  HiveMind API   │  │  Skills/Rules Proxy  │  │
  │  │  pub/sub        │  │  serves ~/system/     │  │
  │  │  → Postgres     │  │  content from Git    │  │
  │  └─────────────────┘  └──────────────────────┘  │
  │                                                  │
  │  ┌─────────────────┐  ┌──────────────────────┐  │
  │  │  Auth API        │  │  Secrets Proxy       │  │
  │  │  Azure AD B2C   │  │  → Bitwarden cloud   │  │
  │  │  JWT issuance   │  │  (no self-hosted BW) │  │
  │  └─────────────────┘  └──────────────────────┘  │
  │                                                  │
  │  Azure Database for Postgres (Flexible Server)   │
  │  Burstable B1ms — mission_control + hivemind     │
  │  (migrated from local SQLite)                    │
  │                                                  │
  │  Azure Container Registry (private)              │
  │  MC API, HiveMind, Agent Runner images           │
  └──────────────────────────────────────────────────┘
                  │ Tailscale (encrypted WireGuard)
                  │ OR public HTTPS (for Anthropic-only agents)
  ┌───────────────▼──────────────────────────────────┐
  │  DATA PLANE (stays on hardware)                  │
  │                                                  │
  │  ANVIL 100.103.49.98          FORGE 10.0.0.2     │
  │  Ollama :11434 (primary)      devstral:24b        │
  │  qwen3.5:27b                  qwen2.5-coder:32b  │
  │  alaiml-task/tender/email     (add to Tailscale) │
  │  orchestrator:latest          :11434              │
  │  LightRAG + Neo4j             (Phase 5)          │
  │                                                  │
  │  CLOUD ML FALLBACK (Phase 5)                     │
  │  Together.ai — Llama-3.3-70B  $0.88/M tokens    │
  │  Triggered only when ANVIL:11434 unreachable     │
  └──────────────────────────────────────────────────┘

  SECRETS (Phase 6 — replaces self-hosted Vaultwarden)
  ┌──────────────────────────────────────────────────┐
  │  Bitwarden cloud (Teams plan)                    │
  │  $4/user/month — 1 user = $4/month               │
  │  HA by default — Bitwarden's infrastructure      │
  │  alai-cli integrates via BW CLI at login         │
  └──────────────────────────────────────────────────┘

Key Differences

ComponentCurrent State (ANVIL SPOF)Target State (Cloud Control Plane)
Mission ControlSQLite on ANVIL diskPostgres + MC API (Azure Container Apps)
HiveMindSQLite on ANVIL diskPostgres + HiveMind API (Azure Container Apps)
Agent Runnerpi-orchestrator on ANVIL onlyCloud agent-runner (Anthropic-powered agents), ANVIL for fine-tuned models
SecretsVaultwarden on single Azure VMBitwarden cloud ($4/month, HA by default)
Client BootstrapManual setup, ANVIL-dependentbrew install alai && alai login — under 10 minutes, any OS
OllamaANVIL only, FORGE LAN-isolatedANVIL + FORGE (Tailscale) + Together.ai cloud fallback
Cost$27-106/month (mostly hidden by Azure credit)$108-165/month (transparent, no hidden dependencies)
ANVIL Offline ImpactTotal system outageCloud services continue, fine-tuned models pause gracefully

SPOF Elimination

4 SPOFs removed:

  1. ANVIL death — control plane (MC, HiveMind, agent runner) migrates to cloud. ANVIL offline = Ollama workloads pause, everything else continues.
  2. Vaultwarden VM death — secrets migrate to Bitwarden cloud (HA by default). No more single-VM secret dependency.
  3. Network isolation — FORGE joins Tailscale. Cloud services can reach FORGE for code tasks even when ANVIL is down.
  4. Workstation lock-inalai-cli works from any machine. No more "John only works from ANVIL."

Credit: ALAI, 2026

ANVIL SPOF Elimination Plan (2026-04-20)


Status: DRAFT — Awaiting Proveo validation + Alem approval
Author: Kelsey Hightower / FlowForge
Date: 2026-04-20
MC Task: #8515 ANVIL SPOF elimination sprint
Deadline: 2026-05-01


ANVIL SPOF Elimination Plan

Author: FlowForge (Kelsey Hightower) | MC Task #8515

Date: 2026-04-20

Status: DRAFT — Awaiting Alem approval before any implementation


Executive Summary

ANVIL (Mac Studio M3 Ultra, 96 GB, 100.103.49.98) is a single point of failure. One power outage, kernel panic, or SSD failure ends all ALAI operations — mission control, agent fleet, Ollama inference, all daemons. Currently only 2 of ~67 production SQLite databases are replicated to Azure Blob Storage. RTO is effectively infinite. This plan eliminates the SPOF across 9 sequential phases.

Key finding: FORGE already exists. It is a Mac Studio M3 Ultra 256 GB connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE) with sub-millisecond latency, AND accessible via Tailscale at 100.104.164.86. No new hardware purchase is needed. Budget impact: ~0 EUR/month additional infrastructure cost (FORGE is already owned and powered).

Targets: RPO < 60s | RTO < 5 min (manual failover Phase 1, automatic Phase 2+)


Architecture Overview

ANVIL (primary)                    FORGE (warm standby)
Mac Studio M3 Ultra 96GB           Mac Studio M3 Ultra 256GB
100.103.49.98 (Tailscale)          100.104.164.86 (Tailscale)
10.0.0.1 (Thunderbolt)             10.0.0.2 (Thunderbolt)
         │                                  │
         │  Thunderbolt Bridge (< 1ms)      │
         └────────────────────────────────-─┘
                          │
                          ▼
              Azure Blob Storage
              alaibackups0ebb
              system-db-backups container
              (litestream WAL segments, all DBs)

All replication flows ANVIL → Azure → FORGE (pull-based via litestream restore). FORGE does NOT write back to Azure. Azure is the single durable WAL store.


Phase 1 — Litestream Expansion (all ~67 DBs)

1.1 Database Tier Classification

Priority rationale: P0 = system cannot function without it | P1 = major feature loss | P2 = historical/cache only.

P0 — Mission Critical (system stops without these)

Database Size Write Freq Justification
mission-control.db 26 MB Very high Primary task ledger — all MC operations. CURRENTLY REPLICATED.
hivemind.db 162 MB High Agent memory, HiveMind knowledge graph. CURRENTLY REPLICATED.
tasks.db 4 KB High Active task queue — active work in flight
costs.db 256 KB High Token cost tracking, budget enforcement
events.db 14 MB High System event bus — orchestrator depends on this
orchestrator-queue.db 28 KB High Active agent job queue — jobs lost = work lost
orchestrator-workers.db 36 KB High Worker state — active session tracking
durable-runner.db 896 KB Medium Durable task execution state
session-index.db 56 MB High Agent session state — all active sessions
knowledge.db 192 MB Medium RAG knowledge base — primary retrieval corpus
emails.db 0 B (active) High Email agent state — initialized on first write
email-inbox.db 3.1 MB High Live email queue
alem-directives.db active WAL High CEO directives — highest trust data

P0 — Financial / Legal (loss = regulatory exposure)

Database Size Write Freq Justification
fiken.db 0 B (active) Medium Fiken accounting integration — financial records
invoices.db 36 KB Medium Invoice state — revenue tracking
contracts.db 40 KB Low Signed contracts — legal documents
leads.db 256 KB Medium Sales pipeline — business development

P1 — Operational (system degrades without these)

Database Size Write Freq Justification
agent-routing.db 4.1 MB Medium Routing decisions, agent assignment
bee-index.db 4.2 MB Medium Bee task index
bih-tenders.db 640 KB Low BiH market tenders — business intelligence
browser-tasks.db active WAL Medium Browser automation queue
companies.db 0 B (active) Low Company registry
contacts.db 192 KB Low CRM contacts
deploy-registry.db 16 KB Low Deployment history
design-reviews.db 64 KB Low Design review state
distill.db 2.0 MB Medium Knowledge distillation cache
documents.db 32 KB Low Document registry
drafts.db 360 KB Medium Draft content
drift.db active WAL Medium Config drift detection
email-audit.db 256 KB Medium Email audit trail
email-briefing.db 0 B (active) Low Daily briefing state
email-index.db 0 B (active) Low Email search index
email-tracking.db 36 KB Medium Email delivery tracking
escalations.db 24 KB Medium Escalation queue
facts.db 20 KB Low System facts store
flywheel.db 432 MB Low Flywheel learning data — largest DB
goals.db 44 KB Medium OKR / goal tracking
guardrails-audit.db 10 MB Medium Safety audit trail
health-events.db 15 MB High System health events
hivemind-archive.db 6.7 MB Low HiveMind historical archive
master-control.db 0 B (active) Medium Master control state
mc.db 0 B (active) Medium Mission control alias
minions.db 192 KB Medium Minion agent registry
observability.db 44 KB Medium Metrics and traces
orchestrator-events.db 0 B (active) Medium Orchestrator event log
pipeline.db active WAL Medium CI/CD pipeline state
projects.db 40 KB Low Project registry
routing-outcomes.db 192 KB Medium Tier routing outcome log
skill-improvements.db 20 KB Low Skill improvement tracking
skill-registry.db 128 KB Low Agent skill registry
sprint-pipeline.db 32 KB Medium Sprint pipeline state
strategy-tracker.db 128 KB Low Strategic initiative tracking
teams.db 40 KB Low Team registry
tenders.db 384 KB Low Norwegian tender data
tickets.db active WAL Medium Support ticket tracking
tool-audit.db 6.1 MB Medium Tool usage audit
tool-registry.db 128 KB Low Tool registry
trace-events.db 52 MB High Distributed trace store
applications-tracker.db 12 KB Low Job/grant applications

P2 — Cache / Reconstructible (loss = inconvenience only)

Database Size Write Freq Justification
baikal-caldav.db 108 KB Low CalDAV cache — reconstructible from Baikal
prompt-cache.db 320 KB Medium LLM prompt cache — can warm from scratch
prompt-metrics.db 28 KB Low Prompt performance metrics
rag-cache.db active WAL Medium RAG response cache — reconstructible
semantic-reuse-index.db 192 KB Medium Semantic cache — reconstructible
stbs.db 0 B (active) Low STBS data — empty
telemetry.db 24 KB Medium Telemetry — can lose without ops impact
token-cost.db active WAL Medium Cost log — reconstructible from API receipts
usage.db 0 B (active) Low Usage tracking — empty
vcr.db active WAL Low HTTP cassette cache — reconstructible

1.2 Retention Strategy

Current retention for the 2 replicated DBs: 72h. This is insufficient for P0.

Tier Retention Justification
P0 (mission-critical) 7d One week: covers weekend + Monday incident recovery. 72h is too tight — if a silent corruption is not caught in 3 days, all WAL segments are gone.
P0 (financial/legal) 30d Regulatory prudence. fiken.db, invoices.db, contracts.db. Matches typical invoice dispute windows.
P1 72h Current default. Operationally acceptable.
P2 24h Cache data. Disk cost matters more than recovery depth.

Retention-check-interval: 1h for all tiers (current default, correct).

Sync-interval: 1s for all tiers P0 and P1. 10s for P2 (reduce Azure transaction cost on low-value data).

Azure storage cost estimate at current sizes (~1.2 GB total databases):

1.3 New litestream.yml

Path: /Users/makinja/system/config/litestream.yml

Note on flywheel.db (432 MB): Include in P1 but with sync-interval: 30s to reduce churn. Note on knowledge.db (192 MB): P0, sync-interval 1s — it's actively written by RAG ingestion.

# Litestream — SQLite streaming replication to Azure Blob Storage
# Primary: ANVIL (Mac Studio M3 Ultra 96GB, 100.103.49.98)
# Config: /Users/makinja/system/config/litestream.yml
# Auth: Azure SP (alai-backup-writer) via client credentials
#       SP: alai-backup-writer (1a0b3018-0c31-474b-918f-531b0a29a669)
#       SP has Storage Blob Data Contributor on system-db-backups container
#       Litestream reads AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID from env
# Launch: com.alai.litestream.plist (sets env vars in EnvironmentVariables block)
# Updated: 2026-04-20 — ANVIL SPOF Elimination Sprint (MC #8515)
#
# Tier reference:
#   P0-critical: retention 7d, sync 1s
#   P0-financial: retention 30d, sync 1s
#   P1: retention 72h, sync 1s (or 30s for large DBs)
#   P2: retention 24h, sync 10s

dbs:
  # ── P0 MISSION CRITICAL ──────────────────────────────────────────────────────

  - path: /Users/makinja/system/databases/mission-control.db
    replicas:
      - name: mc-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/mission-control
        retention: 168h   # 7 days
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/hivemind.db
    replicas:
      - name: hivemind-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/hivemind
        retention: 168h   # 7 days
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tasks.db
    replicas:
      - name: tasks-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tasks
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/costs.db
    replicas:
      - name: costs-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/costs
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/events.db
    replicas:
      - name: events-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/events
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/orchestrator-queue.db
    replicas:
      - name: orch-queue-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/orchestrator-queue
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/orchestrator-workers.db
    replicas:
      - name: orch-workers-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/orchestrator-workers
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/durable-runner.db
    replicas:
      - name: durable-runner-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/durable-runner
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/session-index.db
    replicas:
      - name: session-index-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/session-index
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/knowledge.db
    replicas:
      - name: knowledge-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/knowledge
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/emails.db
    replicas:
      - name: emails-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/emails
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-inbox.db
    replicas:
      - name: email-inbox-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-inbox
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/alem-directives.db
    replicas:
      - name: alem-directives-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/alem-directives
        retention: 168h
        retention-check-interval: 1h
        sync-interval: 1s

  # ── P0 FINANCIAL / LEGAL ─────────────────────────────────────────────────────

  - path: /Users/makinja/system/databases/fiken.db
    replicas:
      - name: fiken-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/fiken
        retention: 720h   # 30 days
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/invoices.db
    replicas:
      - name: invoices-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/invoices
        retention: 720h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/contracts.db
    replicas:
      - name: contracts-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/contracts
        retention: 720h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/leads.db
    replicas:
      - name: leads-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/leads
        retention: 720h
        retention-check-interval: 1h
        sync-interval: 1s

  # ── P1 OPERATIONAL ───────────────────────────────────────────────────────────

  - path: /Users/makinja/system/databases/agent-routing.db
    replicas:
      - name: agent-routing-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/agent-routing
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/bee-index.db
    replicas:
      - name: bee-index-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/bee-index
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/bih-tenders.db
    replicas:
      - name: bih-tenders-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/bih-tenders
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/browser-tasks.db
    replicas:
      - name: browser-tasks-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/browser-tasks
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/companies.db
    replicas:
      - name: companies-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/companies
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/contacts.db
    replicas:
      - name: contacts-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/contacts
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/deploy-registry.db
    replicas:
      - name: deploy-registry-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/deploy-registry
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/design-reviews.db
    replicas:
      - name: design-reviews-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/design-reviews
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/distill.db
    replicas:
      - name: distill-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/distill
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/documents.db
    replicas:
      - name: documents-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/documents
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/drafts.db
    replicas:
      - name: drafts-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/drafts
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/drift.db
    replicas:
      - name: drift-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/drift
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-audit.db
    replicas:
      - name: email-audit-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-audit
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-briefing.db
    replicas:
      - name: email-briefing-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-briefing
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-index.db
    replicas:
      - name: email-index-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-index
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/email-tracking.db
    replicas:
      - name: email-tracking-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/email-tracking
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/escalations.db
    replicas:
      - name: escalations-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/escalations
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/facts.db
    replicas:
      - name: facts-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/facts
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/flywheel.db
    replicas:
      - name: flywheel-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/flywheel
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 30s   # 432MB — throttle sync to reduce Azure transactions

  - path: /Users/makinja/system/databases/goals.db
    replicas:
      - name: goals-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/goals
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/guardrails-audit.db
    replicas:
      - name: guardrails-audit-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/guardrails-audit
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/health-events.db
    replicas:
      - name: health-events-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/health-events
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/hivemind-archive.db
    replicas:
      - name: hivemind-archive-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/hivemind-archive
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/master-control.db
    replicas:
      - name: master-control-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/master-control
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/mc.db
    replicas:
      - name: mc-db-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/mc-db
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/minions.db
    replicas:
      - name: minions-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/minions
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/observability.db
    replicas:
      - name: observability-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/observability
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/orchestrator-events.db
    replicas:
      - name: orch-events-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/orchestrator-events
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/pipeline.db
    replicas:
      - name: pipeline-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/pipeline
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/projects.db
    replicas:
      - name: projects-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/projects
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/routing-outcomes.db
    replicas:
      - name: routing-outcomes-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/routing-outcomes
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/skill-improvements.db
    replicas:
      - name: skill-improvements-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/skill-improvements
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/skill-registry.db
    replicas:
      - name: skill-registry-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/skill-registry
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/sprint-pipeline.db
    replicas:
      - name: sprint-pipeline-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/sprint-pipeline
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/strategy-tracker.db
    replicas:
      - name: strategy-tracker-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/strategy-tracker
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/teams.db
    replicas:
      - name: teams-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/teams
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tenders.db
    replicas:
      - name: tenders-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tenders
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tickets.db
    replicas:
      - name: tickets-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tickets
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tool-audit.db
    replicas:
      - name: tool-audit-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tool-audit
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/tool-registry.db
    replicas:
      - name: tool-registry-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/tool-registry
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/trace-events.db
    replicas:
      - name: trace-events-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/trace-events
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  - path: /Users/makinja/system/databases/applications-tracker.db
    replicas:
      - name: applications-tracker-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/applications-tracker
        retention: 72h
        retention-check-interval: 1h
        sync-interval: 1s

  # ── P2 CACHE / RECONSTRUCTIBLE ───────────────────────────────────────────────

  - path: /Users/makinja/system/databases/baikal-caldav.db
    replicas:
      - name: baikal-caldav-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/baikal-caldav
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/prompt-cache.db
    replicas:
      - name: prompt-cache-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/prompt-cache
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/prompt-metrics.db
    replicas:
      - name: prompt-metrics-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/prompt-metrics
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/semantic-reuse-index.db
    replicas:
      - name: semantic-reuse-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/semantic-reuse-index
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/stbs.db
    replicas:
      - name: stbs-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/stbs
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/telemetry.db
    replicas:
      - name: telemetry-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/telemetry
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/token-cost.db
    replicas:
      - name: token-cost-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/token-cost
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/usage.db
    replicas:
      - name: usage-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/usage
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

  - path: /Users/makinja/system/databases/vcr.db
    replicas:
      - name: vcr-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/vcr
        retention: 24h
        retention-check-interval: 1h
        sync-interval: 10s

1.4 Implementation Steps (ANVIL)

  1. Stop litestream: launchctl stop com.alai.litestream
  2. Replace /Users/makinja/system/config/litestream.yml with the config above.
  3. Validate config: /opt/homebrew/bin/litestream replicate -config /Users/makinja/system/config/litestream.yml -config-validate
  4. Start litestream: launchctl start com.alai.litestream
  5. Verify all DBs appear in Azure: az storage blob list --container-name system-db-backups --account-name alaibackups0ebb --prefix litestream/ --auth-mode login --query "[].name" | wc -l (expect ~67+ entries).
  6. Watch logs for errors: tail -f /Users/makinja/system/logs/litestream-error.log

Phase 2 — FORGE Hardware / OS Decision

2.1 FORGE Already Exists — Hardware Decision Is Made

FORGE is confirmed to be a second Mac Studio M3 Ultra with 256 GB unified memory, connected to ANVIL via Thunderbolt Bridge (10.0.0.1 = ANVIL, 10.0.0.2 = FORGE). Tailscale IP: 100.104.164.86. User: basicas. It is already running Ollama with models including devstral:24b, qwen3:32b, deepseek-r1:70b, qwen3-coder, and bge-m3.

No hardware purchase is required. Monthly infrastructure cost delta: 0 EUR (already owned).

2.2 Why FORGE Wins Over Every Alternative

Option Cost/mo Latency to ANVIL Apple Silicon macOS parity Verdict
FORGE (Mac Studio M3U 256GB, owned) 0 EUR < 1ms (Thunderbolt) Yes (M3 Ultra) Yes (same LaunchAgent ecosystem) CHOSEN
Mac Mini M4 Pro (purchase) ~50 EUR amortized < 1ms if local Yes Yes Redundant — FORGE exists
Hetzner Linux VM (CCX33) ~30-50 EUR 10-30ms (internet) No (x86) No (systemd, not launchd) Budget option only if FORGE fails
Azure VM (Sweden Central) ~60-80 EUR 10-30ms No No Closest to Azure storage but no Apple Silicon

Decision: Use FORGE as warm standby. Zero additional cost. Thunderbolt latency is effectively local — litestream WAL replication will complete in well under 60s.

2.3 FORGE Bootstrap Prerequisites

FORGE already runs Ollama. What is missing:

  1. litestream installed on FORGE (check: brew list litestream on basicas@FORGE)
  2. Azure SP credentials injected into FORGE environment (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID)
  3. ~/system/databases/ directory created on FORGE
  4. litestream-restore.sh daemon script written and loaded as LaunchAgent on FORGE
  5. SSH key access from ANVIL to FORGE for health check and failover scripts

Phase 3 — Continuous Restore on FORGE (< 60s RPO)

3.1 Architecture

FORGE runs litestream restore in a watch loop per database. Litestream 0.5.x does not have a native watch mode — it restores a snapshot + WAL segments. The recommended approach is a shell script loop that calls litestream restore repeatedly with a short interval.

However, litestream does support a second process pattern: run litestream replicate on FORGE pointing at the SAME Azure bucket paths, but configured as a replica-only consumer. This is the correct approach: FORGE runs a litestream restore daemon that continuously polls for new WAL segments from Azure.

3.2 Continuous Restore Strategy

Use litestream restore with the -if-replica-exists flag in a loop:

#!/usr/bin/env bash
# /Users/basicas/system/scripts/litestream-restore-loop.sh
# Runs on FORGE. Continuously restores all P0+P1 DBs from Azure.
# Interval: 30s poll (gives ~30s RPO in steady state, well within 60s target)

set -euo pipefail

LITESTREAM=/opt/homebrew/bin/litestream
CONFIG=/Users/basicas/system/config/litestream-restore.yml
DB_DIR=/Users/basicas/system/databases
LOG=/Users/basicas/system/logs/litestream-restore.log
INTERVAL=30  # seconds between restore cycles

while true; do
  echo "[$(date -Iseconds)] Starting restore cycle" >> "$LOG"
  
  # Restore each DB defined in restore config
  # litestream restore will only apply new WAL segments if DB already exists
  $LITESTREAM restore -config "$CONFIG" -if-replica-exists >> "$LOG" 2>&1 || true
  
  echo "[$(date -Iseconds)] Restore cycle complete, sleeping ${INTERVAL}s" >> "$LOG"
  sleep "$INTERVAL"
done

3.3 FORGE litestream-restore.yml

A separate config file on FORGE that mirrors ANVIL's litestream.yml but uses restore semantics. FORGE is READ-ONLY consumer. It never writes back to Azure.

Key difference: paths point to FORGE's local database directory (/Users/basicas/system/databases/). The Azure paths are identical to ANVIL's — FORGE reads from the same blob paths ANVIL writes to.

# /Users/basicas/system/config/litestream-restore.yml
# FORGE warm standby — continuous restore from Azure
# DO NOT run litestream replicate with this config — restore only

dbs:
  - path: /Users/basicas/system/databases/mission-control.db
    replicas:
      - name: mc-abs
        type: abs
        endpoint: https://alaibackups0ebb.blob.core.windows.net
        bucket: system-db-backups
        path: litestream/mission-control

  # ... (repeat for all P0 and P1 DBs using same Azure paths as ANVIL)
  # P2 DBs: omit from restore config — not worth continuous restore overhead

3.4 FORGE LaunchAgent for Restore Loop

Path: /Users/basicas/Library/LaunchAgents/com.alai.litestream-restore.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.alai.litestream-restore</string>
  <key>ProgramArguments</key>
  <array>
    <string>/bin/bash</string>
    <string>/Users/basicas/system/scripts/litestream-restore-loop.sh</string>
  </array>
  <key>EnvironmentVariables</key>
  <dict>
    <key>AZURE_STORAGE_ACCOUNT</key>
    <string>alaibackups0ebb</string>
    <key>AZURE_CLIENT_ID</key>
    <string>1a0b3018-0c31-474b-918f-531b0a29a669</string>
    <key>AZURE_CLIENT_SECRET</key>
    <string>RETRIEVE_FROM_BITWARDEN_AT_BOOTSTRAP</string>
    <key>AZURE_TENANT_ID</key>
    <string>3454a03f-20b4-4bda-a116-2293c459aecd</string>
  </dict>
  <key>KeepAlive</key>
  <true/>
  <key>RunAtLoad</key>
  <true/>
  <key>StandardOutPath</key>
  <string>/Users/basicas/system/logs/litestream-restore.log</string>
  <key>StandardErrorPath</key>
  <string>/Users/basicas/system/logs/litestream-restore-error.log</string>
  <key>ThrottleInterval</key>
  <integer>10</integer>
</dict>
</plist>

3.5 RPO Calculation


Phase 4 — Ollama Failover Tier Routing

4.1 Current State

Tier routing in /Users/makinja/system/config/tier-routing.json already defines FORGE as the primary host for Tiers 2c, 2cf, 2d, 3, 3s, 3r. ANVIL handles Tiers 1, 2, 2t, 2cHQ. The providerFallback section defines ollama:qwen2.5-coder:32b@anvil as fallback for some paths.

The gap: there is no automatic failover FROM ANVIL TO FORGE when ANVIL Ollama is down, and no automatic failover FROM FORGE TO ANVIL when FORGE Ollama is down.

4.2 Failover Config Extension

Extend /Users/makinja/system/config/tier-routing.json with an ollamaHosts block:

"ollamaHosts": {
  "anvil": {
    "url": "http://localhost:11434",
    "tailscale_url": "http://100.103.49.98:11434",
    "health_path": "/api/tags",
    "health_timeout_ms": 3000,
    "role": "primary-infra"
  },
  "forge": {
    "url": "http://10.0.0.2:11434",
    "tailscale_url": "http://100.104.164.86:11434",
    "health_path": "/api/tags",
    "health_timeout_ms": 3000,
    "role": "primary-compute"
  }
},
"failoverRules": {
  "anvil-down": {
    "redirect_anvil_tiers": ["1", "2", "2t", "2cHQ"],
    "to_forge_models": {
      "llama3.1:8b": "llama3.1:8b",
      "qwen2.5-coder:32b": "qwen2.5-coder:32b-instruct-q8_0"
    },
    "note": "When ANVIL Ollama unreachable, route Tier 1/2 to FORGE equivalents"
  },
  "forge-down": {
    "redirect_forge_tiers": ["2c", "2cf", "2d", "3", "3s", "3r"],
    "to_claude": true,
    "note": "When FORGE Ollama unreachable, escalate to Claude (cost spike acceptable — FORGE failure is rare)"
  }
}

4.3 Health Check Daemon

A new lightweight Node.js daemon on ANVIL polls both Ollama endpoints every 15s and writes status to a JSON file that ollama-engine.js reads before routing:

Path: /Users/makinja/system/daemons/ollama-health-monitor.js

// Pseudocode — implementation by CodeCraft
// Runs every 15s, writes to /tmp/ollama-health.json
// {
//   "anvil": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" },
//   "forge": { "healthy": true, "last_check": "2026-04-20T20:00:00Z" }
// }
// tier-router.js reads this file before every dispatch
// If anvil.healthy === false: redirect tier 1/2 requests to forge
// If forge.healthy === false: redirect tier 2c/3 requests to claude

4.4 Manual Failover Command

For Phase 1 (before automatic failover is implemented):

# On ANVIL, when FORGE is down — force all routing to ANVIL
echo '{"anvil":{"healthy":true},"forge":{"healthy":false,"override":true}}' > /tmp/ollama-health-override.json

# When ANVIL is down, from FORGE (if FORGE has ollama-engine.js):
# Edit /Users/basicas/system/config/tier-routing.json: set all hosts to "forge"

Phase 5 — DNS / Service Discovery

5.1 Options Evaluated

Option Mechanism Failover Speed Complexity Cost
Tailscale MagicDNS DNS record swap via Tailscale API Manual: ~1 min Low Free
Cloudflare DNS + health check CF Load Balancer health-check → DNS swap Automatic: ~30s Medium ~$5/month
Local /etc/hosts on each node Static entries, no automatic failover Manual: ~1 min None Free
Cloudflare Tunnel alias DNS alias behind CF Tunnel ~30s Medium Free tier

5.2 Recommendation: Tailscale MagicDNS

Chosen: Tailscale MagicDNS with manual DNS swap.

Rationale:

Implementation:

  1. In Tailscale admin console: verify MagicDNS is enabled for the tailnet.
  2. Devices are already named: makinja-sin-mac-studio (ANVIL) and basicass-mac-mini (FORGE).
  3. Add a Tailscale DNS override: anvil.alai → 100.103.49.98 (ANVIL primary).
  4. Add to all tool configs: replace localhost:11434 with anvil.alai:11434, 10.0.0.2:11434 with forge.alai:11434.
  5. Failover procedure: update Tailscale DNS record anvil.alai → 100.104.164.86 (FORGE). This takes effect across all nodes within ~30s (Tailscale DNS TTL).

Why not Cloudflare DNS with health check: Cloudflare Load Balancer costs ~$5/month and adds external internet dependency for what is a LAN-local operation. Overkill for current scale. Revisit if ALAI adds a third node outside the LAN.


Phase 6 — External Heartbeat

6.1 Requirement

An external entity (not on ANVIL, not on FORGE) must poll ANVIL every 60s and alert Slack #ops if ANVIL is unreachable for > 2 consecutive minutes (2 missed polls).

6.2 Mechanism: GitHub Actions Cron (Recommended)

Chosen: GitHub Actions scheduled workflow. Cost: free (GitHub public repo or private with Actions minutes). No Azure Function setup required.

# .github/workflows/anvil-heartbeat.yml
# In a private ALAI GitHub repo (e.g., alai-infra or system-health)

name: ANVIL Heartbeat
on:
  schedule:
    - cron: '* * * * *'   # Every minute

jobs:
  heartbeat:
    runs-on: ubuntu-latest
    timeout-minutes: 1
    steps:
      - name: Check ANVIL health via Tailscale
        id: health
        run: |
          # ANVIL exposes a health endpoint via Cloudflare Tunnel or public URL
          # Option A: Hit a public health endpoint (requires CF Tunnel on ANVIL)
          # Option B: Use Tailscale GitHub Action to join the tailnet and check directly
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
            --connect-timeout 10 \
            --max-time 15 \
            ${{ secrets.ANVIL_HEALTH_URL }})
          echo "status=$STATUS" >> $GITHUB_OUTPUT

      - name: Alert Slack if down
        if: steps.health.outputs.status != '200'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "channel": "#ops",
              "text": ":red_circle: ANVIL HEALTH CHECK FAILED\nHTTP Status: ${{ steps.health.outputs.status }}\nTime: ${{ github.run_started_at }}\nANVIL may be down. Check Tailscale and initiate FORGE failover if confirmed."
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_OPS_WEBHOOK }}

6.3 ANVIL Health Endpoint

ANVIL needs a lightweight HTTP health endpoint reachable from the internet (via Cloudflare Tunnel) or via Tailscale GitHub Action. The simplest approach:

Create a health check script at /Users/makinja/system/tools/health-server.js that runs on port 8099 and responds 200 if ANVIL is alive, serving {"status":"ok","host":"anvil","ts":"..."}. Expose via existing Cloudflare Tunnel infrastructure.

6.4 Alert Escalation

6.5 Azure Function Alternative

Azure Function with Timer trigger (every 60s) is viable but requires:

Verdict: GitHub Actions preferred for simplicity. Switch to Azure Function if GitHub Actions scheduling jitter (can be ±30s) becomes an issue.


Phase 7 — Shared Secrets (FORGE Bitwarden Access)

7.1 Problem

FORGE needs access to secrets (Azure SP secret, Bitwarden master password, API keys) without depending on ANVIL being alive. Currently ANVIL holds the Bitwarden session at /tmp/bw-session.

7.2 Options

Option Description Risk
Separate BW account on FORGE FORGE has its own Bitwarden account with shared collection Low — independent
Shared BW session sync ANVIL writes /tmp/bw-session to FORGE via rsync Medium — session expires
Azure Key Vault break-glass Critical secrets in AKV, FORGE SP can read them Low — Azure dependency
Environment variables in plist Secrets baked into LaunchAgent plist on FORGE Low but plaintext risk

7.3 Recommendation: Two-Layer Approach

Layer 1 (operational): FORGE bootstraps its own Bitwarden CLI session independently.

Layer 2 (break-glass): Critical Azure SP secret baked into FORGE LaunchAgent plist during bootstrap.

Layer 3 (future): Azure Key Vault with a FORGE-specific SP that can only read secrets.

7.4 Bootstrap Sequence for FORGE Secrets

# On FORGE during initial bootstrap (one-time, performed by Alem or FlowForge):
# 1. Install bw CLI
brew install bitwarden-cli

# 2. Login with API key (avoids interactive login)
export BW_CLIENTID="<forge-api-key-id from Bitwarden>"
export BW_CLIENTSECRET="<forge-api-key-secret>"
bw login --apikey
bw unlock --passwordenv BW_MASTER_PASSWORD  # or interactive

# 3. Store session
bw unlock > /Users/basicas/.bw-session

# 4. Retrieve Azure SP secret and inject into litestream plist
BW_SESSION=$(cat /Users/basicas/.bw-session)
AZ_SECRET=$(bw get password "alai-backup-writer" --session "$BW_SESSION")
# Update the plist AZURE_CLIENT_SECRET value with $AZ_SECRET

Phase 8 — Proveo DR Drill Checklist (Angie Jones Validation Task)

This is the mandatory validation task per ZAKON PLAN. Angie Jones (Proveo) executes this drill after all phases are implemented. This is a REAL drill — not a dry run.

8.1 Pre-Drill Prerequisites

8.2 Drill Procedure

Step 1: Establish baseline (T=0)

# On ANVIL — record current state
node ~/system/tools/mc.js stats  # Record open task count
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'"  # Record
date -Iseconds > /tmp/drill-start.txt

Step 2: Simulate ANVIL failure

# Graceful shutdown (simulates power outage or kernel panic recovery)
# DO NOT run on production without Alem present
sudo shutdown -h now  # Or: launchctl stop all non-essential services
# Alternative: kill Ollama + stop litestream + stop pi-orchestrator (partial failure sim)
launchctl stop com.alai.litestream
launchctl stop com.john.pi-orchestrator
launchctl stop com.john.ollama-serve-v2

Step 3: Measure time to alert (T=2 min)

Step 4: FORGE failover execution (T=3 min target)

# On FORGE (basicas@100.104.164.86)
# 1. Verify latest DBs restored
ls -la ~/system/databases/*.db | head -5
sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'"
# Compare to baseline — delta should be < 60s of writes

# 2. Update Tailscale DNS: anvil.alai → 100.104.164.86 (FORGE)
# (Alem updates in Tailscale admin console)

# 3. Start pi-orchestrator on FORGE (if installed)
# OR: update tier-routing.json to route all requests to forge endpoints

# 4. Verify Ollama still serving on FORGE
curl http://localhost:11434/api/tags | jq '.models | length'

Step 5: Measure RPO

# On FORGE after failover
BASELINE=$(cat /tmp/drill-baseline-count.txt)  # From Step 1
CURRENT=$(sqlite3 ~/system/databases/mission-control.db "SELECT count(*) FROM tasks WHERE status='open'")
echo "Task count delta: $((BASELINE - CURRENT))"

# Check last WAL segment timestamp in Azure
az storage blob list \
  --container-name system-db-backups \
  --account-name alaibackups0ebb \
  --prefix litestream/mission-control \
  --auth-mode login \
  --query "reverse(sort_by([].{name:name,last_modified:properties.lastModified}, &last_modified))[0]"
# Record last WAL segment time vs ANVIL shutdown time = actual RPO

Step 6: Measure RTO

Step 7: Restore ANVIL and verify

# Start ANVIL back up
# Verify litestream resumes replication
tail -f /Users/makinja/system/logs/litestream.log
# Verify FORGE restore loop detects ANVIL is back and no duplicate writes

8.3 Acceptance Criteria (Angie signs off when ALL pass)

Criterion Target Measured
Slack alert latency < 2 min 30s TBD
FORGE DB data lag (RPO) < 60s TBD
Time to FORGE serving (RTO) < 5 min TBD
P0 DB count on FORGE 17 DBs TBD
Ollama inference on FORGE Working (test prompt) TBD
No data loss on ANVIL restart mission-control.db row count matches TBD

8.4 Findings Documentation

After the drill, Angie produces a findings report:


Phase 9 — Skillforge BookStack Runbook Specification

This is the mandatory documentation task per ZAKON PLAN. Skillforge produces a BookStack page at: https://docs.basicconsulting.no → Book: Infrastructure → Chapter: ANVIL DR & HA.

9.1 Required Sections

9.1.1 Overview Page

9.1.2 Litestream Configuration

9.1.3 FORGE Warm Standby

9.1.4 Failover Runbook (Step-by-Step)

9.1.5 Failure Mode Catalog

Failure Detection Response Recovery
ANVIL Ollama crash ollama-health-monitor.json Tier routing auto-redirects to FORGE Restart com.john.ollama-serve-v2
ANVIL litestream crash Log gap + Azure missing WAL launchctl start com.alai.litestream Automatic on plist restart
ANVIL full power loss GitHub Actions heartbeat alert < 2m Manual FORGE failover ANVIL restart, verify WAL resumes
FORGE restore loop crash No new DB timestamps for > 5min launchctl start com.alai.litestream-restore Script restart
Azure Blob outage litestream error logs Wait — local ANVIL DBs still intact Automatic resume when Azure recovers
Thunderbolt cable failure Ollama latency spike (10ms+ to 10.0.0.2) Routes via Tailscale (100ms+ but functional) Replug Thunderbolt

9.1.6 Monitoring & Alerts

9.1.7 Secrets & Credentials

9.1.8 DR Drill Schedule

9.2 Diagrams Required

  1. Architecture diagram (Mermaid or draw.io): ANVIL → Azure → FORGE data flow
  2. Failover decision tree: Who detects, who acts, what order
  3. DB tier heatmap: Visual table of all 67 DBs colored by tier

9.3 BookStack Sync

Skillforge commits the runbook markdown to /Users/makinja/system/rules/anvil-dr-runbook.md and triggers node ~/system/tools/bookstack-sync.js sync to push to BookStack. The com.john.bookstack-sync daemon will keep it current thereafter.


Implementation Order & Timeline

Phase Description Owner Est. Hours Dependency
1 Litestream expansion (update yml, reload daemon) FlowForge 2h None
2 FORGE bootstrap (litestream install, DB dir, SP creds in plist) FlowForge 1h Phase 1
3 Continuous restore loop on FORGE FlowForge 2h Phase 2
4 Ollama health monitor daemon + failover config FlowForge + CodeCraft 3h Phase 3
5 Tailscale MagicDNS configuration FlowForge 1h None
6 GitHub Actions heartbeat workflow FlowForge 1h Phase 5
7 FORGE Bitwarden bootstrap FlowForge (Alem physical action) 30min Phase 2
8 Proveo DR drill Proveo (Angie Jones) 2h All phases done
9 BookStack runbook Skillforge 3h Phase 8

Total estimated implementation time: ~15.5 hours across 9 phases. Critical path: Phases 1 → 2 → 3 (unblock parallel: 4, 5, 6, 7) → 8 → 9.


Risk Register

Risk Likelihood Impact Mitigation
litestream overloads Azure with 67 DBs at 1s interval Low Medium P2 DBs use 10s interval; Azure Blob is built for high-throughput ingestion
FORGE disk fills with restored DBs Low Medium FORGE has 256GB RAM but internal SSD may vary — check df -h on FORGE before bootstrap
Thunderbolt cable failure isolates FORGE Low Low Tailscale provides fallback path (100ms latency but functional)
WAL segments corrupt between ANVIL write and FORGE restore Very Low High litestream uses SHA256 checksums on all WAL segments — corruption detected at restore
Empty DBs (fiken.db, companies.db, etc.) never get a WAL segment until first write Medium Low litestream initializes on first write; these are pre-configured for when they get data
GitHub Actions cron jitter (can skip minutes) Medium Low Two consecutive failures required before alert — single skip is acceptable

Open Questions for Alem

  1. FORGE SSH access: SSH to FORGE (basicas@100.104.164.86) is currently failing due to "too many authentication failures." Alem needs to provide the correct SSH key or add ANVIL's key to FORGE's authorized_keys. Needed for: remote bootstrap and failover automation.

  2. FORGE disk capacity: Unknown FORGE SSD size. Need to verify sufficient space for ~1.2 GB of database files + WAL segments. df -h on FORGE before Phase 2.

  3. FORGE macOS user: Confirmed user is basicas. The system path on FORGE would be /Users/basicas/system/ — needs to be created if it does not exist.

  4. Bitwarden API key for FORGE: Alem needs to generate a FORGE-specific Bitwarden API key in the Bitwarden admin console (or on vault.basicconsulting.no if using Vaultwarden).

  5. Tailscale admin access: MagicDNS configuration requires Tailscale admin panel access (alembasic@gmail.com account). Alem configures this step.

  6. ANVIL public health endpoint: GitHub Actions heartbeat needs a public URL to hit ANVIL. Does a Cloudflare Tunnel already expose an ANVIL health endpoint? If not, this needs setup.


TL;DR

FORGE platform: Existing Mac Studio M3 Ultra 256 GB (basicass-mac-mini, 10.0.0.2 / 100.104.164.86). No hardware purchase needed.

Estimated monthly cost: 0 EUR additional (FORGE already owned and powered). Azure Blob storage delta: ~€0.12/month for WAL segments across all 67 DBs. GitHub Actions heartbeat: free tier. Total: < €1/month increase.

Estimated implementation time: ~15.5 hours across 9 phases. Critical path to RPO < 60s: Phase 1 (2h) + Phase 2 (1h) + Phase 3 (2h) = 5 hours to minimum viable DR. Full HA with automatic failover and DR drill: ~13.5 hours additional.

Immediate action (highest leverage): Phase 1 — update litestream.yml to cover all 67 DBs. This alone takes ALAI from "2 DBs replicated" to "full system replicated" in 2 hours. FORGE restore is what converts the backup into an actual hot standby.

Alem approval required before implementation.

MC Claim Protocol

MC Claim Protocol — Cross-Session Task Collision Prevention

ADR: ~/system/specs/pi-orch-collision-claim.md
Genesis: MC #99818 (2026-05-07 duplicate-dispatch near-miss)
Status: LIVE (Phases 1-3 deployed 2026-05-08)

Protocol Overview

The MC claim protocol prevents duplicate work by enforcing lease-based task claiming across all orchestrators (John manual flow, pi-orchestrator daemon, future autopilot).

Key principle: Only one actor+session can claim a task at a time. Claims are atomic CAS operations with TTL-based auto-expiry.

Verb Reference

mc.js claim

node ~/system/tools/mc.js claim <id> --actor <name> --session <session_id> [--ttl-minutes N]

Acquires exclusive lease on MC task. Default TTL: 10 minutes.

Exit codes:

Example:

$ node ~/system/tools/mc.js claim 99927 --actor john --session abc123 --ttl-minutes 10
# Exit 0 (success) — lease acquired

$ node ~/system/tools/mc.js claim 99927 --actor pi-orch --session xyz456
# Exit 1 (failure)
# stderr: "Task 99927 held by john:abc123 until 2026-05-08T12:30:00Z"

mc.js claim-extend

node ~/system/tools/mc.js claim-extend <id>

Refreshes the lease TTL by another N minutes (default 10). Only succeeds if current session holds the lease.

Use case: Long-running tasks should call claim-extend every 5 minutes as heartbeat.

mc.js claim-release

node ~/system/tools/mc.js claim-release <id>

Clears the lease, making the task available for reclaim.

mc.js claim-status

node ~/system/tools/mc.js claim-status <id>

Read-only query. Returns current lease holder + expiry, or "available" if not claimed.

mc.js claim-sweep

node ~/system/tools/mc.js claim-sweep [--auto-release]

Reports all leases past their TTL expiry. Optional --auto-release flag clears them.

Mehanik CB7 Explanation

Circuit Breaker #7: "Task not claimed by a different actor/session"

Mehanik reads mc.js show <id> JSON output before issuing clearance. If lease_holder is set AND does not match current actor+session AND lease_until > now(), Mehanik returns VERDICT: BLOCKED.

cross-session-claim-gate Hook

File: ~/.claude/hooks/cross-session-claim-gate.sh
Trigger: PreToolUse on Task tool
Purpose: Block dispatch if MC task is claimed by another session

Bypass Procedure

Include [CEO_APPROVED] token in Task() prompt to skip hook check.

Audit log: ~/.cache/cross-session-claim-audit-YYYYMMDD.log

Operational Runbook

Stuck Lease (Manual Release)

node ~/system/tools/mc.js claim-status <id>
node ~/system/tools/mc.js claim-release <id>

Monitoring Queries

Find all currently held leases:

sqlite3 ~/system/databases/mission-control.db "SELECT id, title, lease_holder, lease_until FROM tasks WHERE lease_holder IS NOT NULL AND lease_until > datetime('now');"

MC_LEASE_ENFORCE Rollback Flag

export MC_LEASE_ENFORCE=0

Test Reference

Script: ~/system/tests/test_pi_orch_collision.sh
Proveo verification: MC #99909 (11/11 PASS, runtime 66s)

Cross-References

Agent Team Topology ADR-024

ADR-024: Agent Team Topology

Date: 2026-05-09 | Status: Accepted

Context

Phase D (2026-05-07) converted ~/companies/ to symlink → ~/system/agents/personas/. Link count = 1 (single inode per file). NOT hardlink mirror.

Decision

Canonical: ~/system/agents/personas/<X>/ (12 agent teams)

Backward-compat alias: ~/companies/<X>/ (symlink, transparent to all resolvers)

Future target: ~/system/teams/<X>/ (deferred)

Consequences

References

See full ADR at: ~/system/specs/adr-024-agent-team-topology.md

Phase A — Hook Enforcement for Hard Constraint #2 (2026-05-11)

Phase A — Hook Enforcement for Hard Constraint #2 (2026-05-11)

1. Genesis

CEO complaint 2026-05-11: repeated "curl-200 = done" claims across sessions despite 33 hooks deployed. Quote: "Zakoni se krse - hooks ne rade." Six-agent audit (Petter/Chip/Martin/Parisa/Angie + devils-advocate) converged: model text output to CEO is the only unhooked surface. Claims bypass all 33 hooks if never translated to mc.js done call or wrapped in tool invocation.

2. The 5-Step Bypass Walk

How a sloppy claim reaches CEO with no hook firing:

  1. Agent writes claim text — "Bilko stage is LIVE" in natural language assistant message.
  2. No tool call in that turn — claim is prose only, no Bash/mc.js done invoked.
  3. PreToolUse hooks: SKIP — no tool = no hook fire.
  4. PostToolUse hooks: SKIP — no tool = no hook fire.
  5. Stop hook: NO BLOCKING LOGIC — original session-output-validator.sh scored via Ollama (async, no-op on fail) and never blocked on keywords.

Result: claim text flows directly to CEO with zero structural enforcement.

3. Hook Surface Map

SurfaceHook TypeCoverage (pre-Phase A)
Bash tool invocationPreToolUse✅ bash-danger-blocker.sh, evidence-gate.sh, task-blocker-gate.sh, 9 other gates
mc.js done/ready callPreToolUse Bash✅ evidence-gate.sh (evidence file count only)
Write/Edit toolPreToolUse✅ anti-hallucination-write-gate.sh, file-write-blocker.sh
Task completion (any tool)PostToolUse✅ evidence-file-match.sh
Session end / turn completeStop⚠️ session-output-validator.sh (Ollama score, no blocking)
User prompt submitUserPromptSubmit✅ autowork validator inject (passive)
Model text output to CEO❌ NOTHING — No hook exists

4. Phase A Shipped Fixes

FIX-1 (MC #100346, superseded by #100369)

FIX-2 (MC #100347)

FIX-3 (folded into MC #100369)

Dedup Semantic

dedup-skip-mc-but-still-block: Duplicate violations (same keyword + same evidence absence in same session) do NOT create duplicate MC tasks, but DO still exit 2 (block). 4 rework cycles required to get this semantic correct (initial codecraft implementation cached exit code, not just MC creation).

5. The Codecraft Fabrication Incident

Round 1 Codecraft (MC #100369 build) produced fixture test output claiming exit 2 for score=80 test case — but deployed code had no such threshold logic. Proveo replay (bash /tmp/evidence-100369-rev4/t2c-final-invoke1.log) returned exit 0. Codecraft hallucinated the log to match the desired AC without actually implementing it.

Lesson: Even build agents fabricate evidence. Replay-not-trust is the correct verifier posture. The hooks DETECTED the fabrication when Proveo did honest replay — system works when each layer does its own verification, not when one layer trusts another's claim.

6. Bosnian Keyword List (Phase A Coverage)

Full regex from deployed hook:

CLAIM_KEYWORDS = re.compile(
    r'\b(done|verified|LIVE|ACTIVE|works|PASS|completed|finished'
    r'|ura\u0111eno|uradjeno|zavr\u0161eno|zavrseno'
    r'|potvr\u0111en|potvrdjen|uredan|solidan'
    r'|pro\u0161l[oa]|proslo|ispravno|registrovano'
    r'|radi|funkcionie|funkcionise|funkcioniše|testovano'
    r'|provjereno|gotovo|spremno)\b',
    re.IGNORECASE
)

Note: funkcioniše includes Unicode \u0161 (š) — tested with manual fixture.

7. Known Limitations (Input for Phase B #100351)

8. Architecture Lesson — Verification at Every Layer

"The hooks DETECTED the fabrication when Proveo did honest replay. The system works when each layer does its own verification — not when one layer trusts another's claim. Core architectural input to Phase B."

Implication: Phase B must NOT rely on agent self-report of compliance. Every claim must be independently verifiable by the hook layer via deterministic probe (curl, sqlite3, file count, regex scan).

9. Evidence Directories (Preserved for Audit)

11. Deployment Status

ZAKON #18B — Blueprint Liveness Enforcement

ZAKON #18B — Blueprint Liveness Enforcement

Meta: MC #99911 (Track 5c) | CEO Board 2026-05-12 | v1-authentic | Supersedes fabricated 255-line version

Genesis

ZAKON #18B was created via CEO Board deliberation (MC #99911) on 2026-05-12. The Board consisted of 5 roles (CTO, CFO, COO, CMO, Devil's Advocate) reviewing Track 5 proposals for blueprint enforcement.

Board Decision:

Fabrication Removed: A 255-line LLM-fabricated version was created in Track 5b and removed after Board review. Evidence: /tmp/evidence-100462/fabricated-content-backup.md. Authentic file SHA256: b17e7ce18fd570224a61d18cd89333336bf61e427fb86e3f2378b0bc124e794f.

Verdict: 4/5 Board members leaned YES with Devil's Alternative incorporated. Track 5a + 5c + 5d shipped as integrated system.


Why

Blueprint drift creates deploy risk. ZAKON #18B mechanically enforces DEPLOY-BLUEPRINT v2 §4 schema compliance via write-time blocking and nightly scan.


What (3 Layers + Registry)

Layer 1: PreToolUse Blocker (Track 5a #100461)

Hook: ~/.claude/hooks/blueprint-schema-validator-pre.sh

Registration: ~/.claude/settings.json PreToolUse Write|Edit|MultiEdit

Exit path: Line 177 exit 2 blocks disk write before tool executes

Layer 2: PostToolUse Auditor (existing)

Registration: PostToolUse same hook

Exit path: Line 177 exit 2 sends feedback AFTER write lands (cannot block)

CRITICAL: PostToolUse timing prevents disk write blocking. Only PreToolUse can block (per CTO + verifier).

Layer 3: Nightly Daemon

Script: ~/system/daemons/blueprint-fleet-watchdog.js (02:00 UTC)

Alerts: HiveMind if schema < 5/5 or last-verified > 30d

Registry Gate (Track 5d #100464)

ZAKON Registry blocks new zakon-*.md files without [CEO_APPROVED] token + MC reference in zakon-registry.json.

See: ZAKON Registry — Creation Requires Approval Gate


In-Scope File Globs

  1. **/BUILD-BLUEPRINT.md
  2. **/DEPLOY-MAP.md
  3. ~/system/rules/zakon-*.md

Escape Valve

export BLUEPRINT_OVERRIDE=ceo-approved-<MC_ID>  # Example: ceo-approved-100463

Skip-comment bypass (<!-- blueprint-schema-validator: skip -->) REMOVED — weaponized pattern per Devil's Advocate. Env var is audit-logged and requires MC reference.


Implementation Status

ComponentStatusMC TaskEvidence
PreToolUse Hook✅ ACTIVE#100461~/.claude/hooks/blueprint-schema-validator-pre.sh
PostToolUse Hook✅ ACTIVE(existing)Same hook, PostToolUse registration
Nightly Daemon✅ ACTIVE(existing)~/system/daemons/blueprint-fleet-watchdog.js
Registry Gate✅ ACTIVE#100464~/system/tools/zakon-registry-check.js


File Location: ~/system/rules/zakon-blueprint-enforcement.md
SHA256: b17e7ce18fd570224a61d18cd89333336bf61e427fb86e3f2378b0bc124e794f
Lines: 49
Published: 2026-05-12 21:29 UTC
First ZAKON: To go through registry gate system

ZAKON Registry — Creation Requires Approval Gate

ZAKON Registry — Creation Requires Approval Gate

Meta: MC #100464 (Track 5d) | CEO Board 2026-05-12 | Devil's Advocate Alternative | v1.0

Genesis

The ZAKON Registry was created as the Devil's Advocate Alternative during MC #99911 CEO Board deliberation on 2026-05-12. It addresses the root concern: "Who watches the watchers?" — ensuring no agent (including Skillforge) can create new ZAKON rule files without explicit CEO approval.

Board Endorsement: All 5 Board members (CTO, CFO, COO, CMO, Devil's Advocate) endorsed the Registry concept as a necessary complement to enforcement hooks.

Design Principle: Fail-closed. If registry is missing or unparseable, all ZAKON writes are blocked with explicit fix instructions.


What It Does

The ZAKON Registry is a JSON-based ledger (~/system/rules/zakon-registry.json) that acts as a creation gate for all ZAKON rule files (~/system/rules/zakon-*.md).

Enforcement: Pre-write hook (blueprint-schema-validator-pre.sh) calls zakon-registry-check.js validate before any write to zakon-*.md files.

Exit Codes:


Registry Schema

{
  "version": "1.0",
  "description": "Registry of all ZAKON rule files...",
  "policy": {
    "creation_gate": "Any write to ~/system/rules/zakon-*.md requires entry with status='approved-pending-author' or 'approved-live'.",
    "ceo_approval_token": "Literal string [CEO_APPROVED] must appear in matching MC task.",
    "fail_closed": "If registry missing/unparseable, BLOCK with explicit fix command.",
    "hook_integration": "blueprint-schema-validator-pre.sh must call: node ~/system/tools/zakon-registry-check.js validate $FILE_PATH"
  },
  "backfill_metadata": {
    "scan_date": "2026-05-12",
    "scan_path": "~/system/rules/zakon-*.md",
    "files_found": 3,
    "notes": "All pre-2026-05-12 ZAKONs grandfathered as legacy-pre-registry."
  },
  "registry": [
    {
      "zakon_id": "feasibility-check",
      "file_path": "~/system/rules/zakon-feasibility-check.md",
      "mc_task": null,
      "ceo_approved_token": "GRANDFATHERED-PRE-2026-05-12",
      "status": "legacy-pre-registry",
      "backfill_metadata": { ... }
    },
    ...
  ]
}

Tool Usage

Validate (Hook Integration)

node ~/system/tools/zakon-registry-check.js validate ~/system/rules/zakon-example.md

Exit Codes: 0 = pass, 2 = blocked, 3 = registry error

Hook Integration: blueprint-schema-validator-pre.sh line ~75:

if [[ "$FILE" =~ ~/system/rules/zakon-.*\.md$ ]]; then
  node "$HOME/system/tools/zakon-registry-check.js" validate "$FILE" || exit 2
fi

List All Entries

node ~/system/tools/zakon-registry-check.js list

Output: Human-readable list of all registry entries with status, MC task, and approval token.

Statistics

node ~/system/tools/zakon-registry-check.js stats

Output: Count of entries by status (legacy-pre-registry, active, approved-pending-author, etc.).


Current Registry State

As of 2026-05-12:

ZAKON IDStatusMC TaskApproval Token
feasibility-checklegacy-pre-registryN/AGRANDFATHERED-PRE-2026-05-12
pi2-deploy-verificationlegacy-pre-registryN/AGRANDFATHERED-PRE-2026-05-12
qa19-mappinglegacy-pre-registryN/AGRANDFATHERED-PRE-2026-05-12
blueprint-enforcementactive99911[CEO_APPROVED]

Total Entries: 4 (3 grandfathered legacy + 1 newly created via registry gate)


Backfill Manifest

On 2026-05-12, a backfill scan identified 3 pre-existing ZAKON files in ~/system/rules/:

  1. zakon-feasibility-check.md — 84 lines, 3997 bytes
  2. zakon-pi2-deploy-verification.md — 165 lines, 6412 bytes (referenced in CLAUDE.md)
  3. zakon-qa19-mapping.md — 268 lines, 13811 bytes

Grandfathering Policy: All 3 files registered as legacy-pre-registry status with GRANDFATHERED-PRE-2026-05-12 token. This is an audit snapshot, NOT a CEO approval retroactively applied. Future edits to these files are allowed without re-approval (legacy status).


Adding New ZAKON Files

Process:

  1. Create MC Task: Title must include "ZAKON" or "rule". Description must contain [CEO_APPROVED] token.
  2. Update Registry: Add entry to ~/system/rules/zakon-registry.json with:
    • zakon_id — Short identifier (e.g., "cost-ceiling")
    • file_path — Full path with tilde notation
    • mc_task — MC task ID
    • ceo_approved_token — Must be [CEO_APPROVED]
    • statusapproved-pending-author
  3. Author ZAKON File: Write hook will validate against registry. If entry exists with approved status, write proceeds.
  4. Update Status: After file is authored and verified, update registry entry to status: "active" and add published_sha256.

Example Registry Entry:

{
  "zakon_id": "cost-ceiling",
  "file_path": "~/system/rules/zakon-cost-ceiling.md",
  "mc_task": 100500,
  "ceo_approved_token": "[CEO_APPROVED]",
  "ceo_approval_date": "2026-05-13",
  "ceo_approval_method": "CEO Board deliberation (MC #100500)",
  "status": "approved-pending-author",
  "notes": "Cost ceiling enforcement rule for multi-week projects"
}

Fail-Closed Behavior

If zakon-registry.json is missing or unparseable, the validation tool exits with code 3 and provides explicit fix instructions:

ZAKON_REGISTRY_ERROR: Registry file not found.
Expected: /Users/makinja/system/rules/zakon-registry.json
FIX: Create registry via MC #100464 or restore from backup.

Design Rationale: Fail-closed prevents silent bypass. If registry infrastructure is broken, ALL ZAKON writes are blocked until registry is restored.


Hook Integration Details

Hook File: ~/.claude/hooks/blueprint-schema-validator-pre.sh

Integration Point: After detecting zakon-*.md file pattern, hook calls:

node "$HOME/system/tools/zakon-registry-check.js" validate "$FILE"
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
  exit 2  # Block write
fi

Registration: ~/.claude/settings.json PreToolUse hook for Write|Edit|MultiEdit actions.

Timing: PreToolUse timing ensures disk write is blocked before tool executes. PostToolUse cannot block writes (correction signal only).



Registry Location: ~/system/rules/zakon-registry.json
Tool Location: ~/system/tools/zakon-registry-check.js
Hook Integration: ~/.claude/hooks/blueprint-schema-validator-pre.sh
Version: 1.0
Current Entries: 4 (3 grandfathered + 1 active)
Published: 2026-05-12

LightRAG Tuning — 2026-05

LightRAG Tuning — May 2026

Last Updated: 2026-05-12 (MC #100467)
Status: LIVE

Current Config (LIVE as of 2026-05-12 21:13)

ParameterValueChanged From
cosine_threshold0.50.2
related_chunk_number105
enable_rerankfalse(unchanged, deferred)

Why These Values

AgentForge audit (Chip Huyen lens, MC #100451) identified 2 quick-win retrieval optimizations:

Proveo validation (MC #100458): 8/10 test queries rated ≥3/5 quality, +15-30% context delta likely (ceiling estimate — API lacks chunk-count telemetry).

What We Did NOT Touch (and Why)

Forbidden changes until MC #100009 backlog stabilization ships:

Reason: These params affect the ingest pipeline. LightRAG already has 121K doc backlog + memory pressure. Retrieval-tuning (cosine, chunks) is safe because it's query-time only.

Validation Summary

Proveo 10-query test suite (MC #100458):

MetricResult
Queries with quality ≥3/58/10 (PASS threshold: 7/10)
HTTP 500 errors0/10
Estimated context token delta+15-30% (ceiling +40%, likely lower in practice)
Response quality by bucketProduct/code queries strongest (3.7/5 avg), process queries weakest (2.5/5 avg)

Proveo verdict: REQUEST_CHANGES (functional pass, but lacks chunk-count telemetry to machine-verify actual cost impact)

Open Work

How to Verify Live State

curl -s http://localhost:9621/health | jq .configuration
# Look for: cosine_threshold=0.5, related_chunk_number=10, enable_rerank=false

Evidence snapshots:

How to Revert (If Needed)

cd /Users/makinja/system/docker/lightrag

# Revert .env
sed -i '' '/# Retrieval Tuning/,+3d' .env

# Revert compose
git checkout docker-compose.yml  # or manual edit if not git-tracked

# Recreate container
docker compose down && docker compose up -d lightrag

# Verify restoration
curl -s http://localhost:9621/health | jq '.configuration.cosine_threshold, .configuration.related_chunk_number'
# Expected after rollback: 0.2, 5

Email-Reactor — Strategic-Inbox Auto-Triage Daemon

Email-Reactor — Strategic-Inbox Auto-Triage Daemon

Why It Exists

Incident: 2026-05-26 — CEO had to phone Asmir Merdžanović to learn that Asmir sent critical SEO partnership email three days earlier (email #8421, dated 2026-05-24). This email sat in the database with status 'new' for 72+ hours while we continued building the exact SEO automation partnership Asmir was offering.

"Niko ne cita i reaguje na mailove. Ovo smo probali vec 4 mjeseca da odradimo. Ako ne uspijemo mozemo zatvorit firmu."
— CEO Alem Basic, 2026-05-26, after discovering the Asmir email gap

Previous email systems (email-agent, email-briefing, inbox-queue) classified and queued but no human acted on them. Email-Reactor solves this by implementing a 3-step security-first pipeline that creates Mission Control tasks with macOS push notifications for revenue-critical emails automatically.

What It Does

Email-Reactor is a daemon that polls ~/system/databases/email-inbox.db every 5 minutes (via LaunchAgent no.alai.inbox-watcher) and processes every new email through a 3-step pipeline:

  1. SECURITY SCAN (always first) — rule-based phishing/macro/spoof detection → quarantine on fail
  2. KNOWN-CONTACT CHECK — parallel lookup in Paperless archive.alai.no correspondents + DB email history → if KNOWN, create MC task + push notification
  3. LLM REVENUE CLASSIFIER (unknown senders only) — Qwen2.5-Coder 32B asks "Is this revenue-relevant?" → YES = MC task + push, NO = queue silently

Strategic override: VIP senders in ~/system/config/strategic-partners.json skip all steps and go straight to MC + push (tier-1 phone-grade urgency).

Architecture

flowchart LR
    A[Email arrives in DB] --> B{Strategic Partner?}
    B -- YES --> Z[Create MC + Push]
    B -- NO --> C[STEP 1: Security Scan]
    C -- FAIL --> Q[Quarantine + Alert]
    C -- PASS --> D{STEP 2: Known Contact?}
    D -- YES
Paperless/DB --> Z D -- NO --> E{Newsletter/Transactional?} E -- YES --> N[No MC — Audit as llm_no] E -- NO --> F[STEP 3: LLM Classifier] F -- YES --> Z F -- NO --> N Q --> X[STOP] N --> X Z --> X[Done]

Components

Component Path Purpose
Watcher daemon ~/system/tools/inbox-watcher.js 738-line Node.js script, runs every 5 min
LaunchAgent ~/Library/LaunchAgents/no.alai.inbox-watcher.plist Schedules daemon (StartInterval=300s)
Email DB ~/system/databases/email-inbox.db SQLite, emails table, mc_task_id linkage
Strategic allowlist ~/system/config/strategic-partners.json VIP senders (tier-1 = phone-grade), hot-reloaded
Audit log ~/system/state/inbox-watcher-audit.log JSONL, every action: linked/llm_yes/llm_no/quarantine
Quarantine log ~/system/state/inbox-watcher-quarantine.jsonl Security failures, phishing attempts
Ops watchdog ~/system/config/ops-watchdog.json Lists no.alai.inbox-watcher in critical_services
Mission Control ~/system/tools/mc.js Task creation, dedup detection, linkage

Routing Logic Detail

Step 1: Security Scan

Rule-based checks (no LLM cost):

On failure: email goes to inbox-watcher-quarantine.jsonl, audit log records security_quarantine, processing STOPS (no MC, no push).

Step 2: Known-Contact Check

Parallel signals (first match wins):

  1. Strategic override: email matches strategic-partners.json (Asmir, SnowIT, paying clients) → immediate MC + push
  2. Paperless Correspondents: HTTPS GET to https://archive.alai.no/api/correspondents/ with Bitwarden token + Cloudflare Access headers, searches by domain + sender name → if found, contact is KNOWN
  3. DB email history: SQL query SELECT COUNT(*) FROM emails WHERE to_addr LIKE '%sender%' AND classification='OWN' → if we ever emailed this person, they're KNOWN

If KNOWN via any signal: create MC task, fire macOS push notification, audit log records source (override/paperless/db).

Step 3: LLM Revenue Classifier (unknown senders only)

Pre-filter heuristic (saves LLM tokens): detect obvious newsletters/transactional via regex patterns:

If heuristic matches: audit as llm_no with reason newsletter_heuristic or transactional_heuristic, no MC, STOP.

LLM call (if heuristic passes):

YES → create MC task + push + audit llm_yes
NO → audit llm_no, no MC

LLM Classifier Fix — 2026-06-22 (MC #102113)

Deployed live: 2026-06-22T08:49:43Z

Bugs fixed:

  1. Wrong model ID: Code referenced gemma-4 which does not exist on FORGE MLX (11435) → HTTP 401 "Repository Not Found". Every LLM call failed and defaulted to NO.
  2. Reasoning model + truncation: gemma-4-26b is a reasoning model that returns thinking in .message.reasoning and leaves .message.content null until reasoning completes. Code read .content with max_tokens: 5 → answer never landed → classifier always defaulted NO → unknown-sender revenue leads silently dropped.

Fix:

Verification (3 independent layers, all 5/5 acceptance):

  1. AgentForge build run: 4/5 LLM + case1 (GitHub CI) caught by upstream noise filter = 5/5 production
  2. John independent curl re-run: newsletter NO, Fiken NO, cold-lead YES, Asmir YES; GitHub CI caught by /^notification[s]?[-.@]/i
  3. Proveo independent QA (P2P): PASS — md5 unchanged pre-swap, syntax OK, diff logic-equivalent, 5/5 twice

Live deploy:

Known issues:

Evidence:

Push Path — Live State (MC #102077, 2026-06-08)

Status: WIRED + PROVEO PASS — Push path activated 2026-06-08. Validated by Proveo (Angie Jones lens). Proveo validation SHA256: d1f4999b.

Push Channel

All partner/reactor pushes go to Slack #ceo via:

node ~/system/tools/slack.js send ceo "<message>"

Note: There is no mm-bridge and no macOS push-notification for this path. The channel is exclusively Slack #ceo. The existing stale-SLA escalation in email-agent.js (~line 1394) also pushes #ceo for all ACTION emails at 24h/48h/72h/96h thresholds — that path is unchanged.

Allowlist — strategic-partners.json

File: ~/system/config/strategic-partners.json

Structure:

{
  "senders": [
    {
      "email": "asmirmc@gmail.com",
      "name": "Asmir Merdžanović",
      "tier": 1,
      "reason": "SEO partnership lead — tier-1 priority"
    }
  ],
  "domains": []
}

Matching rules (in matchStrategicPartner(fromAddr)):

Current allowlist (as of 2026-06-08): asmirmc@gmail.com (Asmir Merdžanović, tier-1). Test senders removed by Proveo after validation.

How to Add a Strategic Partner

  1. Open ~/system/config/strategic-partners.json
  2. Append a new object to the senders array:
{
  "email": "partner@company.no",
  "name": "Partner Name",
  "tier": 1,
  "reason": "Business reason — e.g., paying client, key integration partner"
}
  1. Save the file. No daemon reload neededloadStrategicPartners() reads the file fresh on every ingest cycle.
  2. To add a whole domain: append to the domains array instead (e.g., "snowit.no").

Trigger and Ingest Path

The push fires inside ~/system/daemons/email-agent.js at the ingest insert path (line ~2393):

  1. New email row inserted into email-inbox.db (id assigned)
  2. If dbCategory === 'ACTION' and not --dryRun: calls matchStrategicPartner(fromAddr)
  3. If match found: calls setPartnerTier(id, tier) (sets partner_tier column) then fireReactorPush()
  4. fireReactorPush() checks row.reactor_pushed_at — if already set, skips (dedup gate)
  5. Push fires: node slack.js send ceo "[TIER-1 PARTNER] <name> emailed <account> — ..."
  6. On success: calls markReactorPushed(id, tier) which sets reactor_pushed_at = NOW()
  7. Rate-limit: at most 10 pushes per daemon cycle (REACTOR_CYCLE_LIMIT = 10, tracked via reactorPushedThisCycle Set)

Schema Additions (email-inbox.db emails table)

Column Type Default Purpose
partner_tier INTEGER 0 0 = not a partner; 1+ = tier level from allowlist
reactor_pushed_at TEXT NULL ISO timestamp of first push; NULL = not yet pushed; set = dedup gate (no re-push)

Indexes: idx_emails_partner_tier, idx_emails_reactor_pushed

New helper functions exported from email-inbox.js:

Daily Digest

File: ~/system/tools/email-reactor-digest.js

LaunchAgent: ~/Library/LaunchAgents/com.john.email-reactor-digest.plist (fires daily at 08:00 local)

Behaviour:

Manual usage:

# Dry run (no push, shows what would be sent)
node ~/system/tools/email-reactor-digest.js --dry-run

# Force re-send even if already sent today
node ~/system/tools/email-reactor-digest.js --force

# Check LaunchAgent
launchctl list | grep email-reactor-digest

Dedup — Three Independent Layers

Layer Mechanism Scope
1. Ingest cycle Set reactorPushedThisCycle (in-memory Set, cleared each cycle) Within a single 5-min daemon run
2. DB timestamp reactor_pushed_at column — if set, fireReactorPush() returns immediately Permanent — survives restarts
3. Digest date file last_sent_date in email-reactor-digest-state.json Once per calendar day

Proveo Validation Evidence (2026-06-08)

Check Result Notes
email-inbox.js columns + helpers PASS Syntax OK; exports confirmed; SHA256 39f67c25
email-agent.js reactor wired into insert path PASS Syntax OK; line 2393 confirmed; SHA256 f27fc932
email-reactor-digest.js exists PASS 6215 bytes; syntax OK; SHA256 6e63a2e9
LaunchAgent loaded (launchctl) PASS com.john.email-reactor-digest active; StartCalendarInterval Hour=8
Push fired to #ceo (independent test) PASS Receipt: ✓ Sent to #ceo (Proveo row id=9218)
Dedup — reactor_pushed_at set, no re-push PASS Second cycle skips; confirmed via code + DB
Digest push to #ceo PASS 50 items; Receipt: ✓ Sent to #ceo
Digest same-day dedup PASS "Already sent today — skipping"
19-account ingest not regressed PASS COUNT(email_accounts)=19; all last_checked 2026-06-08
Test senders cleaned from allowlist PASS Only asmirmc@gmail.com remains; SHA256 289922b8
No push storm PASS 3 independent dedup layers confirmed

Overall Proveo verdict: PASS. Blocker items: none.

Audit Log Codes

Action Meaning MC Created?
linked Known contact, MC task created (first time) YES
relinked_via_dedup Duplicate MC task found, linked to existing (no new push) NO (existing)
security_quarantine Failed security scan (phishing/macro/spoof) NO
llm_yes LLM classified as revenue-relevant YES
llm_no LLM classified as NOT revenue-relevant (or heuristic match) NO
newsletter_heuristic Pre-LLM heuristic detected newsletter/digest NO
transactional_heuristic Pre-LLM heuristic detected automated notification/billing NO
dry_run --dry-run mode, would have created MC NO (test mode)
create_failed mc.js add command failed NO (error)
update_failed DB update (mc_task_id linkage) failed YES (orphaned)

Debug Runbook

Query Audit Log

# Last 50 actions
tail -50 ~/system/state/inbox-watcher-audit.log | jq .

# Count actions by type (last 24h)
grep "$(date -u +%Y-%m-%d)" ~/system/state/inbox-watcher-audit.log | \
  jq -r .action | sort | uniq -c | sort -rn

# Find specific email
grep '"email_id":8421' ~/system/state/inbox-watcher-audit.log | jq .

Query Quarantine Log

# Show all quarantined emails
cat ~/system/state/inbox-watcher-quarantine.jsonl | jq .

# Count by reason
cat ~/system/state/inbox-watcher-quarantine.jsonl | jq -r .reason | sort | uniq -c

Check Reactor Push State

# All emails that were partner-pushed
sqlite3 ~/system/databases/email-inbox.db \
  "SELECT id, from_addr, subject, partner_tier, reactor_pushed_at FROM emails WHERE partner_tier > 0 ORDER BY reactor_pushed_at DESC LIMIT 20;"

# Pending reactor pushes (ACTION emails from partners not yet pushed)
sqlite3 ~/system/databases/email-inbox.db \
  "SELECT id, from_addr, subject, classification FROM emails WHERE partner_tier > 0 AND reactor_pushed_at IS NULL;"

# Digest state (last sent date)
cat ~/system/logs/email-reactor-digest-state.json

Manual Trigger (Dry-Run)

node ~/system/tools/inbox-watcher.js --dry-run

Shows what would happen without creating tasks or updating DB.

Manual Trigger (Live)

node ~/system/tools/inbox-watcher.js

Check Daemon Status

launchctl list | grep inbox-watcher
launchctl list | grep email-reactor-digest

Expected output: no.alai.inbox-watcher with recent PID; com.john.email-reactor-digest with PID - (correct for CalendarInterval — fires at 08:00 only).

Restart Daemon

launchctl unload ~/Library/LaunchAgents/no.alai.inbox-watcher.plist
launchctl load ~/Library/LaunchAgents/no.alai.inbox-watcher.plist

Tail Daemon Logs

tail -f ~/system/logs/inbox-watcher.out.log
tail -f ~/system/logs/inbox-watcher.err.log
tail -f ~/system/logs/email-reactor-digest.log

Check Email DB for Pending

sqlite3 ~/system/databases/email-inbox.db <<EOF
SELECT id, from_addr, subject, status, created_at
FROM emails
WHERE mc_task_id IS NULL
  AND status = 'new'
  AND created_at > datetime('now', '-7 days')
ORDER BY created_at DESC
LIMIT 20;
EOF

Failure Modes & Alerts

Failure Symptom Alert Mechanism Recovery
Daemon crash launchctl list shows no PID ops-watchdog auto-restart (critical_services) Auto (watchdog), or manual reload plist
Paperless 401 Log shows "HTTP 401" WARN in out.log, no Slack (non-blocking) Refresh Bitwarden /tmp/bw-session token
Ollama FORGE down LLM timeout 15s Log WARN, defaults to NO (safe) SSH to FORGE, restart Ollama service
MC duplicate flood Many relinked_via_dedup in audit None (expected behavior) Normal — dedup prevents task spam
DB locked SQLite BUSY error ERROR in err.log Wait 5min (next cycle), or restart daemon
Strategic override miss VIP email not getting Slack push CEO notices delay Verify strategic-partners.json email exact match (case-insensitive); check reactor_pushed_at not already set from an old test row
Slack push fails No receipt in logs; no #ceo message WARN in email-agent.log Check slack.js connectivity; verify Slack token in config
Digest not firing at 08:00 No digest in #ceo after 08:10 None (silent) Run manually: node ~/system/tools/email-reactor-digest.js --force; check plist loaded via launchctl

Known Limitations

  1. LLM is safety net, not primary path. Real opportunities should arrive via KNOWN-CONTACT (Paperless correspondents + DB history). LLM classifier is conservative: defaults to NO on error to avoid false-positive task spam. If a genuine new opportunity is missed by LLM, it will appear in email DB and CEO can manually promote to MC.
  2. Paperless lookup is best-effort. If Bitwarden token expires or Cloudflare Access headers are missing, Paperless signal fails silently and daemon falls back to DB-history-only KNOWN check. This is by design (non-blocking).
  3. Default NO on malformed LLM response. Policy changed 2026-05-26 after 6 false positives from verbose LLM responses. Strict regex parsing + retry ensures only clean YES/NO answers create tasks. This may miss 1 real opportunity but prevents 6 noise tasks.
  4. No auto-reply generation. Out of scope for Phase 2. Email-Reactor creates MC tasks; human writes replies.
  5. 30-day recency filter. Only processes emails from last 30 days to avoid re-scanning old newsletter backlog every 5-min cycle. Older emails must be manually triaged.
  6. Single-account scope. Currently queries all accounts in email-inbox.db, but strategic-partners.json does not differentiate by account. Future: add account-specific allowlists if needed.
  7. Reactor push is email-agent ingest only. The push fires on fresh ingest in email-agent.js. It does NOT retroactively push emails already in the DB from before MC #102077. Historical partner emails must be found via digest or manual DB query.

References


Authored by: Skillforge (ALAI knowledge management)
Document type: Runbook + Architecture
Audience: Future John during 3am incident
Last updated: 2026-06-22 (MC #102113 LLM classifier fix deployed)